All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V8 00/11] blk-mq: improvement CPU hotplug
@ 2020-04-24 10:23 Ming Lei
  2020-04-24 10:23 ` [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
                   ` (11 more replies)
  0 siblings, 12 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes inactive(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Thanks John Garry for running lots of tests on arm64 with this patchset
and co-working on investigating all kinds of issues.

Thanks Christoph's review on V7.

Please comment & review, thanks!

https://github.com/ming1/linux/commits/v5.7-rc-blk-mq-improve-cpu-hotplug

V8:
	- add patches to share code with blk_rq_prep_clone
	- code re-organization as suggested by Christoph, most of them are
	in 04/11, 10/11
	- add reviewed-by tag

V7:
	- fix updating .nr_active in get_driver_tag
	- add hctx->cpumask check in cpuhp handler
	- only drain requests which tag is >= 0
	- pass more aggressive cpuhotplug&io test

V6:
	- simplify getting driver tag, so that we can drain in-flight
	  requests correctly without using synchronize_rcu()
	- handle re-submission of flush & passthrough request correctly

V5:
	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
	- re-factor code for re-submit requests in cpu dead hotplug handler
	- address requeue corner case

V4:
	- resubmit IOs in dispatch list in case that this hctx is dead 

V3:
	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
	- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo

Ming Lei (11):
  block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
  block: add helper for copying request
  blk-mq: mark blk_mq_get_driver_tag as static
  blk-mq: assign rq->tag in blk_mq_get_driver_tag
  blk-mq: support rq filter callback when iterating rqs
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  block: add blk_end_flush_machinery
  blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
  blk-mq: re-submit IO in case that hctx is inactive
  block: deactivate hctx when the hctx is actually inactive

 block/blk-core.c           |  29 +--
 block/blk-flush.c          | 141 ++++++++++++---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |  39 ++--
 block/blk-mq-tag.h         |   4 +
 block/blk-mq.c             | 353 +++++++++++++++++++++++++++++--------
 block/blk-mq.h             |  22 ++-
 block/blk.h                |  11 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   6 +
 include/linux/cpuhotplug.h |   1 +
 12 files changed, 479 insertions(+), 133 deletions(-)

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
-- 
2.25.2


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 10:32   ` Christoph Hellwig
                     ` (2 more replies)
  2020-04-24 10:23   ` Ming Lei
                   ` (10 subsequent siblings)
  11 siblings, 3 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel

So far blk_rq_prep_clone() is only used for setup one underlying cloned
request from dm-rq request. block intetrity can be enabled for both dm-rq
and the underlying queues, so it is reasonable to clone rq's
nr_integrity_segments. Also write_hint is from bio, it should have been
cloned too.

So clone nr_integrity_segments and write_hint in blk_rq_prep_clone.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7e4a1da0715e..91537e526b45 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1636,9 +1636,13 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 		rq->special_vec = rq_src->special_vec;
 	}
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+#endif
 	rq->nr_phys_segments = rq_src->nr_phys_segments;
 	rq->ioprio = rq_src->ioprio;
 	rq->extra_len = rq_src->extra_len;
+	rq->write_hint = rq_src->write_hint;
 
 	return 0;
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 02/11] block: add helper for copying request
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
@ 2020-04-24 10:23   ` Ming Lei
  2020-04-24 10:23   ` Ming Lei
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel

Add one new helper of blk_rq_copy_request() to copy request, and the helper
will be used in this patch for re-submitting request, so make it as a block
layer internal helper.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 33 +++++++++++++++++++--------------
 block/blk.h      |  2 ++
 2 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 91537e526b45..76405551d09e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1587,6 +1587,24 @@ void blk_rq_unprep_clone(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 
+void blk_rq_copy_request(struct request *rq, struct request *rq_src)
+{
+	/* Copy attributes of the original request to the clone request. */
+	rq->__sector = blk_rq_pos(rq_src);
+	rq->__data_len = blk_rq_bytes(rq_src);
+	if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
+		rq->special_vec = rq_src->special_vec;
+	}
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+#endif
+	rq->nr_phys_segments = rq_src->nr_phys_segments;
+	rq->ioprio = rq_src->ioprio;
+	rq->extra_len = rq_src->extra_len;
+	rq->write_hint = rq_src->write_hint;
+}
+
 /**
  * blk_rq_prep_clone - Helper function to setup clone request
  * @rq: the request to be setup
@@ -1629,20 +1647,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 			rq->bio = rq->biotail = bio;
 	}
 
-	/* Copy attributes of the original request to the clone request. */
-	rq->__sector = blk_rq_pos(rq_src);
-	rq->__data_len = blk_rq_bytes(rq_src);
-	if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
-		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
-		rq->special_vec = rq_src->special_vec;
-	}
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
-#endif
-	rq->nr_phys_segments = rq_src->nr_phys_segments;
-	rq->ioprio = rq_src->ioprio;
-	rq->extra_len = rq_src->extra_len;
-	rq->write_hint = rq_src->write_hint;
+	blk_rq_copy_request(rq, rq_src);
 
 	return 0;
 
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..bbbced0b3c8c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -120,6 +120,8 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 		rq->rq_disk = bio->bi_disk;
 }
 
+void blk_rq_copy_request(struct request *rq, struct request *rq_src);
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 02/11] block: add helper for copying request
@ 2020-04-24 10:23   ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Hannes Reinecke, Bart Van Assche, Mike Snitzer, John Garry,
	Ming Lei, linux-block, dm-devel, Thomas Gleixner,
	Christoph Hellwig

Add one new helper of blk_rq_copy_request() to copy request, and the helper
will be used in this patch for re-submitting request, so make it as a block
layer internal helper.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 33 +++++++++++++++++++--------------
 block/blk.h      |  2 ++
 2 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 91537e526b45..76405551d09e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1587,6 +1587,24 @@ void blk_rq_unprep_clone(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 
+void blk_rq_copy_request(struct request *rq, struct request *rq_src)
+{
+	/* Copy attributes of the original request to the clone request. */
+	rq->__sector = blk_rq_pos(rq_src);
+	rq->__data_len = blk_rq_bytes(rq_src);
+	if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
+		rq->special_vec = rq_src->special_vec;
+	}
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+#endif
+	rq->nr_phys_segments = rq_src->nr_phys_segments;
+	rq->ioprio = rq_src->ioprio;
+	rq->extra_len = rq_src->extra_len;
+	rq->write_hint = rq_src->write_hint;
+}
+
 /**
  * blk_rq_prep_clone - Helper function to setup clone request
  * @rq: the request to be setup
@@ -1629,20 +1647,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 			rq->bio = rq->biotail = bio;
 	}
 
-	/* Copy attributes of the original request to the clone request. */
-	rq->__sector = blk_rq_pos(rq_src);
-	rq->__data_len = blk_rq_bytes(rq_src);
-	if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
-		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
-		rq->special_vec = rq_src->special_vec;
-	}
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
-#endif
-	rq->nr_phys_segments = rq_src->nr_phys_segments;
-	rq->ioprio = rq_src->ioprio;
-	rq->extra_len = rq_src->extra_len;
-	rq->write_hint = rq_src->write_hint;
+	blk_rq_copy_request(rq, rq_src);
 
 	return 0;
 
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..bbbced0b3c8c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -120,6 +120,8 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 		rq->rq_disk = bio->bi_disk;
 }
 
+void blk_rq_copy_request(struct request *rq, struct request *rq_src);
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
-- 
2.25.2

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
  2020-04-24 10:23 ` [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
  2020-04-24 10:23   ` Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 12:44   ` Hannes Reinecke
  2020-04-24 16:13   ` Martin K. Petersen
  2020-04-24 10:23 ` [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

Now all callers of blk_mq_get_driver_tag are in blk-mq.c, so mark
it as static.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 2 +-
 block/blk-mq.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a7785df2c944..79267f2e8960 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1027,7 +1027,7 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-bool blk_mq_get_driver_tag(struct request *rq)
+static bool blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 10bfdfb494fa..e7d1da4b1f73 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -44,7 +44,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
 void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 				bool kick_requeue_list);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
-bool blk_mq_get_driver_tag(struct request *rq);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *start);
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 10:35   ` Christoph Hellwig
  2020-04-24 13:02   ` Hannes Reinecke
  2020-04-24 10:23 ` [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

Especially for none elevator, rq->tag is assigned after the request is
allocated, so there isn't any way to figure out if one request is in
being dispatched. Also the code path wrt. driver tag becomes a bit
difference between none and io scheduler.

When one hctx becomes inactive, we have to prevent any request from
being dispatched to LLD. And get driver tag provides one perfect chance
to do that. Meantime we can drain any such requests by checking if
rq->tag is assigned.

So only assign rq->tag until blk_mq_get_driver_tag() is called.

This way also simplifies code of dealing with driver tag a lot.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 18 ++----------
 block/blk-mq.c    | 75 ++++++++++++++++++++++++-----------------------
 block/blk-mq.h    | 21 +++++++------
 block/blk.h       |  5 ----
 4 files changed, 51 insertions(+), 68 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c7f396e3d5e2..977edf95d711 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -236,13 +236,8 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 		error = fq->rq_status;
 
 	hctx = flush_rq->mq_hctx;
-	if (!q->elevator) {
-		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
-		flush_rq->tag = -1;
-	} else {
-		blk_mq_put_driver_tag(flush_rq);
-		flush_rq->internal_tag = -1;
-	}
+	flush_rq->internal_tag = -1;
+	blk_mq_put_driver_tag(flush_rq);
 
 	running = &fq->flush_queue[fq->flush_running_idx];
 	BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
@@ -317,14 +312,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	flush_rq->mq_ctx = first_rq->mq_ctx;
 	flush_rq->mq_hctx = first_rq->mq_hctx;
 
-	if (!q->elevator) {
-		fq->orig_rq = first_rq;
-		flush_rq->tag = first_rq->tag;
-		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
-	} else {
-		flush_rq->internal_tag = first_rq->internal_tag;
-	}
-
+	flush_rq->internal_tag = first_rq->internal_tag;
 	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
 	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
 	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 79267f2e8960..65f0aaed55ff 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -276,18 +276,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	struct request *rq = tags->static_rqs[tag];
 	req_flags_t rq_flags = 0;
 
-	if (data->flags & BLK_MQ_REQ_INTERNAL) {
-		rq->tag = -1;
-		rq->internal_tag = tag;
-	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
-			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
-		}
-		rq->tag = tag;
-		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
-	}
+	rq->internal_tag = tag;
+	rq->tag = -1;
 
 	/* csd/requeue_work/fifo_time is initialized before use */
 	rq->q = data->q;
@@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
-	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
+
+	if (hctx->sched_tags) {
+		if (rq->tag >= 0)
+			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
+	} else {
+		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
+        }
+
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
@@ -527,7 +521,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 		blk_stat_add(rq, now);
 	}
 
-	if (rq->internal_tag != -1)
+	if (rq->q->elevator && rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq, now);
 
 	blk_account_io_done(rq, now);
@@ -1027,33 +1021,40 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+static bool __blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
-		.q = rq->q,
-		.hctx = rq->mq_hctx,
-		.flags = BLK_MQ_REQ_NOWAIT,
-		.cmd_flags = rq->cmd_flags,
+		.q		= rq->q,
+		.hctx		= rq->mq_hctx,
+		.flags		= BLK_MQ_REQ_NOWAIT,
+		.cmd_flags	= rq->cmd_flags,
 	};
-	bool shared;
 
-	if (rq->tag != -1)
-		return true;
+	if (data.hctx->sched_tags) {
+		if (blk_mq_tag_is_reserved(data.hctx->sched_tags,
+				rq->internal_tag))
+			data.flags |= BLK_MQ_REQ_RESERVED;
+		rq->tag = blk_mq_get_tag(&data);
+	} else {
+		rq->tag = rq->internal_tag;
+	}
 
-	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
-		data.flags |= BLK_MQ_REQ_RESERVED;
+	if (rq->tag == -1)
+		return false;
 
-	shared = blk_mq_tag_busy(data.hctx);
-	rq->tag = blk_mq_get_tag(&data);
-	if (rq->tag >= 0) {
-		if (shared) {
-			rq->rq_flags |= RQF_MQ_INFLIGHT;
-			atomic_inc(&data.hctx->nr_active);
-		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+	if (blk_mq_tag_busy(data.hctx)) {
+		rq->rq_flags |= RQF_MQ_INFLIGHT;
+		atomic_inc(&data.hctx->nr_active);
 	}
+	data.hctx->tags->rqs[rq->tag] = rq;
+	return true;
+}
 
-	return rq->tag != -1;
+static bool blk_mq_get_driver_tag(struct request *rq)
+{
+	if (rq->tag != -1)
+		return true;
+	return __blk_mq_get_driver_tag(rq);
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e7d1da4b1f73..d0c72d7d07c8 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -196,26 +196,25 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
-					   struct request *rq)
+static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	int tag = rq->tag;
+
+	if (tag < 0)
+		return;
+
 	rq->tag = -1;
 
+	if (hctx->sched_tags)
+		blk_mq_put_tag(hctx->tags, rq->mq_ctx, tag);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
 		atomic_dec(&hctx->nr_active);
 	}
 }
 
-static inline void blk_mq_put_driver_tag(struct request *rq)
-{
-	if (rq->tag == -1 || rq->internal_tag == -1)
-		return;
-
-	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
-}
-
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;
diff --git a/block/blk.h b/block/blk.h
index bbbced0b3c8c..446b2893b478 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -26,11 +26,6 @@ struct blk_flush_queue {
 	struct list_head	flush_data_in_flight;
 	struct request		*flush_rq;
 
-	/*
-	 * flush_rq shares tag with this rq, both can't be active
-	 * at the same time
-	 */
-	struct request		*orig_rq;
 	struct lock_class_key	key;
 	spinlock_t		mq_flush_lock;
 };
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 13:17   ` Hannes Reinecke
  2020-04-24 10:23 ` [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Now request is thought as in-flight only when its state is updated as
MQ_RQ_IN_FLIGHT, which is done by dirver via blk_mq_start_request().

Actually from blk-mq's view, one rq can be thought as in-flight
after its tag is >= 0.

Passing one rq filter callback so that we can iterating requests very
flexiable.

Meantime blk_mq_all_tag_busy_iter is defined as public, which will be
called from blk-mq internally.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 39 +++++++++++++++++++++++++++------------
 block/blk-mq-tag.h |  4 ++++
 2 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904a..2e43b827c96d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -255,6 +255,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 struct bt_tags_iter_data {
 	struct blk_mq_tags *tags;
 	busy_tag_iter_fn *fn;
+	busy_rq_iter_fn *busy_rq_fn;
 	void *data;
 	bool reserved;
 };
@@ -274,7 +275,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
+	if (rq && iter_data->busy_rq_fn(rq, iter_data->data, reserved))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
 	return true;
@@ -294,11 +295,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+			     void *data, bool reserved)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
+		.busy_rq_fn = busy_rq_fn,
 		.data = data,
 		.reserved = reserved,
 	};
@@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 /**
  * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
  * @tags:	Tag map to iterate over.
- * @fn:		Pointer to the function that will be called for each started
- *		request. @fn will be called as follows: @fn(rq, @priv,
- *		reserved) where rq is a pointer to a request. 'reserved'
- *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop.
+ * @fn:		Pointer to the function that will be called for each request
+ * 		when .busy_rq_fn(rq) returns true. @fn will be called as
+ * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
+ * 		request. 'reserved' indicates whether or not @rq is a reserved
+ * 		request. Return true to continue iterating tags, false to stop.
+ * @busy_rq_fn: Pointer to the function that will be called for each request,
+ * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
+ * 		@priv, reserved) returns true, @fn will be called on this rq.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-		busy_tag_iter_fn *fn, void *priv)
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+		void *priv)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, busy_rq_fn,
+				priv, true);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, busy_rq_fn, priv, false);
+}
+
+static bool blk_mq_default_busy_rq(struct request *rq, void *data,
+		bool reserved)
+{
+	return blk_mq_request_started(rq);
 }
 
 /**
@@ -342,7 +356,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
-			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+			blk_mq_all_tag_busy_iter(tagset->tags[i], fn,
+					blk_mq_default_busy_rq, priv);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb682..fdf095d513e5 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -21,6 +21,7 @@ struct blk_mq_tags {
 	struct list_head page_list;
 };
 
+typedef bool (busy_rq_iter_fn)(struct request *, void *, bool);
 
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
@@ -34,6 +35,9 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 13:23   ` Hannes Reinecke
  2020-04-24 10:23 ` [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests from
dm-rq & loop, then the potential dead-lock can be avoided.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c     |  1 +
 block/blk-mq.c             | 19 +++++++++++++++++++
 drivers/block/loop.c       |  2 +-
 drivers/md/dm-rq.c         |  2 +-
 include/linux/blk-mq.h     |  3 +++
 include/linux/cpuhotplug.h |  1 +
 6 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..8e745826eb86 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -239,6 +239,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 65f0aaed55ff..d432cc74ef78 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2261,6 +2261,16 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
+static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2297,6 +2307,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2356,6 +2369,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3610,6 +3626,9 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				blk_mq_hctx_notify_online,
+				blk_mq_hctx_notify_offline);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..784f2e038b55 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..5f1ff70ac029 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b45148ba3291..f550b5274b8b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -391,6 +393,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 77d70b633531..24b3a77810b6 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -152,6 +152,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (5 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 10:38   ` Christoph Hellwig
                     ` (2 more replies)
  2020-04-24 10:23 ` [PATCH V8 08/11] block: add blk_end_flush_machinery Ming Lei
                   ` (4 subsequent siblings)
  11 siblings, 3 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Before one CPU becomes offline, check if it is the last online CPU of hctx.
If yes, mark this hctx as inactive, meantime wait for completion of all
in-flight IOs originated from this hctx. Meantime check if this hctx has
become inactive in blk_mq_get_driver_tag(), if yes, release the
allocated tag.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line when all CPUs of this IRQ line is offline.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |   1 +
 block/blk-mq.c         | 124 +++++++++++++++++++++++++++++++++++++----
 include/linux/blk-mq.h |   3 +
 3 files changed, 117 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 8e745826eb86..b62390918ca5 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d432cc74ef78..4d0c271d9f6f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1050,11 +1050,31 @@ static bool __blk_mq_get_driver_tag(struct request *rq)
 	return true;
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
 {
 	if (rq->tag != -1)
 		return true;
-	return __blk_mq_get_driver_tag(rq);
+
+	if (!__blk_mq_get_driver_tag(rq))
+		return false;
+	/*
+	 * Add one memory barrier in case that direct issue IO process is
+	 * migrated to other CPU which may not belong to this hctx, so we can
+	 * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
+	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
+	 * and driver tag assignment are run on the same CPU in case that
+	 * BLK_MQ_S_INACTIVE is set.
+	 */
+	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
+		smp_mb();
+	else
+		barrier();
+
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
+		blk_mq_put_driver_tag(rq);
+		return false;
+	}
+	return true;
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
@@ -1103,7 +1123,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 		 * Don't clear RESTART here, someone else could have set it.
 		 * At most this will cost an extra queue run.
 		 */
-		return blk_mq_get_driver_tag(rq);
+		return blk_mq_get_driver_tag(rq, false);
 	}
 
 	wait = &hctx->dispatch_wait;
@@ -1129,7 +1149,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	 * allocation failure and adding the hardware queue to the wait
 	 * queue.
 	 */
-	ret = blk_mq_get_driver_tag(rq);
+	ret = blk_mq_get_driver_tag(rq, false);
 	if (!ret) {
 		spin_unlock(&hctx->dispatch_wait_lock);
 		spin_unlock_irq(&wq->lock);
@@ -1228,7 +1248,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			break;
 		}
 
-		if (!blk_mq_get_driver_tag(rq)) {
+		if (!blk_mq_get_driver_tag(rq, false)) {
 			/*
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed. The
@@ -1260,7 +1280,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			bd.last = true;
 		else {
 			nxt = list_first_entry(list, struct request, queuelist);
-			bd.last = !blk_mq_get_driver_tag(nxt);
+			bd.last = !blk_mq_get_driver_tag(nxt, false);
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
@@ -1853,7 +1873,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_dispatch_budget(hctx))
 		goto insert;
 
-	if (!blk_mq_get_driver_tag(rq)) {
+	if (!blk_mq_get_driver_tag(rq, true)) {
 		blk_mq_put_dispatch_budget(hctx);
 		goto insert;
 	}
@@ -2261,13 +2281,92 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
-static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+struct count_inflight_data {
+	unsigned count;
+	struct blk_mq_hw_ctx *hctx;
+};
+
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
 {
-	return 0;
+	struct count_inflight_data *count_data = data;
+
+	/*
+	 * Can't check rq's state because it is updated to MQ_RQ_IN_FLIGHT
+	 * in blk_mq_start_request(), at that time we can't prevent this rq
+	 * from being issued.
+	 *
+	 * So check if driver tag is assigned, if yes, count this rq as
+	 * inflight.
+	 */
+	if (rq->tag >= 0 && rq->mq_hctx == count_data->hctx)
+		count_data->count++;
+
+	return true;
+}
+
+static bool blk_mq_inflight_rq(struct request *rq, void *data,
+			       bool reserved)
+{
+	return rq->tag >= 0;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct count_inflight_data count_data = {
+		.count	= 0,
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
+			blk_mq_inflight_rq, &count_data);
+
+	return count_data.count;
+}
+
+static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx))
+			break;
+		msleep(5);
+	}
 }
 
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
+	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
+	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids))
+		return 0;
+
+	/*
+	 * The current CPU is the last one in this hctx, S_INACTIVE
+	 * can be observed in dispatch path without any barrier needed,
+	 * cause both are run on one same CPU.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	/*
+	 * Order setting BLK_MQ_S_INACTIVE and checking rq->tag & rqs[tag],
+	 * and its pair is the smp_mb() in blk_mq_get_driver_tag
+	 */
+	smp_mb__after_atomic();
+	blk_mq_hctx_drain_inflight_rqs(hctx);
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 	return 0;
 }
 
@@ -2278,12 +2377,15 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
 
-	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f550b5274b8b..b4812c455807 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -403,6 +403,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 08/11] block: add blk_end_flush_machinery
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (6 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 10:41   ` Christoph Hellwig
  2020-04-24 13:47   ` Hannes Reinecke
  2020-04-24 10:23 ` [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Flush requests aren't same with normal FS request:

1) one dedicated per-hctx flush rq is pre-allocated for sending flush request

2) flush request si issued to hardware via one machinary so that flush merge
can be applied

We can't simply re-submit flush rqs via blk_steal_bios(), so add
blk_end_flush_machinery to collect flush requests which needs to
be resubmitted:

- if one flush command without DATA is enough, send one flush, complete this
kind of requests

- otherwise, add the request into a list and let caller re-submit it.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 123 +++++++++++++++++++++++++++++++++++++++++++---
 block/blk.h       |   4 ++
 2 files changed, 120 insertions(+), 7 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 977edf95d711..745d878697ed 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -170,10 +170,11 @@ static void blk_flush_complete_seq(struct request *rq,
 	unsigned int cmd_flags;
 
 	BUG_ON(rq->flush.seq & seq);
-	rq->flush.seq |= seq;
+	if (!error)
+		rq->flush.seq |= seq;
 	cmd_flags = rq->cmd_flags;
 
-	if (likely(!error))
+	if (likely(!error && !fq->flush_queue_terminating))
 		seq = blk_flush_cur_seq(rq);
 	else
 		seq = REQ_FSEQ_DONE;
@@ -200,9 +201,15 @@ static void blk_flush_complete_seq(struct request *rq,
 		 * normal completion and end it.
 		 */
 		BUG_ON(!list_empty(&rq->queuelist));
-		list_del_init(&rq->flush.list);
-		blk_flush_restore_request(rq);
-		blk_mq_end_request(rq, error);
+
+		/* Terminating code will end the request from flush queue */
+		if (likely(!fq->flush_queue_terminating)) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			blk_mq_end_request(rq, error);
+		} else {
+			list_move_tail(&rq->flush.list, pending);
+		}
 		break;
 
 	default:
@@ -279,7 +286,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	struct request *flush_rq = fq->flush_rq;
 
 	/* C1 described at the top of this file */
-	if (fq->flush_pending_idx != fq->flush_running_idx || list_empty(pending))
+	if (fq->flush_pending_idx != fq->flush_running_idx ||
+			list_empty(pending) || fq->flush_queue_terminating)
 		return;
 
 	/* C2 and C3
@@ -331,7 +339,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
 	if (q->elevator) {
-		WARN_ON(rq->tag < 0);
+		WARN_ON(rq->tag < 0 && !fq->flush_queue_terminating);
 		blk_mq_put_driver_tag(rq);
 	}
 
@@ -503,3 +511,104 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
 	kfree(fq->flush_rq);
 	kfree(fq);
 }
+
+static void __blk_end_queued_flush(struct blk_flush_queue *fq,
+		unsigned int queue_idx, struct list_head *resubmit_list,
+		struct list_head *flush_list)
+{
+	struct list_head *queue = &fq->flush_queue[queue_idx];
+	struct request *rq, *nxt;
+
+	list_for_each_entry_safe(rq, nxt, queue, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		if (!blk_rq_sectors(rq) || seq == REQ_FSEQ_POSTFLUSH )
+			list_add_tail(&rq->queuelist, flush_list);
+		else
+			list_add_tail(&rq->queuelist, resubmit_list);
+	}
+}
+
+static void blk_end_queued_flush(struct blk_flush_queue *fq,
+		struct list_head *resubmit_list, struct list_head *flush_list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	__blk_end_queued_flush(fq, 0, resubmit_list, flush_list);
+	__blk_end_queued_flush(fq, 1, resubmit_list, flush_list);
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+}
+
+/* complete requests which just requires one flush command */
+static void blk_complete_flush_requests(struct blk_flush_queue *fq,
+		struct list_head *flush_list)
+{
+	struct block_device *bdev;
+	struct request *rq;
+	int error = -ENXIO;
+
+	if (list_empty(flush_list))
+		return;
+
+	rq = list_first_entry(flush_list, struct request, queuelist);
+
+	/* Send flush via one active hctx so we can move on */
+	bdev = bdget_disk(rq->rq_disk, 0);
+	if (bdev) {
+		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
+		bdput(bdev);
+	}
+
+	while (!list_empty(flush_list)) {
+		rq = list_first_entry(flush_list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_end_request(rq, error);
+	}
+}
+
+/*
+ * Called when this hctx is inactive and all CPUs of this hctx is dead,
+ * otherwise don't reuse this function.
+ *
+ * Terminate this hw queue's flush machinery, and try to complete flush
+ * IO requests if possible, such as any flush IO without data, or flush
+ * data IO in POSTFLUSH stage. Otherwise, add the flush IOs into @list
+ * and let caller to re-submit them.
+ */
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out)
+{
+	LIST_HEAD(resubmit_list);
+	LIST_HEAD(flush_list);
+	struct blk_flush_queue *fq = hctx->fq;
+	struct request *rq, *nxt;
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	fq->flush_queue_terminating = 1;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	/* End inflight flush requests */
+	list_for_each_entry_safe(rq, nxt, in, queuelist) {
+		WARN_ON(!(rq->rq_flags & RQF_FLUSH_SEQ));
+		list_del_init(&rq->queuelist);
+		rq->end_io(rq, BLK_STS_AGAIN);
+	}
+
+	/* End queued requests */
+	blk_end_queued_flush(fq, &resubmit_list, &flush_list);
+
+	/* Send flush and complete requests which just need one flush req */
+	blk_complete_flush_requests(fq, &flush_list);
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	/* reset flush queue so that it is ready to work next time */
+	fq->flush_pending_idx = fq->flush_running_idx = 0;
+	fq->flush_queue_terminating = 0;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	list_splice_init(&resubmit_list, out);
+}
diff --git a/block/blk.h b/block/blk.h
index 446b2893b478..9f2ed3331fd5 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -20,6 +20,7 @@ struct blk_flush_queue {
 	unsigned int		flush_queue_delayed:1;
 	unsigned int		flush_pending_idx:1;
 	unsigned int		flush_running_idx:1;
+	unsigned int		flush_queue_terminating:1;
 	blk_status_t 		rq_status;
 	unsigned long		flush_pending_since;
 	struct list_head	flush_queue[2];
@@ -485,4 +486,7 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		bool *same_page);
 
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out);
+
 #endif /* BLK_INTERNAL_H */
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (7 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 08/11] block: add blk_end_flush_machinery Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 10:42   ` Christoph Hellwig
  2020-04-24 13:48   ` Hannes Reinecke
  2020-04-24 10:23 ` [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Add helper of blk_mq_hctx_handle_dead_cpu for handling cpu dead,
and prepare for handling inactive hctx.

No functional change.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4d0c271d9f6f..0759e0d606b3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2370,22 +2370,13 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-/*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
- */
-static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
+static void blk_mq_hctx_handle_dead_cpu(struct blk_mq_hw_ctx *hctx,
+		unsigned int cpu)
 {
-	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
-			struct blk_mq_hw_ctx, cpuhp_dead);
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
 
-	if (!cpumask_test_cpu(cpu, hctx->cpumask))
-		return 0;
-
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
@@ -2397,13 +2388,29 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	spin_unlock(&ctx->lock);
 
 	if (list_empty(&tmp))
-		return 0;
+		return;
 
 	spin_lock(&hctx->lock);
 	list_splice_tail_init(&tmp, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, true);
+}
+
+/*
+ * 'cpu' is going away. splice any existing rq_list entries from this
+ * software queue to the hw queue dispatch list, and ensure that it
+ * gets run.
+ */
+static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
+	blk_mq_hctx_handle_dead_cpu(hctx, cpu);
 	return 0;
 }
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (8 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 10:44   ` Christoph Hellwig
  2020-04-24 13:55   ` Hannes Reinecke
  2020-04-24 10:23 ` [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
  2020-04-24 15:23 ` [PATCH V8 00/11] blk-mq: improvement CPU hotplug Jens Axboe
  11 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

When all CPUs in one hctx are offline and this hctx becomes inactive, we
shouldn't run this hw queue for completing request any more.

So allocate request from one live hctx, and clone & resubmit the request,
either it is from sw queue or scheduler queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0759e0d606b3..a4a26bb23533 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2370,6 +2370,98 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_end_rq(struct request *rq, blk_status_t error)
+{
+	struct request *orig_rq = rq->end_io_data;
+
+	blk_mq_cleanup_rq(orig_rq);
+	blk_mq_end_request(orig_rq, error);
+
+	blk_put_request(rq);
+}
+
+static void blk_mq_resubmit_rq(struct request *rq)
+{
+	struct request *nrq;
+	unsigned int flags = 0;
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
+		hctx->tags;
+	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
+
+	if (rq->rq_flags & RQF_PREEMPT)
+		flags |= BLK_MQ_REQ_PREEMPT;
+	if (reserved)
+		flags |= BLK_MQ_REQ_RESERVED;
+
+	/* avoid allocation failure by clearing NOWAIT */
+	nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);
+	if (!nrq)
+		return;
+
+	blk_rq_copy_request(nrq, rq);
+
+	nrq->timeout = rq->timeout;
+	nrq->rq_disk = rq->rq_disk;
+	nrq->part = rq->part;
+
+	memcpy(blk_mq_rq_to_pdu(nrq), blk_mq_rq_to_pdu(rq),
+			rq->q->tag_set->cmd_size);
+
+	nrq->end_io = blk_mq_resubmit_end_rq;
+	nrq->end_io_data = rq;
+	nrq->bio = rq->bio;
+	nrq->biotail = rq->biotail;
+
+	if (blk_insert_cloned_request(nrq->q, nrq) != BLK_STS_OK)
+		blk_mq_request_bypass_insert(nrq, false, true);
+}
+
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(sched);
+	LIST_HEAD(re_submit);
+	LIST_HEAD(flush_in);
+	LIST_HEAD(flush_out);
+	struct request *rq, *nxt;
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (!e) {
+		blk_mq_flush_busy_ctxs(hctx, &re_submit);
+	} else {
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched);
+			else
+				list_add(&rq->queuelist, &re_submit);
+		}
+	}
+	while (!list_empty(&sched)) {
+		rq = list_first_entry(&sched, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_sched_insert_request(rq, true, true, true);
+	}
+
+	/* requests in dispatch list have to be re-submitted too */
+	spin_lock(&hctx->lock);
+	list_splice_tail_init(&hctx->dispatch, &re_submit);
+	spin_unlock(&hctx->lock);
+
+	/* blk_end_flush_machinery will cover flush request */
+	list_for_each_entry_safe(rq, nxt, &re_submit, queuelist) {
+		if (rq->rq_flags & RQF_FLUSH_SEQ)
+			list_move(&rq->queuelist, &flush_in);
+	}
+	blk_end_flush_machinery(hctx, &flush_in, &flush_out);
+	list_splice_tail(&flush_out, &re_submit);
+
+	while (!list_empty(&re_submit)) {
+		rq = list_first_entry(&re_submit, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_rq(rq);
+	}
+}
+
 static void blk_mq_hctx_handle_dead_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int cpu)
 {
@@ -2398,9 +2490,8 @@ static void blk_mq_hctx_handle_dead_cpu(struct blk_mq_hw_ctx *hctx,
 }
 
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * @cpu has gone away. If this hctx is inactive, we can't dispatch request
+ * to the hctx any more, so clone and re-submit requests from this hctx
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2410,7 +2501,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (!cpumask_test_cpu(cpu, hctx->cpumask))
 		return 0;
 
-	blk_mq_hctx_handle_dead_cpu(hctx, cpu);
+	if (test_bit(BLK_MQ_S_INACTIVE, &hctx->state))
+		blk_mq_hctx_deactivate(hctx);
+	else
+		blk_mq_hctx_handle_dead_cpu(hctx, cpu);
 	return 0;
 }
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (9 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
@ 2020-04-24 10:23 ` Ming Lei
  2020-04-24 10:43   ` Christoph Hellwig
  2020-04-24 13:56   ` Hannes Reinecke
  2020-04-24 15:23 ` [PATCH V8 00/11] blk-mq: improvement CPU hotplug Jens Axboe
  11 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-24 10:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Run queue on dead CPU still may be triggered in some corner case,
such as one request is requeued after CPU hotplug is handled.

So handle this corner case during run queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 ++++++++++--------------------
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4a26bb23533..68088ff5460c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,8 @@
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx);
+
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
 	int ddir, sectors, bucket;
@@ -1376,28 +1378,16 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	int srcu_idx;
 
 	/*
-	 * We should be running this queue from one of the CPUs that
-	 * are mapped to it.
-	 *
-	 * There are at least two related races now between setting
-	 * hctx->next_cpu from blk_mq_hctx_next_cpu() and running
-	 * __blk_mq_run_hw_queue():
-	 *
-	 * - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes online, then this warning is harmless
-	 *   at all
-	 *
-	 * - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes offline, then the warning can't be
-	 *   triggered, and we depend on blk-mq timeout handler to
-	 *   handle dispatched requests to this hctx
+	 * BLK_MQ_S_INACTIVE may not deal with some requeue corner case:
+	 * one request is requeued after cpu unplug is handled, so check
+	 * if the hctx is actually inactive. If yes, deactive it and
+	 * re-submit all requests in the queue.
 	 */
 	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
-		cpu_online(hctx->next_cpu)) {
-		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
-			raw_smp_processor_id(),
-			cpumask_empty(hctx->cpumask) ? "inactive": "active");
-		dump_stack();
+	    cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) >=
+	    nr_cpu_ids) {
+		blk_mq_hctx_deactivate(hctx);
+		return;
 	}
 
 	/*
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
  2020-04-24 10:23 ` [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
@ 2020-04-24 10:32   ` Christoph Hellwig
  2020-04-24 12:43   ` Hannes Reinecke
  2020-04-24 16:11   ` Martin K. Petersen
  2 siblings, 0 replies; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel

On Fri, Apr 24, 2020 at 06:23:41PM +0800, Ming Lei wrote:
> So far blk_rq_prep_clone() is only used for setup one underlying cloned
> request from dm-rq request. block intetrity can be enabled for both dm-rq
> and the underlying queues, so it is reasonable to clone rq's
> nr_integrity_segments. Also write_hint is from bio, it should have been
> cloned too.
> 
> So clone nr_integrity_segments and write_hint in blk_rq_prep_clone.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 02/11] block: add helper for copying request
  2020-04-24 10:23   ` Ming Lei
  (?)
@ 2020-04-24 10:35   ` Christoph Hellwig
  -1 siblings, 0 replies; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel

On Fri, Apr 24, 2020 at 06:23:42PM +0800, Ming Lei wrote:
> Add one new helper of blk_rq_copy_request() to copy request, and the helper
> will be used in this patch for re-submitting request, so make it as a block
> layer internal helper.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-24 10:23 ` [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
@ 2020-04-24 10:35   ` Christoph Hellwig
  2020-04-24 13:02   ` Hannes Reinecke
  1 sibling, 0 replies; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

On Fri, Apr 24, 2020 at 06:23:44PM +0800, Ming Lei wrote:
> Especially for none elevator, rq->tag is assigned after the request is
> allocated, so there isn't any way to figure out if one request is in
> being dispatched. Also the code path wrt. driver tag becomes a bit
> difference between none and io scheduler.
> 
> When one hctx becomes inactive, we have to prevent any request from
> being dispatched to LLD. And get driver tag provides one perfect chance
> to do that. Meantime we can drain any such requests by checking if
> rq->tag is assigned.
> 
> So only assign rq->tag until blk_mq_get_driver_tag() is called.
> 
> This way also simplifies code of dealing with driver tag a lot.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-24 10:23 ` [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-04-24 10:38   ` Christoph Hellwig
  2020-04-25  3:17     ` Ming Lei
  2020-04-24 13:27   ` Hannes Reinecke
  2020-04-24 13:42   ` John Garry
  2 siblings, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 06:23:47PM +0800, Ming Lei wrote:
> Before one CPU becomes offline, check if it is the last online CPU of hctx.
> If yes, mark this hctx as inactive, meantime wait for completion of all
> in-flight IOs originated from this hctx. Meantime check if this hctx has
> become inactive in blk_mq_get_driver_tag(), if yes, release the
> allocated tag.
> 
> This way guarantees that there isn't any inflight IO before shutdowning
> the managed IRQ line when all CPUs of this IRQ line is offline.

Can you take a look at all my comments on the previous version here
(splitting blk_mq_get_driver_tag for direct_issue vs not, what is
the point of barrier(), smp_mb__before_atomic and
smp_mb__after_atomic), as none seems to be addressed and I also didn't
see a reply.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 08/11] block: add blk_end_flush_machinery
  2020-04-24 10:23 ` [PATCH V8 08/11] block: add blk_end_flush_machinery Ming Lei
@ 2020-04-24 10:41   ` Christoph Hellwig
  2020-04-25  3:44     ` Ming Lei
  2020-04-24 13:47   ` Hannes Reinecke
  1 sibling, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 06:23:48PM +0800, Ming Lei wrote:
> +/* complete requests which just requires one flush command */
> +static void blk_complete_flush_requests(struct blk_flush_queue *fq,
> +		struct list_head *flush_list)
> +{
> +	struct block_device *bdev;
> +	struct request *rq;
> +	int error = -ENXIO;
> +
> +	if (list_empty(flush_list))
> +		return;
> +
> +	rq = list_first_entry(flush_list, struct request, queuelist);
> +
> +	/* Send flush via one active hctx so we can move on */
> +	bdev = bdget_disk(rq->rq_disk, 0);
> +	if (bdev) {
> +		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
> +		bdput(bdev);
> +	}

FYI, we don't really need the bdev to send a bio anymore, this could just
do:

	bio = bio_alloc(GFP_KERNEL, 0); // XXX: shouldn't this be GFP_NOIO??
	bio->bi_disk = rq->rq_disk;
	bio->bi_partno = 0;
	bio_associate_blkg(bio); // XXX: actually, shouldn't this come from rq?
	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
	error = submit_bio_wait(bio);
	bio_put(bio);


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
  2020-04-24 10:23 ` [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
@ 2020-04-24 10:42   ` Christoph Hellwig
  2020-04-25  3:48     ` Ming Lei
  2020-04-24 13:48   ` Hannes Reinecke
  1 sibling, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

> +static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_dead);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
> +	blk_mq_hctx_handle_dead_cpu(hctx, cpu);
>  	return 0;

As commented last time:

why not simply:

	if (cpumask_test_cpu(cpu, hctx->cpumask))
		blk_mq_hctx_handle_dead_cpu(hctx, cpu);
	return 0;

?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-04-24 10:23 ` [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
@ 2020-04-24 10:43   ` Christoph Hellwig
  2020-04-24 13:56   ` Hannes Reinecke
  1 sibling, 0 replies; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 06:23:51PM +0800, Ming Lei wrote:
> Run queue on dead CPU still may be triggered in some corner case,
> such as one request is requeued after CPU hotplug is handled.
> 
> So handle this corner case during run queue.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-24 10:23 ` [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
@ 2020-04-24 10:44   ` Christoph Hellwig
  2020-04-25  3:52     ` Ming Lei
  2020-04-24 13:55   ` Hannes Reinecke
  1 sibling, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 10:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

> +	/* avoid allocation failure by clearing NOWAIT */
> +	nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);
> +	if (!nrq)
> +		return;
> +
> +	blk_rq_copy_request(nrq, rq);
> +
> +	nrq->timeout = rq->timeout;

Shouldn't the timeout also go into blk_rq_copy_request?

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
  2020-04-24 10:23 ` [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
  2020-04-24 10:32   ` Christoph Hellwig
@ 2020-04-24 12:43   ` Hannes Reinecke
  2020-04-24 16:11   ` Martin K. Petersen
  2 siblings, 0 replies; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 12:43 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Mike Snitzer, dm-devel

On 4/24/20 12:23 PM, Ming Lei wrote:
> So far blk_rq_prep_clone() is only used for setup one underlying cloned
> request from dm-rq request. block intetrity can be enabled for both dm-rq
> and the underlying queues, so it is reasonable to clone rq's
> nr_integrity_segments. Also write_hint is from bio, it should have been
> cloned too.
> 
> So clone nr_integrity_segments and write_hint in blk_rq_prep_clone.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: dm-devel@redhat.com
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7e4a1da0715e..91537e526b45 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1636,9 +1636,13 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
>   		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
>   		rq->special_vec = rq_src->special_vec;
>   	}
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> +	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
> +#endif
>   	rq->nr_phys_segments = rq_src->nr_phys_segments;
>   	rq->ioprio = rq_src->ioprio;
>   	rq->extra_len = rq_src->extra_len;
> +	rq->write_hint = rq_src->write_hint;
>   
>   	return 0;
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 02/11] block: add helper for copying request
  2020-04-24 10:23   ` Ming Lei
  (?)
  (?)
@ 2020-04-24 12:43   ` Hannes Reinecke
  -1 siblings, 0 replies; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 12:43 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Mike Snitzer, dm-devel

On 4/24/20 12:23 PM, Ming Lei wrote:
> Add one new helper of blk_rq_copy_request() to copy request, and the helper
> will be used in this patch for re-submitting request, so make it as a block
> layer internal helper.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: dm-devel@redhat.com
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c | 33 +++++++++++++++++++--------------
>   block/blk.h      |  2 ++
>   2 files changed, 21 insertions(+), 14 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static
  2020-04-24 10:23 ` [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
@ 2020-04-24 12:44   ` Hannes Reinecke
  2020-04-24 16:13   ` Martin K. Petersen
  1 sibling, 0 replies; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 12:44 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, John Garry

On 4/24/20 12:23 PM, Ming Lei wrote:
> Now all callers of blk_mq_get_driver_tag are in blk-mq.c, so mark
> it as static.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: John Garry <john.garry@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 2 +-
>   block/blk-mq.h | 1 -
>   2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a7785df2c944..79267f2e8960 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1027,7 +1027,7 @@ static inline unsigned int queued_to_index(unsigned int queued)
>   	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>   }
>   
> -bool blk_mq_get_driver_tag(struct request *rq)
> +static bool blk_mq_get_driver_tag(struct request *rq)
>   {
>   	struct blk_mq_alloc_data data = {
>   		.q = rq->q,
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 10bfdfb494fa..e7d1da4b1f73 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -44,7 +44,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
>   void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
>   				bool kick_requeue_list);
>   void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
> -bool blk_mq_get_driver_tag(struct request *rq);
>   struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
>   					struct blk_mq_ctx *start);
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-24 10:23 ` [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
  2020-04-24 10:35   ` Christoph Hellwig
@ 2020-04-24 13:02   ` Hannes Reinecke
  2020-04-25  2:54     ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:02 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, John Garry

On 4/24/20 12:23 PM, Ming Lei wrote:
> Especially for none elevator, rq->tag is assigned after the request is
> allocated, so there isn't any way to figure out if one request is in
> being dispatched. Also the code path wrt. driver tag becomes a bit
> difference between none and io scheduler.
> 
> When one hctx becomes inactive, we have to prevent any request from
> being dispatched to LLD. And get driver tag provides one perfect chance
> to do that. Meantime we can drain any such requests by checking if
> rq->tag is assigned.
> 

Sorry for being a bit dense, but I'm having a hard time following the 
description.
Maybe this would be a bit clearer:

When one hctx becomes inactive, we do have to prevent any request from
being dispatched to the LLD. If we intercept them in blk_mq_get_tag() we 
can also drain all those requests which have no rq->tag assigned.

(With the nice side effect that if above paragraph is correct I've also 
got it right what the patch is trying to do :-)

> So only assign rq->tag until blk_mq_get_driver_tag() is called.
> 
> This way also simplifies code of dealing with driver tag a lot.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-flush.c | 18 ++----------
>   block/blk-mq.c    | 75 ++++++++++++++++++++++++-----------------------
>   block/blk-mq.h    | 21 +++++++------
>   block/blk.h       |  5 ----
>   4 files changed, 51 insertions(+), 68 deletions(-)
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index c7f396e3d5e2..977edf95d711 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -236,13 +236,8 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
>   		error = fq->rq_status;
>   
>   	hctx = flush_rq->mq_hctx;
> -	if (!q->elevator) {
> -		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
> -		flush_rq->tag = -1;
> -	} else {
> -		blk_mq_put_driver_tag(flush_rq);
> -		flush_rq->internal_tag = -1;
> -	}
> +	flush_rq->internal_tag = -1;
> +	blk_mq_put_driver_tag(flush_rq);
>   
>   	running = &fq->flush_queue[fq->flush_running_idx];
>   	BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
> @@ -317,14 +312,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>   	flush_rq->mq_ctx = first_rq->mq_ctx;
>   	flush_rq->mq_hctx = first_rq->mq_hctx;
>   
> -	if (!q->elevator) {
> -		fq->orig_rq = first_rq;
> -		flush_rq->tag = first_rq->tag;
> -		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
> -	} else {
> -		flush_rq->internal_tag = first_rq->internal_tag;
> -	}
> -
> +	flush_rq->internal_tag = first_rq->internal_tag;
>   	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
>   	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
>   	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 79267f2e8960..65f0aaed55ff 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -276,18 +276,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>   	struct request *rq = tags->static_rqs[tag];
>   	req_flags_t rq_flags = 0;
>   
> -	if (data->flags & BLK_MQ_REQ_INTERNAL) {
> -		rq->tag = -1;
> -		rq->internal_tag = tag;
> -	} else {
> -		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
> -			rq_flags = RQF_MQ_INFLIGHT;
> -			atomic_inc(&data->hctx->nr_active);
> -		}
> -		rq->tag = tag;
> -		rq->internal_tag = -1;
> -		data->hctx->tags->rqs[rq->tag] = rq;
> -	}
> +	rq->internal_tag = tag;
> +	rq->tag = -1;
>   
>   	/* csd/requeue_work/fifo_time is initialized before use */
>   	rq->q = data->q;
> @@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
>   	struct request_queue *q = rq->q;
>   	struct blk_mq_ctx *ctx = rq->mq_ctx;
>   	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> -	const int sched_tag = rq->internal_tag;
>   
>   	blk_pm_mark_last_busy(rq);
>   	rq->mq_hctx = NULL;
> -	if (rq->tag != -1)
> -		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> -	if (sched_tag != -1)
> -		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
> +
> +	if (hctx->sched_tags) {
> +		if (rq->tag >= 0)
> +			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> +		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
> +	} else {
> +		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
> +        }
> +
>   	blk_mq_sched_restart(hctx);
>   	blk_queue_exit(q);
>   }
> @@ -527,7 +521,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
>   		blk_stat_add(rq, now);
>   	}
>   
> -	if (rq->internal_tag != -1)
> +	if (rq->q->elevator && rq->internal_tag != -1)
>   		blk_mq_sched_completed_request(rq, now);
>   
>   	blk_account_io_done(rq, now);

One really does wonder: under which circumstances can 'internal_tag' be 
-1 now ?
The hunk above seems to imply that 'internal_tag' is now always be set; 
and this is also the impression I got from reading this patch.
Care to elaborate?

> @@ -1027,33 +1021,40 @@ static inline unsigned int queued_to_index(unsigned int queued)
>   	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>   }
>   
> -static bool blk_mq_get_driver_tag(struct request *rq)
> +static bool __blk_mq_get_driver_tag(struct request *rq)
>   {
>   	struct blk_mq_alloc_data data = {
> -		.q = rq->q,
> -		.hctx = rq->mq_hctx,
> -		.flags = BLK_MQ_REQ_NOWAIT,
> -		.cmd_flags = rq->cmd_flags,
> +		.q		= rq->q,
> +		.hctx		= rq->mq_hctx,
> +		.flags		= BLK_MQ_REQ_NOWAIT,
> +		.cmd_flags	= rq->cmd_flags,
>   	};
> -	bool shared;
>   
> -	if (rq->tag != -1)
> -		return true;
> +	if (data.hctx->sched_tags) {
> +		if (blk_mq_tag_is_reserved(data.hctx->sched_tags,
> +				rq->internal_tag))
> +			data.flags |= BLK_MQ_REQ_RESERVED;
> +		rq->tag = blk_mq_get_tag(&data);
> +	} else {
> +		rq->tag = rq->internal_tag;
> +	}
>   
> -	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
> -		data.flags |= BLK_MQ_REQ_RESERVED;
> +	if (rq->tag == -1)
> +		return false;
>   
> -	shared = blk_mq_tag_busy(data.hctx);
> -	rq->tag = blk_mq_get_tag(&data);
> -	if (rq->tag >= 0) {
> -		if (shared) {
> -			rq->rq_flags |= RQF_MQ_INFLIGHT;
> -			atomic_inc(&data.hctx->nr_active);
> -		}
> -		data.hctx->tags->rqs[rq->tag] = rq;
> +	if (blk_mq_tag_busy(data.hctx)) {
> +		rq->rq_flags |= RQF_MQ_INFLIGHT;
> +		atomic_inc(&data.hctx->nr_active);
>   	}
> +	data.hctx->tags->rqs[rq->tag] = rq;
> +	return true;
> +}
>   
> -	return rq->tag != -1;
> +static bool blk_mq_get_driver_tag(struct request *rq)
> +{
> +	if (rq->tag != -1)
> +		return true;
> +	return __blk_mq_get_driver_tag(rq);
>   }
>   
>   static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e7d1da4b1f73..d0c72d7d07c8 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -196,26 +196,25 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>   	return true;
>   }
>   
> -static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> -					   struct request *rq)
> +static inline void blk_mq_put_driver_tag(struct request *rq)
>   {
> -	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> +	int tag = rq->tag;
> +
> +	if (tag < 0)
> +		return;
> +
>   	rq->tag = -1;
>    > +	if (hctx->sched_tags)
> +		blk_mq_put_tag(hctx->tags, rq->mq_ctx, tag);
> +
I wonder if you need the local variable 'tag' here; might it not be 
better to set 'rq->tag' to '-1' after the call to put_tag?

But then I'm not sure, either, so it might just be cosmetic ..

>   	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
>   		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
>   		atomic_dec(&hctx->nr_active);
>   	}
>   }
>   
> -static inline void blk_mq_put_driver_tag(struct request *rq)
> -{
> -	if (rq->tag == -1 || rq->internal_tag == -1)
> -		return;
> -
> -	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
> -}
> -
>   static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
>   {
>   	int cpu;
> diff --git a/block/blk.h b/block/blk.h
> index bbbced0b3c8c..446b2893b478 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -26,11 +26,6 @@ struct blk_flush_queue {
>   	struct list_head	flush_data_in_flight;
>   	struct request		*flush_rq;
>   
> -	/*
> -	 * flush_rq shares tag with this rq, both can't be active
> -	 * at the same time
> -	 */
> -	struct request		*orig_rq;
>   	struct lock_class_key	key;
>   	spinlock_t		mq_flush_lock;
>   };
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-04-24 10:23 ` [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
@ 2020-04-24 13:17   ` Hannes Reinecke
  2020-04-25  3:04     ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:17 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 12:23 PM, Ming Lei wrote:
> Now request is thought as in-flight only when its state is updated as
> MQ_RQ_IN_FLIGHT, which is done by dirver via blk_mq_start_request().
> 

driver

> Actually from blk-mq's view, one rq can be thought as in-flight
> after its tag is >= 0.
> 
Well, and that we should clarify to avoid any misunderstanding.
To my understanding, 'in-flight' are request which are submitted to
the LLD. IE we'll have a lifetime rule like

internal_tag >= tag > in-flight

If the existence of a 'tag' would be equivalent to 'in-flight' we could
do away with all the convoluted code managing the MQ_RQ_IN_FLIGHT state, 
wouldn't we?

> Passing one rq filter callback so that we can iterating requests very
> flexiable.
> 

flexible

> Meantime blk_mq_all_tag_busy_iter is defined as public, which will be
> called from blk-mq internally.
> 
Maybe:

Implement blk_mq_all_tag_busy_iter() which accepts a 'busy_fn' argument
to filter over which commands to iterate, and make the existing 
blk_mq_tag_busy_iter() a wrapper for the new function.

> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-tag.c | 39 +++++++++++++++++++++++++++------------
>   block/blk-mq-tag.h |  4 ++++
>   2 files changed, 31 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6e904a..2e43b827c96d 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -255,6 +255,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
>   struct bt_tags_iter_data {
>   	struct blk_mq_tags *tags;
>   	busy_tag_iter_fn *fn;
> +	busy_rq_iter_fn *busy_rq_fn;
>   	void *data;
>   	bool reserved;
>   };
> @@ -274,7 +275,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	 * test and set the bit before assining ->rqs[].
>   	 */
>   	rq = tags->rqs[bitnr];
> -	if (rq && blk_mq_request_started(rq))
> +	if (rq && iter_data->busy_rq_fn(rq, iter_data->data, reserved))
>   		return iter_data->fn(rq, iter_data->data, reserved);
>   
>   	return true;
> @@ -294,11 +295,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>    *		bitmap_tags member of struct blk_mq_tags.
>    */
>   static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
> -			     busy_tag_iter_fn *fn, void *data, bool reserved)
> +			     busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> +			     void *data, bool reserved)
>   {
>   	struct bt_tags_iter_data iter_data = {
>   		.tags = tags,
>   		.fn = fn,
> +		.busy_rq_fn = busy_rq_fn,
>   		.data = data,
>   		.reserved = reserved,
>   	};
> @@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
>   /**
>    * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
>    * @tags:	Tag map to iterate over.
> - * @fn:		Pointer to the function that will be called for each started
> - *		request. @fn will be called as follows: @fn(rq, @priv,
> - *		reserved) where rq is a pointer to a request. 'reserved'
> - *		indicates whether or not @rq is a reserved request. Return
> - *		true to continue iterating tags, false to stop.
> + * @fn:		Pointer to the function that will be called for each request
> + * 		when .busy_rq_fn(rq) returns true. @fn will be called as
> + * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
> + * 		request. 'reserved' indicates whether or not @rq is a reserved
> + * 		request. Return true to continue iterating tags, false to stop.
> + * @busy_rq_fn: Pointer to the function that will be called for each request,
> + * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
> + * 		@priv, reserved) returns true, @fn will be called on this rq.
>    * @priv:	Will be passed as second argument to @fn.
>    */
> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> -		busy_tag_iter_fn *fn, void *priv)
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> +		void *priv)
>   {
>   	if (tags->nr_reserved_tags)
> -		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
> -	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
> +		bt_tags_for_each(tags, &tags->breserved_tags, fn, busy_rq_fn,
> +				priv, true);
> +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, busy_rq_fn, priv, false);
> +}
> +
> +static bool blk_mq_default_busy_rq(struct request *rq, void *data,
> +		bool reserved)
> +{
> +	return blk_mq_request_started(rq);
>   }
>   
>   /**
> @@ -342,7 +356,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>   
>   	for (i = 0; i < tagset->nr_hw_queues; i++) {
>   		if (tagset->tags && tagset->tags[i])
> -			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
> +			blk_mq_all_tag_busy_iter(tagset->tags[i], fn,
> +					blk_mq_default_busy_rq, priv);
>   	}
>   }
>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 2b8321efb682..fdf095d513e5 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -21,6 +21,7 @@ struct blk_mq_tags {
>   	struct list_head page_list;
>   };
>   
> +typedef bool (busy_rq_iter_fn)(struct request *, void *, bool);
>   
>   extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
>   extern void blk_mq_free_tags(struct blk_mq_tags *tags);
> @@ -34,6 +35,9 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>   extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>   		void *priv);
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> +		void *priv);
>   
>   static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>   						 struct blk_mq_hw_ctx *hctx)
> 
I do worry about the performance impact of this new filter function.
 From my understanding, the _busy_iter() functions are supposed to be 
efficient, such that they can be used as an alternative to having a 
global atomic counter.
(cf the replacement of the global host_busy counter).

But if we're adding ever more functionality to the iterator itself 
there's a good chance we'll kill the performance rendering this 
assumption invalid.

Have you measured the performance impact of this?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-04-24 10:23 ` [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-04-24 13:23   ` Hannes Reinecke
  2020-04-25  3:24     ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:23 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 12:23 PM, Ming Lei wrote:
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "
>   That was the constraint of managed interrupts from the very beginning:
> 
>    The driver/subsystem has to quiesce the interrupt line and the associated
>    queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>    until it's restarted by the core when the CPU is plugged in again.
> "
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> one cpuhp state handled after the CPU is down, so there isn't any chance
> to quiesce hctx for blk-mq wrt. CPU hotplug.
> 
> Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> and wait for completion of in-flight requests.
> 
> We will stop hw queue and wait for completion of in-flight requests
> when one hctx is becoming dead in the following patch. This way may
> cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
> loop.
> 
> Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
> loop, so we needn't to wait for completion of in-flight requests from
> dm-rq & loop, then the potential dead-lock can be avoided.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-debugfs.c     |  1 +
>   block/blk-mq.c             | 19 +++++++++++++++++++
>   drivers/block/loop.c       |  2 +-
>   drivers/md/dm-rq.c         |  2 +-
>   include/linux/blk-mq.h     |  3 +++
>   include/linux/cpuhotplug.h |  1 +
>   6 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index b3f2ba483992..8e745826eb86 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -239,6 +239,7 @@ static const char *const hctx_flag_name[] = {
>   	HCTX_FLAG_NAME(TAG_SHARED),
>   	HCTX_FLAG_NAME(BLOCKING),
>   	HCTX_FLAG_NAME(NO_SCHED),
> +	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
>   };
>   #undef HCTX_FLAG_NAME
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 65f0aaed55ff..d432cc74ef78 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2261,6 +2261,16 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	return 0;
> +}
> +
> +static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> +{
> +	return 0;
> +}
> +
>   /*
>    * 'cpu' is going away. splice any existing rq_list entries from this
>    * software queue to the hw queue dispatch list, and ensure that it
> @@ -2297,6 +2307,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   
>   static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
>   {
> +	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
> +		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +						    &hctx->cpuhp_online);
>   	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
>   					    &hctx->cpuhp_dead);
>   }
> @@ -2356,6 +2369,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   {
>   	hctx->queue_num = hctx_idx;
>   
> +	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
> +		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +				&hctx->cpuhp_online);
>   	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
>   
>   	hctx->tags = set->tags[hctx_idx];
> @@ -3610,6 +3626,9 @@ static int __init blk_mq_init(void)
>   {
>   	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
>   				blk_mq_hctx_notify_dead);
> +	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
> +				blk_mq_hctx_notify_online,
> +				blk_mq_hctx_notify_offline);
>   	return 0;
>   }
>   subsys_initcall(blk_mq_init);
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index da693e6a834e..784f2e038b55 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
>   	lo->tag_set.queue_depth = 128;
>   	lo->tag_set.numa_node = NUMA_NO_NODE;
>   	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
> -	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> +	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
>   	lo->tag_set.driver_data = lo;
>   
>   	err = blk_mq_alloc_tag_set(&lo->tag_set);
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 3f8577e2c13b..5f1ff70ac029 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>   	md->tag_set->ops = &dm_mq_ops;
>   	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
>   	md->tag_set->numa_node = md->numa_node_id;
> -	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
> +	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
>   	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
>   	md->tag_set->driver_data = md;
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index b45148ba3291..f550b5274b8b 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
>   	 */
>   	atomic_t		nr_active;
>   
> +	/** @cpuhp_online: List to store request if CPU is going to die */
> +	struct hlist_node	cpuhp_online;
>   	/** @cpuhp_dead: List to store request if some CPU die. */
>   	struct hlist_node	cpuhp_dead;
>   	/** @kobj: Kernel object for sysfs. */
> @@ -391,6 +393,7 @@ struct blk_mq_ops {
>   enum {
>   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
>   	BLK_MQ_F_BLOCKING	= 1 << 5,
>   	BLK_MQ_F_NO_SCHED	= 1 << 6,
>   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 77d70b633531..24b3a77810b6 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -152,6 +152,7 @@ enum cpuhp_state {
>   	CPUHP_AP_SMPBOOT_THREADS,
>   	CPUHP_AP_X86_VDSO_VMA_ONLINE,
>   	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> +	CPUHP_AP_BLK_MQ_ONLINE,
>   	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
>   	CPUHP_AP_X86_INTEL_EPB_ONLINE,
>   	CPUHP_AP_PERF_ONLINE,
> 
Ho-hum.

I do agree for the loop and the CPUHP part (not that I'm qualified to 
just the latter, but anyway).
For the dm side I'm less certain.
Thing is, we rarely get hardware interrupts directly to the 
device-mapper device, but rather to the underlying hardware LLD.
I even not quite sure what exactly the implications of managed 
interrupts with dm are; after all, we're using softirqs here, don't we?

So for DM I'd rather wait for the I/O on the underlying devices' hctx to 
become quiesce, and not kill them ourselves.
Not sure if the device-mapper framework _can_ do this right now, though.
Mike?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-24 10:23 ` [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
  2020-04-24 10:38   ` Christoph Hellwig
@ 2020-04-24 13:27   ` Hannes Reinecke
  2020-04-25  3:30     ` Ming Lei
  2020-04-24 13:42   ` John Garry
  2 siblings, 1 reply; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:27 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 12:23 PM, Ming Lei wrote:
> Before one CPU becomes offline, check if it is the last online CPU of hctx.
> If yes, mark this hctx as inactive, meantime wait for completion of all
> in-flight IOs originated from this hctx. Meantime check if this hctx has
> become inactive in blk_mq_get_driver_tag(), if yes, release the
> allocated tag.
> 
> This way guarantees that there isn't any inflight IO before shutdowning
> the managed IRQ line when all CPUs of this IRQ line is offline.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-debugfs.c |   1 +
>   block/blk-mq.c         | 124 +++++++++++++++++++++++++++++++++++++----
>   include/linux/blk-mq.h |   3 +
>   3 files changed, 117 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 8e745826eb86..b62390918ca5 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
>   	HCTX_STATE_NAME(STOPPED),
>   	HCTX_STATE_NAME(TAG_ACTIVE),
>   	HCTX_STATE_NAME(SCHED_RESTART),
> +	HCTX_STATE_NAME(INACTIVE),
>   };
>   #undef HCTX_STATE_NAME
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d432cc74ef78..4d0c271d9f6f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1050,11 +1050,31 @@ static bool __blk_mq_get_driver_tag(struct request *rq)
>   	return true;
>   }
>   
> -static bool blk_mq_get_driver_tag(struct request *rq)
> +static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
>   {
>   	if (rq->tag != -1)
>   		return true;
> -	return __blk_mq_get_driver_tag(rq);
> +
> +	if (!__blk_mq_get_driver_tag(rq))
> +		return false;
> +	/*
> +	 * Add one memory barrier in case that direct issue IO process is
> +	 * migrated to other CPU which may not belong to this hctx, so we can
> +	 * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> +	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> +	 * and driver tag assignment are run on the same CPU in case that
> +	 * BLK_MQ_S_INACTIVE is set.
> +	 */
> +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> +		smp_mb();
> +	else
> +		barrier();
> +
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
> +		blk_mq_put_driver_tag(rq);
> +		return false;
> +	}
> +	return true;
>   }
>   
>   static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
> @@ -1103,7 +1123,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
>   		 * Don't clear RESTART here, someone else could have set it.
>   		 * At most this will cost an extra queue run.
>   		 */
> -		return blk_mq_get_driver_tag(rq);
> +		return blk_mq_get_driver_tag(rq, false);
>   	}
>   
>   	wait = &hctx->dispatch_wait;
> @@ -1129,7 +1149,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
>   	 * allocation failure and adding the hardware queue to the wait
>   	 * queue.
>   	 */
> -	ret = blk_mq_get_driver_tag(rq);
> +	ret = blk_mq_get_driver_tag(rq, false);
>   	if (!ret) {
>   		spin_unlock(&hctx->dispatch_wait_lock);
>   		spin_unlock_irq(&wq->lock);
> @@ -1228,7 +1248,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>   			break;
>   		}
>   
> -		if (!blk_mq_get_driver_tag(rq)) {
> +		if (!blk_mq_get_driver_tag(rq, false)) {
>   			/*
>   			 * The initial allocation attempt failed, so we need to
>   			 * rerun the hardware queue when a tag is freed. The
> @@ -1260,7 +1280,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>   			bd.last = true;
>   		else {
>   			nxt = list_first_entry(list, struct request, queuelist);
> -			bd.last = !blk_mq_get_driver_tag(nxt);
> +			bd.last = !blk_mq_get_driver_tag(nxt, false);
>   		}
>   
>   		ret = q->mq_ops->queue_rq(hctx, &bd);
> @@ -1853,7 +1873,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   	if (!blk_mq_get_dispatch_budget(hctx))
>   		goto insert;
>   
> -	if (!blk_mq_get_driver_tag(rq)) {
> +	if (!blk_mq_get_driver_tag(rq, true)) {
>   		blk_mq_put_dispatch_budget(hctx);
>   		goto insert;
>   	}
> @@ -2261,13 +2281,92 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> -static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +struct count_inflight_data {
> +	unsigned count;
> +	struct blk_mq_hw_ctx *hctx;
> +};
> +
> +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> +				     bool reserved)
>   {
> -	return 0;
> +	struct count_inflight_data *count_data = data;
> +
> +	/*
> +	 * Can't check rq's state because it is updated to MQ_RQ_IN_FLIGHT
> +	 * in blk_mq_start_request(), at that time we can't prevent this rq
> +	 * from being issued.
> +	 *
> +	 * So check if driver tag is assigned, if yes, count this rq as
> +	 * inflight.
> +	 */
> +	if (rq->tag >= 0 && rq->mq_hctx == count_data->hctx)
> +		count_data->count++;
> +
> +	return true;
> +}
> +
> +static bool blk_mq_inflight_rq(struct request *rq, void *data,
> +			       bool reserved)
> +{
> +	return rq->tag >= 0;
> +}
> +
> +static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct count_inflight_data count_data = {
> +		.count	= 0,
> +		.hctx	= hctx,
> +	};
> +
> +	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
> +			blk_mq_inflight_rq, &count_data);
> +
> +	return count_data.count;
> +}
> +

Remind me again: Why do we need the 'filter' function here?
Can't we just move the filter function into the main iterator and
stay with the original implementation?

> +static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> +	while (1) {
> +		if (!blk_mq_tags_inflight_rqs(hctx))
> +			break;
> +		msleep(5);
> +	}
>   }
>   
>   static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   {
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
> +	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
> +	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids))
> +		return 0;
> +
> +	/*
> +	 * The current CPU is the last one in this hctx, S_INACTIVE
> +	 * can be observed in dispatch path without any barrier needed,
> +	 * cause both are run on one same CPU.
> +	 */
> +	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> +	/*
> +	 * Order setting BLK_MQ_S_INACTIVE and checking rq->tag & rqs[tag],
> +	 * and its pair is the smp_mb() in blk_mq_get_driver_tag
> +	 */
> +	smp_mb__after_atomic();
> +	blk_mq_hctx_drain_inflight_rqs(hctx);
> +	return 0;
> +}
> +
> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (cpumask_test_cpu(cpu, hctx->cpumask))
> +		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
>   	return 0;
>   }
>   
> @@ -2278,12 +2377,15 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>    */
>   static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_dead);
>   	struct blk_mq_ctx *ctx;
>   	LIST_HEAD(tmp);
>   	enum hctx_type type;
>   
> -	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
>   	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>   	type = hctx->type;
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index f550b5274b8b..b4812c455807 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -403,6 +403,9 @@ enum {
>   	BLK_MQ_S_TAG_ACTIVE	= 1,
>   	BLK_MQ_S_SCHED_RESTART	= 2,
>   
> +	/* hw queue is inactive after all its CPUs become offline */
> +	BLK_MQ_S_INACTIVE	= 3,
> +
>   	BLK_MQ_MAX_DEPTH	= 10240,
>   
>   	BLK_MQ_CPU_WORK_BATCH	= 8,
> 
Otherwise this looks good, and is exactly what we need.
Thanks for doing the work!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-24 10:23 ` [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
  2020-04-24 10:38   ` Christoph Hellwig
  2020-04-24 13:27   ` Hannes Reinecke
@ 2020-04-24 13:42   ` John Garry
  2020-04-25  3:41     ` Ming Lei
  2 siblings, 1 reply; 81+ messages in thread
From: John Garry @ 2020-04-24 13:42 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner

On 24/04/2020 11:23, Ming Lei wrote:
>   static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   {
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
> +	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
> +	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids))
> +		return 0;


nit: personally I prefer what we had previously, as it was easier to 
read, even if it did cause the code to be indented:

	if ((cpumask_next_and(-1, cpumask, online_mask) == cpu) &&
	    (cpumask_next_and(cpu, cpumask, online_mask) >= nr_cpu_ids)) {
		// do deactivate
	}

	return 0	

and it could avoid the cpumask_test_cpu() test, unless you want that as 
an optimisation. If so, a comment could help.

cheers,
John

> +
> +	/*
> +	 * The current CPU is the last one in this hctx, S_INACTIVE
> +	 * can be observed in dispatch path without any barrier needed,
> +	 * cause both are run on one same CPU.
> +	 */
> +	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> +	/*
> +	 * Order setting BLK_MQ_S_INACTIVE and checking rq->tag & rqs[tag],
> +	 * and its pair is the smp_mb() in blk_mq_get_driver_tag
> +	 */
> +	smp_mb__after_atomic();
> +	blk_mq_hctx_drain_inflight_rqs(hctx);
> +	return 0;
> +}
> +


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 08/11] block: add blk_end_flush_machinery
  2020-04-24 10:23 ` [PATCH V8 08/11] block: add blk_end_flush_machinery Ming Lei
  2020-04-24 10:41   ` Christoph Hellwig
@ 2020-04-24 13:47   ` Hannes Reinecke
  2020-04-25  3:47     ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:47 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 12:23 PM, Ming Lei wrote:
> Flush requests aren't same with normal FS request:
> 
> 1) one dedicated per-hctx flush rq is pre-allocated for sending flush request
> 
> 2) flush request si issued to hardware via one machinary so that flush merge
> can be applied
> 
> We can't simply re-submit flush rqs via blk_steal_bios(), so add
> blk_end_flush_machinery to collect flush requests which needs to
> be resubmitted:
> 
> - if one flush command without DATA is enough, send one flush, complete this
> kind of requests
> 
> - otherwise, add the request into a list and let caller re-submit it.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-flush.c | 123 +++++++++++++++++++++++++++++++++++++++++++---
>   block/blk.h       |   4 ++
>   2 files changed, 120 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 977edf95d711..745d878697ed 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -170,10 +170,11 @@ static void blk_flush_complete_seq(struct request *rq,
>   	unsigned int cmd_flags;
>   
>   	BUG_ON(rq->flush.seq & seq);
> -	rq->flush.seq |= seq;
> +	if (!error)
> +		rq->flush.seq |= seq;
>   	cmd_flags = rq->cmd_flags;
>   
> -	if (likely(!error))
> +	if (likely(!error && !fq->flush_queue_terminating))
>   		seq = blk_flush_cur_seq(rq);
>   	else
>   		seq = REQ_FSEQ_DONE;
> @@ -200,9 +201,15 @@ static void blk_flush_complete_seq(struct request *rq,
>   		 * normal completion and end it.
>   		 */
>   		BUG_ON(!list_empty(&rq->queuelist));
> -		list_del_init(&rq->flush.list);
> -		blk_flush_restore_request(rq);
> -		blk_mq_end_request(rq, error);
> +
> +		/* Terminating code will end the request from flush queue */
> +		if (likely(!fq->flush_queue_terminating)) {
> +			list_del_init(&rq->flush.list);
> +			blk_flush_restore_request(rq);
> +			blk_mq_end_request(rq, error);
> +		} else {
> +			list_move_tail(&rq->flush.list, pending);
> +		}
>   		break;
>   
>   	default:
> @@ -279,7 +286,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>   	struct request *flush_rq = fq->flush_rq;
>   
>   	/* C1 described at the top of this file */
> -	if (fq->flush_pending_idx != fq->flush_running_idx || list_empty(pending))
> +	if (fq->flush_pending_idx != fq->flush_running_idx ||
> +			list_empty(pending) || fq->flush_queue_terminating)
>   		return;
>   
>   	/* C2 and C3
> @@ -331,7 +339,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
>   	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
>   
>   	if (q->elevator) {
> -		WARN_ON(rq->tag < 0);
> +		WARN_ON(rq->tag < 0 && !fq->flush_queue_terminating);
>   		blk_mq_put_driver_tag(rq);
>   	}
>   
> @@ -503,3 +511,104 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
>   	kfree(fq->flush_rq);
>   	kfree(fq);
>   }
> +
> +static void __blk_end_queued_flush(struct blk_flush_queue *fq,
> +		unsigned int queue_idx, struct list_head *resubmit_list,
> +		struct list_head *flush_list)
> +{
> +	struct list_head *queue = &fq->flush_queue[queue_idx];
> +	struct request *rq, *nxt;
> +
> +	list_for_each_entry_safe(rq, nxt, queue, flush.list) {
> +		unsigned int seq = blk_flush_cur_seq(rq);
> +
> +		list_del_init(&rq->flush.list);
> +		blk_flush_restore_request(rq);
> +		if (!blk_rq_sectors(rq) || seq == REQ_FSEQ_POSTFLUSH )
> +			list_add_tail(&rq->queuelist, flush_list);
> +		else
> +			list_add_tail(&rq->queuelist, resubmit_list);
> +	}
> +}
> +
> +static void blk_end_queued_flush(struct blk_flush_queue *fq,
> +		struct list_head *resubmit_list, struct list_head *flush_list)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&fq->mq_flush_lock, flags);
> +	__blk_end_queued_flush(fq, 0, resubmit_list, flush_list);
> +	__blk_end_queued_flush(fq, 1, resubmit_list, flush_list);
> +	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
> +}
> +
> +/* complete requests which just requires one flush command */
> +static void blk_complete_flush_requests(struct blk_flush_queue *fq,
> +		struct list_head *flush_list)
> +{
> +	struct block_device *bdev;
> +	struct request *rq;
> +	int error = -ENXIO;
> +
> +	if (list_empty(flush_list))
> +		return;
> +
> +	rq = list_first_entry(flush_list, struct request, queuelist);
> +
> +	/* Send flush via one active hctx so we can move on */
> +	bdev = bdget_disk(rq->rq_disk, 0);
> +	if (bdev) {
> +		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
> +		bdput(bdev);
> +	}
> +
> +	while (!list_empty(flush_list)) {
> +		rq = list_first_entry(flush_list, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		blk_mq_end_request(rq, error);
> +	}
> +}
> +
I must admit I'm getting nervous if one mixes direct request 
manipulations with high-level 'blkdev_XXX' calls.
Can't we just requeue everything including the flush itself, and letting 
the requeue algorithm pick a new hctx?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
  2020-04-24 10:23 ` [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
  2020-04-24 10:42   ` Christoph Hellwig
@ 2020-04-24 13:48   ` Hannes Reinecke
  1 sibling, 0 replies; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:48 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 12:23 PM, Ming Lei wrote:
> Add helper of blk_mq_hctx_handle_dead_cpu for handling cpu dead,
> and prepare for handling inactive hctx.
> 
> No functional change.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 31 +++++++++++++++++++------------
>   1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4d0c271d9f6f..0759e0d606b3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2370,22 +2370,13 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
>   	return 0;
>   }
>   
> -/*
> - * 'cpu' is going away. splice any existing rq_list entries from this
> - * software queue to the hw queue dispatch list, and ensure that it
> - * gets run.
> - */
> -static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> +static void blk_mq_hctx_handle_dead_cpu(struct blk_mq_hw_ctx *hctx,
> +		unsigned int cpu)
>   {
> -	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> -			struct blk_mq_hw_ctx, cpuhp_dead);
>   	struct blk_mq_ctx *ctx;
>   	LIST_HEAD(tmp);
>   	enum hctx_type type;
>   
> -	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> -		return 0;
> -
>   	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>   	type = hctx->type;
>   
> @@ -2397,13 +2388,29 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   	spin_unlock(&ctx->lock);
>   
>   	if (list_empty(&tmp))
> -		return 0;
> +		return;
>   
>   	spin_lock(&hctx->lock);
>   	list_splice_tail_init(&tmp, &hctx->dispatch);
>   	spin_unlock(&hctx->lock);
>   
>   	blk_mq_run_hw_queue(hctx, true);
> +}
> +
> +/*
> + * 'cpu' is going away. splice any existing rq_list entries from this
> + * software queue to the hw queue dispatch list, and ensure that it
> + * gets run.
> + */
> +static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_dead);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
> +	blk_mq_hctx_handle_dead_cpu(hctx, cpu);
>   	return 0;
>   }
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-24 10:23 ` [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
  2020-04-24 10:44   ` Christoph Hellwig
@ 2020-04-24 13:55   ` Hannes Reinecke
  2020-04-25  3:59     ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 12:23 PM, Ming Lei wrote:
> When all CPUs in one hctx are offline and this hctx becomes inactive, we
> shouldn't run this hw queue for completing request any more.
> 
> So allocate request from one live hctx, and clone & resubmit the request,
> either it is from sw queue or scheduler queue.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 98 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0759e0d606b3..a4a26bb23533 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2370,6 +2370,98 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
>   	return 0;
>   }
>   
> +static void blk_mq_resubmit_end_rq(struct request *rq, blk_status_t error)
> +{
> +	struct request *orig_rq = rq->end_io_data;
> +
> +	blk_mq_cleanup_rq(orig_rq);
> +	blk_mq_end_request(orig_rq, error);
> +
> +	blk_put_request(rq);
> +}
> +
> +static void blk_mq_resubmit_rq(struct request *rq)
> +{
> +	struct request *nrq;
> +	unsigned int flags = 0;
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> +	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
> +		hctx->tags;
> +	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
> +
> +	if (rq->rq_flags & RQF_PREEMPT)
> +		flags |= BLK_MQ_REQ_PREEMPT;
> +	if (reserved)
> +		flags |= BLK_MQ_REQ_RESERVED;
> +
> +	/* avoid allocation failure by clearing NOWAIT */
> +	nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);
> +	if (!nrq)
> +		return;
> +

Ah-ha. So what happens if we don't get a request here?

> +	blk_rq_copy_request(nrq, rq);
> +
> +	nrq->timeout = rq->timeout;
> +	nrq->rq_disk = rq->rq_disk;
> +	nrq->part = rq->part;
> +
> +	memcpy(blk_mq_rq_to_pdu(nrq), blk_mq_rq_to_pdu(rq),
> +			rq->q->tag_set->cmd_size);
> +
> +	nrq->end_io = blk_mq_resubmit_end_rq;
> +	nrq->end_io_data = rq;
> +	nrq->bio = rq->bio;
> +	nrq->biotail = rq->biotail;
> +
> +	if (blk_insert_cloned_request(nrq->q, nrq) != BLK_STS_OK)
> +		blk_mq_request_bypass_insert(nrq, false, true);
> +}
> +

Not sure if that is a good idea.
With the above code we would having to allocate an additional
tag per request; if we're running full throttle with all tags active 
where should they be coming from?

And all the while we _have_ perfectly valid tags; the tag of the 
original request _is_ perfectly valid, and we have made sure that it's 
not inflight (because if it were we would have to wait for to be 
completed by the hardware anyway).

So why can't we re-use the existing tag here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-04-24 10:23 ` [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
  2020-04-24 10:43   ` Christoph Hellwig
@ 2020-04-24 13:56   ` Hannes Reinecke
  1 sibling, 0 replies; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-24 13:56 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 12:23 PM, Ming Lei wrote:
> Run queue on dead CPU still may be triggered in some corner case,
> such as one request is requeued after CPU hotplug is handled.
> 
> So handle this corner case during run queue.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 30 ++++++++++--------------------
>   1 file changed, 10 insertions(+), 20 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a4a26bb23533..68088ff5460c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -43,6 +43,8 @@
>   static void blk_mq_poll_stats_start(struct request_queue *q);
>   static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
>   
> +static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx);
> +
>   static int blk_mq_poll_stats_bkt(const struct request *rq)
>   {
>   	int ddir, sectors, bucket;
> @@ -1376,28 +1378,16 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>   	int srcu_idx;
>   
>   	/*
> -	 * We should be running this queue from one of the CPUs that
> -	 * are mapped to it.
> -	 *
> -	 * There are at least two related races now between setting
> -	 * hctx->next_cpu from blk_mq_hctx_next_cpu() and running
> -	 * __blk_mq_run_hw_queue():
> -	 *
> -	 * - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
> -	 *   but later it becomes online, then this warning is harmless
> -	 *   at all
> -	 *
> -	 * - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
> -	 *   but later it becomes offline, then the warning can't be
> -	 *   triggered, and we depend on blk-mq timeout handler to
> -	 *   handle dispatched requests to this hctx
> +	 * BLK_MQ_S_INACTIVE may not deal with some requeue corner case:
> +	 * one request is requeued after cpu unplug is handled, so check
> +	 * if the hctx is actually inactive. If yes, deactive it and
> +	 * re-submit all requests in the queue.
>   	 */
>   	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> -		cpu_online(hctx->next_cpu)) {
> -		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
> -			raw_smp_processor_id(),
> -			cpumask_empty(hctx->cpumask) ? "inactive": "active");
> -		dump_stack();
> +	    cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) >=
> +	    nr_cpu_ids) {
> +		blk_mq_hctx_deactivate(hctx);
> +		return;
>   	}
>   
>   	/*
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 00/11] blk-mq: improvement CPU hotplug
  2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (10 preceding siblings ...)
  2020-04-24 10:23 ` [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
@ 2020-04-24 15:23 ` Jens Axboe
  2020-04-24 15:40   ` Christoph Hellwig
  11 siblings, 1 reply; 81+ messages in thread
From: Jens Axboe @ 2020-04-24 15:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 4/24/20 4:23 AM, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>     
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
> 
> But no drivers or blk-mq do that before one hctx becomes inactive(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Thanks John Garry for running lots of tests on arm64 with this patchset
> and co-working on investigating all kinds of issues.
> 
> Thanks Christoph's review on V7.
> 
> Please comment & review, thanks!

Applied for 5.8 - had to do it manually for the first two patches, as
they conflict with the dma drain removal from core from Christoph.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 00/11] blk-mq: improvement CPU hotplug
  2020-04-24 15:23 ` [PATCH V8 00/11] blk-mq: improvement CPU hotplug Jens Axboe
@ 2020-04-24 15:40   ` Christoph Hellwig
  2020-04-24 15:41     ` Jens Axboe
  0 siblings, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-24 15:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 09:23:45AM -0600, Jens Axboe wrote:
> > Please comment & review, thanks!
> 
> Applied for 5.8 - had to do it manually for the first two patches, as
> they conflict with the dma drain removal from core from Christoph.

I'd really like to see the barrier mess sorted out before this is
merged..

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 00/11] blk-mq: improvement CPU hotplug
  2020-04-24 15:40   ` Christoph Hellwig
@ 2020-04-24 15:41     ` Jens Axboe
  0 siblings, 0 replies; 81+ messages in thread
From: Jens Axboe @ 2020-04-24 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On 4/24/20 9:40 AM, Christoph Hellwig wrote:
> On Fri, Apr 24, 2020 at 09:23:45AM -0600, Jens Axboe wrote:
>>> Please comment & review, thanks!
>>
>> Applied for 5.8 - had to do it manually for the first two patches, as
>> they conflict with the dma drain removal from core from Christoph.
> 
> I'd really like to see the barrier mess sorted out before this is
> merged..

OK, we can hold off until that's sorted.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
  2020-04-24 10:23 ` [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
  2020-04-24 10:32   ` Christoph Hellwig
  2020-04-24 12:43   ` Hannes Reinecke
@ 2020-04-24 16:11   ` Martin K. Petersen
  2 siblings, 0 replies; 81+ messages in thread
From: Martin K. Petersen @ 2020-04-24 16:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel


Ming,

> So far blk_rq_prep_clone() is only used for setup one underlying
> cloned request from dm-rq request. block intetrity can be enabled for
> both dm-rq and the underlying queues, so it is reasonable to clone
> rq's nr_integrity_segments. Also write_hint is from bio, it should
> have been cloned too.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 02/11] block: add helper for copying request
  2020-04-24 10:23   ` Ming Lei
                     ` (2 preceding siblings ...)
  (?)
@ 2020-04-24 16:12   ` Martin K. Petersen
  -1 siblings, 0 replies; 81+ messages in thread
From: Martin K. Petersen @ 2020-04-24 16:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel


Ming,

> Add one new helper of blk_rq_copy_request() to copy request, and the
> helper will be used in this patch for re-submitting request, so make
> it as a block layer internal helper.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static
  2020-04-24 10:23 ` [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
  2020-04-24 12:44   ` Hannes Reinecke
@ 2020-04-24 16:13   ` Martin K. Petersen
  1 sibling, 0 replies; 81+ messages in thread
From: Martin K. Petersen @ 2020-04-24 16:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry


Ming,

> Now all callers of blk_mq_get_driver_tag are in blk-mq.c, so mark it
> as static.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-24 13:02   ` Hannes Reinecke
@ 2020-04-25  2:54     ` Ming Lei
  2020-04-25 18:26       ` Hannes Reinecke
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-25  2:54 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

On Fri, Apr 24, 2020 at 03:02:36PM +0200, Hannes Reinecke wrote:
> On 4/24/20 12:23 PM, Ming Lei wrote:
> > Especially for none elevator, rq->tag is assigned after the request is
> > allocated, so there isn't any way to figure out if one request is in
> > being dispatched. Also the code path wrt. driver tag becomes a bit
> > difference between none and io scheduler.
> > 
> > When one hctx becomes inactive, we have to prevent any request from
> > being dispatched to LLD. And get driver tag provides one perfect chance
> > to do that. Meantime we can drain any such requests by checking if
> > rq->tag is assigned.
> > 
> 
> Sorry for being a bit dense, but I'm having a hard time following the
> description.
> Maybe this would be a bit clearer:
> 
> When one hctx becomes inactive, we do have to prevent any request from
> being dispatched to the LLD. If we intercept them in blk_mq_get_tag() we can
> also drain all those requests which have no rq->tag assigned.

No, actually what we need to drain is requests with rq->tag assigned, and
if tag isn't assigned, we can simply prevent the request from being
queued to LLD after the hctx becomes inactive.

Frankly speaking, the description in commit log should be more clear,
and correct.

> 
> (With the nice side effect that if above paragraph is correct I've also got
> it right what the patch is trying to do :-)
> 
> > So only assign rq->tag until blk_mq_get_driver_tag() is called.
> > 
> > This way also simplifies code of dealing with driver tag a lot.
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: John Garry <john.garry@huawei.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-flush.c | 18 ++----------
> >   block/blk-mq.c    | 75 ++++++++++++++++++++++++-----------------------
> >   block/blk-mq.h    | 21 +++++++------
> >   block/blk.h       |  5 ----
> >   4 files changed, 51 insertions(+), 68 deletions(-)
> > 
> > diff --git a/block/blk-flush.c b/block/blk-flush.c
> > index c7f396e3d5e2..977edf95d711 100644
> > --- a/block/blk-flush.c
> > +++ b/block/blk-flush.c
> > @@ -236,13 +236,8 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
> >   		error = fq->rq_status;
> >   	hctx = flush_rq->mq_hctx;
> > -	if (!q->elevator) {
> > -		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
> > -		flush_rq->tag = -1;
> > -	} else {
> > -		blk_mq_put_driver_tag(flush_rq);
> > -		flush_rq->internal_tag = -1;
> > -	}
> > +	flush_rq->internal_tag = -1;
> > +	blk_mq_put_driver_tag(flush_rq);
> >   	running = &fq->flush_queue[fq->flush_running_idx];
> >   	BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
> > @@ -317,14 +312,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> >   	flush_rq->mq_ctx = first_rq->mq_ctx;
> >   	flush_rq->mq_hctx = first_rq->mq_hctx;
> > -	if (!q->elevator) {
> > -		fq->orig_rq = first_rq;
> > -		flush_rq->tag = first_rq->tag;
> > -		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
> > -	} else {
> > -		flush_rq->internal_tag = first_rq->internal_tag;
> > -	}
> > -
> > +	flush_rq->internal_tag = first_rq->internal_tag;
> >   	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
> >   	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
> >   	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 79267f2e8960..65f0aaed55ff 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -276,18 +276,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
> >   	struct request *rq = tags->static_rqs[tag];
> >   	req_flags_t rq_flags = 0;
> > -	if (data->flags & BLK_MQ_REQ_INTERNAL) {
> > -		rq->tag = -1;
> > -		rq->internal_tag = tag;
> > -	} else {
> > -		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
> > -			rq_flags = RQF_MQ_INFLIGHT;
> > -			atomic_inc(&data->hctx->nr_active);
> > -		}
> > -		rq->tag = tag;
> > -		rq->internal_tag = -1;
> > -		data->hctx->tags->rqs[rq->tag] = rq;
> > -	}
> > +	rq->internal_tag = tag;
> > +	rq->tag = -1;
> >   	/* csd/requeue_work/fifo_time is initialized before use */
> >   	rq->q = data->q;
> > @@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
> >   	struct request_queue *q = rq->q;
> >   	struct blk_mq_ctx *ctx = rq->mq_ctx;
> >   	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > -	const int sched_tag = rq->internal_tag;
> >   	blk_pm_mark_last_busy(rq);
> >   	rq->mq_hctx = NULL;
> > -	if (rq->tag != -1)
> > -		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> > -	if (sched_tag != -1)
> > -		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
> > +
> > +	if (hctx->sched_tags) {
> > +		if (rq->tag >= 0)
> > +			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> > +		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
> > +	} else {
> > +		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
> > +        }
> > +
> >   	blk_mq_sched_restart(hctx);
> >   	blk_queue_exit(q);
> >   }
> > @@ -527,7 +521,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
> >   		blk_stat_add(rq, now);
> >   	}
> > -	if (rq->internal_tag != -1)
> > +	if (rq->q->elevator && rq->internal_tag != -1)
> >   		blk_mq_sched_completed_request(rq, now);
> >   	blk_account_io_done(rq, now);
> 
> One really does wonder: under which circumstances can 'internal_tag' be -1
> now ?
> The hunk above seems to imply that 'internal_tag' is now always be set; and
> this is also the impression I got from reading this patch.
> Care to elaborate?

rq->internal_tag should always be assigned, and the only case is that it can
become -1 for flush rq, however it is done in .end_io(), so we may avoid
the above check on rq->internal_tag.

> 
> > @@ -1027,33 +1021,40 @@ static inline unsigned int queued_to_index(unsigned int queued)
> >   	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> >   }
> > -static bool blk_mq_get_driver_tag(struct request *rq)
> > +static bool __blk_mq_get_driver_tag(struct request *rq)
> >   {
> >   	struct blk_mq_alloc_data data = {
> > -		.q = rq->q,
> > -		.hctx = rq->mq_hctx,
> > -		.flags = BLK_MQ_REQ_NOWAIT,
> > -		.cmd_flags = rq->cmd_flags,
> > +		.q		= rq->q,
> > +		.hctx		= rq->mq_hctx,
> > +		.flags		= BLK_MQ_REQ_NOWAIT,
> > +		.cmd_flags	= rq->cmd_flags,
> >   	};
> > -	bool shared;
> > -	if (rq->tag != -1)
> > -		return true;
> > +	if (data.hctx->sched_tags) {
> > +		if (blk_mq_tag_is_reserved(data.hctx->sched_tags,
> > +				rq->internal_tag))
> > +			data.flags |= BLK_MQ_REQ_RESERVED;
> > +		rq->tag = blk_mq_get_tag(&data);
> > +	} else {
> > +		rq->tag = rq->internal_tag;
> > +	}
> > -	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
> > -		data.flags |= BLK_MQ_REQ_RESERVED;
> > +	if (rq->tag == -1)
> > +		return false;
> > -	shared = blk_mq_tag_busy(data.hctx);
> > -	rq->tag = blk_mq_get_tag(&data);
> > -	if (rq->tag >= 0) {
> > -		if (shared) {
> > -			rq->rq_flags |= RQF_MQ_INFLIGHT;
> > -			atomic_inc(&data.hctx->nr_active);
> > -		}
> > -		data.hctx->tags->rqs[rq->tag] = rq;
> > +	if (blk_mq_tag_busy(data.hctx)) {
> > +		rq->rq_flags |= RQF_MQ_INFLIGHT;
> > +		atomic_inc(&data.hctx->nr_active);
> >   	}
> > +	data.hctx->tags->rqs[rq->tag] = rq;
> > +	return true;
> > +}
> > -	return rq->tag != -1;
> > +static bool blk_mq_get_driver_tag(struct request *rq)
> > +{
> > +	if (rq->tag != -1)
> > +		return true;
> > +	return __blk_mq_get_driver_tag(rq);
> >   }
> >   static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index e7d1da4b1f73..d0c72d7d07c8 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -196,26 +196,25 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> >   	return true;
> >   }
> > -static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> > -					   struct request *rq)
> > +static inline void blk_mq_put_driver_tag(struct request *rq)
> >   {
> > -	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
> > +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > +	int tag = rq->tag;
> > +
> > +	if (tag < 0)
> > +		return;
> > +
> >   	rq->tag = -1;
> >    > +	if (hctx->sched_tags)
> > +		blk_mq_put_tag(hctx->tags, rq->mq_ctx, tag);
> > +
> I wonder if you need the local variable 'tag' here; might it not be better
> to set 'rq->tag' to '-1' after the call to put_tag?

No, we can't touch the request after blk_mq_put_tag() returns.

I remember we have fixed such kind of UAF several times.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-04-24 13:17   ` Hannes Reinecke
@ 2020-04-25  3:04     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:04 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 03:17:48PM +0200, Hannes Reinecke wrote:
> On 4/24/20 12:23 PM, Ming Lei wrote:
> > Now request is thought as in-flight only when its state is updated as
> > MQ_RQ_IN_FLIGHT, which is done by dirver via blk_mq_start_request().
> > 
> 
> driver
> 
> > Actually from blk-mq's view, one rq can be thought as in-flight
> > after its tag is >= 0.
> > 
> Well, and that we should clarify to avoid any misunderstanding.
> To my understanding, 'in-flight' are request which are submitted to
> the LLD. IE we'll have a lifetime rule like
> 
> internal_tag >= tag > in-flight
> 
> If the existence of a 'tag' would be equivalent to 'in-flight' we could
> do away with all the convoluted code managing the MQ_RQ_IN_FLIGHT state,
> wouldn't we?

Yeah, I have been thinking about that.

> 
> > Passing one rq filter callback so that we can iterating requests very
> > flexiable.
> > 
> 
> flexible
> 
> > Meantime blk_mq_all_tag_busy_iter is defined as public, which will be
> > called from blk-mq internally.
> > 
> Maybe:
> 
> Implement blk_mq_all_tag_busy_iter() which accepts a 'busy_fn' argument
> to filter over which commands to iterate, and make the existing
> blk_mq_tag_busy_iter() a wrapper for the new function.

Fine.

> 
> > Cc: John Garry <john.garry@huawei.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-tag.c | 39 +++++++++++++++++++++++++++------------
> >   block/blk-mq-tag.h |  4 ++++
> >   2 files changed, 31 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 586c9d6e904a..2e43b827c96d 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -255,6 +255,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
> >   struct bt_tags_iter_data {
> >   	struct blk_mq_tags *tags;
> >   	busy_tag_iter_fn *fn;
> > +	busy_rq_iter_fn *busy_rq_fn;
> >   	void *data;
> >   	bool reserved;
> >   };
> > @@ -274,7 +275,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >   	 * test and set the bit before assining ->rqs[].
> >   	 */
> >   	rq = tags->rqs[bitnr];
> > -	if (rq && blk_mq_request_started(rq))
> > +	if (rq && iter_data->busy_rq_fn(rq, iter_data->data, reserved))
> >   		return iter_data->fn(rq, iter_data->data, reserved);
> >   	return true;
> > @@ -294,11 +295,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >    *		bitmap_tags member of struct blk_mq_tags.
> >    */
> >   static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
> > -			     busy_tag_iter_fn *fn, void *data, bool reserved)
> > +			     busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> > +			     void *data, bool reserved)
> >   {
> >   	struct bt_tags_iter_data iter_data = {
> >   		.tags = tags,
> >   		.fn = fn,
> > +		.busy_rq_fn = busy_rq_fn,
> >   		.data = data,
> >   		.reserved = reserved,
> >   	};
> > @@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
> >   /**
> >    * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
> >    * @tags:	Tag map to iterate over.
> > - * @fn:		Pointer to the function that will be called for each started
> > - *		request. @fn will be called as follows: @fn(rq, @priv,
> > - *		reserved) where rq is a pointer to a request. 'reserved'
> > - *		indicates whether or not @rq is a reserved request. Return
> > - *		true to continue iterating tags, false to stop.
> > + * @fn:		Pointer to the function that will be called for each request
> > + * 		when .busy_rq_fn(rq) returns true. @fn will be called as
> > + * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
> > + * 		request. 'reserved' indicates whether or not @rq is a reserved
> > + * 		request. Return true to continue iterating tags, false to stop.
> > + * @busy_rq_fn: Pointer to the function that will be called for each request,
> > + * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
> > + * 		@priv, reserved) returns true, @fn will be called on this rq.
> >    * @priv:	Will be passed as second argument to @fn.
> >    */
> > -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> > -		busy_tag_iter_fn *fn, void *priv)
> > +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> > +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> > +		void *priv)
> >   {
> >   	if (tags->nr_reserved_tags)
> > -		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
> > -	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
> > +		bt_tags_for_each(tags, &tags->breserved_tags, fn, busy_rq_fn,
> > +				priv, true);
> > +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, busy_rq_fn, priv, false);
> > +}
> > +
> > +static bool blk_mq_default_busy_rq(struct request *rq, void *data,
> > +		bool reserved)
> > +{
> > +	return blk_mq_request_started(rq);
> >   }
> >   /**
> > @@ -342,7 +356,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >   	for (i = 0; i < tagset->nr_hw_queues; i++) {
> >   		if (tagset->tags && tagset->tags[i])
> > -			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
> > +			blk_mq_all_tag_busy_iter(tagset->tags[i], fn,
> > +					blk_mq_default_busy_rq, priv);
> >   	}
> >   }
> >   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> > diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> > index 2b8321efb682..fdf095d513e5 100644
> > --- a/block/blk-mq-tag.h
> > +++ b/block/blk-mq-tag.h
> > @@ -21,6 +21,7 @@ struct blk_mq_tags {
> >   	struct list_head page_list;
> >   };
> > +typedef bool (busy_rq_iter_fn)(struct request *, void *, bool);
> >   extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
> >   extern void blk_mq_free_tags(struct blk_mq_tags *tags);
> > @@ -34,6 +35,9 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
> >   extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
> >   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
> >   		void *priv);
> > +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> > +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> > +		void *priv);
> >   static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
> >   						 struct blk_mq_hw_ctx *hctx)
> > 
> I do worry about the performance impact of this new filter function.
> From my understanding, the _busy_iter() functions are supposed to be
> efficient, such that they can be used as an alternative to having a global

No, blk_mq_tagset_busy_iter() won't be called in fast IO path, usually
it is run in EH code path.

Also I don't see how big the performance impact can be given what the
patch is doing is just to add blk_mq_default_busy_rq() to replace the
check of blk_mq_request_started().

> atomic counter.
> (cf the replacement of the global host_busy counter).
> 
> But if we're adding ever more functionality to the iterator itself there's a
> good chance we'll kill the performance rendering this assumption invalid.
> 
> Have you measured the performance impact of this?

As I mentioned, we don't call such busy_iter() in fast path. Or do you
see such usage in fast path?


thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-24 10:38   ` Christoph Hellwig
@ 2020-04-25  3:17     ` Ming Lei
  2020-04-25  8:32       ` Christoph Hellwig
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Fri, Apr 24, 2020 at 12:38:51PM +0200, Christoph Hellwig wrote:
> On Fri, Apr 24, 2020 at 06:23:47PM +0800, Ming Lei wrote:
> > Before one CPU becomes offline, check if it is the last online CPU of hctx.
> > If yes, mark this hctx as inactive, meantime wait for completion of all
> > in-flight IOs originated from this hctx. Meantime check if this hctx has
> > become inactive in blk_mq_get_driver_tag(), if yes, release the
> > allocated tag.
> > 
> > This way guarantees that there isn't any inflight IO before shutdowning
> > the managed IRQ line when all CPUs of this IRQ line is offline.
> 
> Can you take a look at all my comments on the previous version here
> (splitting blk_mq_get_driver_tag for direct_issue vs not, what is

I am not sure if it helps by adding two helper, given only two
parameters are needed, and the new parameter is just a constant.

> the point of barrier(), smp_mb__before_atomic and
> smp_mb__after_atomic), as none seems to be addressed and I also didn't
> see a reply.

I believe it has been documented:

+   /*
+    * Add one memory barrier in case that direct issue IO process is
+    * migrated to other CPU which may not belong to this hctx, so we can
+    * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
+    * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
+    * and driver tag assignment are run on the same CPU in case that
+    * BLK_MQ_S_INACTIVE is set.
+    */

OK, I can add more:

In case of not direct issue, __blk_mq_delay_run_hw_queue() guarantees
that dispatch is done on CPUs of this hctx.

In case of direct issue, the direct issue IO process may be migrated to
other CPU which doesn't belong to hctx->cpumask even though the chance
is quite small, but still possible.

This patch sets hctx as inactive in the last CPU of hctx, so barrier()
is enough for not direct issue. Otherwise, one smp_mb() is added for
ordering tag assignment(include setting rq) and checking S_INACTIVE in
blk_mq_get_driver_tag().



Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-04-24 13:23   ` Hannes Reinecke
@ 2020-04-25  3:24     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 03:23:08PM +0200, Hannes Reinecke wrote:
> On 4/24/20 12:23 PM, Ming Lei wrote:
> > Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> > up queue mapping. Thomas mentioned the following point[1]:
> > 
> > "
> >   That was the constraint of managed interrupts from the very beginning:
> > 
> >    The driver/subsystem has to quiesce the interrupt line and the associated
> >    queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >    until it's restarted by the core when the CPU is plugged in again.
> > "
> > 
> > However, current blk-mq implementation doesn't quiesce hw queue before
> > the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> > one cpuhp state handled after the CPU is down, so there isn't any chance
> > to quiesce hctx for blk-mq wrt. CPU hotplug.
> > 
> > Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> > and wait for completion of in-flight requests.
> > 
> > We will stop hw queue and wait for completion of in-flight requests
> > when one hctx is becoming dead in the following patch. This way may
> > cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
> > loop.
> > 
> > Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
> > loop, so we needn't to wait for completion of in-flight requests from
> > dm-rq & loop, then the potential dead-lock can be avoided.
> > 
> > [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> > 
> > Cc: John Garry <john.garry@huawei.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-debugfs.c     |  1 +
> >   block/blk-mq.c             | 19 +++++++++++++++++++
> >   drivers/block/loop.c       |  2 +-
> >   drivers/md/dm-rq.c         |  2 +-
> >   include/linux/blk-mq.h     |  3 +++
> >   include/linux/cpuhotplug.h |  1 +
> >   6 files changed, 26 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index b3f2ba483992..8e745826eb86 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -239,6 +239,7 @@ static const char *const hctx_flag_name[] = {
> >   	HCTX_FLAG_NAME(TAG_SHARED),
> >   	HCTX_FLAG_NAME(BLOCKING),
> >   	HCTX_FLAG_NAME(NO_SCHED),
> > +	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
> >   };
> >   #undef HCTX_FLAG_NAME
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 65f0aaed55ff..d432cc74ef78 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2261,6 +2261,16 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >   	return -ENOMEM;
> >   }
> > +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	return 0;
> > +}
> > +
> > +static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	return 0;
> > +}
> > +
> >   /*
> >    * 'cpu' is going away. splice any existing rq_list entries from this
> >    * software queue to the hw queue dispatch list, and ensure that it
> > @@ -2297,6 +2307,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >   static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
> >   {
> > +	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
> > +		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> > +						    &hctx->cpuhp_online);
> >   	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
> >   					    &hctx->cpuhp_dead);
> >   }
> > @@ -2356,6 +2369,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
> >   {
> >   	hctx->queue_num = hctx_idx;
> > +	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
> > +		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> > +				&hctx->cpuhp_online);
> >   	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
> >   	hctx->tags = set->tags[hctx_idx];
> > @@ -3610,6 +3626,9 @@ static int __init blk_mq_init(void)
> >   {
> >   	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
> >   				blk_mq_hctx_notify_dead);
> > +	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
> > +				blk_mq_hctx_notify_online,
> > +				blk_mq_hctx_notify_offline);
> >   	return 0;
> >   }
> >   subsys_initcall(blk_mq_init);
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index da693e6a834e..784f2e038b55 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
> >   	lo->tag_set.queue_depth = 128;
> >   	lo->tag_set.numa_node = NUMA_NO_NODE;
> >   	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
> > -	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> > +	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
> >   	lo->tag_set.driver_data = lo;
> >   	err = blk_mq_alloc_tag_set(&lo->tag_set);
> > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > index 3f8577e2c13b..5f1ff70ac029 100644
> > --- a/drivers/md/dm-rq.c
> > +++ b/drivers/md/dm-rq.c
> > @@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
> >   	md->tag_set->ops = &dm_mq_ops;
> >   	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
> >   	md->tag_set->numa_node = md->numa_node_id;
> > -	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
> > +	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
> >   	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
> >   	md->tag_set->driver_data = md;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index b45148ba3291..f550b5274b8b 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
> >   	 */
> >   	atomic_t		nr_active;
> > +	/** @cpuhp_online: List to store request if CPU is going to die */
> > +	struct hlist_node	cpuhp_online;
> >   	/** @cpuhp_dead: List to store request if some CPU die. */
> >   	struct hlist_node	cpuhp_dead;
> >   	/** @kobj: Kernel object for sysfs. */
> > @@ -391,6 +393,7 @@ struct blk_mq_ops {
> >   enum {
> >   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
> >   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> > +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
> >   	BLK_MQ_F_BLOCKING	= 1 << 5,
> >   	BLK_MQ_F_NO_SCHED	= 1 << 6,
> >   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> > index 77d70b633531..24b3a77810b6 100644
> > --- a/include/linux/cpuhotplug.h
> > +++ b/include/linux/cpuhotplug.h
> > @@ -152,6 +152,7 @@ enum cpuhp_state {
> >   	CPUHP_AP_SMPBOOT_THREADS,
> >   	CPUHP_AP_X86_VDSO_VMA_ONLINE,
> >   	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> > +	CPUHP_AP_BLK_MQ_ONLINE,
> >   	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
> >   	CPUHP_AP_X86_INTEL_EPB_ONLINE,
> >   	CPUHP_AP_PERF_ONLINE,
> > 
> Ho-hum.
> 
> I do agree for the loop and the CPUHP part (not that I'm qualified to just
> the latter, but anyway).
> For the dm side I'm less certain.
> Thing is, we rarely get hardware interrupts directly to the device-mapper
> device, but rather to the underlying hardware LLD.
> I even not quite sure what exactly the implications of managed interrupts
> with dm are; after all, we're using softirqs here, don't we?
> 
> So for DM I'd rather wait for the I/O on the underlying devices' hctx to
> become quiesce, and not kill them ourselves.
> Not sure if the device-mapper framework _can_ do this right now, though.
> Mike?

The problem the patchset tries to address is drivers applying
managed interrupt. When all CPUs of one managed interrupt line are
offline, the IO interrupt may be never to trigger, so IO timeout may
be triggered or IO hang if no timeout handler is provided.

So any drivers which don't use managed interrupt can be marked as
BLK_MQ_F_NO_MANAGED_IRQ.

For dm-rq, request completion is always triggered by underlying request,
so once underlying request is guaranteed to be completed, dm-rq's
request can be completed too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-24 13:27   ` Hannes Reinecke
@ 2020-04-25  3:30     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:30 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 03:27:45PM +0200, Hannes Reinecke wrote:
> On 4/24/20 12:23 PM, Ming Lei wrote:
> > Before one CPU becomes offline, check if it is the last online CPU of hctx.
> > If yes, mark this hctx as inactive, meantime wait for completion of all
> > in-flight IOs originated from this hctx. Meantime check if this hctx has
> > become inactive in blk_mq_get_driver_tag(), if yes, release the
> > allocated tag.
> > 
> > This way guarantees that there isn't any inflight IO before shutdowning
> > the managed IRQ line when all CPUs of this IRQ line is offline.
> > 
> > Cc: John Garry <john.garry@huawei.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-debugfs.c |   1 +
> >   block/blk-mq.c         | 124 +++++++++++++++++++++++++++++++++++++----
> >   include/linux/blk-mq.h |   3 +
> >   3 files changed, 117 insertions(+), 11 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 8e745826eb86..b62390918ca5 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
> >   	HCTX_STATE_NAME(STOPPED),
> >   	HCTX_STATE_NAME(TAG_ACTIVE),
> >   	HCTX_STATE_NAME(SCHED_RESTART),
> > +	HCTX_STATE_NAME(INACTIVE),
> >   };
> >   #undef HCTX_STATE_NAME
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index d432cc74ef78..4d0c271d9f6f 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1050,11 +1050,31 @@ static bool __blk_mq_get_driver_tag(struct request *rq)
> >   	return true;
> >   }
> > -static bool blk_mq_get_driver_tag(struct request *rq)
> > +static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
> >   {
> >   	if (rq->tag != -1)
> >   		return true;
> > -	return __blk_mq_get_driver_tag(rq);
> > +
> > +	if (!__blk_mq_get_driver_tag(rq))
> > +		return false;
> > +	/*
> > +	 * Add one memory barrier in case that direct issue IO process is
> > +	 * migrated to other CPU which may not belong to this hctx, so we can
> > +	 * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> > +	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> > +	 * and driver tag assignment are run on the same CPU in case that
> > +	 * BLK_MQ_S_INACTIVE is set.
> > +	 */
> > +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> > +		smp_mb();
> > +	else
> > +		barrier();
> > +
> > +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
> > +		blk_mq_put_driver_tag(rq);
> > +		return false;
> > +	}
> > +	return true;
> >   }
> >   static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
> > @@ -1103,7 +1123,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
> >   		 * Don't clear RESTART here, someone else could have set it.
> >   		 * At most this will cost an extra queue run.
> >   		 */
> > -		return blk_mq_get_driver_tag(rq);
> > +		return blk_mq_get_driver_tag(rq, false);
> >   	}
> >   	wait = &hctx->dispatch_wait;
> > @@ -1129,7 +1149,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
> >   	 * allocation failure and adding the hardware queue to the wait
> >   	 * queue.
> >   	 */
> > -	ret = blk_mq_get_driver_tag(rq);
> > +	ret = blk_mq_get_driver_tag(rq, false);
> >   	if (!ret) {
> >   		spin_unlock(&hctx->dispatch_wait_lock);
> >   		spin_unlock_irq(&wq->lock);
> > @@ -1228,7 +1248,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >   			break;
> >   		}
> > -		if (!blk_mq_get_driver_tag(rq)) {
> > +		if (!blk_mq_get_driver_tag(rq, false)) {
> >   			/*
> >   			 * The initial allocation attempt failed, so we need to
> >   			 * rerun the hardware queue when a tag is freed. The
> > @@ -1260,7 +1280,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >   			bd.last = true;
> >   		else {
> >   			nxt = list_first_entry(list, struct request, queuelist);
> > -			bd.last = !blk_mq_get_driver_tag(nxt);
> > +			bd.last = !blk_mq_get_driver_tag(nxt, false);
> >   		}
> >   		ret = q->mq_ops->queue_rq(hctx, &bd);
> > @@ -1853,7 +1873,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >   	if (!blk_mq_get_dispatch_budget(hctx))
> >   		goto insert;
> > -	if (!blk_mq_get_driver_tag(rq)) {
> > +	if (!blk_mq_get_driver_tag(rq, true)) {
> >   		blk_mq_put_dispatch_budget(hctx);
> >   		goto insert;
> >   	}
> > @@ -2261,13 +2281,92 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >   	return -ENOMEM;
> >   }
> > -static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> > +struct count_inflight_data {
> > +	unsigned count;
> > +	struct blk_mq_hw_ctx *hctx;
> > +};
> > +
> > +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> > +				     bool reserved)
> >   {
> > -	return 0;
> > +	struct count_inflight_data *count_data = data;
> > +
> > +	/*
> > +	 * Can't check rq's state because it is updated to MQ_RQ_IN_FLIGHT
> > +	 * in blk_mq_start_request(), at that time we can't prevent this rq
> > +	 * from being issued.
> > +	 *
> > +	 * So check if driver tag is assigned, if yes, count this rq as
> > +	 * inflight.
> > +	 */
> > +	if (rq->tag >= 0 && rq->mq_hctx == count_data->hctx)
> > +		count_data->count++;
> > +
> > +	return true;
> > +}
> > +
> > +static bool blk_mq_inflight_rq(struct request *rq, void *data,
> > +			       bool reserved)
> > +{
> > +	return rq->tag >= 0;
> > +}
> > +
> > +static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	struct count_inflight_data count_data = {
> > +		.count	= 0,
> > +		.hctx	= hctx,
> > +	};
> > +
> > +	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
> > +			blk_mq_inflight_rq, &count_data);
> > +
> > +	return count_data.count;
> > +}
> > +
> 
> Remind me again: Why do we need the 'filter' function here?
> Can't we just move the filter function into the main iterator and
> stay with the original implementation?

This is a good question, cause it takes John and me days for figuring
out reason of one IO hang.

The main iterator only iterates requests which blk_mq_request_started()
returns true, see bt_tags_iter().

However, we prevent rq from being allocated a driver tag in
blk_mq_get_driver_tag(). There is still some distance between assigning
driver tag and calling into blk_mq_start_request() which is called from
.queue_rq().

More importantly we can check nothing in blk_mq_start_request().


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-24 13:42   ` John Garry
@ 2020-04-25  3:41     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:41 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 02:42:06PM +0100, John Garry wrote:
> On 24/04/2020 11:23, Ming Lei wrote:
> >   static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> >   {
> > +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> > +			struct blk_mq_hw_ctx, cpuhp_online);
> > +
> > +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> > +		return 0;
> > +
> > +	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
> > +	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids))
> > +		return 0;
> 
> 
> nit: personally I prefer what we had previously, as it was easier to read,
> even if it did cause the code to be indented:
> 
> 	if ((cpumask_next_and(-1, cpumask, online_mask) == cpu) &&
> 	    (cpumask_next_and(cpu, cpumask, online_mask) >= nr_cpu_ids)) {
> 		// do deactivate
> 	}

I will document what the check does, given it can save us one level of
indentation.

> 
> 	return 0	
> 
> and it could avoid the cpumask_test_cpu() test, unless you want that as an
> optimisation. If so, a comment could help.

cpumask_test_cpu() is more readable, and should be pattern for doing
this stuff in cpuhp handler, cause the handler is called for any CPU
to be offline.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 08/11] block: add blk_end_flush_machinery
  2020-04-24 10:41   ` Christoph Hellwig
@ 2020-04-25  3:44     ` Ming Lei
  2020-04-25  8:11       ` Christoph Hellwig
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Fri, Apr 24, 2020 at 12:41:36PM +0200, Christoph Hellwig wrote:
> On Fri, Apr 24, 2020 at 06:23:48PM +0800, Ming Lei wrote:
> > +/* complete requests which just requires one flush command */
> > +static void blk_complete_flush_requests(struct blk_flush_queue *fq,
> > +		struct list_head *flush_list)
> > +{
> > +	struct block_device *bdev;
> > +	struct request *rq;
> > +	int error = -ENXIO;
> > +
> > +	if (list_empty(flush_list))
> > +		return;
> > +
> > +	rq = list_first_entry(flush_list, struct request, queuelist);
> > +
> > +	/* Send flush via one active hctx so we can move on */
> > +	bdev = bdget_disk(rq->rq_disk, 0);
> > +	if (bdev) {
> > +		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
> > +		bdput(bdev);
> > +	}
> 
> FYI, we don't really need the bdev to send a bio anymore, this could just
> do:
> 
> 	bio = bio_alloc(GFP_KERNEL, 0); // XXX: shouldn't this be GFP_NOIO??

Error handling.

> 	bio->bi_disk = rq->rq_disk;
> 	bio->bi_partno = 0;
> 	bio_associate_blkg(bio); // XXX: actually, shouldn't this come from rq?
> 	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> 	error = submit_bio_wait(bio);
> 	bio_put(bio);

Yeah, that is another way, however I prefer to blkdev_issue_flush
because it takes less code and we do have the API of blkdev_issue_flush.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 08/11] block: add blk_end_flush_machinery
  2020-04-24 13:47   ` Hannes Reinecke
@ 2020-04-25  3:47     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 03:47:28PM +0200, Hannes Reinecke wrote:
> On 4/24/20 12:23 PM, Ming Lei wrote:
> > Flush requests aren't same with normal FS request:
> > 
> > 1) one dedicated per-hctx flush rq is pre-allocated for sending flush request
> > 
> > 2) flush request si issued to hardware via one machinary so that flush merge
> > can be applied
> > 
> > We can't simply re-submit flush rqs via blk_steal_bios(), so add
> > blk_end_flush_machinery to collect flush requests which needs to
> > be resubmitted:
> > 
> > - if one flush command without DATA is enough, send one flush, complete this
> > kind of requests
> > 
> > - otherwise, add the request into a list and let caller re-submit it.
> > 
> > Cc: John Garry <john.garry@huawei.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-flush.c | 123 +++++++++++++++++++++++++++++++++++++++++++---
> >   block/blk.h       |   4 ++
> >   2 files changed, 120 insertions(+), 7 deletions(-)
> > 
> > diff --git a/block/blk-flush.c b/block/blk-flush.c
> > index 977edf95d711..745d878697ed 100644
> > --- a/block/blk-flush.c
> > +++ b/block/blk-flush.c
> > @@ -170,10 +170,11 @@ static void blk_flush_complete_seq(struct request *rq,
> >   	unsigned int cmd_flags;
> >   	BUG_ON(rq->flush.seq & seq);
> > -	rq->flush.seq |= seq;
> > +	if (!error)
> > +		rq->flush.seq |= seq;
> >   	cmd_flags = rq->cmd_flags;
> > -	if (likely(!error))
> > +	if (likely(!error && !fq->flush_queue_terminating))
> >   		seq = blk_flush_cur_seq(rq);
> >   	else
> >   		seq = REQ_FSEQ_DONE;
> > @@ -200,9 +201,15 @@ static void blk_flush_complete_seq(struct request *rq,
> >   		 * normal completion and end it.
> >   		 */
> >   		BUG_ON(!list_empty(&rq->queuelist));
> > -		list_del_init(&rq->flush.list);
> > -		blk_flush_restore_request(rq);
> > -		blk_mq_end_request(rq, error);
> > +
> > +		/* Terminating code will end the request from flush queue */
> > +		if (likely(!fq->flush_queue_terminating)) {
> > +			list_del_init(&rq->flush.list);
> > +			blk_flush_restore_request(rq);
> > +			blk_mq_end_request(rq, error);
> > +		} else {
> > +			list_move_tail(&rq->flush.list, pending);
> > +		}
> >   		break;
> >   	default:
> > @@ -279,7 +286,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> >   	struct request *flush_rq = fq->flush_rq;
> >   	/* C1 described at the top of this file */
> > -	if (fq->flush_pending_idx != fq->flush_running_idx || list_empty(pending))
> > +	if (fq->flush_pending_idx != fq->flush_running_idx ||
> > +			list_empty(pending) || fq->flush_queue_terminating)
> >   		return;
> >   	/* C2 and C3
> > @@ -331,7 +339,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
> >   	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
> >   	if (q->elevator) {
> > -		WARN_ON(rq->tag < 0);
> > +		WARN_ON(rq->tag < 0 && !fq->flush_queue_terminating);
> >   		blk_mq_put_driver_tag(rq);
> >   	}
> > @@ -503,3 +511,104 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
> >   	kfree(fq->flush_rq);
> >   	kfree(fq);
> >   }
> > +
> > +static void __blk_end_queued_flush(struct blk_flush_queue *fq,
> > +		unsigned int queue_idx, struct list_head *resubmit_list,
> > +		struct list_head *flush_list)
> > +{
> > +	struct list_head *queue = &fq->flush_queue[queue_idx];
> > +	struct request *rq, *nxt;
> > +
> > +	list_for_each_entry_safe(rq, nxt, queue, flush.list) {
> > +		unsigned int seq = blk_flush_cur_seq(rq);
> > +
> > +		list_del_init(&rq->flush.list);
> > +		blk_flush_restore_request(rq);
> > +		if (!blk_rq_sectors(rq) || seq == REQ_FSEQ_POSTFLUSH )
> > +			list_add_tail(&rq->queuelist, flush_list);
> > +		else
> > +			list_add_tail(&rq->queuelist, resubmit_list);
> > +	}
> > +}
> > +
> > +static void blk_end_queued_flush(struct blk_flush_queue *fq,
> > +		struct list_head *resubmit_list, struct list_head *flush_list)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&fq->mq_flush_lock, flags);
> > +	__blk_end_queued_flush(fq, 0, resubmit_list, flush_list);
> > +	__blk_end_queued_flush(fq, 1, resubmit_list, flush_list);
> > +	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
> > +}
> > +
> > +/* complete requests which just requires one flush command */
> > +static void blk_complete_flush_requests(struct blk_flush_queue *fq,
> > +		struct list_head *flush_list)
> > +{
> > +	struct block_device *bdev;
> > +	struct request *rq;
> > +	int error = -ENXIO;
> > +
> > +	if (list_empty(flush_list))
> > +		return;
> > +
> > +	rq = list_first_entry(flush_list, struct request, queuelist);
> > +
> > +	/* Send flush via one active hctx so we can move on */
> > +	bdev = bdget_disk(rq->rq_disk, 0);
> > +	if (bdev) {
> > +		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
> > +		bdput(bdev);
> > +	}
> > +
> > +	while (!list_empty(flush_list)) {
> > +		rq = list_first_entry(flush_list, struct request, queuelist);
> > +		list_del_init(&rq->queuelist);
> > +		blk_mq_end_request(rq, error);
> > +	}
> > +}
> > +
> I must admit I'm getting nervous if one mixes direct request manipulations
> with high-level 'blkdev_XXX' calls.
> Can't we just requeue everything including the flush itself, and letting the
> requeue algorithm pick a new hctx?

No, the biggest problem is that rq->mq_hctx is bound to the hctx(becomes inactive now)
since request allocation, so we can't requeue them simply.

What we need here is simply one flush command, blkdev_issue_flush() does
exactly what we want.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
  2020-04-24 10:42   ` Christoph Hellwig
@ 2020-04-25  3:48     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Fri, Apr 24, 2020 at 12:42:38PM +0200, Christoph Hellwig wrote:
> > +static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> > +			struct blk_mq_hw_ctx, cpuhp_dead);
> > +
> > +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> > +		return 0;
> > +
> > +	blk_mq_hctx_handle_dead_cpu(hctx, cpu);
> >  	return 0;
> 
> As commented last time:
> 
> why not simply:
> 
> 	if (cpumask_test_cpu(cpu, hctx->cpumask))
> 		blk_mq_hctx_handle_dead_cpu(hctx, cpu);
> 	return 0;

Hammm, it is just done on another handler, looks this one is missed.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-24 10:44   ` Christoph Hellwig
@ 2020-04-25  3:52     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Fri, Apr 24, 2020 at 12:44:58PM +0200, Christoph Hellwig wrote:
> > +	/* avoid allocation failure by clearing NOWAIT */
> > +	nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);
> > +	if (!nrq)
> > +		return;
> > +
> > +	blk_rq_copy_request(nrq, rq);
> > +
> > +	nrq->timeout = rq->timeout;
> 
> Shouldn't the timeout also go into blk_rq_copy_request?

I'd suggest to not do it because dm-rq clones request between
two different queues, and different queue may have different default
timeout value. And I guess that is why dm-rq code doesn't do that.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-24 13:55   ` Hannes Reinecke
@ 2020-04-25  3:59     ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  3:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Fri, Apr 24, 2020 at 03:55:35PM +0200, Hannes Reinecke wrote:
> On 4/24/20 12:23 PM, Ming Lei wrote:
> > When all CPUs in one hctx are offline and this hctx becomes inactive, we
> > shouldn't run this hw queue for completing request any more.
> > 
> > So allocate request from one live hctx, and clone & resubmit the request,
> > either it is from sw queue or scheduler queue.
> > 
> > Cc: John Garry <john.garry@huawei.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 98 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 0759e0d606b3..a4a26bb23533 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2370,6 +2370,98 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> >   	return 0;
> >   }
> > +static void blk_mq_resubmit_end_rq(struct request *rq, blk_status_t error)
> > +{
> > +	struct request *orig_rq = rq->end_io_data;
> > +
> > +	blk_mq_cleanup_rq(orig_rq);
> > +	blk_mq_end_request(orig_rq, error);
> > +
> > +	blk_put_request(rq);
> > +}
> > +
> > +static void blk_mq_resubmit_rq(struct request *rq)
> > +{
> > +	struct request *nrq;
> > +	unsigned int flags = 0;
> > +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > +	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
> > +		hctx->tags;
> > +	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
> > +
> > +	if (rq->rq_flags & RQF_PREEMPT)
> > +		flags |= BLK_MQ_REQ_PREEMPT;
> > +	if (reserved)
> > +		flags |= BLK_MQ_REQ_RESERVED;
> > +
> > +	/* avoid allocation failure by clearing NOWAIT */
> > +	nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);
> > +	if (!nrq)
> > +		return;
> > +
> 
> Ah-ha. So what happens if we don't get a request here?

So far it isn't possible if NOWAIT is cleared because the two requests
belong to different hctx.

> 
> > +	blk_rq_copy_request(nrq, rq);
> > +
> > +	nrq->timeout = rq->timeout;
> > +	nrq->rq_disk = rq->rq_disk;
> > +	nrq->part = rq->part;
> > +
> > +	memcpy(blk_mq_rq_to_pdu(nrq), blk_mq_rq_to_pdu(rq),
> > +			rq->q->tag_set->cmd_size);
> > +
> > +	nrq->end_io = blk_mq_resubmit_end_rq;
> > +	nrq->end_io_data = rq;
> > +	nrq->bio = rq->bio;
> > +	nrq->biotail = rq->biotail;
> > +
> > +	if (blk_insert_cloned_request(nrq->q, nrq) != BLK_STS_OK)
> > +		blk_mq_request_bypass_insert(nrq, false, true);
> > +}
> > +
> 
> Not sure if that is a good idea.
> With the above code we would having to allocate an additional
> tag per request; if we're running full throttle with all tags active where
> should they be coming from?

The two requests are from different hctx, and we don't have
per-request-queue throttle in blk-mq, and scsi does have, however
no requests from this inactive hctx exists in LLD.

So no the throttle you worry about.

> 
> And all the while we _have_ perfectly valid tags; the tag of the original
> request _is_ perfectly valid, and we have made sure that it's not inflight

No, we are talking request in sw queue and scheduler queue, which aren't
assigned tag yet.

> (because if it were we would have to wait for to be completed by the
> hardware anyway).
> 
> So why can't we re-use the existing tag here?

No, the tag doesn't exist.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 08/11] block: add blk_end_flush_machinery
  2020-04-25  3:44     ` Ming Lei
@ 2020-04-25  8:11       ` Christoph Hellwig
  2020-04-25  9:51         ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-25  8:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, John Garry,
	Bart Van Assche, Hannes Reinecke, Thomas Gleixner

On Sat, Apr 25, 2020 at 11:44:05AM +0800, Ming Lei wrote:
> > FYI, we don't really need the bdev to send a bio anymore, this could just
> > do:
> > 
> > 	bio = bio_alloc(GFP_KERNEL, 0); // XXX: shouldn't this be GFP_NOIO??
> 
> Error handling.

bio_alloc does not fail for allocations that can sleep.

> 
> > 	bio->bi_disk = rq->rq_disk;
> > 	bio->bi_partno = 0;
> > 	bio_associate_blkg(bio); // XXX: actually, shouldn't this come from rq?
> > 	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> > 	error = submit_bio_wait(bio);
> > 	bio_put(bio);
> 
> Yeah, that is another way, however I prefer to blkdev_issue_flush
> because it takes less code and we do have the API of blkdev_issue_flush.

It is about the same amount of code, and as commented above I think the
current code uses the wrong blkg assignment.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-25  3:17     ` Ming Lei
@ 2020-04-25  8:32       ` Christoph Hellwig
  2020-04-25  9:34         ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-25  8:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, John Garry,
	Bart Van Assche, Hannes Reinecke, Thomas Gleixner, will, peterz,
	paulmck

On Sat, Apr 25, 2020 at 11:17:23AM +0800, Ming Lei wrote:
> I am not sure if it helps by adding two helper, given only two
> parameters are needed, and the new parameter is just a constant.
> 
> > the point of barrier(), smp_mb__before_atomic and
> > smp_mb__after_atomic), as none seems to be addressed and I also didn't
> > see a reply.
> 
> I believe it has been documented:
> 
> +   /*
> +    * Add one memory barrier in case that direct issue IO process is
> +    * migrated to other CPU which may not belong to this hctx, so we can
> +    * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> +    * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> +    * and driver tag assignment are run on the same CPU in case that
> +    * BLK_MQ_S_INACTIVE is set.
> +    */
> 
> OK, I can add more:
> 
> In case of not direct issue, __blk_mq_delay_run_hw_queue() guarantees
> that dispatch is done on CPUs of this hctx.
> 
> In case of direct issue, the direct issue IO process may be migrated to
> other CPU which doesn't belong to hctx->cpumask even though the chance
> is quite small, but still possible.
> 
> This patch sets hctx as inactive in the last CPU of hctx, so barrier()
> is enough for not direct issue. Otherwise, one smp_mb() is added for
> ordering tag assignment(include setting rq) and checking S_INACTIVE in
> blk_mq_get_driver_tag().

How do you prevent a cpu migration between the call to raw_smp_processor_id
and barrier? 

Also as far as I can tell Documentation/core-api/atomic_ops.rst ask
you to use smp_mb__before_atomic and smp_mb__after_atomic for any
ordering with non-updating bitops.  Quote:

--------------------------------- snip ---------------------------------
If explicit memory barriers are required around {set,clear}_bit() (which do
not return a value, and thus does not need to provide memory barrier
semantics), two interfaces are provided::

        void smp_mb__before_atomic(void);
	void smp_mb__after_atomic(void);
--------------------------------- snip ---------------------------------

I really want someone who knows the memory model to look over this scheme,
as it looks dangerous.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-25  8:32       ` Christoph Hellwig
@ 2020-04-25  9:34         ` Ming Lei
  2020-04-25  9:53           ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-25  9:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner, will, peterz, paulmck

On Sat, Apr 25, 2020 at 10:32:24AM +0200, Christoph Hellwig wrote:
> On Sat, Apr 25, 2020 at 11:17:23AM +0800, Ming Lei wrote:
> > I am not sure if it helps by adding two helper, given only two
> > parameters are needed, and the new parameter is just a constant.
> > 
> > > the point of barrier(), smp_mb__before_atomic and
> > > smp_mb__after_atomic), as none seems to be addressed and I also didn't
> > > see a reply.
> > 
> > I believe it has been documented:
> > 
> > +   /*
> > +    * Add one memory barrier in case that direct issue IO process is
> > +    * migrated to other CPU which may not belong to this hctx, so we can
> > +    * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> > +    * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> > +    * and driver tag assignment are run on the same CPU in case that
> > +    * BLK_MQ_S_INACTIVE is set.
> > +    */
> > 
> > OK, I can add more:
> > 
> > In case of not direct issue, __blk_mq_delay_run_hw_queue() guarantees
> > that dispatch is done on CPUs of this hctx.
> > 
> > In case of direct issue, the direct issue IO process may be migrated to
> > other CPU which doesn't belong to hctx->cpumask even though the chance
> > is quite small, but still possible.
> > 
> > This patch sets hctx as inactive in the last CPU of hctx, so barrier()
> > is enough for not direct issue. Otherwise, one smp_mb() is added for
> > ordering tag assignment(include setting rq) and checking S_INACTIVE in
> > blk_mq_get_driver_tag().
> 
> How do you prevent a cpu migration between the call to raw_smp_processor_id
> and barrier? 

Fine, we may have to use get_cpu()/put_cpu() for direct issue to cover
the check. For non-direct issue, either preempt is disabled or the work is
run on specified CPU.

> 
> Also as far as I can tell Documentation/core-api/atomic_ops.rst ask
> you to use smp_mb__before_atomic and smp_mb__after_atomic for any
> ordering with non-updating bitops.  Quote:
> 
> --------------------------------- snip ---------------------------------
> If explicit memory barriers are required around {set,clear}_bit() (which do
> not return a value, and thus does not need to provide memory barrier
> semantics), two interfaces are provided::
> 
>         void smp_mb__before_atomic(void);
> 	void smp_mb__after_atomic(void);
> --------------------------------- snip ---------------------------------
> 
> I really want someone who knows the memory model to look over this scheme,
> as it looks dangerous.

smp_mb() is enough, the version of _[before|after]_atomic might be
a little lightweight in some ARCH I guess. Given smp_mb() or its variant
is only needed in very unlikely case of slow path, it is fine to just
use smp_mb(), IMO.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 08/11] block: add blk_end_flush_machinery
  2020-04-25  8:11       ` Christoph Hellwig
@ 2020-04-25  9:51         ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-25  9:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Sat, Apr 25, 2020 at 10:11:25AM +0200, Christoph Hellwig wrote:
> On Sat, Apr 25, 2020 at 11:44:05AM +0800, Ming Lei wrote:
> > > FYI, we don't really need the bdev to send a bio anymore, this could just
> > > do:
> > > 
> > > 	bio = bio_alloc(GFP_KERNEL, 0); // XXX: shouldn't this be GFP_NOIO??
> > 
> > Error handling.
> 
> bio_alloc does not fail for allocations that can sleep.
> 
> > 
> > > 	bio->bi_disk = rq->rq_disk;
> > > 	bio->bi_partno = 0;
> > > 	bio_associate_blkg(bio); // XXX: actually, shouldn't this come from rq?
> > > 	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> > > 	error = submit_bio_wait(bio);
> > > 	bio_put(bio);
> > 
> > Yeah, that is another way, however I prefer to blkdev_issue_flush
> > because it takes less code and we do have the API of blkdev_issue_flush.
> 
> It is about the same amount of code, and as commented above I think the
> current code uses the wrong blkg assignment.

There can be lots of flush requests which reply on this single flush
emulation, then there can be more associated blkgs, so not sure we can
select a correct one. But this single flush emulation is only triggered
in very unlikey case, does it really matter to select one 'correct' blkcg
for this single flush request without DATA?

BTW, block/blk-flush.c does run aggressive flush merge by the built-in
per-hctx flush request and double queues, and blkcg can't account actual
flush request accurately.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-25  9:34         ` Ming Lei
@ 2020-04-25  9:53           ` Ming Lei
  2020-04-25 15:48             ` Christoph Hellwig
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-25  9:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner, will, peterz, paulmck

On Sat, Apr 25, 2020 at 05:34:37PM +0800, Ming Lei wrote:
> On Sat, Apr 25, 2020 at 10:32:24AM +0200, Christoph Hellwig wrote:
> > On Sat, Apr 25, 2020 at 11:17:23AM +0800, Ming Lei wrote:
> > > I am not sure if it helps by adding two helper, given only two
> > > parameters are needed, and the new parameter is just a constant.
> > > 
> > > > the point of barrier(), smp_mb__before_atomic and
> > > > smp_mb__after_atomic), as none seems to be addressed and I also didn't
> > > > see a reply.
> > > 
> > > I believe it has been documented:
> > > 
> > > +   /*
> > > +    * Add one memory barrier in case that direct issue IO process is
> > > +    * migrated to other CPU which may not belong to this hctx, so we can
> > > +    * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> > > +    * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> > > +    * and driver tag assignment are run on the same CPU in case that
> > > +    * BLK_MQ_S_INACTIVE is set.
> > > +    */
> > > 
> > > OK, I can add more:
> > > 
> > > In case of not direct issue, __blk_mq_delay_run_hw_queue() guarantees
> > > that dispatch is done on CPUs of this hctx.
> > > 
> > > In case of direct issue, the direct issue IO process may be migrated to
> > > other CPU which doesn't belong to hctx->cpumask even though the chance
> > > is quite small, but still possible.
> > > 
> > > This patch sets hctx as inactive in the last CPU of hctx, so barrier()
> > > is enough for not direct issue. Otherwise, one smp_mb() is added for
> > > ordering tag assignment(include setting rq) and checking S_INACTIVE in
> > > blk_mq_get_driver_tag().
> > 
> > How do you prevent a cpu migration between the call to raw_smp_processor_id
> > and barrier? 
> 
> Fine, we may have to use get_cpu()/put_cpu() for direct issue to cover
> the check. For non-direct issue, either preempt is disabled or the work is
> run on specified CPU.
> 
> > 
> > Also as far as I can tell Documentation/core-api/atomic_ops.rst ask
> > you to use smp_mb__before_atomic and smp_mb__after_atomic for any
> > ordering with non-updating bitops.  Quote:
> > 
> > --------------------------------- snip ---------------------------------
> > If explicit memory barriers are required around {set,clear}_bit() (which do
> > not return a value, and thus does not need to provide memory barrier
> > semantics), two interfaces are provided::
> > 
> >         void smp_mb__before_atomic(void);
> > 	void smp_mb__after_atomic(void);
> > --------------------------------- snip ---------------------------------
> > 
> > I really want someone who knows the memory model to look over this scheme,
> > as it looks dangerous.
> 
> smp_mb() is enough, the version of _[before|after]_atomic might be
> a little lightweight in some ARCH I guess. Given smp_mb() or its variant
> is only needed in very unlikely case of slow path, it is fine to just

s/of/or


Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-25  9:53           ` Ming Lei
@ 2020-04-25 15:48             ` Christoph Hellwig
  2020-04-26  2:06               ` Ming Lei
                                 ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-25 15:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, John Garry,
	Bart Van Assche, Hannes Reinecke, Thomas Gleixner, will, peterz,
	paulmck

FYI, here is what I think we should be doing (but the memory model
experts please correct me):

 - just drop the direct_issue flag and check for the CPU, which is
   cheap enough
 - replace the raw_smp_processor_id with a get_cpu to make sure we
   don't hit the tiny migration windows
 - a bunch of random cleanups to make the code easier to read, mostly
   by being more self-documenting or improving the comments.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bfa4020256ae9..da749865f6eed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1049,28 +1049,16 @@ static bool __blk_mq_get_driver_tag(struct request *rq)
 		atomic_inc(&data.hctx->nr_active);
 	}
 	data.hctx->tags->rqs[rq->tag] = rq;
-	return true;
-}
-
-static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
-{
-	if (rq->tag != -1)
-		return true;
 
-	if (!__blk_mq_get_driver_tag(rq))
-		return false;
 	/*
-	 * Add one memory barrier in case that direct issue IO process is
-	 * migrated to other CPU which may not belong to this hctx, so we can
-	 * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
-	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
-	 * and driver tag assignment are run on the same CPU in case that
-	 * BLK_MQ_S_INACTIVE is set.
+	 * Ensure updates to rq->tag and tags->rqs[] are seen by
+	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
+	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
+	 * gets migrated to another CPU that is not mapped to this hctx.
 	 */
-	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
+	if (rq->mq_ctx->cpu != get_cpu())
 		smp_mb();
-	else
-		barrier();
+	put_cpu();
 
 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
 		blk_mq_put_driver_tag(rq);
@@ -1079,6 +1067,13 @@ static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
 	return true;
 }
 
+static bool blk_mq_get_driver_tag(struct request *rq)
+{
+	if (rq->tag != -1)
+		return true;
+	return __blk_mq_get_driver_tag(rq);
+}
+
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 				int flags, void *key)
 {
@@ -1125,7 +1120,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 		 * Don't clear RESTART here, someone else could have set it.
 		 * At most this will cost an extra queue run.
 		 */
-		return blk_mq_get_driver_tag(rq, false);
+		return blk_mq_get_driver_tag(rq);
 	}
 
 	wait = &hctx->dispatch_wait;
@@ -1151,7 +1146,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	 * allocation failure and adding the hardware queue to the wait
 	 * queue.
 	 */
-	ret = blk_mq_get_driver_tag(rq, false);
+	ret = blk_mq_get_driver_tag(rq);
 	if (!ret) {
 		spin_unlock(&hctx->dispatch_wait_lock);
 		spin_unlock_irq(&wq->lock);
@@ -1252,7 +1247,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			break;
 		}
 
-		if (!blk_mq_get_driver_tag(rq, false)) {
+		if (!blk_mq_get_driver_tag(rq)) {
 			/*
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed. The
@@ -1284,7 +1279,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			bd.last = true;
 		else {
 			nxt = list_first_entry(list, struct request, queuelist);
-			bd.last = !blk_mq_get_driver_tag(nxt, false);
+			bd.last = !blk_mq_get_driver_tag(nxt);
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
@@ -1886,7 +1881,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_dispatch_budget(hctx))
 		goto insert;
 
-	if (!blk_mq_get_driver_tag(rq, true)) {
+	if (!blk_mq_get_driver_tag(rq)) {
 		blk_mq_put_dispatch_budget(hctx);
 		goto insert;
 	}
@@ -2327,23 +2322,24 @@ static bool blk_mq_inflight_rq(struct request *rq, void *data,
 static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
 {
 	struct count_inflight_data count_data = {
-		.count	= 0,
 		.hctx	= hctx,
 	};
 
 	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
 			blk_mq_inflight_rq, &count_data);
-
 	return count_data.count;
 }
 
-static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
+		struct blk_mq_hw_ctx *hctx)
 {
-	while (1) {
-		if (!blk_mq_tags_inflight_rqs(hctx))
-			break;
-		msleep(5);
-	}
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return false;
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
 }
 
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
@@ -2351,25 +2347,19 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
 			struct blk_mq_hw_ctx, cpuhp_online);
 
-	if (!cpumask_test_cpu(cpu, hctx->cpumask))
-		return 0;
-
-	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
-	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids))
+	if (!blk_mq_last_cpu_in_hctx(cpu, hctx))
 		return 0;
 
 	/*
-	 * The current CPU is the last one in this hctx, S_INACTIVE
-	 * can be observed in dispatch path without any barrier needed,
-	 * cause both are run on one same CPU.
+	 * Order setting BLK_MQ_S_INACTIVE versus checking rq->tag and rqs[tag],
+	 * in blk_mq_tags_inflight_rqs.  It pairs with the smp_mb() in
+	 * blk_mq_get_driver_tag.
 	 */
 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
-	/*
-	 * Order setting BLK_MQ_S_INACTIVE and checking rq->tag & rqs[tag],
-	 * and its pair is the smp_mb() in blk_mq_get_driver_tag
-	 */
 	smp_mb__after_atomic();
-	blk_mq_hctx_drain_inflight_rqs(hctx);
+
+	while (blk_mq_tags_inflight_rqs(hctx))
+		msleep(5);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-25  2:54     ` Ming Lei
@ 2020-04-25 18:26       ` Hannes Reinecke
  0 siblings, 0 replies; 81+ messages in thread
From: Hannes Reinecke @ 2020-04-25 18:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

On 4/25/20 4:54 AM, Ming Lei wrote:
> On Fri, Apr 24, 2020 at 03:02:36PM +0200, Hannes Reinecke wrote:
>> On 4/24/20 12:23 PM, Ming Lei wrote:
>>> Especially for none elevator, rq->tag is assigned after the request is
>>> allocated, so there isn't any way to figure out if one request is in
>>> being dispatched. Also the code path wrt. driver tag becomes a bit
>>> difference between none and io scheduler.
>>>
>>> When one hctx becomes inactive, we have to prevent any request from
>>> being dispatched to LLD. And get driver tag provides one perfect chance
>>> to do that. Meantime we can drain any such requests by checking if
>>> rq->tag is assigned.
>>>
>>
>> Sorry for being a bit dense, but I'm having a hard time following the
>> description.
>> Maybe this would be a bit clearer:
>>
>> When one hctx becomes inactive, we do have to prevent any request from
>> being dispatched to the LLD. If we intercept them in blk_mq_get_tag() we can
>> also drain all those requests which have no rq->tag assigned.
> 
> No, actually what we need to drain is requests with rq->tag assigned, and
> if tag isn't assigned, we can simply prevent the request from being
> queued to LLD after the hctx becomes inactive.
> 
> Frankly speaking, the description in commit log should be more clear,
> and correct.
> 
Ah. Thanks for the explanation.


>>
>> (With the nice side effect that if above paragraph is correct I've also got
>> it right what the patch is trying to do :-)
>>
>>> So only assign rq->tag until blk_mq_get_driver_tag() is called.
>>>
>>> This way also simplifies code of dealing with driver tag a lot.
>>>
>>> Cc: Bart Van Assche <bvanassche@acm.org>
>>> Cc: Hannes Reinecke <hare@suse.com>
>>> Cc: Christoph Hellwig <hch@lst.de>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: John Garry <john.garry@huawei.com>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    block/blk-flush.c | 18 ++----------
>>>    block/blk-mq.c    | 75 ++++++++++++++++++++++++-----------------------
>>>    block/blk-mq.h    | 21 +++++++------
>>>    block/blk.h       |  5 ----
>>>    4 files changed, 51 insertions(+), 68 deletions(-)
>>>
>>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>>> index c7f396e3d5e2..977edf95d711 100644
>>> --- a/block/blk-flush.c
>>> +++ b/block/blk-flush.c
>>> @@ -236,13 +236,8 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
>>>    		error = fq->rq_status;
>>>    	hctx = flush_rq->mq_hctx;
>>> -	if (!q->elevator) {
>>> -		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
>>> -		flush_rq->tag = -1;
>>> -	} else {
>>> -		blk_mq_put_driver_tag(flush_rq);
>>> -		flush_rq->internal_tag = -1;
>>> -	}
>>> +	flush_rq->internal_tag = -1;
>>> +	blk_mq_put_driver_tag(flush_rq);
>>>    	running = &fq->flush_queue[fq->flush_running_idx];
>>>    	BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
>>> @@ -317,14 +312,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>>>    	flush_rq->mq_ctx = first_rq->mq_ctx;
>>>    	flush_rq->mq_hctx = first_rq->mq_hctx;
>>> -	if (!q->elevator) {
>>> -		fq->orig_rq = first_rq;
>>> -		flush_rq->tag = first_rq->tag;
>>> -		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
>>> -	} else {
>>> -		flush_rq->internal_tag = first_rq->internal_tag;
>>> -	}
>>> -
>>> +	flush_rq->internal_tag = first_rq->internal_tag;
>>>    	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
>>>    	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
>>>    	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 79267f2e8960..65f0aaed55ff 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -276,18 +276,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>>>    	struct request *rq = tags->static_rqs[tag];
>>>    	req_flags_t rq_flags = 0;
>>> -	if (data->flags & BLK_MQ_REQ_INTERNAL) {
>>> -		rq->tag = -1;
>>> -		rq->internal_tag = tag;
>>> -	} else {
>>> -		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
>>> -			rq_flags = RQF_MQ_INFLIGHT;
>>> -			atomic_inc(&data->hctx->nr_active);
>>> -		}
>>> -		rq->tag = tag;
>>> -		rq->internal_tag = -1;
>>> -		data->hctx->tags->rqs[rq->tag] = rq;
>>> -	}
>>> +	rq->internal_tag = tag;
>>> +	rq->tag = -1;
>>>    	/* csd/requeue_work/fifo_time is initialized before use */
>>>    	rq->q = data->q;
>>> @@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
>>>    	struct request_queue *q = rq->q;
>>>    	struct blk_mq_ctx *ctx = rq->mq_ctx;
>>>    	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>>> -	const int sched_tag = rq->internal_tag;
>>>    	blk_pm_mark_last_busy(rq);
>>>    	rq->mq_hctx = NULL;
>>> -	if (rq->tag != -1)
>>> -		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
>>> -	if (sched_tag != -1)
>>> -		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
>>> +
>>> +	if (hctx->sched_tags) {
>>> +		if (rq->tag >= 0)
>>> +			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
>>> +		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
>>> +	} else {
>>> +		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
>>> +        }
>>> +
>>>    	blk_mq_sched_restart(hctx);
>>>    	blk_queue_exit(q);
>>>    }
>>> @@ -527,7 +521,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
>>>    		blk_stat_add(rq, now);
>>>    	}
>>> -	if (rq->internal_tag != -1)
>>> +	if (rq->q->elevator && rq->internal_tag != -1)
>>>    		blk_mq_sched_completed_request(rq, now);
>>>    	blk_account_io_done(rq, now);
>>
>> One really does wonder: under which circumstances can 'internal_tag' be -1
>> now ?
>> The hunk above seems to imply that 'internal_tag' is now always be set; and
>> this is also the impression I got from reading this patch.
>> Care to elaborate?
> 
> rq->internal_tag should always be assigned, and the only case is that it can
> become -1 for flush rq, however it is done in .end_io(), so we may avoid
> the above check on rq->internal_tag.
> 
Thought so.

>>
>>> @@ -1027,33 +1021,40 @@ static inline unsigned int queued_to_index(unsigned int queued)
>>>    	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>>>    }
>>> -static bool blk_mq_get_driver_tag(struct request *rq)
>>> +static bool __blk_mq_get_driver_tag(struct request *rq)
>>>    {
>>>    	struct blk_mq_alloc_data data = {
>>> -		.q = rq->q,
>>> -		.hctx = rq->mq_hctx,
>>> -		.flags = BLK_MQ_REQ_NOWAIT,
>>> -		.cmd_flags = rq->cmd_flags,
>>> +		.q		= rq->q,
>>> +		.hctx		= rq->mq_hctx,
>>> +		.flags		= BLK_MQ_REQ_NOWAIT,
>>> +		.cmd_flags	= rq->cmd_flags,
>>>    	};
>>> -	bool shared;
>>> -	if (rq->tag != -1)
>>> -		return true;
>>> +	if (data.hctx->sched_tags) {
>>> +		if (blk_mq_tag_is_reserved(data.hctx->sched_tags,
>>> +				rq->internal_tag))
>>> +			data.flags |= BLK_MQ_REQ_RESERVED;
>>> +		rq->tag = blk_mq_get_tag(&data);
>>> +	} else {
>>> +		rq->tag = rq->internal_tag;
>>> +	}
>>> -	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
>>> -		data.flags |= BLK_MQ_REQ_RESERVED;
>>> +	if (rq->tag == -1)
>>> +		return false;
>>> -	shared = blk_mq_tag_busy(data.hctx);
>>> -	rq->tag = blk_mq_get_tag(&data);
>>> -	if (rq->tag >= 0) {
>>> -		if (shared) {
>>> -			rq->rq_flags |= RQF_MQ_INFLIGHT;
>>> -			atomic_inc(&data.hctx->nr_active);
>>> -		}
>>> -		data.hctx->tags->rqs[rq->tag] = rq;
>>> +	if (blk_mq_tag_busy(data.hctx)) {
>>> +		rq->rq_flags |= RQF_MQ_INFLIGHT;
>>> +		atomic_inc(&data.hctx->nr_active);
>>>    	}
>>> +	data.hctx->tags->rqs[rq->tag] = rq;
>>> +	return true;
>>> +}
>>> -	return rq->tag != -1;
>>> +static bool blk_mq_get_driver_tag(struct request *rq)
>>> +{
>>> +	if (rq->tag != -1)
>>> +		return true;
>>> +	return __blk_mq_get_driver_tag(rq);
>>>    }
>>>    static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
>>> diff --git a/block/blk-mq.h b/block/blk-mq.h
>>> index e7d1da4b1f73..d0c72d7d07c8 100644
>>> --- a/block/blk-mq.h
>>> +++ b/block/blk-mq.h
>>> @@ -196,26 +196,25 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>>>    	return true;
>>>    }
>>> -static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>>> -					   struct request *rq)
>>> +static inline void blk_mq_put_driver_tag(struct request *rq)
>>>    {
>>> -	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
>>> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>>> +	int tag = rq->tag;
>>> +
>>> +	if (tag < 0)
>>> +		return;
>>> +
>>>    	rq->tag = -1;
>>>     > +	if (hctx->sched_tags)
>>> +		blk_mq_put_tag(hctx->tags, rq->mq_ctx, tag);
>>> +
>> I wonder if you need the local variable 'tag' here; might it not be better
>> to set 'rq->tag' to '-1' after the call to put_tag?
> 
> No, we can't touch the request after blk_mq_put_tag() returns.
> 
> I remember we have fixed such kind of UAF several times.
> 
Ah, right, of course.

FWIW:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-25 15:48             ` Christoph Hellwig
@ 2020-04-26  2:06               ` Ming Lei
  2020-04-26  8:19                 ` John Garry
  2020-04-27 15:36                 ` Christoph Hellwig
  2020-04-27 19:03               ` Paul E. McKenney
  2020-04-28 15:58               ` Peter Zijlstra
  2 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-26  2:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner, will, peterz, paulmck

On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> FYI, here is what I think we should be doing (but the memory model
> experts please correct me):
> 
>  - just drop the direct_issue flag and check for the CPU, which is
>    cheap enough

That isn't correct because the CPU for running async queue may not be
same with rq->mq_ctx->cpu since hctx->cpumask may include several CPUs
and we run queue in RR style and it is really a normal case.

So I'd rather to keep 'direct_issue' flag given it is just a constant
argument and gcc is smart enough to optimize this case if you don't have
better idea.

>  - replace the raw_smp_processor_id with a get_cpu to make sure we
>    don't hit the tiny migration windows

Will do that.

>  - a bunch of random cleanups to make the code easier to read, mostly
>    by being more self-documenting or improving the comments.

The cleanup code looks fine.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-26  2:06               ` Ming Lei
@ 2020-04-26  8:19                 ` John Garry
  2020-04-27 15:36                 ` Christoph Hellwig
  1 sibling, 0 replies; 81+ messages in thread
From: John Garry @ 2020-04-26  8:19 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, will, peterz, paulmck

On 26/04/2020 03:06, Ming Lei wrote:
> On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
>> FYI, here is what I think we should be doing (but the memory model
>> experts please correct me):

Hi Ming,

>>
>>   - just drop the direct_issue flag and check for the CPU, which is
>>     cheap enough
> 
> That isn't correct because the CPU for running async queue may not be
> same with rq->mq_ctx->cpu since hctx->cpumask may include several CPUs
> and we run queue in RR style and it is really a normal case.
> 
> So I'd rather to keep 'direct_issue' flag given it is just a constant

How about changing the function argument name from 'direct_issue' to 
something which better expresses the intent, like 'may_migrate' or 
similar? (I know you mention the reason this in the comment also).

> argument and gcc is smart enough to optimize this case if you don't have
> better idea.
> 
>>   - replace the raw_smp_processor_id with a get_cpu to make sure we
>>     don't hit the tiny migration windows
> 

I do wonder if the barrier always may be cheaper than this.

> Will do that.
> 
>>   - a bunch of random cleanups to make the code easier to read, mostly
>>     by being more self-documenting or improving the comments.
> 
> The cleanup code looks fine.
> 
> 

Cheers,
John


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-26  2:06               ` Ming Lei
  2020-04-26  8:19                 ` John Garry
@ 2020-04-27 15:36                 ` Christoph Hellwig
  2020-04-28  1:10                   ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-27 15:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, John Garry,
	Bart Van Assche, Hannes Reinecke, Thomas Gleixner, will, peterz,
	paulmck

On Sun, Apr 26, 2020 at 10:06:21AM +0800, Ming Lei wrote:
> On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > FYI, here is what I think we should be doing (but the memory model
> > experts please correct me):
> > 
> >  - just drop the direct_issue flag and check for the CPU, which is
> >    cheap enough
> 
> That isn't correct because the CPU for running async queue may not be
> same with rq->mq_ctx->cpu since hctx->cpumask may include several CPUs
> and we run queue in RR style and it is really a normal case.

But in that case the memory barrier really doesn't matter anywaẏ.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-25 15:48             ` Christoph Hellwig
  2020-04-26  2:06               ` Ming Lei
@ 2020-04-27 19:03               ` Paul E. McKenney
  2020-04-28  6:54                 ` Christoph Hellwig
  2020-04-28 15:58               ` Peter Zijlstra
  2 siblings, 1 reply; 81+ messages in thread
From: Paul E. McKenney @ 2020-04-27 19:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner, will, peterz

On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> FYI, here is what I think we should be doing (but the memory model
> experts please correct me):

I would be happy to, but could you please tell me what to apply this
against?  I made several wrong guesses, and am not familiar enough with
this code to evaluate this patch in isolation.

							Thanx, Paul

>  - just drop the direct_issue flag and check for the CPU, which is
>    cheap enough
>  - replace the raw_smp_processor_id with a get_cpu to make sure we
>    don't hit the tiny migration windows
>  - a bunch of random cleanups to make the code easier to read, mostly
>    by being more self-documenting or improving the comments.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index bfa4020256ae9..da749865f6eed 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1049,28 +1049,16 @@ static bool __blk_mq_get_driver_tag(struct request *rq)
>  		atomic_inc(&data.hctx->nr_active);
>  	}
>  	data.hctx->tags->rqs[rq->tag] = rq;
> -	return true;
> -}
> -
> -static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
> -{
> -	if (rq->tag != -1)
> -		return true;
>  
> -	if (!__blk_mq_get_driver_tag(rq))
> -		return false;
>  	/*
> -	 * Add one memory barrier in case that direct issue IO process is
> -	 * migrated to other CPU which may not belong to this hctx, so we can
> -	 * order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> -	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> -	 * and driver tag assignment are run on the same CPU in case that
> -	 * BLK_MQ_S_INACTIVE is set.
> +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> +	 * gets migrated to another CPU that is not mapped to this hctx.
>  	 */
> -	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> +	if (rq->mq_ctx->cpu != get_cpu())
>  		smp_mb();
> -	else
> -		barrier();
> +	put_cpu();
>  
>  	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
>  		blk_mq_put_driver_tag(rq);
> @@ -1079,6 +1067,13 @@ static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
>  	return true;
>  }
>  
> +static bool blk_mq_get_driver_tag(struct request *rq)
> +{
> +	if (rq->tag != -1)
> +		return true;
> +	return __blk_mq_get_driver_tag(rq);
> +}
> +
>  static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
>  				int flags, void *key)
>  {
> @@ -1125,7 +1120,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
>  		 * Don't clear RESTART here, someone else could have set it.
>  		 * At most this will cost an extra queue run.
>  		 */
> -		return blk_mq_get_driver_tag(rq, false);
> +		return blk_mq_get_driver_tag(rq);
>  	}
>  
>  	wait = &hctx->dispatch_wait;
> @@ -1151,7 +1146,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
>  	 * allocation failure and adding the hardware queue to the wait
>  	 * queue.
>  	 */
> -	ret = blk_mq_get_driver_tag(rq, false);
> +	ret = blk_mq_get_driver_tag(rq);
>  	if (!ret) {
>  		spin_unlock(&hctx->dispatch_wait_lock);
>  		spin_unlock_irq(&wq->lock);
> @@ -1252,7 +1247,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  			break;
>  		}
>  
> -		if (!blk_mq_get_driver_tag(rq, false)) {
> +		if (!blk_mq_get_driver_tag(rq)) {
>  			/*
>  			 * The initial allocation attempt failed, so we need to
>  			 * rerun the hardware queue when a tag is freed. The
> @@ -1284,7 +1279,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  			bd.last = true;
>  		else {
>  			nxt = list_first_entry(list, struct request, queuelist);
> -			bd.last = !blk_mq_get_driver_tag(nxt, false);
> +			bd.last = !blk_mq_get_driver_tag(nxt);
>  		}
>  
>  		ret = q->mq_ops->queue_rq(hctx, &bd);
> @@ -1886,7 +1881,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>  	if (!blk_mq_get_dispatch_budget(hctx))
>  		goto insert;
>  
> -	if (!blk_mq_get_driver_tag(rq, true)) {
> +	if (!blk_mq_get_driver_tag(rq)) {
>  		blk_mq_put_dispatch_budget(hctx);
>  		goto insert;
>  	}
> @@ -2327,23 +2322,24 @@ static bool blk_mq_inflight_rq(struct request *rq, void *data,
>  static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct count_inflight_data count_data = {
> -		.count	= 0,
>  		.hctx	= hctx,
>  	};
>  
>  	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
>  			blk_mq_inflight_rq, &count_data);
> -
>  	return count_data.count;
>  }
>  
> -static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> +static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
> +		struct blk_mq_hw_ctx *hctx)
>  {
> -	while (1) {
> -		if (!blk_mq_tags_inflight_rqs(hctx))
> -			break;
> -		msleep(5);
> -	}
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return false;
> +	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
> +		return false;
> +	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
> +		return false;
> +	return true;
>  }
>  
>  static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> @@ -2351,25 +2347,19 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>  	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
>  			struct blk_mq_hw_ctx, cpuhp_online);
>  
> -	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> -		return 0;
> -
> -	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
> -	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids))
> +	if (!blk_mq_last_cpu_in_hctx(cpu, hctx))
>  		return 0;
>  
>  	/*
> -	 * The current CPU is the last one in this hctx, S_INACTIVE
> -	 * can be observed in dispatch path without any barrier needed,
> -	 * cause both are run on one same CPU.
> +	 * Order setting BLK_MQ_S_INACTIVE versus checking rq->tag and rqs[tag],
> +	 * in blk_mq_tags_inflight_rqs.  It pairs with the smp_mb() in
> +	 * blk_mq_get_driver_tag.
>  	 */
>  	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> -	/*
> -	 * Order setting BLK_MQ_S_INACTIVE and checking rq->tag & rqs[tag],
> -	 * and its pair is the smp_mb() in blk_mq_get_driver_tag
> -	 */
>  	smp_mb__after_atomic();
> -	blk_mq_hctx_drain_inflight_rqs(hctx);
> +
> +	while (blk_mq_tags_inflight_rqs(hctx))
> +		msleep(5);
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-27 15:36                 ` Christoph Hellwig
@ 2020-04-28  1:10                   ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-28  1:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner, will, peterz, paulmck

On Mon, Apr 27, 2020 at 05:36:01PM +0200, Christoph Hellwig wrote:
> On Sun, Apr 26, 2020 at 10:06:21AM +0800, Ming Lei wrote:
> > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > FYI, here is what I think we should be doing (but the memory model
> > > experts please correct me):
> > > 
> > >  - just drop the direct_issue flag and check for the CPU, which is
> > >    cheap enough
> > 
> > That isn't correct because the CPU for running async queue may not be
> > same with rq->mq_ctx->cpu since hctx->cpumask may include several CPUs
> > and we run queue in RR style and it is really a normal case.
> 
> But in that case the memory barrier really doesn't matter anywaẏ.

It might be true, however we can save the cost with zero cost, why
not do it? Also with document benefit.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-27 19:03               ` Paul E. McKenney
@ 2020-04-28  6:54                 ` Christoph Hellwig
  0 siblings, 0 replies; 81+ messages in thread
From: Christoph Hellwig @ 2020-04-28  6:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Hellwig, Ming Lei, Jens Axboe, linux-block, John Garry,
	Bart Van Assche, Hannes Reinecke, Thomas Gleixner, will, peterz

On Mon, Apr 27, 2020 at 12:03:37PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > FYI, here is what I think we should be doing (but the memory model
> > experts please correct me):
> 
> I would be happy to, but could you please tell me what to apply this
> against?  I made several wrong guesses, and am not familiar enough with
> this code to evaluate this patch in isolation.

Here is a link to the whole series on lore:

https://lore.kernel.org/linux-block/1eba973b-25cf-88bd-7284-d86730b4ddf2@kernel.dk/T/#t

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-25 15:48             ` Christoph Hellwig
  2020-04-26  2:06               ` Ming Lei
  2020-04-27 19:03               ` Paul E. McKenney
@ 2020-04-28 15:58               ` Peter Zijlstra
  2020-04-29  2:16                 ` Ming Lei
  2 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2020-04-28 15:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner, will, paulmck

On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
>  		atomic_inc(&data.hctx->nr_active);
>  	}
>  	data.hctx->tags->rqs[rq->tag] = rq;
>  
>  	/*
> +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> +	 * gets migrated to another CPU that is not mapped to this hctx.
>  	 */
> +	if (rq->mq_ctx->cpu != get_cpu())
>  		smp_mb();
> +	put_cpu();

This looks exceedingly weird; how do you think you can get to another
CPU and not have an smp_mb() implied in the migration itself? Also, what
stops the migration from happening right after the put_cpu() ?


>  	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
>  		blk_mq_put_driver_tag(rq);


> +static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
> +		struct blk_mq_hw_ctx *hctx)
>  {
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return false;
> +	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
> +		return false;
> +	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
> +		return false;
> +	return true;
>  }

Does this want something like:

	lockdep_assert_held(*set->tag_list_lock);

to make sure hctx->cpumask is stable? Those mask ops are not stable vs
concurrenct set/clear at all.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-28 15:58               ` Peter Zijlstra
@ 2020-04-29  2:16                 ` Ming Lei
  2020-04-29  8:07                   ` Will Deacon
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-29  2:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, Jens Axboe, linux-block, John Garry,
	Bart Van Assche, Hannes Reinecke, Thomas Gleixner, will, paulmck

Hi Peter,

On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> >  		atomic_inc(&data.hctx->nr_active);
> >  	}
> >  	data.hctx->tags->rqs[rq->tag] = rq;
> >  
> >  	/*
> > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > +	 * gets migrated to another CPU that is not mapped to this hctx.
> >  	 */
> > +	if (rq->mq_ctx->cpu != get_cpu())
> >  		smp_mb();
> > +	put_cpu();
> 
> This looks exceedingly weird; how do you think you can get to another
> CPU and not have an smp_mb() implied in the migration itself? Also, what

What we need is one smp_mb() between the following two OPs:

1) 
   rq->tag = rq->internal_tag;
   data.hctx->tags->rqs[rq->tag] = rq;

2) 
	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))

And the pair of the above barrier is in blk_mq_hctx_notify_offline().

So if this process is migrated before 1), the implied smp_mb() is useless.

> stops the migration from happening right after the put_cpu() ?

If the migration happens after put_cpu(), the above two OPs are still
ordered by the implied smp_mb(), so looks not a problem.

> 
> 
> >  	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
> >  		blk_mq_put_driver_tag(rq);
> 
> 
> > +static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
> > +		struct blk_mq_hw_ctx *hctx)
> >  {
> > +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> > +		return false;
> > +	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
> > +		return false;
> > +	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
> > +		return false;
> > +	return true;
> >  }
> 
> Does this want something like:
> 
> 	lockdep_assert_held(*set->tag_list_lock);
> 
> to make sure hctx->cpumask is stable? Those mask ops are not stable vs
> concurrenct set/clear at all.

hctx->cpumask is only updated in __blk_mq_update_nr_hw_queues(), in
which all request queues in the tagset have been frozen, and no any
in-flight IOs, so we needn't to pay attention to that case.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29  2:16                 ` Ming Lei
@ 2020-04-29  8:07                   ` Will Deacon
  2020-04-29  9:46                     ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Will Deacon @ 2020-04-29  8:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > >  		atomic_inc(&data.hctx->nr_active);
> > >  	}
> > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > >  
> > >  	/*
> > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > >  	 */
> > > +	if (rq->mq_ctx->cpu != get_cpu())
> > >  		smp_mb();
> > > +	put_cpu();
> > 
> > This looks exceedingly weird; how do you think you can get to another
> > CPU and not have an smp_mb() implied in the migration itself? Also, what
> 
> What we need is one smp_mb() between the following two OPs:
> 
> 1) 
>    rq->tag = rq->internal_tag;
>    data.hctx->tags->rqs[rq->tag] = rq;
> 
> 2) 
> 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> 
> And the pair of the above barrier is in blk_mq_hctx_notify_offline().

I'm struggling with this, so let me explain why. My understanding of the
original patch [1] and your explanation above is that you want *either* of
the following behaviours

  - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
    run on the same CPU with barrier() between them, or

  - There is a migration and therefore an implied smp_mb() between them

However, given that most CPUs can speculate loads (and therefore the
test_bit() operation), I don't understand how the "everything runs on the
same CPU" is safe if a barrier() is required.  In other words, if the
barrier() is needed to prevent the compiler hoisting the load, then the CPU
can still cause problems.

Thanks,

Will

[1] https://lore.kernel.org/linux-block/20200424102351.475641-8-ming.lei@redhat.com/

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29  8:07                   ` Will Deacon
@ 2020-04-29  9:46                     ` Ming Lei
  2020-04-29 12:27                       ` Will Deacon
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-29  9:46 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, Apr 29, 2020 at 09:07:29AM +0100, Will Deacon wrote:
> On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> > On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > >  		atomic_inc(&data.hctx->nr_active);
> > > >  	}
> > > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > > >  
> > > >  	/*
> > > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > > >  	 */
> > > > +	if (rq->mq_ctx->cpu != get_cpu())
> > > >  		smp_mb();
> > > > +	put_cpu();
> > > 
> > > This looks exceedingly weird; how do you think you can get to another
> > > CPU and not have an smp_mb() implied in the migration itself? Also, what
> > 
> > What we need is one smp_mb() between the following two OPs:
> > 
> > 1) 
> >    rq->tag = rq->internal_tag;
> >    data.hctx->tags->rqs[rq->tag] = rq;
> > 
> > 2) 
> > 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> > 
> > And the pair of the above barrier is in blk_mq_hctx_notify_offline().
> 
> I'm struggling with this, so let me explain why. My understanding of the
> original patch [1] and your explanation above is that you want *either* of
> the following behaviours
> 
>   - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
>     run on the same CPU with barrier() between them, or
> 
>   - There is a migration and therefore an implied smp_mb() between them
> 
> However, given that most CPUs can speculate loads (and therefore the
> test_bit() operation), I don't understand how the "everything runs on the
> same CPU" is safe if a barrier() is required.  In other words, if the
> barrier() is needed to prevent the compiler hoisting the load, then the CPU
> can still cause problems.

Do you think the speculate loads may return wrong value of
BLK_MQ_S_INACTIVE bit in case of single CPU? BTW, writing the bit is
done on the same CPU. If yes, this machine may not obey cache consistency,
IMO.

Also smp_mb() is really barrier() in case of non-SMP, looks non-SMP code
still works well without other barrier required even though with
speculate loads.

Thanks,
Ming

> 
> Thanks,
> 
> Will
> 
> [1] https://lore.kernel.org/linux-block/20200424102351.475641-8-ming.lei@redhat.com/
> 

-- 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29  9:46                     ` Ming Lei
@ 2020-04-29 12:27                       ` Will Deacon
  2020-04-29 13:43                         ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Will Deacon @ 2020-04-29 12:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, Apr 29, 2020 at 05:46:16PM +0800, Ming Lei wrote:
> On Wed, Apr 29, 2020 at 09:07:29AM +0100, Will Deacon wrote:
> > On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> > > On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > > > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > > >  		atomic_inc(&data.hctx->nr_active);
> > > > >  	}
> > > > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > > > >  
> > > > >  	/*
> > > > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > > > >  	 */
> > > > > +	if (rq->mq_ctx->cpu != get_cpu())
> > > > >  		smp_mb();
> > > > > +	put_cpu();
> > > > 
> > > > This looks exceedingly weird; how do you think you can get to another
> > > > CPU and not have an smp_mb() implied in the migration itself? Also, what
> > > 
> > > What we need is one smp_mb() between the following two OPs:
> > > 
> > > 1) 
> > >    rq->tag = rq->internal_tag;
> > >    data.hctx->tags->rqs[rq->tag] = rq;
> > > 
> > > 2) 
> > > 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> > > 
> > > And the pair of the above barrier is in blk_mq_hctx_notify_offline().
> > 
> > I'm struggling with this, so let me explain why. My understanding of the
> > original patch [1] and your explanation above is that you want *either* of
> > the following behaviours
> > 
> >   - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
> >     run on the same CPU with barrier() between them, or
> > 
> >   - There is a migration and therefore an implied smp_mb() between them
> > 
> > However, given that most CPUs can speculate loads (and therefore the
> > test_bit() operation), I don't understand how the "everything runs on the
> > same CPU" is safe if a barrier() is required.  In other words, if the
> > barrier() is needed to prevent the compiler hoisting the load, then the CPU
> > can still cause problems.
> 
> Do you think the speculate loads may return wrong value of
> BLK_MQ_S_INACTIVE bit in case of single CPU? BTW, writing the bit is
> done on the same CPU. If yes, this machine may not obey cache consistency,
> IMO.

If the write is on the same CPU, then the read will of course return the
value written by that write, otherwise we'd have much bigger problems!

But then I'm confused, because you're saying that the write is done on the
same CPU, but previously you were saying that migration occuring before (1)
was problematic. Can you explain a bit more about that case, please? What
is running before (1) that is relevant here?

Thanks,

Will

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29 12:27                       ` Will Deacon
@ 2020-04-29 13:43                         ` Ming Lei
  2020-04-29 17:34                           ` Will Deacon
  2020-04-29 17:46                           ` Paul E. McKenney
  0 siblings, 2 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-29 13:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, Apr 29, 2020 at 01:27:57PM +0100, Will Deacon wrote:
> On Wed, Apr 29, 2020 at 05:46:16PM +0800, Ming Lei wrote:
> > On Wed, Apr 29, 2020 at 09:07:29AM +0100, Will Deacon wrote:
> > > On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> > > > On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > > > > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > > > >  		atomic_inc(&data.hctx->nr_active);
> > > > > >  	}
> > > > > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > > > > >  
> > > > > >  	/*
> > > > > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > > > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > > > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > > > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > > > > >  	 */
> > > > > > +	if (rq->mq_ctx->cpu != get_cpu())
> > > > > >  		smp_mb();
> > > > > > +	put_cpu();
> > > > > 
> > > > > This looks exceedingly weird; how do you think you can get to another
> > > > > CPU and not have an smp_mb() implied in the migration itself? Also, what
> > > > 
> > > > What we need is one smp_mb() between the following two OPs:
> > > > 
> > > > 1) 
> > > >    rq->tag = rq->internal_tag;
> > > >    data.hctx->tags->rqs[rq->tag] = rq;
> > > > 
> > > > 2) 
> > > > 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> > > > 
> > > > And the pair of the above barrier is in blk_mq_hctx_notify_offline().
> > > 
> > > I'm struggling with this, so let me explain why. My understanding of the
> > > original patch [1] and your explanation above is that you want *either* of
> > > the following behaviours
> > > 
> > >   - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
> > >     run on the same CPU with barrier() between them, or
> > > 
> > >   - There is a migration and therefore an implied smp_mb() between them
> > > 
> > > However, given that most CPUs can speculate loads (and therefore the
> > > test_bit() operation), I don't understand how the "everything runs on the
> > > same CPU" is safe if a barrier() is required.  In other words, if the
> > > barrier() is needed to prevent the compiler hoisting the load, then the CPU
> > > can still cause problems.
> > 
> > Do you think the speculate loads may return wrong value of
> > BLK_MQ_S_INACTIVE bit in case of single CPU? BTW, writing the bit is
> > done on the same CPU. If yes, this machine may not obey cache consistency,
> > IMO.
> 
> If the write is on the same CPU, then the read will of course return the
> value written by that write, otherwise we'd have much bigger problems!

OK, then it is nothing to with speculate loads.

> 
> But then I'm confused, because you're saying that the write is done on the
> same CPU, but previously you were saying that migration occuring before (1)
> was problematic. Can you explain a bit more about that case, please? What
> is running before (1) that is relevant here?

Please see the following two code paths:

[1] code path1:
blk_mq_hctx_notify_offline():
	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);

	smp_mb() or smp_mb_after_atomic()

	blk_mq_hctx_drain_inflight_rqs():
		blk_mq_tags_inflight_rqs()
			rq = hctx->tags->rqs[index]
			and
			READ rq->tag

[2] code path2:
	blk_mq_get_driver_tag():

		process might be migrated to other CPU here and chance is small,
		then the follow code will be run on CPU different with code path1

		rq->tag = rq->internal_tag;
		hctx->tags->rqs[rq->tag] = rq;

		barrier() in case that code path2 is run on same CPU with code path1
		OR
		smp_mb() in case that code path2 is run on different CPU with code path1 because
		of process migration
		
		test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state)



Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29 13:43                         ` Ming Lei
@ 2020-04-29 17:34                           ` Will Deacon
  2020-04-30  0:39                             ` Ming Lei
  2020-04-29 17:46                           ` Paul E. McKenney
  1 sibling, 1 reply; 81+ messages in thread
From: Will Deacon @ 2020-04-29 17:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote:
> On Wed, Apr 29, 2020 at 01:27:57PM +0100, Will Deacon wrote:
> > On Wed, Apr 29, 2020 at 05:46:16PM +0800, Ming Lei wrote:
> > > On Wed, Apr 29, 2020 at 09:07:29AM +0100, Will Deacon wrote:
> > > > On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> > > > > On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > > > > > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > > > > >  		atomic_inc(&data.hctx->nr_active);
> > > > > > >  	}
> > > > > > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > > > > > >  
> > > > > > >  	/*
> > > > > > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > > > > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > > > > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > > > > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > > > > > >  	 */
> > > > > > > +	if (rq->mq_ctx->cpu != get_cpu())
> > > > > > >  		smp_mb();
> > > > > > > +	put_cpu();
> > > > > > 
> > > > > > This looks exceedingly weird; how do you think you can get to another
> > > > > > CPU and not have an smp_mb() implied in the migration itself? Also, what
> > > > > 
> > > > > What we need is one smp_mb() between the following two OPs:
> > > > > 
> > > > > 1) 
> > > > >    rq->tag = rq->internal_tag;
> > > > >    data.hctx->tags->rqs[rq->tag] = rq;
> > > > > 
> > > > > 2) 
> > > > > 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> > > > > 
> > > > > And the pair of the above barrier is in blk_mq_hctx_notify_offline().
> > > > 
> > > > I'm struggling with this, so let me explain why. My understanding of the
> > > > original patch [1] and your explanation above is that you want *either* of
> > > > the following behaviours
> > > > 
> > > >   - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
> > > >     run on the same CPU with barrier() between them, or
> > > > 
> > > >   - There is a migration and therefore an implied smp_mb() between them
> > > > 
> > > > However, given that most CPUs can speculate loads (and therefore the
> > > > test_bit() operation), I don't understand how the "everything runs on the
> > > > same CPU" is safe if a barrier() is required.  In other words, if the
> > > > barrier() is needed to prevent the compiler hoisting the load, then the CPU
> > > > can still cause problems.
> > > 
> > > Do you think the speculate loads may return wrong value of
> > > BLK_MQ_S_INACTIVE bit in case of single CPU? BTW, writing the bit is
> > > done on the same CPU. If yes, this machine may not obey cache consistency,
> > > IMO.
> > 
> > If the write is on the same CPU, then the read will of course return the
> > value written by that write, otherwise we'd have much bigger problems!
> 
> OK, then it is nothing to with speculate loads.
> 
> > 
> > But then I'm confused, because you're saying that the write is done on the
> > same CPU, but previously you were saying that migration occuring before (1)
> > was problematic. Can you explain a bit more about that case, please? What
> > is running before (1) that is relevant here?
> 
> Please see the following two code paths:
> 
> [1] code path1:
> blk_mq_hctx_notify_offline():
> 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> 
> 	smp_mb() or smp_mb_after_atomic()
> 
> 	blk_mq_hctx_drain_inflight_rqs():
> 		blk_mq_tags_inflight_rqs()
> 			rq = hctx->tags->rqs[index]
> 			and
> 			READ rq->tag
> 
> [2] code path2:
> 	blk_mq_get_driver_tag():
> 
> 		process might be migrated to other CPU here and chance is small,
> 		then the follow code will be run on CPU different with code path1
> 
> 		rq->tag = rq->internal_tag;
> 		hctx->tags->rqs[rq->tag] = rq;

I /think/ this can be distilled to the SB litmus test:

	// blk_mq_hctx_notify_offline()		blk_mq_get_driver_tag();
	Wstate = INACTIVE			Wtag
	smp_mb()				smp_mb()
	Rtag					Rstate

and you want to make sure that either blk_mq_get_driver_tag() sees the
state as INACTIVE and does the cleanup, or it doesn't and
blk_mq_hctx_notify_offline() sees the newly written tag and waits for the
request to complete (I don't get how that happens, but hey).

Is that right?

> 		barrier() in case that code path2 is run on same CPU with code path1
> 		OR
> 		smp_mb() in case that code path2 is run on different CPU with code path1 because
> 		of process migration
> 		
> 		test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state)

Couldn't you just check this at the start of blk_mq_get_driver_tag() as
well, and then make the smp_mb() unconditional?

Will

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29 13:43                         ` Ming Lei
  2020-04-29 17:34                           ` Will Deacon
@ 2020-04-29 17:46                           ` Paul E. McKenney
  2020-04-30  0:43                             ` Ming Lei
  1 sibling, 1 reply; 81+ messages in thread
From: Paul E. McKenney @ 2020-04-29 17:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Will Deacon, Peter Zijlstra, Christoph Hellwig, Jens Axboe,
	linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote:
> On Wed, Apr 29, 2020 at 01:27:57PM +0100, Will Deacon wrote:
> > On Wed, Apr 29, 2020 at 05:46:16PM +0800, Ming Lei wrote:
> > > On Wed, Apr 29, 2020 at 09:07:29AM +0100, Will Deacon wrote:
> > > > On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> > > > > On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > > > > > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > > > > >  		atomic_inc(&data.hctx->nr_active);
> > > > > > >  	}
> > > > > > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > > > > > >  
> > > > > > >  	/*
> > > > > > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > > > > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > > > > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > > > > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > > > > > >  	 */
> > > > > > > +	if (rq->mq_ctx->cpu != get_cpu())
> > > > > > >  		smp_mb();
> > > > > > > +	put_cpu();
> > > > > > 
> > > > > > This looks exceedingly weird; how do you think you can get to another
> > > > > > CPU and not have an smp_mb() implied in the migration itself? Also, what
> > > > > 
> > > > > What we need is one smp_mb() between the following two OPs:
> > > > > 
> > > > > 1) 
> > > > >    rq->tag = rq->internal_tag;
> > > > >    data.hctx->tags->rqs[rq->tag] = rq;
> > > > > 
> > > > > 2) 
> > > > > 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> > > > > 
> > > > > And the pair of the above barrier is in blk_mq_hctx_notify_offline().
> > > > 
> > > > I'm struggling with this, so let me explain why. My understanding of the
> > > > original patch [1] and your explanation above is that you want *either* of
> > > > the following behaviours
> > > > 
> > > >   - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
> > > >     run on the same CPU with barrier() between them, or
> > > > 
> > > >   - There is a migration and therefore an implied smp_mb() between them
> > > > 
> > > > However, given that most CPUs can speculate loads (and therefore the
> > > > test_bit() operation), I don't understand how the "everything runs on the
> > > > same CPU" is safe if a barrier() is required.  In other words, if the
> > > > barrier() is needed to prevent the compiler hoisting the load, then the CPU
> > > > can still cause problems.
> > > 
> > > Do you think the speculate loads may return wrong value of
> > > BLK_MQ_S_INACTIVE bit in case of single CPU? BTW, writing the bit is
> > > done on the same CPU. If yes, this machine may not obey cache consistency,
> > > IMO.
> > 
> > If the write is on the same CPU, then the read will of course return the
> > value written by that write, otherwise we'd have much bigger problems!
> 
> OK, then it is nothing to with speculate loads.
> 
> > 
> > But then I'm confused, because you're saying that the write is done on the
> > same CPU, but previously you were saying that migration occuring before (1)
> > was problematic. Can you explain a bit more about that case, please? What
> > is running before (1) that is relevant here?
> 
> Please see the following two code paths:
> 
> [1] code path1:
> blk_mq_hctx_notify_offline():
> 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> 
> 	smp_mb() or smp_mb_after_atomic()
> 
> 	blk_mq_hctx_drain_inflight_rqs():
> 		blk_mq_tags_inflight_rqs()
> 			rq = hctx->tags->rqs[index]
> 			and
> 			READ rq->tag
> 
> [2] code path2:
> 	blk_mq_get_driver_tag():
> 
> 		process might be migrated to other CPU here and chance is small,
> 		then the follow code will be run on CPU different with code path1

If the process is migrated from one CPU to another, each CPU will execute
full barriers (smp_mb() or equivalent) as part of the migration.  Do those
barriers help prevent the undesired outcome?

								Thanx, Paul

> 		rq->tag = rq->internal_tag;
> 		hctx->tags->rqs[rq->tag] = rq;
> 
> 		barrier() in case that code path2 is run on same CPU with code path1
> 		OR
> 		smp_mb() in case that code path2 is run on different CPU with code path1 because
> 		of process migration
> 		
> 		test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state)
> 
> 
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29 17:34                           ` Will Deacon
@ 2020-04-30  0:39                             ` Ming Lei
  2020-04-30 11:04                               ` Will Deacon
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-30  0:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, Apr 29, 2020 at 06:34:01PM +0100, Will Deacon wrote:
> On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote:
> > On Wed, Apr 29, 2020 at 01:27:57PM +0100, Will Deacon wrote:
> > > On Wed, Apr 29, 2020 at 05:46:16PM +0800, Ming Lei wrote:
> > > > On Wed, Apr 29, 2020 at 09:07:29AM +0100, Will Deacon wrote:
> > > > > On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> > > > > > On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > > > > > > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > > > > > >  		atomic_inc(&data.hctx->nr_active);
> > > > > > > >  	}
> > > > > > > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > > > > > > >  
> > > > > > > >  	/*
> > > > > > > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > > > > > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > > > > > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > > > > > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > > > > > > >  	 */
> > > > > > > > +	if (rq->mq_ctx->cpu != get_cpu())
> > > > > > > >  		smp_mb();
> > > > > > > > +	put_cpu();
> > > > > > > 
> > > > > > > This looks exceedingly weird; how do you think you can get to another
> > > > > > > CPU and not have an smp_mb() implied in the migration itself? Also, what
> > > > > > 
> > > > > > What we need is one smp_mb() between the following two OPs:
> > > > > > 
> > > > > > 1) 
> > > > > >    rq->tag = rq->internal_tag;
> > > > > >    data.hctx->tags->rqs[rq->tag] = rq;
> > > > > > 
> > > > > > 2) 
> > > > > > 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> > > > > > 
> > > > > > And the pair of the above barrier is in blk_mq_hctx_notify_offline().
> > > > > 
> > > > > I'm struggling with this, so let me explain why. My understanding of the
> > > > > original patch [1] and your explanation above is that you want *either* of
> > > > > the following behaviours
> > > > > 
> > > > >   - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
> > > > >     run on the same CPU with barrier() between them, or
> > > > > 
> > > > >   - There is a migration and therefore an implied smp_mb() between them
> > > > > 
> > > > > However, given that most CPUs can speculate loads (and therefore the
> > > > > test_bit() operation), I don't understand how the "everything runs on the
> > > > > same CPU" is safe if a barrier() is required.  In other words, if the
> > > > > barrier() is needed to prevent the compiler hoisting the load, then the CPU
> > > > > can still cause problems.
> > > > 
> > > > Do you think the speculate loads may return wrong value of
> > > > BLK_MQ_S_INACTIVE bit in case of single CPU? BTW, writing the bit is
> > > > done on the same CPU. If yes, this machine may not obey cache consistency,
> > > > IMO.
> > > 
> > > If the write is on the same CPU, then the read will of course return the
> > > value written by that write, otherwise we'd have much bigger problems!
> > 
> > OK, then it is nothing to with speculate loads.
> > 
> > > 
> > > But then I'm confused, because you're saying that the write is done on the
> > > same CPU, but previously you were saying that migration occuring before (1)
> > > was problematic. Can you explain a bit more about that case, please? What
> > > is running before (1) that is relevant here?
> > 
> > Please see the following two code paths:
> > 
> > [1] code path1:
> > blk_mq_hctx_notify_offline():
> > 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> > 
> > 	smp_mb() or smp_mb_after_atomic()
> > 
> > 	blk_mq_hctx_drain_inflight_rqs():
> > 		blk_mq_tags_inflight_rqs()
> > 			rq = hctx->tags->rqs[index]
> > 			and
> > 			READ rq->tag
> > 
> > [2] code path2:
> > 	blk_mq_get_driver_tag():
> > 
> > 		process might be migrated to other CPU here and chance is small,
> > 		then the follow code will be run on CPU different with code path1
> > 
> > 		rq->tag = rq->internal_tag;
> > 		hctx->tags->rqs[rq->tag] = rq;
> 
> I /think/ this can be distilled to the SB litmus test:
> 
> 	// blk_mq_hctx_notify_offline()		blk_mq_get_driver_tag();
> 	Wstate = INACTIVE			Wtag
> 	smp_mb()				smp_mb()
> 	Rtag					Rstate
> 
> and you want to make sure that either blk_mq_get_driver_tag() sees the
> state as INACTIVE and does the cleanup, or it doesn't and
> blk_mq_hctx_notify_offline() sees the newly written tag and waits for the
> request to complete (I don't get how that happens, but hey).
> 
> Is that right?

Yeah, exactly.

> 
> > 		barrier() in case that code path2 is run on same CPU with code path1
> > 		OR
> > 		smp_mb() in case that code path2 is run on different CPU with code path1 because
> > 		of process migration
> > 		
> > 		test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state)
> 
> Couldn't you just check this at the start of blk_mq_get_driver_tag() as
> well, and then make the smp_mb() unconditional?

As I mentioned, the chance for the current process(calling
blk_mq_get_driver_tag()) migration is very small, we do want to
avoid the extra smp_mb() in the fast path.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-29 17:46                           ` Paul E. McKenney
@ 2020-04-30  0:43                             ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-04-30  0:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Peter Zijlstra, Christoph Hellwig, Jens Axboe,
	linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On Wed, Apr 29, 2020 at 10:46:36AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote:
> > On Wed, Apr 29, 2020 at 01:27:57PM +0100, Will Deacon wrote:
> > > On Wed, Apr 29, 2020 at 05:46:16PM +0800, Ming Lei wrote:
> > > > On Wed, Apr 29, 2020 at 09:07:29AM +0100, Will Deacon wrote:
> > > > > On Wed, Apr 29, 2020 at 10:16:12AM +0800, Ming Lei wrote:
> > > > > > On Tue, Apr 28, 2020 at 05:58:37PM +0200, Peter Zijlstra wrote:
> > > > > > > On Sat, Apr 25, 2020 at 05:48:32PM +0200, Christoph Hellwig wrote:
> > > > > > > >  		atomic_inc(&data.hctx->nr_active);
> > > > > > > >  	}
> > > > > > > >  	data.hctx->tags->rqs[rq->tag] = rq;
> > > > > > > >  
> > > > > > > >  	/*
> > > > > > > > +	 * Ensure updates to rq->tag and tags->rqs[] are seen by
> > > > > > > > +	 * blk_mq_tags_inflight_rqs.  This pairs with the smp_mb__after_atomic
> > > > > > > > +	 * in blk_mq_hctx_notify_offline.  This only matters in case a process
> > > > > > > > +	 * gets migrated to another CPU that is not mapped to this hctx.
> > > > > > > >  	 */
> > > > > > > > +	if (rq->mq_ctx->cpu != get_cpu())
> > > > > > > >  		smp_mb();
> > > > > > > > +	put_cpu();
> > > > > > > 
> > > > > > > This looks exceedingly weird; how do you think you can get to another
> > > > > > > CPU and not have an smp_mb() implied in the migration itself? Also, what
> > > > > > 
> > > > > > What we need is one smp_mb() between the following two OPs:
> > > > > > 
> > > > > > 1) 
> > > > > >    rq->tag = rq->internal_tag;
> > > > > >    data.hctx->tags->rqs[rq->tag] = rq;
> > > > > > 
> > > > > > 2) 
> > > > > > 	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state)))
> > > > > > 
> > > > > > And the pair of the above barrier is in blk_mq_hctx_notify_offline().
> > > > > 
> > > > > I'm struggling with this, so let me explain why. My understanding of the
> > > > > original patch [1] and your explanation above is that you want *either* of
> > > > > the following behaviours
> > > > > 
> > > > >   - __blk_mq_get_driver_tag() (i.e. (1) above) and test_bit(BLK_MQ_S_INACTIVE, ...)
> > > > >     run on the same CPU with barrier() between them, or
> > > > > 
> > > > >   - There is a migration and therefore an implied smp_mb() between them
> > > > > 
> > > > > However, given that most CPUs can speculate loads (and therefore the
> > > > > test_bit() operation), I don't understand how the "everything runs on the
> > > > > same CPU" is safe if a barrier() is required.  In other words, if the
> > > > > barrier() is needed to prevent the compiler hoisting the load, then the CPU
> > > > > can still cause problems.
> > > > 
> > > > Do you think the speculate loads may return wrong value of
> > > > BLK_MQ_S_INACTIVE bit in case of single CPU? BTW, writing the bit is
> > > > done on the same CPU. If yes, this machine may not obey cache consistency,
> > > > IMO.
> > > 
> > > If the write is on the same CPU, then the read will of course return the
> > > value written by that write, otherwise we'd have much bigger problems!
> > 
> > OK, then it is nothing to with speculate loads.
> > 
> > > 
> > > But then I'm confused, because you're saying that the write is done on the
> > > same CPU, but previously you were saying that migration occuring before (1)
> > > was problematic. Can you explain a bit more about that case, please? What
> > > is running before (1) that is relevant here?
> > 
> > Please see the following two code paths:
> > 
> > [1] code path1:
> > blk_mq_hctx_notify_offline():
> > 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> > 
> > 	smp_mb() or smp_mb_after_atomic()
> > 
> > 	blk_mq_hctx_drain_inflight_rqs():
> > 		blk_mq_tags_inflight_rqs()
> > 			rq = hctx->tags->rqs[index]
> > 			and
> > 			READ rq->tag
> > 
> > [2] code path2:
> > 	blk_mq_get_driver_tag():
> > 
> > 		process might be migrated to other CPU here and chance is small,
> > 		then the follow code will be run on CPU different with code path1
> 
> If the process is migrated from one CPU to another, each CPU will execute
> full barriers (smp_mb() or equivalent) as part of the migration.  Do those
> barriers help prevent the undesired outcome?

No, WRITE on rq->tag will be run on the new CPU which becomes different
with another CPU on which READ rq->tag is done.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-30  0:39                             ` Ming Lei
@ 2020-04-30 11:04                               ` Will Deacon
  2020-04-30 14:02                                 ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Will Deacon @ 2020-04-30 11:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Thu, Apr 30, 2020 at 08:39:45AM +0800, Ming Lei wrote:
> On Wed, Apr 29, 2020 at 06:34:01PM +0100, Will Deacon wrote:
> > On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote:
> > > Please see the following two code paths:
> > > 
> > > [1] code path1:
> > > blk_mq_hctx_notify_offline():
> > > 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> > > 
> > > 	smp_mb() or smp_mb_after_atomic()
> > > 
> > > 	blk_mq_hctx_drain_inflight_rqs():
> > > 		blk_mq_tags_inflight_rqs()
> > > 			rq = hctx->tags->rqs[index]
> > > 			and
> > > 			READ rq->tag
> > > 
> > > [2] code path2:
> > > 	blk_mq_get_driver_tag():
> > > 
> > > 		process might be migrated to other CPU here and chance is small,
> > > 		then the follow code will be run on CPU different with code path1
> > > 
> > > 		rq->tag = rq->internal_tag;
> > > 		hctx->tags->rqs[rq->tag] = rq;
> > 
> > I /think/ this can be distilled to the SB litmus test:
> > 
> > 	// blk_mq_hctx_notify_offline()		blk_mq_get_driver_tag();
> > 	Wstate = INACTIVE			Wtag
> > 	smp_mb()				smp_mb()
> > 	Rtag					Rstate
> > 
> > and you want to make sure that either blk_mq_get_driver_tag() sees the
> > state as INACTIVE and does the cleanup, or it doesn't and
> > blk_mq_hctx_notify_offline() sees the newly written tag and waits for the
> > request to complete (I don't get how that happens, but hey).
> > 
> > Is that right?
> 
> Yeah, exactly.
> 
> > 
> > > 		barrier() in case that code path2 is run on same CPU with code path1
> > > 		OR
> > > 		smp_mb() in case that code path2 is run on different CPU with code path1 because
> > > 		of process migration
> > > 		
> > > 		test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state)
> > 
> > Couldn't you just check this at the start of blk_mq_get_driver_tag() as
> > well, and then make the smp_mb() unconditional?
> 
> As I mentioned, the chance for the current process(calling
> blk_mq_get_driver_tag()) migration is very small, we do want to
> avoid the extra smp_mb() in the fast path.

Hmm, but your suggestion of checking 'rq->mq_ctx->cpu' only works if that
is the same CPU on which blk_mq_hctx_notify_offline() executes. What
provides that guarantee?

If there's any chance of this thing being concurrent, then you need the
barrier there just in case. So I'd say you either need to prevent the race,
or live with the barrier. Do you have numbers to show how expensive it is?

Will

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-30 11:04                               ` Will Deacon
@ 2020-04-30 14:02                                 ` Ming Lei
  2020-05-05 15:46                                   ` Christoph Hellwig
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-04-30 14:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Christoph Hellwig, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Thu, Apr 30, 2020 at 12:04:29PM +0100, Will Deacon wrote:
> On Thu, Apr 30, 2020 at 08:39:45AM +0800, Ming Lei wrote:
> > On Wed, Apr 29, 2020 at 06:34:01PM +0100, Will Deacon wrote:
> > > On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote:
> > > > Please see the following two code paths:
> > > > 
> > > > [1] code path1:
> > > > blk_mq_hctx_notify_offline():
> > > > 	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> > > > 
> > > > 	smp_mb() or smp_mb_after_atomic()
> > > > 
> > > > 	blk_mq_hctx_drain_inflight_rqs():
> > > > 		blk_mq_tags_inflight_rqs()
> > > > 			rq = hctx->tags->rqs[index]
> > > > 			and
> > > > 			READ rq->tag
> > > > 
> > > > [2] code path2:
> > > > 	blk_mq_get_driver_tag():
> > > > 
> > > > 		process might be migrated to other CPU here and chance is small,
> > > > 		then the follow code will be run on CPU different with code path1
> > > > 
> > > > 		rq->tag = rq->internal_tag;
> > > > 		hctx->tags->rqs[rq->tag] = rq;
> > > 
> > > I /think/ this can be distilled to the SB litmus test:
> > > 
> > > 	// blk_mq_hctx_notify_offline()		blk_mq_get_driver_tag();
> > > 	Wstate = INACTIVE			Wtag
> > > 	smp_mb()				smp_mb()
> > > 	Rtag					Rstate
> > > 
> > > and you want to make sure that either blk_mq_get_driver_tag() sees the
> > > state as INACTIVE and does the cleanup, or it doesn't and
> > > blk_mq_hctx_notify_offline() sees the newly written tag and waits for the
> > > request to complete (I don't get how that happens, but hey).
> > > 
> > > Is that right?
> > 
> > Yeah, exactly.
> > 
> > > 
> > > > 		barrier() in case that code path2 is run on same CPU with code path1
> > > > 		OR
> > > > 		smp_mb() in case that code path2 is run on different CPU with code path1 because
> > > > 		of process migration
> > > > 		
> > > > 		test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state)
> > > 
> > > Couldn't you just check this at the start of blk_mq_get_driver_tag() as
> > > well, and then make the smp_mb() unconditional?
> > 
> > As I mentioned, the chance for the current process(calling
> > blk_mq_get_driver_tag()) migration is very small, we do want to
> > avoid the extra smp_mb() in the fast path.
> 
> Hmm, but your suggestion of checking 'rq->mq_ctx->cpu' only works if that
> is the same CPU on which blk_mq_hctx_notify_offline() executes. What
> provides that guarantee?

BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
handler. So if there is any request of this hctx submitted from somewhere,
it has to this last cpu. That is done by blk-mq's queue mapping.

In case of direct issue, basically blk_mq_get_driver_tag() is run after
the request is allocated, that is why I mentioned the chance of
migration is very small.

> 
> If there's any chance of this thing being concurrent, then you need the

The only chance is that the process running blk_mq_get_driver_tag is
migrated to another CPU in case of direct issue. And we do add
smp_mb() for this case.

> barrier there just in case. So I'd say you either need to prevent the race,
> or live with the barrier. Do you have numbers to show how expensive it is?

Not yet, but we can save it easily in the very fast path, so why not do it?
Especially most of times preemption won't happen at all.

Also this patch itself is correct, and preempt disable via get_cpu()
suggested by Christoph isn't needed too, because migration implies
smp_mb(). I will document this point in next version.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-30 14:02                                 ` Ming Lei
@ 2020-05-05 15:46                                   ` Christoph Hellwig
  2020-05-06  1:24                                     ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Christoph Hellwig @ 2020-05-05 15:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Will Deacon, Peter Zijlstra, Christoph Hellwig, Jens Axboe,
	linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, paulmck

On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> handler. So if there is any request of this hctx submitted from somewhere,
> it has to this last cpu. That is done by blk-mq's queue mapping.
> 
> In case of direct issue, basically blk_mq_get_driver_tag() is run after
> the request is allocated, that is why I mentioned the chance of
> migration is very small.

"very small" does not cut it, it has to be zero.  And it seems the
new version still has this hack.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-05 15:46                                   ` Christoph Hellwig
@ 2020-05-06  1:24                                     ` Ming Lei
  2020-05-06  7:28                                       ` Will Deacon
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-05-06  1:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Will Deacon, Peter Zijlstra, Jens Axboe, linux-block, John Garry,
	Bart Van Assche, Hannes Reinecke, Thomas Gleixner, paulmck

On Tue, May 05, 2020 at 05:46:18PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> > BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> > offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> > handler. So if there is any request of this hctx submitted from somewhere,
> > it has to this last cpu. That is done by blk-mq's queue mapping.
> > 
> > In case of direct issue, basically blk_mq_get_driver_tag() is run after
> > the request is allocated, that is why I mentioned the chance of
> > migration is very small.
> 
> "very small" does not cut it, it has to be zero.  And it seems the
> new version still has this hack.

But smp_mb() is used for ordering the WRITE and READ, so it is correct.

barrier() is enough when process migration doesn't happen.

thank, 
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-06  1:24                                     ` Ming Lei
@ 2020-05-06  7:28                                       ` Will Deacon
  2020-05-06  8:07                                         ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Will Deacon @ 2020-05-06  7:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Peter Zijlstra, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, May 06, 2020 at 09:24:25AM +0800, Ming Lei wrote:
> On Tue, May 05, 2020 at 05:46:18PM +0200, Christoph Hellwig wrote:
> > On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> > > BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> > > offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> > > handler. So if there is any request of this hctx submitted from somewhere,
> > > it has to this last cpu. That is done by blk-mq's queue mapping.
> > > 
> > > In case of direct issue, basically blk_mq_get_driver_tag() is run after
> > > the request is allocated, that is why I mentioned the chance of
> > > migration is very small.
> > 
> > "very small" does not cut it, it has to be zero.  And it seems the
> > new version still has this hack.
> 
> But smp_mb() is used for ordering the WRITE and READ, so it is correct.
> 
> barrier() is enough when process migration doesn't happen.

Without numbers I would just make the smp_mb() unconditional. Your
questionable optimisation trades that for a load of the CPU ID and a
conditional branch, which isn't obviously faster to me. It's also very
difficult to explain to people and relies on a bunch of implicit behaviour
(e.g. racing only with CPU-affine hotplug notifier).

If it turns out that the smp_mb() is worthwhile,  then I'd suggest improving
the comment, perhaps to include the litmus test I cooked previously.

Will

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-06  7:28                                       ` Will Deacon
@ 2020-05-06  8:07                                         ` Ming Lei
  2020-05-06  9:56                                           ` Will Deacon
  0 siblings, 1 reply; 81+ messages in thread
From: Ming Lei @ 2020-05-06  8:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Christoph Hellwig, Peter Zijlstra, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, May 06, 2020 at 08:28:03AM +0100, Will Deacon wrote:
> On Wed, May 06, 2020 at 09:24:25AM +0800, Ming Lei wrote:
> > On Tue, May 05, 2020 at 05:46:18PM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> > > > BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> > > > offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> > > > handler. So if there is any request of this hctx submitted from somewhere,
> > > > it has to this last cpu. That is done by blk-mq's queue mapping.
> > > > 
> > > > In case of direct issue, basically blk_mq_get_driver_tag() is run after
> > > > the request is allocated, that is why I mentioned the chance of
> > > > migration is very small.
> > > 
> > > "very small" does not cut it, it has to be zero.  And it seems the
> > > new version still has this hack.
> > 
> > But smp_mb() is used for ordering the WRITE and READ, so it is correct.
> > 
> > barrier() is enough when process migration doesn't happen.
> 
> Without numbers I would just make the smp_mb() unconditional. Your
> questionable optimisation trades that for a load of the CPU ID and a
> conditional branch, which isn't obviously faster to me. It's also very

The CPU ID is just percpu READ, and unlikely() has been used for
optimizing the conditional branch. And smp_mb() could cause CPU stall, I
guess, so it should be much slower than reading CPU ID.

Let's see the attached microbench[1], the result shows that smp_mb() is
10+ times slower than smp_processor_id() with one conditional branch.

[    1.239951] test_foo: smp_mb 738701907 smp_id 62904315 result 0 overflow 5120

The micronbench is run on simple 8cores KVM guest, and cpu is
'Model name:          Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz'.

Result is pretty stable in my 5 runs of VM boot.

> difficult to explain to people and relies on a bunch of implicit behaviour
> (e.g. racing only with CPU-affine hotplug notifier).

It can be documented easily.

> 
> If it turns out that the smp_mb() is worthwhile,  then I'd suggest improving
> the comment, perhaps to include the litmus test I cooked previously.

I have added big comment on this usage in V10 already.



[1] miscrobench

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 956106b01810..548eec11f922 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3836,8 +3836,47 @@ unsigned int blk_mq_rq_cpu(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_rq_cpu);
 
+static unsigned long test_smp_mb(unsigned long cnt)
+{
+	unsigned long start = local_clock();
+
+	while (cnt--)
+		smp_mb();
+
+	return local_clock() - start;
+}
+
+static unsigned long test_smp_id(unsigned long cnt, short *result, int *overflow)
+{
+	unsigned long start = local_clock();
+
+	while (cnt--) {
+		short cpu = smp_processor_id();
+		*result += cpu;
+		if (unlikely(*result == 0))
+			(*overflow)++;
+	}
+	return local_clock() - start;
+}
+
+static void test_foo(void)
+{
+	const unsigned long cnt = 10 << 24;
+	short result = 0;
+	int overflow = 0;
+	unsigned long v1, v2;
+
+	v1 = test_smp_mb(cnt);
+	v2 = test_smp_id(cnt, &result, &overflow);
+
+	printk("%s: smp_mb %lu smp_id %lu result %d overflow %d\n",
+			__func__, v1, v2, (int)result, overflow);
+}
+
 static int __init blk_mq_init(void)
 {
+	test_foo();
+
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
 	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-06  8:07                                         ` Ming Lei
@ 2020-05-06  9:56                                           ` Will Deacon
  2020-05-06 10:22                                             ` Ming Lei
  0 siblings, 1 reply; 81+ messages in thread
From: Will Deacon @ 2020-05-06  9:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Peter Zijlstra, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, May 06, 2020 at 04:07:27PM +0800, Ming Lei wrote:
> On Wed, May 06, 2020 at 08:28:03AM +0100, Will Deacon wrote:
> > On Wed, May 06, 2020 at 09:24:25AM +0800, Ming Lei wrote:
> > > On Tue, May 05, 2020 at 05:46:18PM +0200, Christoph Hellwig wrote:
> > > > On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> > > > > BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> > > > > offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> > > > > handler. So if there is any request of this hctx submitted from somewhere,
> > > > > it has to this last cpu. That is done by blk-mq's queue mapping.
> > > > > 
> > > > > In case of direct issue, basically blk_mq_get_driver_tag() is run after
> > > > > the request is allocated, that is why I mentioned the chance of
> > > > > migration is very small.
> > > > 
> > > > "very small" does not cut it, it has to be zero.  And it seems the
> > > > new version still has this hack.
> > > 
> > > But smp_mb() is used for ordering the WRITE and READ, so it is correct.
> > > 
> > > barrier() is enough when process migration doesn't happen.
> > 
> > Without numbers I would just make the smp_mb() unconditional. Your
> > questionable optimisation trades that for a load of the CPU ID and a
> > conditional branch, which isn't obviously faster to me. It's also very
> 
> The CPU ID is just percpu READ, and unlikely() has been used for
> optimizing the conditional branch. And smp_mb() could cause CPU stall, I
> guess, so it should be much slower than reading CPU ID.

Percpu accesses aren't uniformly cheap across architectures.

> Let's see the attached microbench[1], the result shows that smp_mb() is
> 10+ times slower than smp_processor_id() with one conditional branch.

Nobody said anything about smp_mb() in a tight loop, so this is hardly
surprising. Throughput of barrier instructions will hit a ceiling fairly
quickly, but they don't have to cause stalls in general use. I would expect
the numbers to converge if you added some back-off to the loops (e.g.
ndelay() or something). But I was really hoping for some numbers from the
block layer itself, since that's what we actually care about.

> [    1.239951] test_foo: smp_mb 738701907 smp_id 62904315 result 0 overflow 5120
> 
> The micronbench is run on simple 8cores KVM guest, and cpu is
> 'Model name:          Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz'.
> 
> Result is pretty stable in my 5 runs of VM boot.

Honestly, I get the impression that you're not particularly happy with me
putting in the effort to review your patches, so I'll leave it up to
Christoph as to whether he wants to predicate the concurrency design on
a hokey microbenchmark.

FWIW: I agree that the code should work as you have it in v10, I just think
it's unnecessarily complicated and fragile.

/me goes to review other things

Will

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-06  9:56                                           ` Will Deacon
@ 2020-05-06 10:22                                             ` Ming Lei
  0 siblings, 0 replies; 81+ messages in thread
From: Ming Lei @ 2020-05-06 10:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Christoph Hellwig, Peter Zijlstra, Jens Axboe, linux-block,
	John Garry, Bart Van Assche, Hannes Reinecke, Thomas Gleixner,
	paulmck

On Wed, May 06, 2020 at 10:56:10AM +0100, Will Deacon wrote:
> On Wed, May 06, 2020 at 04:07:27PM +0800, Ming Lei wrote:
> > On Wed, May 06, 2020 at 08:28:03AM +0100, Will Deacon wrote:
> > > On Wed, May 06, 2020 at 09:24:25AM +0800, Ming Lei wrote:
> > > > On Tue, May 05, 2020 at 05:46:18PM +0200, Christoph Hellwig wrote:
> > > > > On Thu, Apr 30, 2020 at 10:02:54PM +0800, Ming Lei wrote:
> > > > > > BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming
> > > > > > offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug
> > > > > > handler. So if there is any request of this hctx submitted from somewhere,
> > > > > > it has to this last cpu. That is done by blk-mq's queue mapping.
> > > > > > 
> > > > > > In case of direct issue, basically blk_mq_get_driver_tag() is run after
> > > > > > the request is allocated, that is why I mentioned the chance of
> > > > > > migration is very small.
> > > > > 
> > > > > "very small" does not cut it, it has to be zero.  And it seems the
> > > > > new version still has this hack.
> > > > 
> > > > But smp_mb() is used for ordering the WRITE and READ, so it is correct.
> > > > 
> > > > barrier() is enough when process migration doesn't happen.
> > > 
> > > Without numbers I would just make the smp_mb() unconditional. Your
> > > questionable optimisation trades that for a load of the CPU ID and a
> > > conditional branch, which isn't obviously faster to me. It's also very
> > 
> > The CPU ID is just percpu READ, and unlikely() has been used for
> > optimizing the conditional branch. And smp_mb() could cause CPU stall, I
> > guess, so it should be much slower than reading CPU ID.
> 
> Percpu accesses aren't uniformly cheap across architectures.

I believe percpu access is cheap enough than smp_mb() in almost
every SMP ARCH, otherwise no one would use percpu variable on that
ARCH.

> 
> > Let's see the attached microbench[1], the result shows that smp_mb() is
> > 10+ times slower than smp_processor_id() with one conditional branch.
> 
> Nobody said anything about smp_mb() in a tight loop, so this is hardly
> surprising. Throughput of barrier instructions will hit a ceiling fairly
> quickly, but they don't have to cause stalls in general use. I would expect
> the numbers to converge if you added some back-off to the loops (e.g.
> ndelay() or something). But I was really hoping for some numbers from the
> block layer itself, since that's what we actually care about.

I believe that the microbench is enough to show smp_mb() is much heavier
and slower than smp_processor_id() with conditional branch.

Cause some aio or io uring workload just takes CPU to submit IO without
any delay, and we don't want to take extra CPU unnecessarily in IO
submission side. And IOPS may reach millions or dozens of millions
level. Storage guys have been working very hard to optimize the whole
IO path.

> 
> > [    1.239951] test_foo: smp_mb 738701907 smp_id 62904315 result 0 overflow 5120
> > 
> > The micronbench is run on simple 8cores KVM guest, and cpu is
> > 'Model name:          Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz'.
> > 
> > Result is pretty stable in my 5 runs of VM boot.
> 
> Honestly, I get the impression that you're not particularly happy with me
> putting in the effort to review your patches, so I'll leave it up to
> Christoph as to whether he wants to predicate the concurrency design on
> a hokey microbenchmark.
> 
> FWIW: I agree that the code should work as you have it in v10, I just think
> it's unnecessarily complicated and fragile.

Yeah, it works and it is correct, and we can document the usage, another point
is that CPU hotplug doesn't happen frequently, so we shouldn't introduce extra
cost for handling cpu hotplug in fast IO path, meantime smp_mb() won't
be something which can be ignored, especially in some big machine with
lots of CPU cores.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2020-05-06 10:23 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
2020-04-24 10:23 ` [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
2020-04-24 10:32   ` Christoph Hellwig
2020-04-24 12:43   ` Hannes Reinecke
2020-04-24 16:11   ` Martin K. Petersen
2020-04-24 10:23 ` [PATCH V8 02/11] block: add helper for copying request Ming Lei
2020-04-24 10:23   ` Ming Lei
2020-04-24 10:35   ` Christoph Hellwig
2020-04-24 12:43   ` Hannes Reinecke
2020-04-24 16:12   ` Martin K. Petersen
2020-04-24 10:23 ` [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
2020-04-24 12:44   ` Hannes Reinecke
2020-04-24 16:13   ` Martin K. Petersen
2020-04-24 10:23 ` [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
2020-04-24 10:35   ` Christoph Hellwig
2020-04-24 13:02   ` Hannes Reinecke
2020-04-25  2:54     ` Ming Lei
2020-04-25 18:26       ` Hannes Reinecke
2020-04-24 10:23 ` [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
2020-04-24 13:17   ` Hannes Reinecke
2020-04-25  3:04     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2020-04-24 13:23   ` Hannes Reinecke
2020-04-25  3:24     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
2020-04-24 10:38   ` Christoph Hellwig
2020-04-25  3:17     ` Ming Lei
2020-04-25  8:32       ` Christoph Hellwig
2020-04-25  9:34         ` Ming Lei
2020-04-25  9:53           ` Ming Lei
2020-04-25 15:48             ` Christoph Hellwig
2020-04-26  2:06               ` Ming Lei
2020-04-26  8:19                 ` John Garry
2020-04-27 15:36                 ` Christoph Hellwig
2020-04-28  1:10                   ` Ming Lei
2020-04-27 19:03               ` Paul E. McKenney
2020-04-28  6:54                 ` Christoph Hellwig
2020-04-28 15:58               ` Peter Zijlstra
2020-04-29  2:16                 ` Ming Lei
2020-04-29  8:07                   ` Will Deacon
2020-04-29  9:46                     ` Ming Lei
2020-04-29 12:27                       ` Will Deacon
2020-04-29 13:43                         ` Ming Lei
2020-04-29 17:34                           ` Will Deacon
2020-04-30  0:39                             ` Ming Lei
2020-04-30 11:04                               ` Will Deacon
2020-04-30 14:02                                 ` Ming Lei
2020-05-05 15:46                                   ` Christoph Hellwig
2020-05-06  1:24                                     ` Ming Lei
2020-05-06  7:28                                       ` Will Deacon
2020-05-06  8:07                                         ` Ming Lei
2020-05-06  9:56                                           ` Will Deacon
2020-05-06 10:22                                             ` Ming Lei
2020-04-29 17:46                           ` Paul E. McKenney
2020-04-30  0:43                             ` Ming Lei
2020-04-24 13:27   ` Hannes Reinecke
2020-04-25  3:30     ` Ming Lei
2020-04-24 13:42   ` John Garry
2020-04-25  3:41     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 08/11] block: add blk_end_flush_machinery Ming Lei
2020-04-24 10:41   ` Christoph Hellwig
2020-04-25  3:44     ` Ming Lei
2020-04-25  8:11       ` Christoph Hellwig
2020-04-25  9:51         ` Ming Lei
2020-04-24 13:47   ` Hannes Reinecke
2020-04-25  3:47     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
2020-04-24 10:42   ` Christoph Hellwig
2020-04-25  3:48     ` Ming Lei
2020-04-24 13:48   ` Hannes Reinecke
2020-04-24 10:23 ` [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
2020-04-24 10:44   ` Christoph Hellwig
2020-04-25  3:52     ` Ming Lei
2020-04-24 13:55   ` Hannes Reinecke
2020-04-25  3:59     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
2020-04-24 10:43   ` Christoph Hellwig
2020-04-24 13:56   ` Hannes Reinecke
2020-04-24 15:23 ` [PATCH V8 00/11] blk-mq: improvement CPU hotplug Jens Axboe
2020-04-24 15:40   ` Christoph Hellwig
2020-04-24 15:41     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.