linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags
@ 2019-03-25  5:38 Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 1/8] blk-mq: get rid of the synchronize_rcu in __blk_mq_update_nr_hw_queues Jianchao Wang
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

As we know, there is a risk of accesing stale requests when iterate
in-flight requests with tags->rqs[] and this has been talked in following
thread,
[1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2
[2] https://marc.info/?l=linux-block&m=154526189023236&w=2

A typical sence could be
blk_mq_get_request         blk_mq_queue_tag_busy_iter
  -> blk_mq_get_tag
                             -> bt_for_each
                               -> bt_iter
                                 -> rq = taags->rqs[]
                                 -> rq->q
  -> blk_mq_rq_ctx_init
    -> data->hctx->tags->rqs[rq->tag] = rq;

The root cause is that there is a window between set bit on tag sbitmap
and set tags->rqs[].

This patch would fix this issue by iterating requests with tags->static_rqs[]
instead of tags->rqs[] which would be changed dynamically. Moreover,
we will try to get a non-zero q_usage_counter before access hctxs and tags and
thus could avoid the race with updating nr_hw_queues, switching io scheduler
and even queue clean up which are all under a frozen and drained queue.

The 1st patch get rid of the useless of synchronize_rcu in __blk_mq_update_nr_hw_queues

The 2nd patch modify the blk_mq_queue_tag_busy_iter to use tags->static_rqs[]
instead of tags->rqs[] to iterate the busy tags.

The 3rd ~ 7th patch change the blk_mq_tagset_busy_iter to blk_mq_queue_tag_busy_iter
which is safer

The 8th patch get rid of the blk_mq_tagset_busy_iter.

Change log

V1 -> V2:
  - Add wrapper to hide "inflight" parameter to user based on Sagi's suggestion.
  - Other misc changes on comment.

Jianchao Wang (8)
blk-mq: get rid of the synchronize_rcu in
 blk-mq: use static_rqs instead of rqs to iterate tags
 blk-mq: use blk_mq_queue_tag_inflight_iter in debugfs
 mtip32xx: use blk_mq_queue_tag_inflight_iter
 nbd: use blk_mq_queue_tag_inflight_iter
 skd: use blk_mq_queue_tag_inflight_iter
 nvme: use blk_mq_queue_tag_inflight_iter
 blk-mq: remove blk_mq_tagset_busy_iter

diff stat

 block/blk-mq-debugfs.c            |   2 +-
 block/blk-mq-tag.c                | 193 ++++++++++++++------------------------
 block/blk-mq-tag.h                |   4 +-
 block/blk-mq.c                    |  31 ++----
 drivers/block/mtip32xx/mtip32xx.c |   6 +-
 drivers/block/nbd.c               |   2 +-
 drivers/block/skd_main.c          |   4 +-
 drivers/nvme/host/core.c          |  12 +++
 drivers/nvme/host/fc.c            |  10 +-
 drivers/nvme/host/nvme.h          |   2 +
 drivers/nvme/host/pci.c           |   5 +-
 drivers/nvme/host/rdma.c          |   4 +-
 drivers/nvme/host/tcp.c           |   5 +-
 drivers/nvme/target/loop.c        |   4 +-
 include/linux/blk-mq.h            |   7 +-
 15 files changed, 119 insertions(+), 172 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 1/8] blk-mq: get rid of the synchronize_rcu in __blk_mq_update_nr_hw_queues
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 2/8] blk-mq: use static_rqs instead of rqs to iterate tags Jianchao Wang
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

In commit 530ca2c (blk-mq: Allow blocking queue tag iter callbacks),
we try to get a non-zero q_usage_counter to avoid access hctxs that
being modified. So the synchronize_rcu is useless and should be
removed.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq-tag.c | 4 +---
 block/blk-mq.c     | 4 ----
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a4931fc..5f28002 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -384,9 +384,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	/*
 	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
 	 * while the queue is frozen. So we can use q_usage_counter to avoid
-	 * racing with it. __blk_mq_update_nr_hw_queues() uses
-	 * synchronize_rcu() to ensure this function left the critical section
-	 * below.
+	 * racing with it.
 	 */
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a9c1816..2102d91 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3211,10 +3211,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue(q);
 	/*
-	 * Sync with blk_mq_queue_tag_busy_iter.
-	 */
-	synchronize_rcu();
-	/*
 	 * Switch IO scheduler to 'none', cleaning up the data associated
 	 * with the previous scheduler. We will switch back once we are done
 	 * updating the new sw to hw queue mappings.
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 2/8] blk-mq: use static_rqs instead of rqs to iterate tags
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 1/8] blk-mq: get rid of the synchronize_rcu in __blk_mq_update_nr_hw_queues Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25  7:12   ` Dongli Zhang
  2019-03-25  5:38 ` [PATCH V2 3/8] blk-mq: use blk_mq_queue_tag_inflight_iter in debugfs Jianchao Wang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

tags->rqs[] will not been cleaned when free driver tag to avoid
an extra store on a shared area in the per io path. But there
is a window between get driver tag and write tags->rqs[], so we
may see stale rq in tags->rqs[] which may have been freed, as
following case,
blk_mq_get_request         blk_mq_queue_tag_busy_iter
  -> blk_mq_get_tag
                             -> bt_for_each
                               -> bt_iter
                                 -> rq = taags->rqs[]
                                 -> rq->q
  -> blk_mq_rq_ctx_init
    -> data->hctx->tags->rqs[rq->tag] = rq;

In additiion, tags->rqs[] only contains the requests that get
driver tag. It is not accurate for io-scheduler case when account
busy tags in part_in_flight.

To fix both of them, the blk_mq_queue_tag_busy_iter is changed
in this patch to use tags->static_rqs[] instead of tags->rqs[].
We have to identify whether there is a io scheduler attached to
decide to use hctx->tags or hctx->sched_tags. And we will try to
get a non-zero q_usage_counter before that, then could avoid race
with update nr_hw_queues, switch io-scheduler and even queue cleanup.

Add 'inflight' parameter to determine to iterate in-flight
requests or just busy tags and add a new helper interface
blk_mq_queue_tag_inflight_iter to iterate all of the in-flight
tags and export this interface for drivers.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq-tag.c     | 96 ++++++++++++++++++++++++++++++++++++--------------
 block/blk-mq-tag.h     |  4 +--
 block/blk-mq.c         | 27 +++++---------
 include/linux/blk-mq.h |  5 +--
 4 files changed, 83 insertions(+), 49 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5f28002..bf7b235 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -216,26 +216,38 @@ struct bt_iter_data {
 	busy_iter_fn *fn;
 	void *data;
 	bool reserved;
+	bool inflight;
 };
 
 static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 {
 	struct bt_iter_data *iter_data = data;
 	struct blk_mq_hw_ctx *hctx = iter_data->hctx;
-	struct blk_mq_tags *tags = hctx->tags;
 	bool reserved = iter_data->reserved;
+	struct blk_mq_tags *tags;
 	struct request *rq;
 
+	tags =  hctx->sched_tags ? hctx->sched_tags : hctx->tags;
+
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
+	/*
+	 * Because tags->rqs[] will not been cleaned when free driver tag
+	 * and there is a window between get driver tag and write tags->rqs[],
+	 * so we may see stale rq in tags->rqs[] which may have been freed.
+	 * Using static_rqs[] is safer.
+	 */
+	rq = tags->static_rqs[bitnr];
 
 	/*
-	 * We can hit rq == NULL here, because the tagging functions
-	 * test and set the bit before assigning ->rqs[].
+	 * There is a small window between get tag and blk_mq_rq_ctx_init,
+	 * so rq->q and rq->mq_hctx maybe different.
 	 */
-	if (rq && rq->q == hctx->queue)
-		return iter_data->fn(hctx, rq, iter_data->data, reserved);
+	if (rq && rq->q == hctx->queue &&
+	    rq->mq_hctx == hctx &&
+	    (!iter_data->inflight ||
+	     blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
+		return iter_data->fn(rq, iter_data->data, reserved);
 	return true;
 }
 
@@ -246,7 +258,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		or the bitmap_tags member of struct blk_mq_tags.
  * @fn:		Pointer to the function that will be called for each request
  *		associated with @hctx that has been assigned a driver tag.
- *		@fn will be called as follows: @fn(@hctx, rq, @data, @reserved)
+ *		@fn will be called as follows: @fn(rq, @data, @reserved)
  *		where rq is a pointer to a request. Return true to continue
  *		iterating tags, false to stop.
  * @data:	Will be passed as third argument to @fn.
@@ -254,13 +266,14 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
-			busy_iter_fn *fn, void *data, bool reserved)
+			busy_iter_fn *fn, void *data, bool reserved, bool inflight)
 {
 	struct bt_iter_data iter_data = {
 		.hctx = hctx,
 		.fn = fn,
 		.data = data,
 		.reserved = reserved,
+		.inflight = inflight,
 	};
 
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
@@ -362,36 +375,33 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
 /**
- * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
+ * __blk_mq_queue_tag_busy_iter - iterate over all busy or inflight tags
  * @q:		Request queue to examine.
- * @fn:		Pointer to the function that will be called for each request
- *		on @q. @fn will be called as follows: @fn(hctx, rq, @priv,
- *		reserved) where rq is a pointer to a request and hctx points
- *		to the hardware queue associated with the request. 'reserved'
- *		indicates whether or not @rq is a reserved request.
+ * @fn:		Pointer to the function that will be called for each
+ * 		in-flight request issued by @q. @fn will be called as
+ * 		follows:
+ * 		@fn(rq, @priv, reserved)
+ * 		rq is a pointer to a request.'reserved' indicates whether or
+ * 		not @rq is a reserved request.
  * @priv:	Will be passed as third argument to @fn.
- *
- * Note: if @q->tag_set is shared with other request queues then @fn will be
- * called for all requests on all queues that share that tag set and not only
- * for requests associated with @q.
  */
-void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
-		void *priv)
+static void __blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+		void *priv, bool inflight)
 {
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
 	/*
-	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
-	 * while the queue is frozen. So we can use q_usage_counter to avoid
-	 * racing with it.
+	 * Get a reference of the queue unless it has been zero. We use this
+	 * to avoid the race with the code that would modify the hctxs after
+	 * freeze and drain the queue, including updating nr_hw_queues, io
+	 * scheduler switching and queue clean up.
 	 */
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		struct blk_mq_tags *tags = hctx->tags;
-
+		struct blk_mq_tags *tags;
 		/*
 		 * If no software queues are currently mapped to this
 		 * hardware queue, there's nothing to check
@@ -399,13 +409,45 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		if (!blk_mq_hw_queue_mapped(hctx))
 			continue;
 
+		tags =  hctx->sched_tags ? hctx->sched_tags : hctx->tags;
+
 		if (tags->nr_reserved_tags)
-			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
-		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+			bt_for_each(hctx, &tags->breserved_tags,
+				    fn, priv, true, inflight);
+		bt_for_each(hctx, &tags->bitmap_tags,
+			    fn, priv, false, inflight);
+		/*
+		 * flush_rq represents the rq with REQ_PREFLUSH and REQ_FUA
+		 * (if FUA is not supported by device) to be issued to
+		 * device. So we need to consider it when iterate inflight
+		 * rqs, but needn't to count it when iterate busy tags.
+		 */
+		if (inflight &&
+		    blk_mq_rq_state(hctx->fq->flush_rq) == MQ_RQ_IN_FLIGHT)
+			fn(hctx->fq->flush_rq, priv, false);
 	}
 	blk_queue_exit(q);
 }
 
+/*
+ * Iterate all the busy tags including pending and in-flight ones.
+ */
+void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+				void *priv)
+{
+	__blk_mq_queue_tag_busy_iter(q, fn, priv, false);
+}
+
+/*
+ * Iterate all the inflight tags.
+ */
+void blk_mq_queue_tag_inflight_iter(struct request_queue *q,
+				    busy_iter_fn *fn, void *priv)
+{
+	__blk_mq_queue_tag_busy_iter(q, fn, priv, true);
+}
+EXPORT_SYMBOL(blk_mq_queue_tag_inflight_iter);
+
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
 		    bool round_robin, int node)
 {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0..bcbb699 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -32,9 +32,9 @@ extern bool blk_mq_has_free_tags(struct blk_mq_tags *tags);
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_tags **tags,
 					unsigned int depth, bool can_grow);
-extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
-void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+extern void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2102d91..531522e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -93,8 +93,7 @@ struct mq_inflight {
 	unsigned int *inflight;
 };
 
-static bool blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
-				  struct request *rq, void *priv,
+static bool blk_mq_check_inflight(struct request *rq, void *priv,
 				  bool reserved)
 {
 	struct mq_inflight *mi = priv;
@@ -119,8 +118,7 @@ unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part)
 	return inflight[0];
 }
 
-static bool blk_mq_check_inflight_rw(struct blk_mq_hw_ctx *hctx,
-				     struct request *rq, void *priv,
+static bool blk_mq_check_inflight_rw(struct request *rq, void *priv,
 				     bool reserved)
 {
 	struct mq_inflight *mi = priv;
@@ -809,28 +807,22 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 }
 EXPORT_SYMBOL(blk_mq_tag_to_rq);
 
-static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
-			       void *priv, bool reserved)
+static bool blk_mq_rq_inflight(struct request *rq, void *priv, bool reserved)
 {
+	bool *busy = priv;
 	/*
 	 * If we find a request that is inflight and the queue matches,
 	 * we know the queue is busy. Return false to stop the iteration.
 	 */
-	if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
-		bool *busy = priv;
-
-		*busy = true;
-		return false;
-	}
-
-	return true;
+	*busy = true;
+	return false;
 }
 
 bool blk_mq_queue_inflight(struct request_queue *q)
 {
 	bool busy = false;
 
-	blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy);
+	blk_mq_queue_tag_inflight_iter(q, blk_mq_rq_inflight, &busy);
 	return busy;
 }
 EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);
@@ -870,8 +862,7 @@ static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
 	return false;
 }
 
-static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, void *priv, bool reserved)
+static bool blk_mq_check_expired(struct request *rq, void *priv, bool reserved)
 {
 	unsigned long *next = priv;
 
@@ -932,7 +923,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
+	blk_mq_queue_tag_inflight_iter(q, blk_mq_check_expired, &next);
 
 	if (next != 0) {
 		mod_timer(&q->timeout, next);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b0c814b..dff9bb6 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -132,8 +132,7 @@ typedef int (init_request_fn)(struct blk_mq_tag_set *set, struct request *,
 typedef void (exit_request_fn)(struct blk_mq_tag_set *set, struct request *,
 		unsigned int);
 
-typedef bool (busy_iter_fn)(struct blk_mq_hw_ctx *, struct request *, void *,
-		bool);
+typedef bool (busy_iter_fn)(struct request *, void *, bool);
 typedef bool (busy_tag_iter_fn)(struct request *, void *, bool);
 typedef int (poll_fn)(struct blk_mq_hw_ctx *);
 typedef int (map_queues_fn)(struct blk_mq_tag_set *set);
@@ -321,6 +320,8 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
+void blk_mq_queue_tag_inflight_iter(struct request_queue *q,
+		busy_iter_fn *fn, void *priv);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 3/8] blk-mq: use blk_mq_queue_tag_inflight_iter in debugfs
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 1/8] blk-mq: get rid of the synchronize_rcu in __blk_mq_update_nr_hw_queues Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 2/8] blk-mq: use static_rqs instead of rqs to iterate tags Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 4/8] mtip32xx: use blk_mq_queue_tag_inflight_iter Jianchao Wang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq-debugfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ec1d18c..b7f2538 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -438,7 +438,7 @@ static int hctx_busy_show(void *data, struct seq_file *m)
 	struct blk_mq_hw_ctx *hctx = data;
 	struct show_busy_params params = { .m = m, .hctx = hctx };
 
-	blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy_rq,
+	blk_mq_queue_tag_inflight_iter(hctx->queue, hctx_show_busy_rq,
 				&params);
 
 	return 0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 4/8] mtip32xx: use blk_mq_queue_tag_inflight_iter
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
                   ` (2 preceding siblings ...)
  2019-03-25  5:38 ` [PATCH V2 3/8] blk-mq: use blk_mq_queue_tag_inflight_iter in debugfs Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 5/8] nbd: " Jianchao Wang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/block/mtip32xx/mtip32xx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 83302ec..103e691 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -2771,12 +2771,12 @@ static int mtip_service_thread(void *data)
 
 			blk_mq_quiesce_queue(dd->queue);
 
-			blk_mq_tagset_busy_iter(&dd->tags, mtip_queue_cmd, dd);
+			blk_mq_queue_tag_inflight_iter(dd->queue, mtip_queue_cmd, dd);
 
 			set_bit(MTIP_PF_ISSUE_CMDS_BIT, &dd->port->flags);
 
 			if (mtip_device_reset(dd))
-				blk_mq_tagset_busy_iter(&dd->tags,
+				blk_mq_queue_tag_inflight_iter(dd->queue,
 							mtip_abort_cmd, dd);
 
 			clear_bit(MTIP_PF_TO_ACTIVE_BIT, &dd->port->flags);
@@ -3903,7 +3903,7 @@ static int mtip_block_remove(struct driver_data *dd)
 
 	blk_freeze_queue_start(dd->queue);
 	blk_mq_quiesce_queue(dd->queue);
-	blk_mq_tagset_busy_iter(&dd->tags, mtip_no_dev_cleanup, dd);
+	blk_mq_queue_tag_inflight_iter(dd->queue, mtip_no_dev_cleanup, dd);
 	blk_mq_unquiesce_queue(dd->queue);
 
 	/*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 5/8] nbd: use blk_mq_queue_tag_inflight_iter
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
                   ` (3 preceding siblings ...)
  2019-03-25  5:38 ` [PATCH V2 4/8] mtip32xx: use blk_mq_queue_tag_inflight_iter Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 6/8] skd: " Jianchao Wang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/block/nbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 90ba9f4..c48984b 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -747,7 +747,7 @@ static bool nbd_clear_req(struct request *req, void *data, bool reserved)
 static void nbd_clear_que(struct nbd_device *nbd)
 {
 	blk_mq_quiesce_queue(nbd->disk->queue);
-	blk_mq_tagset_busy_iter(&nbd->tag_set, nbd_clear_req, NULL);
+	blk_mq_queue_tag_inflight_iter(nbd->disk->queue, nbd_clear_req, NULL);
 	blk_mq_unquiesce_queue(nbd->disk->queue);
 	dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n");
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 6/8] skd: use blk_mq_queue_tag_inflight_iter
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
                   ` (4 preceding siblings ...)
  2019-03-25  5:38 ` [PATCH V2 5/8] nbd: " Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 7/8] nvme: " Jianchao Wang
  2019-03-25  5:38 ` [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter Jianchao Wang
  7 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/block/skd_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index 7d3ad6c..0213b19 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -395,7 +395,7 @@ static int skd_in_flight(struct skd_device *skdev)
 {
 	int count = 0;
 
-	blk_mq_tagset_busy_iter(&skdev->tag_set, skd_inc_in_flight, &count);
+	blk_mq_queue_tag_inflight_iter(skdev->queue, skd_inc_in_flight, &count);
 
 	return count;
 }
@@ -1916,7 +1916,7 @@ static bool skd_recover_request(struct request *req, void *data, bool reserved)
 
 static void skd_recover_requests(struct skd_device *skdev)
 {
-	blk_mq_tagset_busy_iter(&skdev->tag_set, skd_recover_request, skdev);
+	blk_mq_queue_tag_inflight_iter(skdev->queue, skd_recover_request, skdev);
 }
 
 static void skd_isr_msg_from_dev(struct skd_device *skdev)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
                   ` (5 preceding siblings ...)
  2019-03-25  5:38 ` [PATCH V2 6/8] skd: " Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25 13:49   ` Keith Busch
  2019-03-25  5:38 ` [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter Jianchao Wang
  7 siblings, 1 reply; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

blk_mq_tagset_inflight_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here. A new helper
interface nvme_iterate_inflight_rqs is introduced to iterate
all of the ns under a ctrl.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/core.c   | 12 ++++++++++++
 drivers/nvme/host/fc.c     | 10 +++++-----
 drivers/nvme/host/nvme.h   |  2 ++
 drivers/nvme/host/pci.c    |  5 +++--
 drivers/nvme/host/rdma.c   |  4 ++--
 drivers/nvme/host/tcp.c    |  5 +++--
 drivers/nvme/target/loop.c |  4 ++--
 7 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4706019..d6c53fe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3874,6 +3874,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
+		busy_iter_fn *fn, void *data)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_queue_tag_inflight_iter(ns->queue, fn, data);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_iterate_inflight_rqs);
+
 int __init nvme_core_init(void)
 {
 	int result = -ENOMEM;
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index f3b9d91..667da72 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2367,7 +2367,7 @@ nvme_fc_complete_rq(struct request *rq)
 /*
  * This routine is used by the transport when it needs to find active
  * io on a queue that is to be terminated. The transport uses
- * blk_mq_tagset_busy_itr() to find the busy requests, which then invoke
+ * blk_mq_queue_tag_inflight_iter() to find the busy requests, which then invoke
  * this routine to kill them on a 1 by 1 basis.
  *
  * As FC allocates FC exchange for each io, the transport must contact
@@ -2740,7 +2740,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
 	 * If io queues are present, stop them and terminate all outstanding
 	 * ios on them. As FC allocates FC exchange for each io, the
 	 * transport must contact the LLDD to terminate the exchange,
-	 * thus releasing the FC exchange. We use blk_mq_tagset_busy_itr()
+	 * thus releasing the FC exchange. We use blk_mq_queue_tag_inflight_iter
 	 * to tell us what io's are busy and invoke a transport routine
 	 * to kill them with the LLDD.  After terminating the exchange
 	 * the LLDD will call the transport's normal io done path, but it
@@ -2750,7 +2750,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
 	 */
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
-		blk_mq_tagset_busy_iter(&ctrl->tag_set,
+		nvme_iterate_inflight_rqs(&ctrl->ctrl,
 				nvme_fc_terminate_exchange, &ctrl->ctrl);
 	}
 
@@ -2768,11 +2768,11 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
 
 	/*
 	 * clean up the admin queue. Same thing as above.
-	 * use blk_mq_tagset_busy_itr() and the transport routine to
+	 * use blk_mq_queue_tag_inflight_iter() and the transport routine to
 	 * terminate the exchanges.
 	 */
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
-	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
+	blk_mq_queue_tag_inflight_iter(ctrl->ctrl.admin_q,
 				nvme_fc_terminate_exchange, &ctrl->ctrl);
 
 	/* kill the aens as they are a separate path */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 527d645..4c6bc803 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -445,6 +445,8 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
+void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
+		busy_iter_fn *fn, void *data);
 
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a90cf5d..96faa36 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2430,8 +2430,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	nvme_suspend_queue(&dev->queues[0]);
 	nvme_pci_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	nvme_iterate_inflight_rqs(&dev->ctrl, nvme_cancel_request, &dev->ctrl);
+	blk_mq_queue_tag_inflight_iter(dev->ctrl.admin_q,
+			nvme_cancel_request, &dev->ctrl);
 
 	/*
 	 * The driver will not be starting up queues again if shutting down so
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 11a5eca..5660200 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -914,7 +914,7 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
 {
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
 	nvme_rdma_stop_queue(&ctrl->queues[0]);
-	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request,
+	blk_mq_queue_tag_inflight_iter(ctrl->ctrl.admin_q, nvme_cancel_request,
 			&ctrl->ctrl);
 	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
 	nvme_rdma_destroy_admin_queue(ctrl, remove);
@@ -926,7 +926,7 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
 		nvme_rdma_stop_io_queues(ctrl);
-		blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
+		nvme_iterate_inflight_rqs(&ctrl->ctrl, nvme_cancel_request,
 				&ctrl->ctrl);
 		if (remove)
 			nvme_start_queues(&ctrl->ctrl);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e7e0888..4c825dc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1710,7 +1710,8 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
 {
 	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_tcp_stop_queue(ctrl, 0);
-	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl);
+	blk_mq_queue_tag_inflight_iter(ctrl->admin_q,
+			nvme_cancel_request, ctrl);
 	blk_mq_unquiesce_queue(ctrl->admin_q);
 	nvme_tcp_destroy_admin_queue(ctrl, remove);
 }
@@ -1722,7 +1723,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 		return;
 	nvme_stop_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
-	blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl);
+	nvme_iterate_inflight_rqs(ctrl, nvme_cancel_request, ctrl);
 	if (remove)
 		nvme_start_queues(ctrl);
 	nvme_tcp_destroy_io_queues(ctrl, remove);
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index b9f623a..50d7288 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -421,7 +421,7 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
 {
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
-		blk_mq_tagset_busy_iter(&ctrl->tag_set,
+		nvme_iterate_inflight_rqs(&ctrl->ctrl,
 					nvme_cancel_request, &ctrl->ctrl);
 		nvme_loop_destroy_io_queues(ctrl);
 	}
@@ -430,7 +430,7 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
 		nvme_shutdown_ctrl(&ctrl->ctrl);
 
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
-	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
+	blk_mq_queue_tag_inflight_iter(ctrl->ctrl.admin_q,
 				nvme_cancel_request, &ctrl->ctrl);
 	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
 	nvme_loop_destroy_admin_queue(ctrl);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter
  2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
                   ` (6 preceding siblings ...)
  2019-03-25  5:38 ` [PATCH V2 7/8] nvme: " Jianchao Wang
@ 2019-03-25  5:38 ` Jianchao Wang
  2019-03-25  7:18   ` Hannes Reinecke
  7 siblings, 1 reply; 28+ messages in thread
From: Jianchao Wang @ 2019-03-25  5:38 UTC (permalink / raw)
  To: axboe
  Cc: hch, jthumshirn, hare, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

As nobody uses blk_mq_tagset_busy_iter, remove it.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq-tag.c     | 95 --------------------------------------------------
 include/linux/blk-mq.h |  2 --
 2 files changed, 97 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index bf7b235..a6a28dd 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -279,101 +279,6 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
 }
 
-struct bt_tags_iter_data {
-	struct blk_mq_tags *tags;
-	busy_tag_iter_fn *fn;
-	void *data;
-	bool reserved;
-};
-
-static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
-{
-	struct bt_tags_iter_data *iter_data = data;
-	struct blk_mq_tags *tags = iter_data->tags;
-	bool reserved = iter_data->reserved;
-	struct request *rq;
-
-	if (!reserved)
-		bitnr += tags->nr_reserved_tags;
-
-	/*
-	 * We can hit rq == NULL here, because the tagging functions
-	 * test and set the bit before assining ->rqs[].
-	 */
-	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
-		return iter_data->fn(rq, iter_data->data, reserved);
-
-	return true;
-}
-
-/**
- * bt_tags_for_each - iterate over the requests in a tag map
- * @tags:	Tag map to iterate over.
- * @bt:		sbitmap to examine. This is either the breserved_tags member
- *		or the bitmap_tags member of struct blk_mq_tags.
- * @fn:		Pointer to the function that will be called for each started
- *		request. @fn will be called as follows: @fn(rq, @data,
- *		@reserved) where rq is a pointer to a request. Return true
- *		to continue iterating tags, false to stop.
- * @data:	Will be passed as second argument to @fn.
- * @reserved:	Indicates whether @bt is the breserved_tags member or the
- *		bitmap_tags member of struct blk_mq_tags.
- */
-static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
-{
-	struct bt_tags_iter_data iter_data = {
-		.tags = tags,
-		.fn = fn,
-		.data = data,
-		.reserved = reserved,
-	};
-
-	if (tags->rqs)
-		sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
-}
-
-/**
- * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
- * @tags:	Tag map to iterate over.
- * @fn:		Pointer to the function that will be called for each started
- *		request. @fn will be called as follows: @fn(rq, @priv,
- *		reserved) where rq is a pointer to a request. 'reserved'
- *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop.
- * @priv:	Will be passed as second argument to @fn.
- */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-		busy_tag_iter_fn *fn, void *priv)
-{
-	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
-}
-
-/**
- * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set
- * @tagset:	Tag set to iterate over.
- * @fn:		Pointer to the function that will be called for each started
- *		request. @fn will be called as follows: @fn(rq, @priv,
- *		reserved) where rq is a pointer to a request. 'reserved'
- *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop.
- * @priv:	Will be passed as second argument to @fn.
- */
-void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-		busy_tag_iter_fn *fn, void *priv)
-{
-	int i;
-
-	for (i = 0; i < tagset->nr_hw_queues; i++) {
-		if (tagset->tags && tagset->tags[i])
-			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
-	}
-}
-EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
-
 /**
  * __blk_mq_queue_tag_busy_iter - iterate over all busy or inflight tags
  * @q:		Request queue to examine.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dff9bb6..3ba8001 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -318,8 +318,6 @@ void blk_mq_unquiesce_queue(struct request_queue *q);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
-void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-		busy_tag_iter_fn *fn, void *priv);
 void blk_mq_queue_tag_inflight_iter(struct request_queue *q,
 		busy_iter_fn *fn, void *priv);
 void blk_mq_freeze_queue(struct request_queue *q);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 2/8] blk-mq: use static_rqs instead of rqs to iterate tags
  2019-03-25  5:38 ` [PATCH V2 2/8] blk-mq: use static_rqs instead of rqs to iterate tags Jianchao Wang
@ 2019-03-25  7:12   ` Dongli Zhang
  2019-03-25  7:14     ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Dongli Zhang @ 2019-03-25  7:12 UTC (permalink / raw)
  To: Jianchao Wang
  Cc: axboe, linux-block, jsmart2021, sagi, josef, linux-nvme,
	linux-kernel, keith.busch, hare, jthumshirn, hch, bvanassche

Hi Jianchao,

On 3/25/19 1:38 PM, Jianchao Wang wrote:
> tags->rqs[] will not been cleaned when free driver tag to avoid
> an extra store on a shared area in the per io path. But there
> is a window between get driver tag and write tags->rqs[], so we
> may see stale rq in tags->rqs[] which may have been freed, as
> following case,
> blk_mq_get_request         blk_mq_queue_tag_busy_iter
>   -> blk_mq_get_tag

Sorry I forgot to mention below when I was studying the v1 patchset.

If there is further review iteration, I would suggest clarify here that
blk_mq_get_tag() would set the bit in sbitmap (e.g., via __sbitmap_get_word())
that bt_for_each() would iterate. Otherwise, please ignore my message.

E.g.,

blk_mq_get_request
  -> blk_mq_get_tag
      -> taint sbitmap tag bit
         without setting tags->rqs[tag]

                                          -> bt_for_each
                                              -> iterate all dirty bit...

This would be much more friendly for people not working on block layer and when
they are looking for the potential fix for a bug by simply searching over the
commit messages. People not working on this would be able to quickly understand
there is a window between setting bit and setting data.


>                              -> bt_for_each
>                                -> bt_iter
>                                  -> rq = taags->rqs[]

%s/taags/tags/g

>                                  -> rq->q
>   -> blk_mq_rq_ctx_init
>     -> data->hctx->tags->rqs[rq->tag] = rq;
> 

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 2/8] blk-mq: use static_rqs instead of rqs to iterate tags
  2019-03-25  7:12   ` Dongli Zhang
@ 2019-03-25  7:14     ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2019-03-25  7:14 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: axboe, linux-block, jsmart2021, sagi, josef, linux-nvme,
	linux-kernel, keith.busch, hare, jthumshirn, hch, bvanassche

Hi Dongli

Thanks for your kindly response.

On 3/25/19 3:12 PM, Dongli Zhang wrote:
> Hi Jianchao,
> 
> On 3/25/19 1:38 PM, Jianchao Wang wrote:
>> tags->rqs[] will not been cleaned when free driver tag to avoid
>> an extra store on a shared area in the per io path. But there
>> is a window between get driver tag and write tags->rqs[], so we
>> may see stale rq in tags->rqs[] which may have been freed, as
>> following case,
>> blk_mq_get_request         blk_mq_queue_tag_busy_iter
>>   -> blk_mq_get_tag
> 
> Sorry I forgot to mention below when I was studying the v1 patchset.
> 
> If there is further review iteration, I would suggest clarify here that
> blk_mq_get_tag() would set the bit in sbitmap (e.g., via __sbitmap_get_word())
> that bt_for_each() would iterate. Otherwise, please ignore my message.
> 
> E.g.,
> 
> blk_mq_get_request
>   -> blk_mq_get_tag
>       -> taint sbitmap tag bit
>          without setting tags->rqs[tag]
> 
>                                           -> bt_for_each
>                                               -> iterate all dirty bit...
> 
> This would be much more friendly for people not working on block layer and when
> they are looking for the potential fix for a bug by simply searching over the
> commit messages. People not working on this would be able to quickly understand
> there is a window between setting bit and setting data.

Yes, I will do that.

> 
> 
>>                              -> bt_for_each
>>                                -> bt_iter
>>                                  -> rq = taags->rqs[]
> 
> %s/taags/tags/g

Yes.


Thanks
Jianchao

> 
>>                                  -> rq->q
>>   -> blk_mq_rq_ctx_init
>>     -> data->hctx->tags->rqs[rq->tag] = rq;
>>
> 
> Thank you very much!
> 
> Dongli Zhang
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter
  2019-03-25  5:38 ` [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter Jianchao Wang
@ 2019-03-25  7:18   ` Hannes Reinecke
  2019-03-25  7:37     ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2019-03-25  7:18 UTC (permalink / raw)
  To: Jianchao Wang, axboe
  Cc: hch, jthumshirn, josef, bvanassche, sagi, keith.busch,
	jsmart2021, linux-block, linux-nvme, linux-kernel

On 3/25/19 6:38 AM, Jianchao Wang wrote:
> As nobody uses blk_mq_tagset_busy_iter, remove it.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
> ---
>   block/blk-mq-tag.c     | 95 --------------------------------------------------
>   include/linux/blk-mq.h |  2 --
>   2 files changed, 97 deletions(-)
> 
Please, don't.

I'm currently implementing reserved commands for SCSI and reworking the 
SCSI error handling where I rely on this interface quite heavily.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter
  2019-03-25  7:18   ` Hannes Reinecke
@ 2019-03-25  7:37     ` jianchao.wang
  2019-03-25  8:25       ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2019-03-25  7:37 UTC (permalink / raw)
  To: Hannes Reinecke, axboe
  Cc: linux-block, jsmart2021, sagi, josef, linux-nvme, linux-kernel,
	keith.busch, jthumshirn, hch, bvanassche

Hi Hannes

On 3/25/19 3:18 PM, Hannes Reinecke wrote:
> On 3/25/19 6:38 AM, Jianchao Wang wrote:
>> As nobody uses blk_mq_tagset_busy_iter, remove it.
>>
>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
>> ---
>>   block/blk-mq-tag.c     | 95 --------------------------------------------------
>>   include/linux/blk-mq.h |  2 --
>>   2 files changed, 97 deletions(-)
>>
> Please, don't.
> 
> I'm currently implementing reserved commands for SCSI and reworking the SCSI error handling where I rely on this interface quite heavily.


blk_mq_tagset_busy_iter could access some stale requests which maybe freed due to io scheduler switching, request_queue cleanup (shared tagset)
when there is someone submits io and gets driver tag. When io scheduler attached, even quiesce request_queue won't work.

If this patchset is accepted, blk_mq_tagset_busy_iter could be replaced with blk_mq_queue_inflight_tag_iter which needs
to be invoked by every request_queue that shares the tagset.

Please go ahead your work.

Thanks
Jianchao

> 
> Cheers,
> 
> Hannes

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter
  2019-03-25  7:37     ` jianchao.wang
@ 2019-03-25  8:25       ` Hannes Reinecke
  2019-03-25  9:12         ` jianchao.wang
  2019-03-26 14:17         ` Jens Axboe
  0 siblings, 2 replies; 28+ messages in thread
From: Hannes Reinecke @ 2019-03-25  8:25 UTC (permalink / raw)
  To: jianchao.wang, axboe
  Cc: linux-block, jsmart2021, sagi, josef, linux-nvme, linux-kernel,
	keith.busch, jthumshirn, hch, bvanassche

On 3/25/19 8:37 AM, jianchao.wang wrote:
> Hi Hannes
> 
> On 3/25/19 3:18 PM, Hannes Reinecke wrote:
>> On 3/25/19 6:38 AM, Jianchao Wang wrote:
>>> As nobody uses blk_mq_tagset_busy_iter, remove it.
>>>
>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
>>> ---
>>>    block/blk-mq-tag.c     | 95 --------------------------------------------------
>>>    include/linux/blk-mq.h |  2 --
>>>    2 files changed, 97 deletions(-)
>>>
>> Please, don't.
>>
>> I'm currently implementing reserved commands for SCSI and reworking the SCSI error handling where I rely on
>> this interface quite heavily.
> 
> 
> blk_mq_tagset_busy_iter could access some stale requests which maybe freed due to io scheduler switching,
> request_queue cleanup (shared tagset)
> when there is someone submits io and gets driver tag. When io scheduler attached, even quiesce
> request_queue won't work.
> 
> If this patchset is accepted, blk_mq_tagset_busy_iter could be replaced with blk_mq_queue_inflight_tag_iter
> which needs to be invoked by every request_queue that shares the tagset.
> 
The point is, at that time I do _not_ have a request queue to work with.

Most SCSI drivers have a host-wide shared tagset, which is used by all 
request queues on that host.
Iterating over the shared tagset is far more efficient than to traverse 
over all devices and the attached request queues.

If I had to traverse all request queues I would need to add additional 
locking to ensure this traversal is race-free, making it a really 
cumbersome interface to use.

Plus the tagset iter is understood to be used only in cases where I/O is 
stopped from the upper layers (ie no new I/O will be submitted).
So here we only need to protect against I/O being completed, which is 
not what this patchset is about.

So my objection still stands: Please, don't.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter
  2019-03-25  8:25       ` Hannes Reinecke
@ 2019-03-25  9:12         ` jianchao.wang
  2019-03-26 14:17         ` Jens Axboe
  1 sibling, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2019-03-25  9:12 UTC (permalink / raw)
  To: Hannes Reinecke, axboe
  Cc: keith.busch, jsmart2021, sagi, linux-kernel, linux-nvme, josef,
	linux-block, jthumshirn, hch, bvanassche



On 3/25/19 4:25 PM, Hannes Reinecke wrote:
> On 3/25/19 8:37 AM, jianchao.wang wrote:
>> Hi Hannes
>>
>> On 3/25/19 3:18 PM, Hannes Reinecke wrote:
>>> On 3/25/19 6:38 AM, Jianchao Wang wrote:
>>>> As nobody uses blk_mq_tagset_busy_iter, remove it.
>>>>
>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
>>>> ---
>>>>    block/blk-mq-tag.c     | 95 --------------------------------------------------
>>>>    include/linux/blk-mq.h |  2 --
>>>>    2 files changed, 97 deletions(-)
>>>>
>>> Please, don't.
>>>
>>> I'm currently implementing reserved commands for SCSI and reworking the SCSI error handling where I rely on
>>> this interface quite heavily.
>>
>>
>> blk_mq_tagset_busy_iter could access some stale requests which maybe freed due to io scheduler switching,
>> request_queue cleanup (shared tagset)
>> when there is someone submits io and gets driver tag. When io scheduler attached, even quiesce
                                                         ^^^^
                                                      s/When/Without
>> request_queue won't work.
>>
>> If this patchset is accepted, blk_mq_tagset_busy_iter could be replaced with blk_mq_queue_inflight_tag_iter
>> which needs to be invoked by every request_queue that shares the tagset.
>>
> The point is, at that time I do _not_ have a request queue to work with.
> 
> Most SCSI drivers have a host-wide shared tagset, which is used by all request queues on that host.
> Iterating over the shared tagset is far more efficient than to traverse over all devices and the attached request queues.
> 
> If I had to traverse all request queues I would need to add additional locking to ensure this traversal is race-free, making it a really cumbersome interface to use.

Yes, the new interface int this patchset is indeed not convenient to use with shared tagset case.
Perhaps we could introduce a interface to iterate all of the request_queue that shares a same tagset, such as,

mutex_lock(&set->tag_list_lock)
list_for_each_entry(q, &set->tag_list, tag_set_list)
...
mutex_unlock(&set->tag_list_lock)

> 
> Plus the tagset iter is understood to be used only in cases where I/O is stopped from the upper layers (ie no new I/O will be submitted).

The window is during get driver tag and store the rq into tags->rqs[].
w/ io-scheduler attached, we need to quiesce the request_queue to stop the driver tag allocation attempts.
w/o io-scheduler attached, quiesce request_queue cannot work, we need to freeze the queue and drain all of tasks that enters blk_mq_make_request
to rule out all of attempts of allocating driver tags. Unfortunately, we cannot distinguish between the task entering .make_request and allocated
requests both of which occupies q_usage_counter.

So to stop all of the attmpts of allocating driver tag, we have freeze & drain and quiesce the request_queue.


Thanks
Jianchao

> So here we only need to protect against I/O being completed, which is not what this patchset is about.
> 
> So my objection still stands: Please, don't.
> 
> Cheers,
> 
> Hannes

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-25  5:38 ` [PATCH V2 7/8] nvme: " Jianchao Wang
@ 2019-03-25 13:49   ` Keith Busch
  2019-03-26  1:17     ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-03-25 13:49 UTC (permalink / raw)
  To: Jianchao Wang
  Cc: axboe, hch, jthumshirn, hare, josef, bvanassche, sagi,
	keith.busch, jsmart2021, linux-block, linux-nvme, linux-kernel

On Mon, Mar 25, 2019 at 01:38:37PM +0800, Jianchao Wang wrote:
> blk_mq_tagset_inflight_iter is not safe that it could get stale request
> in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here. A new helper
> interface nvme_iterate_inflight_rqs is introduced to iterate
> all of the ns under a ctrl.

Nak, NVMe only iterates tags when new requests can't enter, allocated
requests can't dispatch, and dispatched commands can't complete. So
it is perfectly safe to iterate if the driver takes reasonable steps
beforehand. Further, for M tags and N namespaces, we complete teardown
in O(M) time, but this makes in O(M*N) without gaining anything.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-25 13:49   ` Keith Busch
@ 2019-03-26  1:17     ` jianchao.wang
  2019-03-26  2:41       ` Ming Lei
  0 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2019-03-26  1:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, linux-block, jsmart2021, bvanassche, josef, linux-nvme,
	linux-kernel, keith.busch, hare, jthumshirn, hch, sagi

Hi Keith

On 3/25/19 9:49 PM, Keith Busch wrote:
> On Mon, Mar 25, 2019 at 01:38:37PM +0800, Jianchao Wang wrote:
>> blk_mq_tagset_inflight_iter is not safe that it could get stale request
>> in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here. A new helper
>> interface nvme_iterate_inflight_rqs is introduced to iterate
>> all of the ns under a ctrl.
> 
> Nak, NVMe only iterates tags when new requests can't enter, allocated
> requests can't dispatch, and dispatched commands can't complete. So
> it is perfectly safe to iterate if the driver takes reasonable steps
> beforehand.

nvme_dev_disable just quiesce and freeze the request_queue, but not drain the enters.
So there still could be someone escapes the queue freeze checking and tries to allocate
request.

> Further, for M tags and N namespaces, we complete teardown
> in O(M) time, but this makes in O(M*N) without gaining anything.
> 

Yes, it is indeed inefficient.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-26  1:17     ` jianchao.wang
@ 2019-03-26  2:41       ` Ming Lei
  2019-03-26  3:05         ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Ming Lei @ 2019-03-26  2:41 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, Keith Busch, James Smart,
	Bart Van Assche, Josef Bacik, linux-nvme,
	Linux Kernel Mailing List, linux-block, Hannes Reinecke,
	Johannes Thumshirn, Christoph Hellwig, Sagi Grimberg

On Tue, Mar 26, 2019 at 9:18 AM jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
>
> Hi Keith
>
> On 3/25/19 9:49 PM, Keith Busch wrote:
> > On Mon, Mar 25, 2019 at 01:38:37PM +0800, Jianchao Wang wrote:
> >> blk_mq_tagset_inflight_iter is not safe that it could get stale request
> >> in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here. A new helper
> >> interface nvme_iterate_inflight_rqs is introduced to iterate
> >> all of the ns under a ctrl.
> >
> > Nak, NVMe only iterates tags when new requests can't enter, allocated
> > requests can't dispatch, and dispatched commands can't complete. So
> > it is perfectly safe to iterate if the driver takes reasonable steps
> > beforehand.
>
> nvme_dev_disable just quiesce and freeze the request_queue, but not drain the enters.
> So there still could be someone escapes the queue freeze checking and tries to allocate
> request.

The rq->state is just IDLE for these allocated request, so there
shouldn't be issue
in NVMe's case.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-26  2:41       ` Ming Lei
@ 2019-03-26  3:05         ` jianchao.wang
  2019-03-26 23:57           ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2019-03-26  3:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Keith Busch, James Smart,
	Bart Van Assche, Josef Bacik, linux-nvme,
	Linux Kernel Mailing List, linux-block, Hannes Reinecke,
	Johannes Thumshirn, Christoph Hellwig, Sagi Grimberg



On 3/26/19 10:41 AM, Ming Lei wrote:
> On Tue, Mar 26, 2019 at 9:18 AM jianchao.wang
> <jianchao.w.wang@oracle.com> wrote:
>>
>> Hi Keith
>>
>> On 3/25/19 9:49 PM, Keith Busch wrote:
>>> On Mon, Mar 25, 2019 at 01:38:37PM +0800, Jianchao Wang wrote:
>>>> blk_mq_tagset_inflight_iter is not safe that it could get stale request
>>>> in tags->rqs[]. Use blk_mq_queue_tag_inflight_iter here. A new helper
>>>> interface nvme_iterate_inflight_rqs is introduced to iterate
>>>> all of the ns under a ctrl.
>>>
>>> Nak, NVMe only iterates tags when new requests can't enter, allocated
>>> requests can't dispatch, and dispatched commands can't complete. So
>>> it is perfectly safe to iterate if the driver takes reasonable steps
>>> beforehand.
>>
>> nvme_dev_disable just quiesce and freeze the request_queue, but not drain the enters.
>> So there still could be someone escapes the queue freeze checking and tries to allocate
>> request.
> 
> The rq->state is just IDLE for these allocated request, so there
> shouldn't be issue
> in NVMe's case.

What if there used to be a io scheduler and leave some stale requests of sched tags ?
Or the nr_hw_queues was decreased and leave the hctx->fq->flush_rq ?

The stable request could be some tings freed and used
by others and the state field happen to be overwritten to non-zero...

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter
  2019-03-25  8:25       ` Hannes Reinecke
  2019-03-25  9:12         ` jianchao.wang
@ 2019-03-26 14:17         ` Jens Axboe
  1 sibling, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2019-03-26 14:17 UTC (permalink / raw)
  To: Hannes Reinecke, jianchao.wang
  Cc: linux-block, jsmart2021, sagi, josef, linux-nvme, linux-kernel,
	keith.busch, jthumshirn, hch, bvanassche

On 3/25/19 2:25 AM, Hannes Reinecke wrote:
> On 3/25/19 8:37 AM, jianchao.wang wrote:
>> Hi Hannes
>>
>> On 3/25/19 3:18 PM, Hannes Reinecke wrote:
>>> On 3/25/19 6:38 AM, Jianchao Wang wrote:
>>>> As nobody uses blk_mq_tagset_busy_iter, remove it.
>>>>
>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
>>>> ---
>>>>    block/blk-mq-tag.c     | 95 --------------------------------------------------
>>>>    include/linux/blk-mq.h |  2 --
>>>>    2 files changed, 97 deletions(-)
>>>>
>>> Please, don't.
>>>
>>> I'm currently implementing reserved commands for SCSI and reworking
>>> the SCSI error handling where I rely on this interface quite
>>> heavily.
>>
>>
>> blk_mq_tagset_busy_iter could access some stale requests which maybe
>> freed due to io scheduler switching, request_queue cleanup (shared
>> tagset) when there is someone submits io and gets driver tag. When io
>> scheduler attached, even quiesce request_queue won't work.
>>
>> If this patchset is accepted, blk_mq_tagset_busy_iter could be
>> replaced with blk_mq_queue_inflight_tag_iter which needs to be
>> invoked by every request_queue that shares the tagset.
>>
> The point is, at that time I do _not_ have a request queue to work
> with.
> 
> Most SCSI drivers have a host-wide shared tagset, which is used by all
> request queues on that host.  Iterating over the shared tagset is far
> more efficient than to traverse over all devices and the attached
> request queues.
> 
> If I had to traverse all request queues I would need to add additional
> locking to ensure this traversal is race-free, making it a really
> cumbersome interface to use.
> 
> Plus the tagset iter is understood to be used only in cases where I/O
> is stopped from the upper layers (ie no new I/O will be submitted).
> So here we only need to protect against I/O being completed, which is
> not what this patchset is about.
> 
> So my objection still stands: Please, don't.

We can't just keep an interface that's hard to use correctly just
because you have something pending for that. Jianchao has good
suggestions for you on how to proceed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-26  3:05         ` jianchao.wang
@ 2019-03-26 23:57           ` Keith Busch
  2019-03-27  2:03             ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-03-26 23:57 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, Busch, Keith, James Smart, Bart Van Assche,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, linux-block,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg

On Mon, Mar 25, 2019 at 08:05:53PM -0700, jianchao.wang wrote:
> What if there used to be a io scheduler and leave some stale requests of sched tags ?
> Or the nr_hw_queues was decreased and leave the hctx->fq->flush_rq ?

Requests internally queued in scheduler or block layer are not eligible
for the nvme driver's iterator callback. We only use it to reclaim
dispatched requests that the target can't return, which only applies to
requests that must have a valid rq->tag value from hctx->tags.
 
> The stable request could be some tings freed and used
> by others and the state field happen to be overwritten to non-zero...

I am not sure I follow what this means. At least for nvme, every queue
sharing the same tagset is quiesced and frozen, there should be no
request state in flux at the time we iterate.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-26 23:57           ` Keith Busch
@ 2019-03-27  2:03             ` jianchao.wang
  2019-03-27  2:15               ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2019-03-27  2:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche, Ming Lei,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, Busch, Keith,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg

Hi Keith

On 3/27/19 7:57 AM, Keith Busch wrote:
> On Mon, Mar 25, 2019 at 08:05:53PM -0700, jianchao.wang wrote:
>> What if there used to be a io scheduler and leave some stale requests of sched tags ?
>> Or the nr_hw_queues was decreased and leave the hctx->fq->flush_rq ?
> 
> Requests internally queued in scheduler or block layer are not eligible
> for the nvme driver's iterator callback. We only use it to reclaim
> dispatched requests that the target can't return, which only applies to
> requests that must have a valid rq->tag value from hctx->tags.
>  
>> The stable request could be some tings freed and used
>> by others and the state field happen to be overwritten to non-zero...
> 
> I am not sure I follow what this means. At least for nvme, every queue
> sharing the same tagset is quiesced and frozen, there should be no
> request state in flux at the time we iterate.
> 

In nvme_dev_disable, when we try to reclaim the in-flight requests with blk_mq_tagset_busy_iter,
the request_queues are quiesced but just start-freeze.
We will try to _drain_ the in-flight requests for the _shutdown_ case when controller is not dead.
For the reset case, there still could be someone escapes the checking of queue freezing and enters
blk_mq_make_request and tries to allocate tag, then we may get,

generic_make_request        nvme_dev_disable
 -> blk_queue_enter              
                              -> nvme_start_freeze (just start freeze, no drain)
                              -> nvme_stop_queues
 -> blk_mq_make_request
  - > blk_mq_get_request      -> blk_mq_tagset_busy_iter
     -> blk_mq_get_tag
                                -> bt_tags_for_each
                                   -> bt_tags_iter
                                       -> rq = tags->rqs[] ---> [1]
     -> blk_mq_rq_ctx_init
       -> data->hctx->tags->rqs[rq->tag] = rq;

The rq got on position [1] could be a stale request that has been freed due to,
1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
2. a removed io scheduler's sched request

And this stale request may have been used by others and the request->state is changed to a non-zero
value and passes the checking of blk_mq_request_started and then it will be handled by nvme_cancel_request.

Thanks
Jianchao


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-27  2:03             ` jianchao.wang
@ 2019-03-27  2:15               ` Keith Busch
  2019-03-27  2:27                 ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-03-27  2:15 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche, Ming Lei,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, Busch, Keith,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg

On Wed, Mar 27, 2019 at 10:03:26AM +0800, jianchao.wang wrote:
> Hi Keith
> 
> On 3/27/19 7:57 AM, Keith Busch wrote:
> > On Mon, Mar 25, 2019 at 08:05:53PM -0700, jianchao.wang wrote:
> >> What if there used to be a io scheduler and leave some stale requests of sched tags ?
> >> Or the nr_hw_queues was decreased and leave the hctx->fq->flush_rq ?
> > 
> > Requests internally queued in scheduler or block layer are not eligible
> > for the nvme driver's iterator callback. We only use it to reclaim
> > dispatched requests that the target can't return, which only applies to
> > requests that must have a valid rq->tag value from hctx->tags.
> >  
> >> The stable request could be some tings freed and used
> >> by others and the state field happen to be overwritten to non-zero...
> > 
> > I am not sure I follow what this means. At least for nvme, every queue
> > sharing the same tagset is quiesced and frozen, there should be no
> > request state in flux at the time we iterate.
> > 
> 
> In nvme_dev_disable, when we try to reclaim the in-flight requests with blk_mq_tagset_busy_iter,
> the request_queues are quiesced but just start-freeze.
> We will try to _drain_ the in-flight requests for the _shutdown_ case when controller is not dead.
> For the reset case, there still could be someone escapes the checking of queue freezing and enters
> blk_mq_make_request and tries to allocate tag, then we may get,
> 
> generic_make_request        nvme_dev_disable
>  -> blk_queue_enter              
>                               -> nvme_start_freeze (just start freeze, no drain)
>                               -> nvme_stop_queues
>  -> blk_mq_make_request
>   - > blk_mq_get_request      -> blk_mq_tagset_busy_iter
>      -> blk_mq_get_tag
>                                 -> bt_tags_for_each
>                                    -> bt_tags_iter
>                                        -> rq = tags->rqs[] ---> [1]
>      -> blk_mq_rq_ctx_init
>        -> data->hctx->tags->rqs[rq->tag] = rq;
> 
> The rq got on position [1] could be a stale request that has been freed due to,
> 1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
> 2. a removed io scheduler's sched request
> 
> And this stale request may have been used by others and the request->state is changed to a non-zero
> value and passes the checking of blk_mq_request_started and then it will be handled by nvme_cancel_request.

How is that request state going to be anyting other than IDLE? A freed
request state is IDLE, and continues to be IDLE until dispatched. But
dispatch is blocked for the entire tagset, so request states can't be
started during an nvme reset.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-27  2:15               ` Keith Busch
@ 2019-03-27  2:27                 ` jianchao.wang
  2019-03-27  2:33                   ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2019-03-27  2:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Busch, Keith, James Smart, Bart Van Assche, Ming Lei,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, linux-block,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg



On 3/27/19 10:15 AM, Keith Busch wrote:
> On Wed, Mar 27, 2019 at 10:03:26AM +0800, jianchao.wang wrote:
>> Hi Keith
>>
>> On 3/27/19 7:57 AM, Keith Busch wrote:
>>> On Mon, Mar 25, 2019 at 08:05:53PM -0700, jianchao.wang wrote:
>>>> What if there used to be a io scheduler and leave some stale requests of sched tags ?
>>>> Or the nr_hw_queues was decreased and leave the hctx->fq->flush_rq ?
>>>
>>> Requests internally queued in scheduler or block layer are not eligible
>>> for the nvme driver's iterator callback. We only use it to reclaim
>>> dispatched requests that the target can't return, which only applies to
>>> requests that must have a valid rq->tag value from hctx->tags.
>>>  
>>>> The stable request could be some tings freed and used
>>>> by others and the state field happen to be overwritten to non-zero...
>>>
>>> I am not sure I follow what this means. At least for nvme, every queue
>>> sharing the same tagset is quiesced and frozen, there should be no
>>> request state in flux at the time we iterate.
>>>
>>
>> In nvme_dev_disable, when we try to reclaim the in-flight requests with blk_mq_tagset_busy_iter,
>> the request_queues are quiesced but just start-freeze.
>> We will try to _drain_ the in-flight requests for the _shutdown_ case when controller is not dead.
>> For the reset case, there still could be someone escapes the checking of queue freezing and enters
>> blk_mq_make_request and tries to allocate tag, then we may get,
>>
>> generic_make_request        nvme_dev_disable
>>  -> blk_queue_enter              
>>                               -> nvme_start_freeze (just start freeze, no drain)
>>                               -> nvme_stop_queues
>>  -> blk_mq_make_request
>>   - > blk_mq_get_request      -> blk_mq_tagset_busy_iter
>>      -> blk_mq_get_tag
>>                                 -> bt_tags_for_each
>>                                    -> bt_tags_iter
>>                                        -> rq = tags->rqs[] ---> [1]
>>      -> blk_mq_rq_ctx_init
>>        -> data->hctx->tags->rqs[rq->tag] = rq;
>>
>> The rq got on position [1] could be a stale request that has been freed due to,
>> 1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
>> 2. a removed io scheduler's sched request
>>
>> And this stale request may have been used by others and the request->state is changed to a non-zero
>> value and passes the checking of blk_mq_request_started and then it will be handled by nvme_cancel_request.
> 
> How is that request state going to be anyting other than IDLE? A freed
> request state is IDLE, and continues to be IDLE until dispatched. But
> dispatch is blocked for the entire tagset, so request states can't be
> started during an nvme reset.
> 

As the comment above, the stable request maybe something that has been freed due to following case,
1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
2. a removed io scheduler's sched request
and this freed request could be allocated by others which may change the field of request->state.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-27  2:27                 ` jianchao.wang
@ 2019-03-27  2:33                   ` Keith Busch
  2019-03-27  2:45                     ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-03-27  2:33 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, Busch, Keith, James Smart, Bart Van Assche, Ming Lei,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, linux-block,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg

On Wed, Mar 27, 2019 at 10:27:57AM +0800, jianchao.wang wrote:
> As the comment above, the stable request maybe something that has been freed due to following case,
> 1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
> 2. a removed io scheduler's sched request
> and this freed request could be allocated by others which may change the field of request->state.

You're not explaing how that request->state is changed. I understand the
request can be reallocated, but what is changing its state?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-27  2:33                   ` Keith Busch
@ 2019-03-27  2:45                     ` jianchao.wang
  2019-03-27  6:51                       ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2019-03-27  2:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche, Ming Lei,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, Busch, Keith,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg

Hi Keith

On 3/27/19 10:33 AM, Keith Busch wrote:
> On Wed, Mar 27, 2019 at 10:27:57AM +0800, jianchao.wang wrote:
>> As the comment above, the stable request maybe something that has been freed due to following case,
>> 1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
>> 2. a removed io scheduler's sched request
>> and this freed request could be allocated by others which may change the field of request->state.
> 
> You're not explaing how that request->state is changed. I understand the
> request can be reallocated, but what is changing its state?
> 

Sorry for my bad description, and lead to the misunderstand.
The _free_ below means,

1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
   The whole request_queue is cleaned up and freed, so the hctx->fq.flush is freed back to a slab

2. a removed io scheduler's sched request
   The io scheduled is detached and all of the structures are freed, including the pages where sched
   requests locates.

So the pointers in tags->rqs[] may point to memory that is not used as a blk layer request.


Thanks
Jianchao

  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-27  2:45                     ` jianchao.wang
@ 2019-03-27  6:51                       ` Keith Busch
  2019-03-27  7:18                         ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2019-03-27  6:51 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche, Ming Lei,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, Busch, Keith,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg

On Wed, Mar 27, 2019 at 10:45:33AM +0800, jianchao.wang wrote:
> 1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
>    The whole request_queue is cleaned up and freed, so the hctx->fq.flush is freed back to a slab
>
> 2. a removed io scheduler's sched request
>    The io scheduled is detached and all of the structures are freed, including the pages where sched
>    requests locates.
> 
> So the pointers in tags->rqs[] may point to memory that is not used as a blk layer request.

Oh, free as in kfree'd, not blk_mq_free_request. So it's a read-after-
free that you're concerned about, not that anyone explicitly changed a
request->state.

We at least can't free the flush_queue until the queue is frozen. If the
queue is frozen, we've completed the special fq->flush_rq where its end_io
replaces tags->rqs[tag] back to the fq->orig_rq from the static_rqs,
so nvme's iterator couldn't see the fq->flush_rq address if it's invalid.

The sched_tags concern, though, appears theoretically possible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 7/8] nvme: use blk_mq_queue_tag_inflight_iter
  2019-03-27  6:51                       ` Keith Busch
@ 2019-03-27  7:18                         ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2019-03-27  7:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche, Ming Lei,
	Josef Bacik, linux-nvme, Linux Kernel Mailing List, Busch, Keith,
	Hannes Reinecke, Johannes Thumshirn, Christoph Hellwig,
	Sagi Grimberg

Hi Keith

On 3/27/19 2:51 PM, Keith Busch wrote:
> On Wed, Mar 27, 2019 at 10:45:33AM +0800, jianchao.wang wrote:
>> 1. a hctx->fq.flush_rq of dead request_queue that shares the same tagset
>>    The whole request_queue is cleaned up and freed, so the hctx->fq.flush is freed back to a slab
>>
>> 2. a removed io scheduler's sched request
>>    The io scheduled is detached and all of the structures are freed, including the pages where sched
>>    requests locates.
>>
>> So the pointers in tags->rqs[] may point to memory that is not used as a blk layer request.
> 
> Oh, free as in kfree'd, not blk_mq_free_request. So it's a read-after-
> free that you're concerned about, not that anyone explicitly changed a
> request->state.

Yes ;)

> 
> We at least can't free the flush_queue until the queue is frozen. If the
> queue is frozen, we've completed the special fq->flush_rq where its end_io
> replaces tags->rqs[tag] back to the fq->orig_rq from the static_rqs,
> so nvme's iterator couldn't see the fq->flush_rq address if it's invalid.
> 

This is true for the non io-scheduler case in which the flush_rq would steal the driver tag.
But for io-scheduler case, flush_rq would acquire a driver tag itself.


> The sched_tags concern, though, appears theoretically possible.
> 

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-03-27  7:19 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-25  5:38 [PATCH V2 0/8]: blk-mq: use static_rqs to iterate busy tags Jianchao Wang
2019-03-25  5:38 ` [PATCH V2 1/8] blk-mq: get rid of the synchronize_rcu in __blk_mq_update_nr_hw_queues Jianchao Wang
2019-03-25  5:38 ` [PATCH V2 2/8] blk-mq: use static_rqs instead of rqs to iterate tags Jianchao Wang
2019-03-25  7:12   ` Dongli Zhang
2019-03-25  7:14     ` jianchao.wang
2019-03-25  5:38 ` [PATCH V2 3/8] blk-mq: use blk_mq_queue_tag_inflight_iter in debugfs Jianchao Wang
2019-03-25  5:38 ` [PATCH V2 4/8] mtip32xx: use blk_mq_queue_tag_inflight_iter Jianchao Wang
2019-03-25  5:38 ` [PATCH V2 5/8] nbd: " Jianchao Wang
2019-03-25  5:38 ` [PATCH V2 6/8] skd: " Jianchao Wang
2019-03-25  5:38 ` [PATCH V2 7/8] nvme: " Jianchao Wang
2019-03-25 13:49   ` Keith Busch
2019-03-26  1:17     ` jianchao.wang
2019-03-26  2:41       ` Ming Lei
2019-03-26  3:05         ` jianchao.wang
2019-03-26 23:57           ` Keith Busch
2019-03-27  2:03             ` jianchao.wang
2019-03-27  2:15               ` Keith Busch
2019-03-27  2:27                 ` jianchao.wang
2019-03-27  2:33                   ` Keith Busch
2019-03-27  2:45                     ` jianchao.wang
2019-03-27  6:51                       ` Keith Busch
2019-03-27  7:18                         ` jianchao.wang
2019-03-25  5:38 ` [PATCH V2 8/8] blk-mq: remove blk_mq_tagset_busy_iter Jianchao Wang
2019-03-25  7:18   ` Hannes Reinecke
2019-03-25  7:37     ` jianchao.wang
2019-03-25  8:25       ` Hannes Reinecke
2019-03-25  9:12         ` jianchao.wang
2019-03-26 14:17         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).