All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 0/7] blk-mq: fix races related with freeing queue
@ 2019-04-04  8:43 Ming Lei
  2019-04-04  8:43 ` [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V4:
	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
	- fix comment in the last patch

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler


Ming Lei (7):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: quiesce queue before updating nr_hw_queues
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  SCSI: don't hold device refcount in IO path

 block/blk-core.c        | 23 +----------------------
 block/blk-mq-sysfs.c    |  8 ++++++++
 block/blk-mq.c          | 25 +++++++++++++++++--------
 block/blk-mq.h          |  2 +-
 drivers/scsi/scsi_lib.c | 28 ++--------------------------
 5 files changed, 29 insertions(+), 57 deletions(-)

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
-- 
2.9.5


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
@ 2019-04-04  8:43 ` Ming Lei
  2019-04-04 15:58   ` Bart Van Assche
                     ` (2 more replies)
  2019-04-04  8:43 ` [PATCH V4 2/7] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ff3d7b49969..5b586affee09 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
 			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
+
+				percpu_ref_get(&this_q->q_usage_counter);
 				blk_mq_sched_insert_requests(this_hctx, this_ctx,
 								&rq_list,
 								from_schedule);
+				percpu_ref_put(&this_q->q_usage_counter);
 			}
 
 			this_q = rq->q;
@@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
+
+		percpu_ref_get(&this_q->q_usage_counter);
 		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
 						from_schedule);
+		percpu_ref_put(&this_q->q_usage_counter);
 	}
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V4 2/7] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-04  8:43 ` [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-04  8:43 ` Ming Lei
  2019-04-04  8:43 ` [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues Ming Lei
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4673ebe42255..6583d67f3e34 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b586affee09..b512ba0cb359 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2626,6 +2626,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues
  2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-04  8:43 ` [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-04  8:43 ` [PATCH V4 2/7] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
@ 2019-04-04  8:43 ` Ming Lei
  2019-04-04 15:57   ` Bart Van Assche
  2019-04-08  3:16   ` jianchao.wang
  2019-04-04  8:43 ` [PATCH V4 4/7] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Inside __blk_mq_update_nr_hw_queues(), only request queues are frozen
before updating nr_hw_queues.

However, even though blk_mq_freeze_queue() is returned, there might be
run queue activity not completed, then use-after-free may be triggered
on hctx and its fields.

Fix this issue by really quiescing queue via blk_mq_quiesce_queue() and
blk_sync_queue() for making sure no any run queue activity pending
before releasing hctx.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b512ba0cb359..41c12d9008b7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3224,8 +3224,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
 		return;
 
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
+		blk_mq_quiesce_queue(q);
+		blk_sync_queue(q);
+	}
 	/*
 	 * Sync with blk_mq_queue_tag_busy_iter.
 	 */
@@ -3269,8 +3272,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_elv_switch_back(&head, q);
 
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		blk_mq_unquiesce_queue(q);
 		blk_mq_unfreeze_queue(q);
+	}
 }
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V4 4/7] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (2 preceding siblings ...)
  2019-04-04  8:43 ` [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues Ming Lei
@ 2019-04-04  8:43 ` Ming Lei
  2019-04-04  8:43 ` [PATCH V4 5/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang,
	stable

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6583d67f3e34..20298aa5a77c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 41c12d9008b7..3fc1e30f8084 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d704fc7766f4..c421e3a16e36 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V4 5/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (3 preceding siblings ...)
  2019-04-04  8:43 ` [PATCH V4 4/7] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
@ 2019-04-04  8:43 ` Ming Lei
  2019-04-04  8:43 ` [PATCH V4 6/7] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
  2019-04-04  8:43 ` [PATCH V4 7/7] SCSI: don't hold device refcount in IO path Ming Lei
  6 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 20298aa5a77c..ad17e999f79e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V4 6/7] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (4 preceding siblings ...)
  2019-04-04  8:43 ` [PATCH V4 5/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
@ 2019-04-04  8:43 ` Ming Lei
  2019-04-04  8:43 ` [PATCH V4 7/7] SCSI: don't hold device refcount in IO path Ming Lei
  6 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad17e999f79e..82b630da7e2c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V4 7/7] SCSI: don't hold device refcount in IO path
  2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (5 preceding siblings ...)
  2019-04-04  8:43 ` [PATCH V4 6/7] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
@ 2019-04-04  8:43 ` Ming Lei
  2019-04-04 17:14   ` Bart Van Assche
  6 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-04-04  8:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

scsi_device's refcount is always grabbed in IO path.

Turns out it isn't necessary, because blk_queue_cleanup() will
drain any in-flight IOs, then cancel timeout/requeue work, and
SCSI's requeue_work is canceled too in __scsi_remove_device().

Also scsi_device won't go away until blk_cleanup_queue() is done.

So don't hold the refcount in IO path, especially the refcount isn't
required in IO path since blk_queue_enter() / blk_queue_exit()
is introduced in the legacy block layer.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 28 ++--------------------------
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 601b9f1de267..3369d66911eb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -141,8 +141,6 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 
 static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
 {
-	struct scsi_device *sdev = cmd->device;
-
 	if (cmd->request->rq_flags & RQF_DONTPREP) {
 		cmd->request->rq_flags &= ~RQF_DONTPREP;
 		scsi_mq_uninit_cmd(cmd);
@@ -150,7 +148,6 @@ static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
 		WARN_ON_ONCE(true);
 	}
 	blk_mq_requeue_request(cmd->request, true);
-	put_device(&sdev->sdev_gendev);
 }
 
 /**
@@ -189,19 +186,7 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, bool unbusy)
 	 */
 	cmd->result = 0;
 
-	/*
-	 * Before a SCSI command is dispatched,
-	 * get_device(&sdev->sdev_gendev) is called and the host,
-	 * target and device busy counters are increased. Since
-	 * requeuing a request causes these actions to be repeated and
-	 * since scsi_device_unbusy() has already been called,
-	 * put_device(&device->sdev_gendev) must still be called. Call
-	 * put_device() after blk_mq_requeue_request() to avoid that
-	 * removal of the SCSI device can start before requeueing has
-	 * happened.
-	 */
 	blk_mq_requeue_request(cmd->request, true);
-	put_device(&device->sdev_gendev);
 }
 
 /*
@@ -619,7 +604,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 		blk_mq_run_hw_queues(q, true);
 
 	percpu_ref_put(&q->q_usage_counter);
-	put_device(&sdev->sdev_gendev);
 	return false;
 }
 
@@ -1613,7 +1597,6 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
 	struct scsi_device *sdev = q->queuedata;
 
 	atomic_dec(&sdev->device_busy);
-	put_device(&sdev->sdev_gendev);
 }
 
 static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
@@ -1621,16 +1604,9 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct scsi_device *sdev = q->queuedata;
 
-	if (!get_device(&sdev->sdev_gendev))
-		goto out;
-	if (!scsi_dev_queue_ready(q, sdev))
-		goto out_put_device;
-
-	return true;
+	if (scsi_dev_queue_ready(q, sdev))
+		return true;
 
-out_put_device:
-	put_device(&sdev->sdev_gendev);
-out:
 	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
 		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
 	return false;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues
  2019-04-04  8:43 ` [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues Ming Lei
@ 2019-04-04 15:57   ` Bart Van Assche
  2019-04-04 20:55     ` Ming Lei
  2019-04-08  3:16   ` jianchao.wang
  1 sibling, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2019-04-04 15:57 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b512ba0cb359..41c12d9008b7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3224,8 +3224,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>         if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
>                 return;
>  
> -       list_for_each_entry(q, &set->tag_list, tag_set_list)
> +       list_for_each_entry(q, &set->tag_list, tag_set_list) {
>                 blk_mq_freeze_queue(q);
> +               blk_mq_quiesce_queue(q);
> +               blk_sync_queue(q);
> +       }
>         /*
>          * Sync with blk_mq_queue_tag_busy_iter.
>          */
> @@ -3269,8 +3272,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>         list_for_each_entry(q, &set->tag_list, tag_set_list)
>                 blk_mq_elv_switch_back(&head, q);
>  
> -       list_for_each_entry(q, &set->tag_list, tag_set_list)
> +       list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +               blk_mq_unquiesce_queue(q);
>                 blk_mq_unfreeze_queue(q);
> +       }
>  }
>  
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)

Are you sure this patch is sufficient? What prevents that blk_mq_run_hw_queues()
gets called after the blk_mq_quiesce() and blk_sync_queue() calls have finished
and before the queue is unfrozen?

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-04  8:43 ` [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-04 15:58   ` Bart Van Assche
  2019-04-04 20:45     ` Ming Lei
  2019-04-04 21:30   ` Bart Van Assche
  2019-04-05  9:26   ` Dongli Zhang
  2 siblings, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2019-04-04 15:58 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote:
> Just like aio/io_uring, we need to grab 2 refcount for queuing one
> request, one is for submission, another is for completion.
> 
> If the request isn't queued from plug code path, the refcount grabbed
> in generic_make_request() serves for submission. In theroy, this
> refcount should have been released after the sumission(async run queue)
> is done. blk_freeze_queue() works with blk_sync_queue() together
> for avoiding race between cleanup queue and IO submission, given async
> run queue activities are canceled because hctx->run_work is scheduled with
> the refcount held, so it is fine to not hold the refcount when
> running the run queue work function for dispatch IO.
> 
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed because the queue's
> kobject refcount isn't guaranteed to be grabbed in flushing plug list
> context, then kernel oops is triggered, see the following race:
> 
> blk_mq_flush_plug_list():
>         blk_mq_sched_insert_requests()
>                 insert requests to sw queue or scheduler queue
>                 blk_mq_run_hw_queue
> 
> Because of concurrent run queue, all requests inserted above may be
> completed before calling the above blk_mq_run_hw_queue. Then queue can
> be freed during the above blk_mq_run_hw_queue().
> 
> Fixes the issue by grab .q_usage_counter before calling
> blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> safe because the queue is absolutely alive before inserting request.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b49969..5b586affee09 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>                 if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
>                         if (this_hctx) {
>                                 trace_block_unplug(this_q, depth, !from_schedule);
> +
> +                               percpu_ref_get(&this_q->q_usage_counter);
>                                 blk_mq_sched_insert_requests(this_hctx, this_ctx,
>                                                                 &rq_list,
>                                                                 from_schedule);
> +                               percpu_ref_put(&this_q->q_usage_counter);
>                         }
>  
>                         this_q = rq->q;
> @@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>          */
>         if (this_hctx) {
>                 trace_block_unplug(this_q, depth, !from_schedule);
> +
> +               percpu_ref_get(&this_q->q_usage_counter);
>                 blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
>                                                 from_schedule);
> +               percpu_ref_put(&this_q->q_usage_counter);
>         }
>  }

Although this patch looks fine to me: have you considered to insert one
percpu_ref_get() call at the start of blk_mq_flush_plug_list() and one
percpu_ref_put() call at the end of the same function?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 7/7] SCSI: don't hold device refcount in IO path
  2019-04-04  8:43 ` [PATCH V4 7/7] SCSI: don't hold device refcount in IO path Ming Lei
@ 2019-04-04 17:14   ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2019-04-04 17:14 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote:
> scsi_device's refcount is always grabbed in IO path.
> 
> Turns out it isn't necessary, because blk_queue_cleanup() will
> drain any in-flight IOs, then cancel timeout/requeue work, and
> SCSI's requeue_work is canceled too in __scsi_remove_device().
> 
> Also scsi_device won't go away until blk_cleanup_queue() is done.
> 
> So don't hold the refcount in IO path, especially the refcount isn't
> required in IO path since blk_queue_enter() / blk_queue_exit()
> is introduced in the legacy block layer.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c | 28 ++--------------------------
>  1 file changed, 2 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 601b9f1de267..3369d66911eb 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -141,8 +141,6 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
>  
>  static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
>  {
> -       struct scsi_device *sdev = cmd->device;
> -
>         if (cmd->request->rq_flags & RQF_DONTPREP) {
>                 cmd->request->rq_flags &= ~RQF_DONTPREP;
>                 scsi_mq_uninit_cmd(cmd);
> @@ -150,7 +148,6 @@ static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
>                 WARN_ON_ONCE(true);
>         }
>         blk_mq_requeue_request(cmd->request, true);
> -       put_device(&sdev->sdev_gendev);
>  }
>  
>  /**
> @@ -189,19 +186,7 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, bool unbusy)
>          */
>         cmd->result = 0;
>  
> -       /*
> -        * Before a SCSI command is dispatched,
> -        * get_device(&sdev->sdev_gendev) is called and the host,
> -        * target and device busy counters are increased. Since
> -        * requeuing a request causes these actions to be repeated and
> -        * since scsi_device_unbusy() has already been called,
> -        * put_device(&device->sdev_gendev) must still be called. Call
> -        * put_device() after blk_mq_requeue_request() to avoid that
> -        * removal of the SCSI device can start before requeueing has
> -        * happened.
> -        */
>         blk_mq_requeue_request(cmd->request, true);
> -       put_device(&device->sdev_gendev);
>  }
>  
>  /*
> @@ -619,7 +604,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
>                 blk_mq_run_hw_queues(q, true);
>  
>         percpu_ref_put(&q->q_usage_counter);
> -       put_device(&sdev->sdev_gendev);
>         return false;
>  }
>  
> @@ -1613,7 +1597,6 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
>         struct scsi_device *sdev = q->queuedata;
>  
>         atomic_dec(&sdev->device_busy);
> -       put_device(&sdev->sdev_gendev);
>  }
>  
>  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> @@ -1621,16 +1604,9 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
>         struct request_queue *q = hctx->queue;
>         struct scsi_device *sdev = q->queuedata;
>  
> -       if (!get_device(&sdev->sdev_gendev))
> -               goto out;
> -       if (!scsi_dev_queue_ready(q, sdev))
> -               goto out_put_device;
> -
> -       return true;
> +       if (scsi_dev_queue_ready(q, sdev))
> +               return true;
>  
> -out_put_device:
> -       put_device(&sdev->sdev_gendev);
> -out:
>         if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
>                 blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
>         return false;

Reviewed-by: Bart Van Assche <bvanassche@acm.org>




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-04 15:58   ` Bart Van Assche
@ 2019-04-04 20:45     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04 20:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, Linux SCSI List, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Thu, Apr 4, 2019 at 11:58 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote:
> > Just like aio/io_uring, we need to grab 2 refcount for queuing one
> > request, one is for submission, another is for completion.
> >
> > If the request isn't queued from plug code path, the refcount grabbed
> > in generic_make_request() serves for submission. In theroy, this
> > refcount should have been released after the sumission(async run queue)
> > is done. blk_freeze_queue() works with blk_sync_queue() together
> > for avoiding race between cleanup queue and IO submission, given async
> > run queue activities are canceled because hctx->run_work is scheduled with
> > the refcount held, so it is fine to not hold the refcount when
> > running the run queue work function for dispatch IO.
> >
> > However, if request is staggered into plug list, and finally queued
> > from plug code path, the refcount in submission side is actually missed.
> > And we may start to run queue after queue is removed because the queue's
> > kobject refcount isn't guaranteed to be grabbed in flushing plug list
> > context, then kernel oops is triggered, see the following race:
> >
> > blk_mq_flush_plug_list():
> >         blk_mq_sched_insert_requests()
> >                 insert requests to sw queue or scheduler queue
> >                 blk_mq_run_hw_queue
> >
> > Because of concurrent run queue, all requests inserted above may be
> > completed before calling the above blk_mq_run_hw_queue. Then queue can
> > be freed during the above blk_mq_run_hw_queue().
> >
> > Fixes the issue by grab .q_usage_counter before calling
> > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> > safe because the queue is absolutely alive before inserting request.
> >
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 3ff3d7b49969..5b586affee09 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >                 if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
> >                         if (this_hctx) {
> >                                 trace_block_unplug(this_q, depth, !from_schedule);
> > +
> > +                               percpu_ref_get(&this_q->q_usage_counter);
> >                                 blk_mq_sched_insert_requests(this_hctx, this_ctx,
> >                                                                 &rq_list,
> >                                                                 from_schedule);
> > +                               percpu_ref_put(&this_q->q_usage_counter);
> >                         }
> >
> >                         this_q = rq->q;
> > @@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >          */
> >         if (this_hctx) {
> >                 trace_block_unplug(this_q, depth, !from_schedule);
> > +
> > +               percpu_ref_get(&this_q->q_usage_counter);
> >                 blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
> >                                                 from_schedule);
> > +               percpu_ref_put(&this_q->q_usage_counter);
> >         }
> >  }
>
> Although this patch looks fine to me: have you considered to insert one
> percpu_ref_get() call at the start of blk_mq_flush_plug_list() and one
> percpu_ref_put() call at the end of the same function?

Requests from different request queues can be added to the same per-task
plug list, so we can't do that way simply.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues
  2019-04-04 15:57   ` Bart Van Assche
@ 2019-04-04 20:55     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-04 20:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, Linux SCSI List, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Thu, Apr 4, 2019 at 11:57 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote:
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index b512ba0cb359..41c12d9008b7 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3224,8 +3224,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >         if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
> >                 return;
> >
> > -       list_for_each_entry(q, &set->tag_list, tag_set_list)
> > +       list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >                 blk_mq_freeze_queue(q);
> > +               blk_mq_quiesce_queue(q);
> > +               blk_sync_queue(q);
> > +       }
> >         /*
> >          * Sync with blk_mq_queue_tag_busy_iter.
> >          */
> > @@ -3269,8 +3272,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >         list_for_each_entry(q, &set->tag_list, tag_set_list)
> >                 blk_mq_elv_switch_back(&head, q);
> >
> > -       list_for_each_entry(q, &set->tag_list, tag_set_list)
> > +       list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > +               blk_mq_unquiesce_queue(q);
> >                 blk_mq_unfreeze_queue(q);
> > +       }
> >  }
> >
> >  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>
> Are you sure this patch is sufficient? What prevents that blk_mq_run_hw_queues()
> gets called after the blk_mq_quiesce() and blk_sync_queue() calls have finished
> and before the queue is unfrozen?

blk_queue_quiesced(hctx->queue) is supposed to prevents blk_mq_run_hw_queues()
from being called after the queue is quiesced.

But there is another bug in blk_mq_quiesce() which shouldn't depend on
hctx_lock(hctx, &srcu_idx)
for this check, will fix blk_mq_quiesce() in V5.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-04  8:43 ` [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-04 15:58   ` Bart Van Assche
@ 2019-04-04 21:30   ` Bart Van Assche
  2019-04-05  9:26   ` Dongli Zhang
  2 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2019-04-04 21:30 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote:
> Just like aio/io_uring, we need to grab 2 refcount for queuing one
> request, one is for submission, another is for completion.
> 
> If the request isn't queued from plug code path, the refcount grabbed
> in generic_make_request() serves for submission. In theroy, this
> refcount should have been released after the sumission(async run queue)
> is done. blk_freeze_queue() works with blk_sync_queue() together
> for avoiding race between cleanup queue and IO submission, given async
> run queue activities are canceled because hctx->run_work is scheduled with
> the refcount held, so it is fine to not hold the refcount when
> running the run queue work function for dispatch IO.
> 
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed because the queue's
> kobject refcount isn't guaranteed to be grabbed in flushing plug list
> context, then kernel oops is triggered, see the following race:
> 
> blk_mq_flush_plug_list():
>         blk_mq_sched_insert_requests()
>                 insert requests to sw queue or scheduler queue
>                 blk_mq_run_hw_queue
> 
> Because of concurrent run queue, all requests inserted above may be
> completed before calling the above blk_mq_run_hw_queue. Then queue can
> be freed during the above blk_mq_run_hw_queue().
> 
> Fixes the issue by grab .q_usage_counter before calling
> blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> safe because the queue is absolutely alive before inserting request.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-04  8:43 ` [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-04 15:58   ` Bart Van Assche
  2019-04-04 21:30   ` Bart Van Assche
@ 2019-04-05  9:26   ` Dongli Zhang
  2019-04-05 13:52     ` Ming Lei
  2 siblings, 1 reply; 18+ messages in thread
From: Dongli Zhang @ 2019-04-05  9:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

Hi Ming,

On 04/04/2019 04:43 PM, Ming Lei wrote:
> Just like aio/io_uring, we need to grab 2 refcount for queuing one
> request, one is for submission, another is for completion.
> 
> If the request isn't queued from plug code path, the refcount grabbed
> in generic_make_request() serves for submission. In theroy, this
> refcount should have been released after the sumission(async run queue)
> is done. blk_freeze_queue() works with blk_sync_queue() together
> for avoiding race between cleanup queue and IO submission, given async
> run queue activities are canceled because hctx->run_work is scheduled with
> the refcount held, so it is fine to not hold the refcount when
> running the run queue work function for dispatch IO.
> 
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed because the queue's
> kobject refcount isn't guaranteed to be grabbed in flushing plug list
> context, then kernel oops is triggered, see the following race:
> 
> blk_mq_flush_plug_list():
>         blk_mq_sched_insert_requests()
>                 insert requests to sw queue or scheduler queue
>                 blk_mq_run_hw_queue
> 
> Because of concurrent run queue, all requests inserted above may be
> completed before calling the above blk_mq_run_hw_queue. Then queue can
> be freed during the above blk_mq_run_hw_queue().
> 
> Fixes the issue by grab .q_usage_counter before calling
> blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> safe because the queue is absolutely alive before inserting request.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b49969..5b586affee09 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
>  			if (this_hctx) {
>  				trace_block_unplug(this_q, depth, !from_schedule);
> +
> +				percpu_ref_get(&this_q->q_usage_counter);

Sorry to bother but I would just like to double confirm the reason to use
"percpu_ref_get()" here which does not check whether the queue has been frozen.

Is it because there is assumption that any direct/indirect caller of
blk_mq_flush_plug_list() much have already grabbed q_usage_counter, which is
similar to blk_queue_enter_live()?

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-05  9:26   ` Dongli Zhang
@ 2019-04-05 13:52     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-05 13:52 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Fri, Apr 05, 2019 at 05:26:24PM +0800, Dongli Zhang wrote:
> Hi Ming,
> 
> On 04/04/2019 04:43 PM, Ming Lei wrote:
> > Just like aio/io_uring, we need to grab 2 refcount for queuing one
> > request, one is for submission, another is for completion.
> > 
> > If the request isn't queued from plug code path, the refcount grabbed
> > in generic_make_request() serves for submission. In theroy, this
> > refcount should have been released after the sumission(async run queue)
> > is done. blk_freeze_queue() works with blk_sync_queue() together
> > for avoiding race between cleanup queue and IO submission, given async
> > run queue activities are canceled because hctx->run_work is scheduled with
> > the refcount held, so it is fine to not hold the refcount when
> > running the run queue work function for dispatch IO.
> > 
> > However, if request is staggered into plug list, and finally queued
> > from plug code path, the refcount in submission side is actually missed.
> > And we may start to run queue after queue is removed because the queue's
> > kobject refcount isn't guaranteed to be grabbed in flushing plug list
> > context, then kernel oops is triggered, see the following race:
> > 
> > blk_mq_flush_plug_list():
> >         blk_mq_sched_insert_requests()
> >                 insert requests to sw queue or scheduler queue
> >                 blk_mq_run_hw_queue
> > 
> > Because of concurrent run queue, all requests inserted above may be
> > completed before calling the above blk_mq_run_hw_queue. Then queue can
> > be freed during the above blk_mq_run_hw_queue().
> > 
> > Fixes the issue by grab .q_usage_counter before calling
> > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> > safe because the queue is absolutely alive before inserting request.
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 3ff3d7b49969..5b586affee09 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >  		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
> >  			if (this_hctx) {
> >  				trace_block_unplug(this_q, depth, !from_schedule);
> > +
> > +				percpu_ref_get(&this_q->q_usage_counter);
> 
> Sorry to bother but I would just like to double confirm the reason to use
> "percpu_ref_get()" here which does not check whether the queue has been frozen.
> 
> Is it because there is assumption that any direct/indirect caller of
> blk_mq_flush_plug_list() much have already grabbed q_usage_counter, which is
> similar to blk_queue_enter_live()?

Because there is request in the plug list to be queued.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues
  2019-04-04  8:43 ` [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues Ming Lei
  2019-04-04 15:57   ` Bart Van Assche
@ 2019-04-08  3:16   ` jianchao.wang
  2019-04-08  9:01     ` Ming Lei
  1 sibling, 1 reply; 18+ messages in thread
From: jianchao.wang @ 2019-04-08  3:16 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley

Hi Ming

Why not add percpu_ref_tryget/put pair into run queue and requeue work,
or before queue the work ?

Then freezing queue could really implement to freeze the queue and there will be
no any queue activity after freeze, including run queue and requeue work.

Thanks
Jianchao

On 4/4/19 4:43 PM, Ming Lei wrote:
> Inside __blk_mq_update_nr_hw_queues(), only request queues are frozen
> before updating nr_hw_queues.
> 
> However, even though blk_mq_freeze_queue() is returned, there might be
> run queue activity not completed, then use-after-free may be triggered
> on hctx and its fields.
> 
> Fix this issue by really quiescing queue via blk_mq_quiesce_queue() and
> blk_sync_queue() for making sure no any run queue activity pending
> before releasing hctx.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reported-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b512ba0cb359..41c12d9008b7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3224,8 +3224,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
>  		return;
>  
> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		blk_mq_freeze_queue(q);
> +		blk_mq_quiesce_queue(q);
> +		blk_sync_queue(q);
> +	}
>  	/*
>  	 * Sync with blk_mq_queue_tag_busy_iter.
>  	 */
> @@ -3269,8 +3272,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  	list_for_each_entry(q, &set->tag_list, tag_set_list)
>  		blk_mq_elv_switch_back(&head, q);
>  
> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +		blk_mq_unquiesce_queue(q);
>  		blk_mq_unfreeze_queue(q);
> +	}
>  }
>  
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues
  2019-04-08  3:16   ` jianchao.wang
@ 2019-04-08  9:01     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-04-08  9:01 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, Linux SCSI List, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Mon, Apr 8, 2019 at 11:17 AM jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
>
> Hi Ming
>
> Why not add percpu_ref_tryget/put pair into run queue and requeue work,
> or before queue the work ?

If following this direction, most of block layer API might need the pair.

Also Jens has complained in another thread, the pair may introduce
1.2% performance
loss, then we should avoid it in the fast path, such as blk_mq_run_hw_queue().

Given it is required that request queue is alive from kobject view
before calling almost
every block layer API, this lifetime issue should be addressed easily
by moving the hctx
free into queue's release handler.

>
> Then freezing queue could really implement to freeze the queue and there will be
> no any queue activity after freeze, including run queue and requeue work.

The queue activity is just block layer internal thing, not related
with driver, so not a big
deal by dealing with freeing hctx resources in release handler.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-04-08  9:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-04  8:43 [PATCH V4 0/7] blk-mq: fix races related with freeing queue Ming Lei
2019-04-04  8:43 ` [PATCH V4 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
2019-04-04 15:58   ` Bart Van Assche
2019-04-04 20:45     ` Ming Lei
2019-04-04 21:30   ` Bart Van Assche
2019-04-05  9:26   ` Dongli Zhang
2019-04-05 13:52     ` Ming Lei
2019-04-04  8:43 ` [PATCH V4 2/7] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
2019-04-04  8:43 ` [PATCH V4 3/7] blk-mq: quiesce queue before updating nr_hw_queues Ming Lei
2019-04-04 15:57   ` Bart Van Assche
2019-04-04 20:55     ` Ming Lei
2019-04-08  3:16   ` jianchao.wang
2019-04-08  9:01     ` Ming Lei
2019-04-04  8:43 ` [PATCH V4 4/7] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
2019-04-04  8:43 ` [PATCH V4 5/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
2019-04-04  8:43 ` [PATCH V4 6/7] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
2019-04-04  8:43 ` [PATCH V4 7/7] SCSI: don't hold device refcount in IO path Ming Lei
2019-04-04 17:14   ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.