All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/6] blk-mq: fix races related with freeing queue
@ 2019-04-03 10:26 Ming Lei
  2019-04-03 10:26 ` [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Ming Lei @ 2019-04-03 10:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler


Ming Lei (6):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  SCSI: don't hold device refcount in IO path

 block/blk-core.c        | 23 +----------------------
 block/blk-mq-sysfs.c    |  8 ++++++++
 block/blk-mq.c          | 16 ++++++++++------
 block/blk-mq.h          |  2 +-
 drivers/scsi/scsi_lib.c | 30 ++++++------------------------
 5 files changed, 26 insertions(+), 53 deletions(-)

-- 
2.9.5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-03 10:26 [PATCH V3 0/6] blk-mq: fix races related with freeing queue Ming Lei
@ 2019-04-03 10:26 ` Ming Lei
  2019-04-03 15:21   ` Bart Van Assche
  2019-04-03 10:26 ` [PATCH V3 2/6] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2019-04-03 10:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. But blk_freeze_queue() is working with blk_sync_queue() together
for canceling async run queue work activities if hctx->run_work is scheduled
with holding the refcount.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed, then kernel oops
is triggered.

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ff3d7b49969..5b586affee09 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
 			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
+
+				percpu_ref_get(&this_q->q_usage_counter);
 				blk_mq_sched_insert_requests(this_hctx, this_ctx,
 								&rq_list,
 								from_schedule);
+				percpu_ref_put(&this_q->q_usage_counter);
 			}
 
 			this_q = rq->q;
@@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
+
+		percpu_ref_get(&this_q->q_usage_counter);
 		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
 						from_schedule);
+		percpu_ref_put(&this_q->q_usage_counter);
 	}
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V3 2/6] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-03 10:26 [PATCH V3 0/6] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-03 10:26 ` [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-03 10:26 ` Ming Lei
  2019-04-03 16:19   ` Bart Van Assche
  2019-04-03 10:26 ` [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2019-04-03 10:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

with holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4673ebe42255..6583d67f3e34 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b586affee09..b512ba0cb359 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2626,6 +2626,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-03 10:26 [PATCH V3 0/6] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-03 10:26 ` [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-03 10:26 ` [PATCH V3 2/6] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
@ 2019-04-03 10:26 ` Ming Lei
  2019-04-03 16:26   ` Bart Van Assche
  2019-04-03 10:26 ` [PATCH V3 4/6] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2019-04-03 10:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang,
	stable

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6583d67f3e34..20298aa5a77c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b512ba0cb359..afc9912e2e42 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d704fc7766f4..c421e3a16e36 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V3 4/6] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-03 10:26 [PATCH V3 0/6] blk-mq: fix races related with freeing queue Ming Lei
                   ` (2 preceding siblings ...)
  2019-04-03 10:26 ` [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
@ 2019-04-03 10:26 ` Ming Lei
  2019-04-03 16:21   ` Bart Van Assche
  2019-04-03 10:26 ` [PATCH V3 5/6] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
  2019-04-03 10:26 ` [PATCH V3 6/6] SCSI: don't hold device refcount in IO path Ming Lei
  5 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2019-04-03 10:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 20298aa5a77c..ad17e999f79e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V3 5/6] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-03 10:26 [PATCH V3 0/6] blk-mq: fix races related with freeing queue Ming Lei
                   ` (3 preceding siblings ...)
  2019-04-03 10:26 ` [PATCH V3 4/6] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
@ 2019-04-03 10:26 ` Ming Lei
  2019-04-03 15:25   ` Bart Van Assche
  2019-04-03 10:26 ` [PATCH V3 6/6] SCSI: don't hold device refcount in IO path Ming Lei
  5 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2019-04-03 10:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad17e999f79e..82b630da7e2c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V3 6/6] SCSI: don't hold device refcount in IO path
  2019-04-03 10:26 [PATCH V3 0/6] blk-mq: fix races related with freeing queue Ming Lei
                   ` (4 preceding siblings ...)
  2019-04-03 10:26 ` [PATCH V3 5/6] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
@ 2019-04-03 10:26 ` Ming Lei
  2019-04-03 16:24   ` Bart Van Assche
  5 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2019-04-03 10:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

scsi_device's refcount is always grabbed in IO path.

Turns out it isn't necessary, because blk_queue_cleanup() will
drain any in-flight IOs, then cancel timeout/requeue work, and
SCSI's requeue_work is canceled too in __scsi_remove_device().

Also scsi_device won't go away until blk_cleanup_queue() is done.

So don't hold the refcount in IO path, especially the refcount isn't
required in IO path since blk_queue_enter() / blk_queue_exit()
is introduced in the legacy block layer.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 30 ++++++------------------------
 1 file changed, 6 insertions(+), 24 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 601b9f1de267..6772d2ae1357 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -141,8 +141,6 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 
 static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
 {
-	struct scsi_device *sdev = cmd->device;
-
 	if (cmd->request->rq_flags & RQF_DONTPREP) {
 		cmd->request->rq_flags &= ~RQF_DONTPREP;
 		scsi_mq_uninit_cmd(cmd);
@@ -150,7 +148,6 @@ static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
 		WARN_ON_ONCE(true);
 	}
 	blk_mq_requeue_request(cmd->request, true);
-	put_device(&sdev->sdev_gendev);
 }
 
 /**
@@ -190,18 +187,12 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, bool unbusy)
 	cmd->result = 0;
 
 	/*
-	 * Before a SCSI command is dispatched,
-	 * get_device(&sdev->sdev_gendev) is called and the host,
-	 * target and device busy counters are increased. Since
-	 * requeuing a request causes these actions to be repeated and
-	 * since scsi_device_unbusy() has already been called,
-	 * put_device(&device->sdev_gendev) must still be called. Call
-	 * put_device() after blk_mq_requeue_request() to avoid that
-	 * removal of the SCSI device can start before requeueing has
-	 * happened.
+	 * Before a SCSI command is dispatched, the host, target and
+	 * device busy counters are increased. Since requeuing a request
+	 * causes these actions to be repeated and since scsi_device_unbusy()
+	 * has already been called.
 	 */
 	blk_mq_requeue_request(cmd->request, true);
-	put_device(&device->sdev_gendev);
 }
 
 /*
@@ -619,7 +610,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 		blk_mq_run_hw_queues(q, true);
 
 	percpu_ref_put(&q->q_usage_counter);
-	put_device(&sdev->sdev_gendev);
 	return false;
 }
 
@@ -1613,7 +1603,6 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
 	struct scsi_device *sdev = q->queuedata;
 
 	atomic_dec(&sdev->device_busy);
-	put_device(&sdev->sdev_gendev);
 }
 
 static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
@@ -1621,16 +1610,9 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct scsi_device *sdev = q->queuedata;
 
-	if (!get_device(&sdev->sdev_gendev))
-		goto out;
-	if (!scsi_dev_queue_ready(q, sdev))
-		goto out_put_device;
-
-	return true;
+	if (scsi_dev_queue_ready(q, sdev))
+		return true;
 
-out_put_device:
-	put_device(&sdev->sdev_gendev);
-out:
 	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
 		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
 	return false;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-03 10:26 ` [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-03 15:21   ` Bart Van Assche
  2019-04-03 16:27     ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2019-04-03 15:21 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed, then kernel oops
> is triggered.

I don't think that this patch is necessary. blk_mq_get_request() increases
q_usage_counter. In other words, as long as at least one request has been
allocated that has not finished it is guaranteed that q_usage_counter > 0.
So there is no need for additional protection in blk_mq_flush_plug_list().

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 5/6] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-03 10:26 ` [PATCH V3 5/6] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
@ 2019-04-03 15:25   ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2019-04-03 15:25 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> Now freeing hw queue resource is moved to hctx's release handler,
> we don't need to worry about the race between blk_cleanup_queue and
> run queue any more.
> 
> So don't drain in-progress dispatch in blk_cleanup_queue().
> 
> This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
> freeing queue").

Reviewed-by: Bart Van Assche <bvanassche@acm.org>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 2/6] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-03 10:26 ` [PATCH V3 2/6] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
@ 2019-04-03 16:19   ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2019-04-03 16:19 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> with holding queue's kobject refcount, it is safe for driver
> to schedule requeue. However, blk_mq_kick_requeue_list() may
> be called after blk_sync_queue() is done because of concurrent
> requeue activities, then requeue work may not be completed when
> freeing queue, and kernel oops is triggered.
> 
> So moving the cancel of requeue_work into blk_mq_release() for
> avoiding race between requeue and freeing queue.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>
>  block/blk-core.c | 1 -
>  block/blk-mq.c   | 2 ++
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4673ebe42255..6583d67f3e34 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
>                 struct blk_mq_hw_ctx *hctx;
>                 int i;
>  
> -               cancel_delayed_work_sync(&q->requeue_work);
>                 queue_for_each_hw_ctx(q, hctx, i)
>                         cancel_delayed_work_sync(&hctx->run_work);
>         }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5b586affee09..b512ba0cb359 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2626,6 +2626,8 @@ void blk_mq_release(struct request_queue *q)
>         struct blk_mq_hw_ctx *hctx;
>         unsigned int i;
>  
> +       cancel_delayed_work_sync(&q->requeue_work);
> +
>         /* hctx kobj stays in hctx */
>         queue_for_each_hw_ctx(q, hctx, i) {
>                 if (!hctx)

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 4/6] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-03 10:26 ` [PATCH V3 4/6] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
@ 2019-04-03 16:21   ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2019-04-03 16:21 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> hctx is always released after requeue is freed.
> 
> With holding queue's kobject refcount, it is safe for driver to run queue,
> so one run queue might be scheduled after blk_sync_queue() is done.
> 
> So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
> for avoiding run released queue.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 6/6] SCSI: don't hold device refcount in IO path
  2019-04-03 10:26 ` [PATCH V3 6/6] SCSI: don't hold device refcount in IO path Ming Lei
@ 2019-04-03 16:24   ` Bart Van Assche
  2019-04-04  6:47     ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2019-04-03 16:24 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
>         /*
> -        * Before a SCSI command is dispatched,
> -        * get_device(&sdev->sdev_gendev) is called and the host,
> -        * target and device busy counters are increased. Since
> -        * requeuing a request causes these actions to be repeated and
> -        * since scsi_device_unbusy() has already been called,
> -        * put_device(&device->sdev_gendev) must still be called. Call
> -        * put_device() after blk_mq_requeue_request() to avoid that
> -        * removal of the SCSI device can start before requeueing has
> -        * happened.
> +        * Before a SCSI command is dispatched, the host, target and
> +        * device busy counters are increased. Since requeuing a request
> +        * causes these actions to be repeated and since scsi_device_unbusy()
> +        * has already been called.
>          */

The last sentence of the new comment looks incomplete to me ("Since ...").

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-03 10:26 ` [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
@ 2019-04-03 16:26   ` Bart Van Assche
  2019-04-04  6:40     ` Ming Lei
  2019-04-04  7:42     ` Ming Lei
  0 siblings, 2 replies; 19+ messages in thread
From: Bart Van Assche @ 2019-04-03 16:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang, stable

On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 6583d67f3e34..20298aa5a77c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
>         blk_exit_queue(q);
>  
>         if (queue_is_mq(q))
> -               blk_mq_free_queue(q);
> +               blk_mq_exit_queue(q);
>  
>         percpu_ref_exit(&q->q_usage_counter);
>  
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 3f9c3f4ac44c..4040e62c3737 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -10,6 +10,7 @@
>  #include <linux/smp.h>
>  
>  #include <linux/blk-mq.h>
> +#include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-tag.h"
>  
> @@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
>  {
>         struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
>                                                   kobj);
> +
> +       if (hctx->flags & BLK_MQ_F_BLOCKING)
> +               cleanup_srcu_struct(hctx->srcu);
> +       blk_free_flush_queue(hctx->fq);
> +       sbitmap_free(&hctx->ctx_map);
>         free_cpumask_var(hctx->cpumask);
>         kfree(hctx->ctxs);
>         kfree(hctx);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b512ba0cb359..afc9912e2e42 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>         if (set->ops->exit_hctx)
>                 set->ops->exit_hctx(hctx, hctx_idx);
>  
> -       if (hctx->flags & BLK_MQ_F_BLOCKING)
> -               cleanup_srcu_struct(hctx->srcu);
> -
>         blk_mq_remove_cpuhp(hctx);
> -       blk_free_flush_queue(hctx->fq);
> -       sbitmap_free(&hctx->ctx_map);
>  }
>  
>  static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  }
>  EXPORT_SYMBOL(blk_mq_init_allocated_queue);
>  
> -void blk_mq_free_queue(struct request_queue *q)
> +/* tags can _not_ be used after returning from blk_mq_exit_queue */
> +void blk_mq_exit_queue(struct request_queue *q)
>  {
>         struct blk_mq_tag_set   *set = q->tag_set;
>  
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index d704fc7766f4..c421e3a16e36 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -37,7 +37,7 @@ struct blk_mq_ctx {
>         struct kobject          kobj;
>  } ____cacheline_aligned_in_smp;
>  
> -void blk_mq_free_queue(struct request_queue *q);
> +void blk_mq_exit_queue(struct request_queue *q);
>  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
>  void blk_mq_wake_waiters(struct request_queue *q);
>  bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);

Isn't this an incomplete solution? The above patch fixes the race between
cleaning up a queue and running a queue but does not address the race between
running a queue and changing the number of hardware queues.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-03 15:21   ` Bart Van Assche
@ 2019-04-03 16:27     ` Ming Lei
  2019-04-03 21:22       ` Bart Van Assche
  0 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2019-04-03 16:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Wed, Apr 03, 2019 at 08:21:11AM -0700, Bart Van Assche wrote:
> On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> > However, if request is staggered into plug list, and finally queued
> > from plug code path, the refcount in submission side is actually missed.
> > And we may start to run queue after queue is removed, then kernel oops
> > is triggered.
> 
> I don't think that this patch is necessary. blk_mq_get_request() increases
> q_usage_counter. In other words, as long as at least one request has been
> allocated that has not finished it is guaranteed that q_usage_counter > 0.
> So there is no need for additional protection in blk_mq_flush_plug_list().

blk_mq_flush_plug_list():
	blk_mq_sched_insert_requests()
		insert requests to sw queue or scheduler queue 
		blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be completed
before calling the above blk_mq_run_hw_queue. Then queue can be freed
during the above blk_mq_run_hw_queue().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-03 16:27     ` Ming Lei
@ 2019-04-03 21:22       ` Bart Van Assche
  2019-04-04  6:57         ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2019-04-03 21:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Thu, 2019-04-04 at 00:27 +0800, Ming Lei wrote:
> blk_mq_flush_plug_list():
> 	blk_mq_sched_insert_requests()
> 		insert requests to sw queue or scheduler queue 
> 		blk_mq_run_hw_queue
> 
> Because of concurrent run queue, all requests inserted above may be completed
> before calling the above blk_mq_run_hw_queue. Then queue can be freed
> during the above blk_mq_run_hw_queue().

Thanks for the clarification. That makes sense to me.

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-03 16:26   ` Bart Van Assche
@ 2019-04-04  6:40     ` Ming Lei
  2019-04-04  7:42     ` Ming Lei
  1 sibling, 0 replies; 19+ messages in thread
From: Ming Lei @ 2019-04-04  6:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang,
	stable

On Wed, Apr 03, 2019 at 09:26:32AM -0700, Bart Van Assche wrote:
> On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 6583d67f3e34..20298aa5a77c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
> >         blk_exit_queue(q);
> >  
> >         if (queue_is_mq(q))
> > -               blk_mq_free_queue(q);
> > +               blk_mq_exit_queue(q);
> >  
> >         percpu_ref_exit(&q->q_usage_counter);
> >  
> > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > index 3f9c3f4ac44c..4040e62c3737 100644
> > --- a/block/blk-mq-sysfs.c
> > +++ b/block/blk-mq-sysfs.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/smp.h>
> >  
> >  #include <linux/blk-mq.h>
> > +#include "blk.h"
> >  #include "blk-mq.h"
> >  #include "blk-mq-tag.h"
> >  
> > @@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
> >  {
> >         struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
> >                                                   kobj);
> > +
> > +       if (hctx->flags & BLK_MQ_F_BLOCKING)
> > +               cleanup_srcu_struct(hctx->srcu);
> > +       blk_free_flush_queue(hctx->fq);
> > +       sbitmap_free(&hctx->ctx_map);
> >         free_cpumask_var(hctx->cpumask);
> >         kfree(hctx->ctxs);
> >         kfree(hctx);
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index b512ba0cb359..afc9912e2e42 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >         if (set->ops->exit_hctx)
> >                 set->ops->exit_hctx(hctx, hctx_idx);
> >  
> > -       if (hctx->flags & BLK_MQ_F_BLOCKING)
> > -               cleanup_srcu_struct(hctx->srcu);
> > -
> >         blk_mq_remove_cpuhp(hctx);
> > -       blk_free_flush_queue(hctx->fq);
> > -       sbitmap_free(&hctx->ctx_map);
> >  }
> >  
> >  static void blk_mq_exit_hw_queues(struct request_queue *q,
> > @@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> >  }
> >  EXPORT_SYMBOL(blk_mq_init_allocated_queue);
> >  
> > -void blk_mq_free_queue(struct request_queue *q)
> > +/* tags can _not_ be used after returning from blk_mq_exit_queue */
> > +void blk_mq_exit_queue(struct request_queue *q)
> >  {
> >         struct blk_mq_tag_set   *set = q->tag_set;
> >  
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index d704fc7766f4..c421e3a16e36 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -37,7 +37,7 @@ struct blk_mq_ctx {
> >         struct kobject          kobj;
> >  } ____cacheline_aligned_in_smp;
> >  
> > -void blk_mq_free_queue(struct request_queue *q);
> > +void blk_mq_exit_queue(struct request_queue *q);
> >  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
> >  void blk_mq_wake_waiters(struct request_queue *q);
> >  bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
> 
> Isn't this an incomplete solution? The above patch fixes the race between
> cleaning up a queue and running a queue but does not address the race between
> running a queue and changing the number of hardware queues.

In blk_mq_realloc_hw_ctxs(), we still cover hctx's lifetime via kboject.

When new hctx is created, there can't be run queue activity on new hctx.

For hctx to be removed, the following code is run:

        for (; j < end; j++) {
                struct blk_mq_hw_ctx *hctx = hctxs[j];

                if (hctx) {
                        if (hctx->tags)
                                blk_mq_free_map_and_requests(set, j);
                        blk_mq_exit_hctx(q, set, hctx, j);
                        kobject_put(&hctx->kobj);
                        hctxs[j] = NULL;

                }
        }

So the un-completed run queue activity still can be run until queue's
kobject refcount is released, this way is same with blk_cleanup_queue().

So could you explain what the race between run queue and
blk_mq_update_nr_hw_queues() is?


Thanks, 
Ming

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 6/6] SCSI: don't hold device refcount in IO path
  2019-04-03 16:24   ` Bart Van Assche
@ 2019-04-04  6:47     ` Ming Lei
  0 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2019-04-04  6:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Wed, Apr 03, 2019 at 09:24:54AM -0700, Bart Van Assche wrote:
> On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> >         /*
> > -        * Before a SCSI command is dispatched,
> > -        * get_device(&sdev->sdev_gendev) is called and the host,
> > -        * target and device busy counters are increased. Since
> > -        * requeuing a request causes these actions to be repeated and
> > -        * since scsi_device_unbusy() has already been called,
> > -        * put_device(&device->sdev_gendev) must still be called. Call
> > -        * put_device() after blk_mq_requeue_request() to avoid that
> > -        * removal of the SCSI device can start before requeueing has
> > -        * happened.
> > +        * Before a SCSI command is dispatched, the host, target and
> > +        * device busy counters are increased. Since requeuing a request
> > +        * causes these actions to be repeated and since scsi_device_unbusy()
> > +        * has already been called.
> >          */
> 
> The last sentence of the new comment looks incomplete to me ("Since ...").

Given most of the previous comment is about get/put device, I'd suggest
to remove the whole block given we are clean that get/put device isn't
required in IO path.

Will do that in V4 if no objections.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-03 21:22       ` Bart Van Assche
@ 2019-04-04  6:57         ` Ming Lei
  0 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2019-04-04  6:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Wed, Apr 03, 2019 at 02:22:54PM -0700, Bart Van Assche wrote:
> On Thu, 2019-04-04 at 00:27 +0800, Ming Lei wrote:
> > blk_mq_flush_plug_list():
> > 	blk_mq_sched_insert_requests()
> > 		insert requests to sw queue or scheduler queue 
> > 		blk_mq_run_hw_queue
> > 
> > Because of concurrent run queue, all requests inserted above may be completed
> > before calling the above blk_mq_run_hw_queue. Then queue can be freed
> > during the above blk_mq_run_hw_queue().
> 
> Thanks for the clarification. That makes sense to me.

I guess it is helpful for review to add the above description in comment log,
will do that in V4.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-03 16:26   ` Bart Van Assche
  2019-04-04  6:40     ` Ming Lei
@ 2019-04-04  7:42     ` Ming Lei
  1 sibling, 0 replies; 19+ messages in thread
From: Ming Lei @ 2019-04-04  7:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang,
	stable

On Wed, Apr 03, 2019 at 09:26:32AM -0700, Bart Van Assche wrote:
> On Wed, 2019-04-03 at 18:26 +0800, Ming Lei wrote:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 6583d67f3e34..20298aa5a77c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
> >         blk_exit_queue(q);
> >  
> >         if (queue_is_mq(q))
> > -               blk_mq_free_queue(q);
> > +               blk_mq_exit_queue(q);
> >  
> >         percpu_ref_exit(&q->q_usage_counter);
> >  
> > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > index 3f9c3f4ac44c..4040e62c3737 100644
> > --- a/block/blk-mq-sysfs.c
> > +++ b/block/blk-mq-sysfs.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/smp.h>
> >  
> >  #include <linux/blk-mq.h>
> > +#include "blk.h"
> >  #include "blk-mq.h"
> >  #include "blk-mq-tag.h"
> >  
> > @@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
> >  {
> >         struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
> >                                                   kobj);
> > +
> > +       if (hctx->flags & BLK_MQ_F_BLOCKING)
> > +               cleanup_srcu_struct(hctx->srcu);
> > +       blk_free_flush_queue(hctx->fq);
> > +       sbitmap_free(&hctx->ctx_map);
> >         free_cpumask_var(hctx->cpumask);
> >         kfree(hctx->ctxs);
> >         kfree(hctx);
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index b512ba0cb359..afc9912e2e42 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >         if (set->ops->exit_hctx)
> >                 set->ops->exit_hctx(hctx, hctx_idx);
> >  
> > -       if (hctx->flags & BLK_MQ_F_BLOCKING)
> > -               cleanup_srcu_struct(hctx->srcu);
> > -
> >         blk_mq_remove_cpuhp(hctx);
> > -       blk_free_flush_queue(hctx->fq);
> > -       sbitmap_free(&hctx->ctx_map);
> >  }
> >  
> >  static void blk_mq_exit_hw_queues(struct request_queue *q,
> > @@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> >  }
> >  EXPORT_SYMBOL(blk_mq_init_allocated_queue);
> >  
> > -void blk_mq_free_queue(struct request_queue *q)
> > +/* tags can _not_ be used after returning from blk_mq_exit_queue */
> > +void blk_mq_exit_queue(struct request_queue *q)
> >  {
> >         struct blk_mq_tag_set   *set = q->tag_set;
> >  
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index d704fc7766f4..c421e3a16e36 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -37,7 +37,7 @@ struct blk_mq_ctx {
> >         struct kobject          kobj;
> >  } ____cacheline_aligned_in_smp;
> >  
> > -void blk_mq_free_queue(struct request_queue *q);
> > +void blk_mq_exit_queue(struct request_queue *q);
> >  int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
> >  void blk_mq_wake_waiters(struct request_queue *q);
> >  bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
> 
> Isn't this an incomplete solution? The above patch fixes the race between
> cleaning up a queue and running a queue but does not address the race between
> running a queue and changing the number of hardware queues.

Now I understand the race with updating nr hw queues, thank!

We may have to quiesce request queues in __blk_mq_update_nr_hw_queues()
for avoiding this race.

thanks,
Ming

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-04-04  7:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03 10:26 [PATCH V3 0/6] blk-mq: fix races related with freeing queue Ming Lei
2019-04-03 10:26 ` [PATCH V3 1/6] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
2019-04-03 15:21   ` Bart Van Assche
2019-04-03 16:27     ` Ming Lei
2019-04-03 21:22       ` Bart Van Assche
2019-04-04  6:57         ` Ming Lei
2019-04-03 10:26 ` [PATCH V3 2/6] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
2019-04-03 16:19   ` Bart Van Assche
2019-04-03 10:26 ` [PATCH V3 3/6] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
2019-04-03 16:26   ` Bart Van Assche
2019-04-04  6:40     ` Ming Lei
2019-04-04  7:42     ` Ming Lei
2019-04-03 10:26 ` [PATCH V3 4/6] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
2019-04-03 16:21   ` Bart Van Assche
2019-04-03 10:26 ` [PATCH V3 5/6] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
2019-04-03 15:25   ` Bart Van Assche
2019-04-03 10:26 ` [PATCH V3 6/6] SCSI: don't hold device refcount in IO path Ming Lei
2019-04-03 16:24   ` Bart Van Assche
2019-04-04  6:47     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.