All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 0/9] blk-mq: fix races related with freeing queue
@ 2019-04-12  3:30 Ming Lei
  2019-04-12  3:30 ` [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V5:
	- refactor blk_mq_alloc_and_init_hctx()
	- fix race related updating nr_hw_queues by always freeing hctx
	  after request queue is released

V4:
	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
	- fix comment in the last patch

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler



Ming Lei (9):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move all hctx alloction & initialization into
    __blk_mq_alloc_and_init_hctx
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: always free hctx after request queue is freed
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  SCSI: don't hold device refcount in IO path

 block/blk-core.c        |  23 +-----
 block/blk-mq-sysfs.c    |   8 +++
 block/blk-mq.c          | 188 ++++++++++++++++++++++++++++--------------------
 block/blk-mq.h          |   2 +-
 drivers/scsi/scsi_lib.c |  28 +-------
 include/linux/blk-mq.h  |   2 +
 include/linux/blkdev.h  |   7 ++
 7 files changed, 132 insertions(+), 126 deletions(-)

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>

-- 
2.9.5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12  8:20   ` Johannes Thumshirn
  2019-04-12 10:55   ` Hannes Reinecke
  2019-04-12  3:30 ` [PATCH V5 2/9] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ff3d7b49969..5b586affee09 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
 			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
+
+				percpu_ref_get(&this_q->q_usage_counter);
 				blk_mq_sched_insert_requests(this_hctx, this_ctx,
 								&rq_list,
 								from_schedule);
+				percpu_ref_put(&this_q->q_usage_counter);
 			}
 
 			this_q = rq->q;
@@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
+
+		percpu_ref_get(&this_q->q_usage_counter);
 		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
 						from_schedule);
+		percpu_ref_put(&this_q->q_usage_counter);
 	}
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-12  3:30 ` [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12  8:23   ` Johannes Thumshirn
  2019-04-12  3:30 ` [PATCH V5 3/9] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4673ebe42255..6583d67f3e34 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b586affee09..b512ba0cb359 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2626,6 +2626,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 3/9] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-12  3:30 ` [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-12  3:30 ` [PATCH V5 2/9] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12 11:03   ` Hannes Reinecke
  2019-04-12  3:30 ` [PATCH V5 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang,
	stable

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6583d67f3e34..20298aa5a77c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b512ba0cb359..afc9912e2e42 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d704fc7766f4..c421e3a16e36 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
                   ` (2 preceding siblings ...)
  2019-04-12  3:30 ` [PATCH V5 3/9] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12 11:04   ` Hannes Reinecke
  2019-04-12  3:30 ` [PATCH V5 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Move all hctx related allocation and initialization into
__blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
into real two functions, one is for allocate everything, and another is for
initializing everyting.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index afc9912e2e42..5ad58169ad6a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2276,15 +2276,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static int blk_mq_init_hctx(struct request_queue *q,
-		struct blk_mq_tag_set *set,
-		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 {
-	int node;
+	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+			   __alignof__(struct blk_mq_hw_ctx)) !=
+		     sizeof(struct blk_mq_hw_ctx));
+
+	if (tag_set->flags & BLK_MQ_F_BLOCKING)
+		hw_ctx_size += sizeof(struct srcu_struct);
+
+	return hw_ctx_size;
+}
+
+static struct blk_mq_hw_ctx *
+__blk_mq_alloc_and_init_hctx(struct request_queue *q,
+			     struct blk_mq_tag_set *set,
+			     unsigned hctx_idx, int node)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+			node);
+	if (!hctx)
+		goto fail_alloc_hctx;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+				node))
+		goto free_hctx;
+
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
+	hctx->queue_num = hctx_idx;
 
-	node = hctx->numa_node;
 	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
+		hctx->numa_node = set->numa_node;
+	node = hctx->numa_node;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -2329,8 +2359,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
+	blk_mq_hctx_kobj_init(hctx);
 
-	return 0;
+	return hctx;
 
  free_fq:
 	kfree(hctx->fq);
@@ -2343,7 +2374,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	kfree(hctx->ctxs);
  unregister_cpu_notifier:
 	blk_mq_remove_cpuhp(hctx);
-	return -1;
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+ fail_alloc_hctx:
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -2689,51 +2724,11 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_sq_queue);
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-			node);
-	if (!hctx)
-		return NULL;
-
-	if (!zalloc_cpumask_var_node(&hctx->cpumask,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-				node)) {
-		kfree(hctx);
-		return NULL;
-	}
-
-	atomic_set(&hctx->nr_active, 0);
-	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
-	if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx);
-		return NULL;
-	}
-	blk_mq_hctx_kobj_init(hctx);
-
-	return hctx;
+	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
                   ` (3 preceding siblings ...)
  2019-04-12  3:30 ` [PATCH V5 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12 11:04   ` Hannes Reinecke
  2019-04-12  3:30 ` [PATCH V5 6/9] blk-mq: always free hctx after request queue is freed Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() which is for allocating all hctx resources,
another is blk_mq_init_hctx() which is for initializing hctx, and
serves as counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 69 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 45 insertions(+), 24 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ad58169ad6a..71996fe494eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2290,10 +2290,36 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 	return hw_ctx_size;
 }
 
+static int blk_mq_init_hctx(struct request_queue *q,
+			    struct blk_mq_tag_set *set,
+			    struct blk_mq_hw_ctx *hctx,
+			    unsigned hctx_idx)
+{
+	hctx->queue_num = hctx_idx;
+
+	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+	hctx->tags = set->tags[hctx_idx];
+
+	if (set->ops->init_hctx &&
+	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+		goto fail;
+
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+				hctx->numa_node))
+		goto exit_hctx;
+	return 0;
+ exit_hctx:
+	if (set->ops->exit_hctx)
+		set->ops->exit_hctx(hctx, hctx_idx);
+ fail:
+	return -1;
+}
+
 static struct blk_mq_hw_ctx *
-__blk_mq_alloc_and_init_hctx(struct request_queue *q,
-			     struct blk_mq_tag_set *set,
-			     unsigned hctx_idx, int node)
+blk_mq_alloc_hctx(struct request_queue *q,
+		  struct blk_mq_tag_set *set,
+		  unsigned hctx_idx, int node)
 {
 	struct blk_mq_hw_ctx *hctx;
 
@@ -2310,8 +2336,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	atomic_set(&hctx->nr_active, 0);
 	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
 	if (node == NUMA_NO_NODE)
 		hctx->numa_node = set->numa_node;
 	node = hctx->numa_node;
@@ -2322,10 +2346,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-	hctx->tags = set->tags[hctx_idx];
-
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2338,24 +2358,16 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
 				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
 		goto free_ctxs;
-
 	hctx->nr_ctx = 0;
 
 	spin_lock_init(&hctx->dispatch_wait_lock);
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
 	if (!hctx->fq)
-		goto exit_hctx;
-
-	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-		goto free_fq;
+		goto free_bitmap;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
@@ -2363,11 +2375,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	return hctx;
 
- free_fq:
-	kfree(hctx->fq);
- exit_hctx:
-	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
  free_bitmap:
 	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
@@ -2728,7 +2735,21 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		goto fail;
+
+	if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
+		goto free_hctx;
+
+	return hctx;
+
+ free_hctx:
+	kobject_put(&hctx->kobj);
+ fail:
+	return NULL;
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
                   ` (4 preceding siblings ...)
  2019-04-12  3:30 ` [PATCH V5 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12 11:06   ` Hannes Reinecke
  2019-04-12  3:30 ` [PATCH V5 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 71996fe494eb..886fbb678617 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2260,6 +2260,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	blk_mq_remove_cpuhp(hctx);
+
+	spin_lock(&q->dead_hctx_lock);
+	list_add(&hctx->hctx_list, &q->dead_hctx_list);
+	spin_unlock(&q->dead_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2660,15 +2664,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
+	struct blk_mq_hw_ctx *hctx, *next;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
-	/* hctx kobj stays in hctx */
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx)
-			continue;
+	/* all hctx are in .dead_hctx_list now */
+	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
+		list_del_init(&hctx->hctx_list);
 		kobject_put(&hctx->kobj);
 	}
 
@@ -2735,9 +2737,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
+
+	/* reuse dead hctx first */
+	spin_lock(&q->dead_hctx_lock);
+	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
+		if (tmp->numa_node == node) {
+			hctx = tmp;
+			break;
+		}
+	}
+	if (hctx)
+		list_del_init(&hctx->hctx_list);
+	spin_unlock(&q->dead_hctx_lock);
 
-	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
 	if (!hctx)
 		goto fail;
 
@@ -2775,10 +2790,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
 		if (hctx) {
-			if (hctxs[i]) {
+			if (hctxs[i])
 				blk_mq_exit_hctx(q, set, hctxs[i], i);
-				kobject_put(&hctxs[i]->kobj);
-			}
 			hctxs[i] = hctx;
 		} else {
 			if (hctxs[i])
@@ -2809,9 +2822,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			if (hctx->tags)
 				blk_mq_free_map_and_requests(set, j);
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
-
 		}
 	}
 	mutex_unlock(&q->sysfs_lock);
@@ -2854,6 +2865,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_sys_init;
 
+	INIT_LIST_HEAD(&q->dead_hctx_list);
+	spin_lock_init(&q->dead_hctx_lock);
+
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cb2aa7ecafff..a44c3f95dcc1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
 	struct dentry		*sched_debugfs_dir;
 #endif
 
+	struct list_head	hctx_list;
+
 	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
 	struct srcu_struct	srcu[0];
 };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4b85dc066264..1325f941f0be 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -535,6 +535,13 @@ struct request_queue {
 
 	struct mutex		sysfs_lock;
 
+	/*
+	 * for reusing dead hctx instance in case of updating
+	 * nr_hw_queues
+	 */
+	struct list_head	dead_hctx_list;
+	spinlock_t		dead_hctx_lock;
+
 	atomic_t		mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
                   ` (5 preceding siblings ...)
  2019-04-12  3:30 ` [PATCH V5 6/9] blk-mq: always free hctx after request queue is freed Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12 11:08   ` Hannes Reinecke
  2019-04-12  3:30 ` [PATCH V5 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
  2019-04-12  3:30 ` [PATCH V5 9/9] SCSI: don't hold device refcount in IO path Ming Lei
  8 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 20298aa5a77c..ad17e999f79e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
                   ` (6 preceding siblings ...)
  2019-04-12  3:30 ` [PATCH V5 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12 11:09   ` Hannes Reinecke
  2019-04-12  3:30 ` [PATCH V5 9/9] SCSI: don't hold device refcount in IO path Ming Lei
  8 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad17e999f79e..82b630da7e2c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V5 9/9] SCSI: don't hold device refcount in IO path
  2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
                   ` (7 preceding siblings ...)
  2019-04-12  3:30 ` [PATCH V5 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
@ 2019-04-12  3:30 ` Ming Lei
  2019-04-12 11:09   ` Hannes Reinecke
  2019-04-13  0:04   ` Martin K. Petersen
  8 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2019-04-12  3:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

scsi_device's refcount is always grabbed in IO path.

Turns out it isn't necessary, because blk_queue_cleanup() will
drain any in-flight IOs, then cancel timeout/requeue work, and
SCSI's requeue_work is canceled too in __scsi_remove_device().

Also scsi_device won't go away until blk_cleanup_queue() is done.

So don't hold the refcount in IO path, especially the refcount isn't
required in IO path since blk_queue_enter() / blk_queue_exit()
is introduced in the legacy block layer.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 28 ++--------------------------
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 601b9f1de267..3369d66911eb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -141,8 +141,6 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 
 static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
 {
-	struct scsi_device *sdev = cmd->device;
-
 	if (cmd->request->rq_flags & RQF_DONTPREP) {
 		cmd->request->rq_flags &= ~RQF_DONTPREP;
 		scsi_mq_uninit_cmd(cmd);
@@ -150,7 +148,6 @@ static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
 		WARN_ON_ONCE(true);
 	}
 	blk_mq_requeue_request(cmd->request, true);
-	put_device(&sdev->sdev_gendev);
 }
 
 /**
@@ -189,19 +186,7 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, bool unbusy)
 	 */
 	cmd->result = 0;
 
-	/*
-	 * Before a SCSI command is dispatched,
-	 * get_device(&sdev->sdev_gendev) is called and the host,
-	 * target and device busy counters are increased. Since
-	 * requeuing a request causes these actions to be repeated and
-	 * since scsi_device_unbusy() has already been called,
-	 * put_device(&device->sdev_gendev) must still be called. Call
-	 * put_device() after blk_mq_requeue_request() to avoid that
-	 * removal of the SCSI device can start before requeueing has
-	 * happened.
-	 */
 	blk_mq_requeue_request(cmd->request, true);
-	put_device(&device->sdev_gendev);
 }
 
 /*
@@ -619,7 +604,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 		blk_mq_run_hw_queues(q, true);
 
 	percpu_ref_put(&q->q_usage_counter);
-	put_device(&sdev->sdev_gendev);
 	return false;
 }
 
@@ -1613,7 +1597,6 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
 	struct scsi_device *sdev = q->queuedata;
 
 	atomic_dec(&sdev->device_busy);
-	put_device(&sdev->sdev_gendev);
 }
 
 static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
@@ -1621,16 +1604,9 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct scsi_device *sdev = q->queuedata;
 
-	if (!get_device(&sdev->sdev_gendev))
-		goto out;
-	if (!scsi_dev_queue_ready(q, sdev))
-		goto out_put_device;
-
-	return true;
+	if (scsi_dev_queue_ready(q, sdev))
+		return true;
 
-out_put_device:
-	put_device(&sdev->sdev_gendev);
-out:
 	if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev))
 		blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
 	return false;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-12  3:30 ` [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-12  8:20   ` Johannes Thumshirn
  2019-04-12 10:55   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2019-04-12  8:20 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-12  3:30 ` [PATCH V5 2/9] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
@ 2019-04-12  8:23   ` Johannes Thumshirn
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Thumshirn @ 2019-04-12  8:23 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-12  3:30 ` [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-12  8:20   ` Johannes Thumshirn
@ 2019-04-12 10:55   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 10:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/12/19 5:30 AM, Ming Lei wrote:
> Just like aio/io_uring, we need to grab 2 refcount for queuing one
> request, one is for submission, another is for completion.
> 
> If the request isn't queued from plug code path, the refcount grabbed
> in generic_make_request() serves for submission. In theroy, this
> refcount should have been released after the sumission(async run queue)
> is done. blk_freeze_queue() works with blk_sync_queue() together
> for avoiding race between cleanup queue and IO submission, given async
> run queue activities are canceled because hctx->run_work is scheduled with
> the refcount held, so it is fine to not hold the refcount when
> running the run queue work function for dispatch IO.
> 
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed because the queue's
> kobject refcount isn't guaranteed to be grabbed in flushing plug list
> context, then kernel oops is triggered, see the following race:
> 
> blk_mq_flush_plug_list():
>          blk_mq_sched_insert_requests()
>                  insert requests to sw queue or scheduler queue
>                  blk_mq_run_hw_queue
> 
> Because of concurrent run queue, all requests inserted above may be
> completed before calling the above blk_mq_run_hw_queue. Then queue can
> be freed during the above blk_mq_run_hw_queue().
> 
> Fixes the issue by grab .q_usage_counter before calling
> blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> safe because the queue is absolutely alive before inserting request.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b49969..5b586affee09 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>   		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
>   			if (this_hctx) {
>   				trace_block_unplug(this_q, depth, !from_schedule);
> +
> +				percpu_ref_get(&this_q->q_usage_counter);
>   				blk_mq_sched_insert_requests(this_hctx, this_ctx,
>   								&rq_list,
>   								from_schedule);
> +				percpu_ref_put(&this_q->q_usage_counter);
>   			}
>   
>   			this_q = rq->q;
> @@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>   	 */
>   	if (this_hctx) {
>   		trace_block_unplug(this_q, depth, !from_schedule);
> +
> +		percpu_ref_get(&this_q->q_usage_counter);
>   		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
>   						from_schedule);
> +		percpu_ref_put(&this_q->q_usage_counter);
>   	}
>   }
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 3/9] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-12  3:30 ` [PATCH V5 3/9] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
@ 2019-04-12 11:03   ` Hannes Reinecke
  2019-04-13  7:18     ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 11:03 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang, stable

On 4/12/19 5:30 AM, Ming Lei wrote:
> Once blk_cleanup_queue() returns, tags shouldn't be used any more,
> because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
> ("blk-mq: Fix a use-after-free") fixes this issue exactly.
> 
> However, that commit introduces another issue. Before 45a9c9d909b2,
> we are allowed to run queue during cleaning up queue if the queue's
> kobj refcount is held. After that commit, queue can't be run during
> queue cleaning up, otherwise oops can be triggered easily because
> some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
> 
> We have invented ways for addressing this kind of issue before, such as:
> 
> 	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
> 	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
> 
> But still can't cover all cases, recently James reports another such
> kind of issue:
> 
> 	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
> 
> This issue can be quite hard to address by previous way, given
> scsi_run_queue() may run requeues for other LUNs.
> 
> Fixes the above issue by freeing hctx's resources in its release handler, and this
> way is safe becasue tags isn't needed for freeing such hctx resource.
> 
> This approach follows typical design pattern wrt. kobject's release handler.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reported-by: James Smart <james.smart@broadcom.com>
> Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c     | 2 +-
>   block/blk-mq-sysfs.c | 6 ++++++
>   block/blk-mq.c       | 8 ++------
>   block/blk-mq.h       | 2 +-
>   4 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 6583d67f3e34..20298aa5a77c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
>   	blk_exit_queue(q);
>   
>   	if (queue_is_mq(q))
> -		blk_mq_free_queue(q);
> +		blk_mq_exit_queue(q);
>   
>   	percpu_ref_exit(&q->q_usage_counter);
>   
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 3f9c3f4ac44c..4040e62c3737 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -10,6 +10,7 @@
>   #include <linux/smp.h>
>   
>   #include <linux/blk-mq.h>
> +#include "blk.h"
>   #include "blk-mq.h"
>   #include "blk-mq-tag.h"
>   
> @@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
>   {
>   	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
>   						  kobj);
> +
> +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> +		cleanup_srcu_struct(hctx->srcu);
> +	blk_free_flush_queue(hctx->fq);
> +	sbitmap_free(&hctx->ctx_map);
>   	free_cpumask_var(hctx->cpumask);
>   	kfree(hctx->ctxs);
>   	kfree(hctx);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b512ba0cb359..afc9912e2e42 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   	if (set->ops->exit_hctx)
>   		set->ops->exit_hctx(hctx, hctx_idx);
>   
> -	if (hctx->flags & BLK_MQ_F_BLOCKING)
> -		cleanup_srcu_struct(hctx->srcu);
> -
>   	blk_mq_remove_cpuhp(hctx);
> -	blk_free_flush_queue(hctx->fq);
> -	sbitmap_free(&hctx->ctx_map);
>   }
>   
>   static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   }
>   EXPORT_SYMBOL(blk_mq_init_allocated_queue);
>   
> -void blk_mq_free_queue(struct request_queue *q)
> +/* tags can _not_ be used after returning from blk_mq_exit_queue */
> +void blk_mq_exit_queue(struct request_queue *q)
>   {
>   	struct blk_mq_tag_set	*set = q->tag_set;
>   
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index d704fc7766f4..c421e3a16e36 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -37,7 +37,7 @@ struct blk_mq_ctx {
>   	struct kobject		kobj;
>   } ____cacheline_aligned_in_smp;
>   
> -void blk_mq_free_queue(struct request_queue *q);
> +void blk_mq_exit_queue(struct request_queue *q);
>   int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
>   void blk_mq_wake_waiters(struct request_queue *q);
>   bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
> 
Hmm. While this is a good point, I'm somehow not convinced that it fixes 
the mentioned problem.

For the sbitmap_any_set() case, the problem appears to be the racy 
interaction between nvme_scan_queues() and nvme_reset_ctrl().
Both can (and do) run in parallel, and as nvme_scan_queue() is not able 
to update the nvme namespace list race-free (nvme_ns_remove() will kill 
the namespace before removing it from the queue) we'll hit issues where 
nvme_start_queues() is called on namespaces which are halfway through 
nvme_ns_remove(), causing all sorts of issues.

See "[PATCHv2 RFC] nvme: use nvme_set_queue_dying() during namespace 
rescanning" for my attempt to fix this.
That patch is actually not complete; there is another one under testing 
currently to not remove namespaces at all during nvme_reset_ctrl(), but 
results are not in yet.

Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-12  3:30 ` [PATCH V5 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx Ming Lei
@ 2019-04-12 11:04   ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 11:04 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/12/19 5:30 AM, Ming Lei wrote:
> Move all hctx related allocation and initialization into
> __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> into real two functions, one is for allocate everything, and another is for
> initializing everyting.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
>   1 file changed, 44 insertions(+), 49 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-12  3:30 ` [PATCH V5 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
@ 2019-04-12 11:04   ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 11:04 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/12/19 5:30 AM, Ming Lei wrote:
> Split blk_mq_alloc_and_init_hctx into two parts, and one is
> blk_mq_alloc_hctx() which is for allocating all hctx resources,
> another is blk_mq_init_hctx() which is for initializing hctx, and
> serves as counter-part of blk_mq_exit_hctx().
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 69 ++++++++++++++++++++++++++++++++++++++--------------------
>   1 file changed, 45 insertions(+), 24 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-12  3:30 ` [PATCH V5 6/9] blk-mq: always free hctx after request queue is freed Ming Lei
@ 2019-04-12 11:06   ` Hannes Reinecke
  2019-04-13  7:27     ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 11:06 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/12/19 5:30 AM, Ming Lei wrote:
> In normal queue cleanup path, hctx is released after request queue
> is freed, see blk_mq_release().
> 
> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> of hw queues shrinking. This way is easy to cause use-after-free,
> because: one implicit rule is that it is safe to call almost all block
> layer APIs if the request queue is alive; and one hctx may be retrieved
> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> finally use-after-free is triggered.
> 
> Fixes this issue by always freeing hctx after releasing request queue.
> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> a per-queue list to hold them, then try to resuse these hctxs if numa
> node is matched.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
>   include/linux/blk-mq.h |  2 ++
>   include/linux/blkdev.h |  7 +++++++
>   3 files changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 71996fe494eb..886fbb678617 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2260,6 +2260,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   		set->ops->exit_hctx(hctx, hctx_idx);
>   
>   	blk_mq_remove_cpuhp(hctx);
> +
> +	spin_lock(&q->dead_hctx_lock);
> +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> +	spin_unlock(&q->dead_hctx_lock);
>   }
>   
>   static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2660,15 +2664,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>    */
>   void blk_mq_release(struct request_queue *q)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> +	struct blk_mq_hw_ctx *hctx, *next;
>   
>   	cancel_delayed_work_sync(&q->requeue_work);
>   
> -	/* hctx kobj stays in hctx */
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (!hctx)
> -			continue;
> +	/* all hctx are in .dead_hctx_list now */
> +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> +		list_del_init(&hctx->hctx_list);
>   		kobject_put(&hctx->kobj);
>   	}
>   
> @@ -2735,9 +2737,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
>   		struct blk_mq_tag_set *set, struct request_queue *q,
>   		int hctx_idx, int node)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> +
> +	/* reuse dead hctx first */
> +	spin_lock(&q->dead_hctx_lock);
> +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> +		if (tmp->numa_node == node) {
> +			hctx = tmp;
> +			break;
> +		}
> +	}
> +	if (hctx)
> +		list_del_init(&hctx->hctx_list);
> +	spin_unlock(&q->dead_hctx_lock);
>   
> -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> +	if (!hctx)
> +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
>   	if (!hctx)
>   		goto fail;
>   
> @@ -2775,10 +2790,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>   
>   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
>   		if (hctx) {
> -			if (hctxs[i]) {
> +			if (hctxs[i])
>   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> -				kobject_put(&hctxs[i]->kobj);
> -			}
>   			hctxs[i] = hctx;
>   		} else {
>   			if (hctxs[i])
> @@ -2809,9 +2822,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>   			if (hctx->tags)
>   				blk_mq_free_map_and_requests(set, j);
>   			blk_mq_exit_hctx(q, set, hctx, j);
> -			kobject_put(&hctx->kobj);
>   			hctxs[j] = NULL;
> -
>   		}
>   	}
>   	mutex_unlock(&q->sysfs_lock);
> @@ -2854,6 +2865,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   	if (!q->queue_hw_ctx)
>   		goto err_sys_init;
>   
> +	INIT_LIST_HEAD(&q->dead_hctx_list);
> +	spin_lock_init(&q->dead_hctx_lock);
> +
>   	blk_mq_realloc_hw_ctxs(set, q);
>   	if (!q->nr_hw_queues)
>   		goto err_hctxs;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index cb2aa7ecafff..a44c3f95dcc1 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
>   	struct dentry		*sched_debugfs_dir;
>   #endif
>   
> +	struct list_head	hctx_list;
> +
>   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
>   	struct srcu_struct	srcu[0];
>   };
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 4b85dc066264..1325f941f0be 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -535,6 +535,13 @@ struct request_queue {
>   
>   	struct mutex		sysfs_lock;
>   
> +	/*
> +	 * for reusing dead hctx instance in case of updating
> +	 * nr_hw_queues
> +	 */
> +	struct list_head	dead_hctx_list;
> +	spinlock_t		dead_hctx_lock;
> +
>   	atomic_t		mq_freeze_depth;
>   
>   #if defined(CONFIG_BLK_DEV_BSG)
> 
I actually had been looking into this, too, but couldn't convince myself 
that the code really is a problem.
Did you see this happening in real life?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-12  3:30 ` [PATCH V5 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
@ 2019-04-12 11:08   ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 11:08 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/12/19 5:30 AM, Ming Lei wrote:
> hctx is always released after requeue is freed.
> 
> With holding queue's kobject refcount, it is safe for driver to run queue,
> so one run queue might be scheduled after blk_sync_queue() is done.
> 
> So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
> for avoiding run released queue.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c     | 8 --------
>   block/blk-mq-sysfs.c | 2 ++
>   2 files changed, 2 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-12  3:30 ` [PATCH V5 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
@ 2019-04-12 11:09   ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 11:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/12/19 5:30 AM, Ming Lei wrote:
> Now freeing hw queue resource is moved to hctx's release handler,
> we don't need to worry about the race between blk_cleanup_queue and
> run queue any more.
> 
> So don't drain in-progress dispatch in blk_cleanup_queue().
> 
> This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
> freeing queue").
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c | 12 ------------
>   1 file changed, 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 9/9] SCSI: don't hold device refcount in IO path
  2019-04-12  3:30 ` [PATCH V5 9/9] SCSI: don't hold device refcount in IO path Ming Lei
@ 2019-04-12 11:09   ` Hannes Reinecke
  2019-04-13  0:04   ` Martin K. Petersen
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2019-04-12 11:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/12/19 5:30 AM, Ming Lei wrote:
> scsi_device's refcount is always grabbed in IO path.
> 
> Turns out it isn't necessary, because blk_queue_cleanup() will
> drain any in-flight IOs, then cancel timeout/requeue work, and
> SCSI's requeue_work is canceled too in __scsi_remove_device().
> 
> Also scsi_device won't go away until blk_cleanup_queue() is done.
> 
> So don't hold the refcount in IO path, especially the refcount isn't
> required in IO path since blk_queue_enter() / blk_queue_exit()
> is introduced in the legacy block layer.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   drivers/scsi/scsi_lib.c | 28 ++--------------------------
>   1 file changed, 2 insertions(+), 26 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 9/9] SCSI: don't hold device refcount in IO path
  2019-04-12  3:30 ` [PATCH V5 9/9] SCSI: don't hold device refcount in IO path Ming Lei
  2019-04-12 11:09   ` Hannes Reinecke
@ 2019-04-13  0:04   ` Martin K. Petersen
  2019-04-13  6:56     ` Ming Lei
  1 sibling, 1 reply; 26+ messages in thread
From: Martin K. Petersen @ 2019-04-13  0:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang


Ming,

> scsi_device's refcount is always grabbed in IO path.
>
> Turns out it isn't necessary, because blk_queue_cleanup() will
> drain any in-flight IOs, then cancel timeout/requeue work, and
> SCSI's requeue_work is canceled too in __scsi_remove_device().
>
> Also scsi_device won't go away until blk_cleanup_queue() is done.
>
> So don't hold the refcount in IO path, especially the refcount isn't
> required in IO path since blk_queue_enter() / blk_queue_exit()
> is introduced in the legacy block layer.

This all looks good to me. I assume it will go through Jens' tree?

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 9/9] SCSI: don't hold device refcount in IO path
  2019-04-13  0:04   ` Martin K. Petersen
@ 2019-04-13  6:56     ` Ming Lei
  2019-04-13  9:23       ` Ming Lei
  2019-04-16  2:12       ` Martin K. Petersen
  0 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2019-04-13  6:56 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Fri, Apr 12, 2019 at 08:04:40PM -0400, Martin K. Petersen wrote:
> 
> Ming,
> 
> > scsi_device's refcount is always grabbed in IO path.
> >
> > Turns out it isn't necessary, because blk_queue_cleanup() will
> > drain any in-flight IOs, then cancel timeout/requeue work, and
> > SCSI's requeue_work is canceled too in __scsi_remove_device().
> >
> > Also scsi_device won't go away until blk_cleanup_queue() is done.
> >
> > So don't hold the refcount in IO path, especially the refcount isn't
> > required in IO path since blk_queue_enter() / blk_queue_exit()
> > is introduced in the legacy block layer.
> 
> This all looks good to me. I assume it will go through Jens' tree?
> 
> Acked-by: Martin K. Petersen <martin.petersen@oracle.com>

Hi Nartin,

This patch doesn't depend on patch 1~8, so please take it via scsi tree
if you are fine.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 3/9] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-12 11:03   ` Hannes Reinecke
@ 2019-04-13  7:18     ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2019-04-13  7:18 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang,
	stable

On Fri, Apr 12, 2019 at 01:03:35PM +0200, Hannes Reinecke wrote:
> On 4/12/19 5:30 AM, Ming Lei wrote:
> > Once blk_cleanup_queue() returns, tags shouldn't be used any more,
> > because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
> > ("blk-mq: Fix a use-after-free") fixes this issue exactly.
> > 
> > However, that commit introduces another issue. Before 45a9c9d909b2,
> > we are allowed to run queue during cleaning up queue if the queue's
> > kobj refcount is held. After that commit, queue can't be run during
> > queue cleaning up, otherwise oops can be triggered easily because
> > some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
> > 
> > We have invented ways for addressing this kind of issue before, such as:
> > 
> > 	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
> > 	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
> > 
> > But still can't cover all cases, recently James reports another such
> > kind of issue:
> > 
> > 	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
> > 
> > This issue can be quite hard to address by previous way, given
> > scsi_run_queue() may run requeues for other LUNs.
> > 
> > Fixes the above issue by freeing hctx's resources in its release handler, and this
> > way is safe becasue tags isn't needed for freeing such hctx resource.
> > 
> > This approach follows typical design pattern wrt. kobject's release handler.
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > Reported-by: James Smart <james.smart@broadcom.com>
> > Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-core.c     | 2 +-
> >   block/blk-mq-sysfs.c | 6 ++++++
> >   block/blk-mq.c       | 8 ++------
> >   block/blk-mq.h       | 2 +-
> >   4 files changed, 10 insertions(+), 8 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 6583d67f3e34..20298aa5a77c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
> >   	blk_exit_queue(q);
> >   	if (queue_is_mq(q))
> > -		blk_mq_free_queue(q);
> > +		blk_mq_exit_queue(q);
> >   	percpu_ref_exit(&q->q_usage_counter);
> > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > index 3f9c3f4ac44c..4040e62c3737 100644
> > --- a/block/blk-mq-sysfs.c
> > +++ b/block/blk-mq-sysfs.c
> > @@ -10,6 +10,7 @@
> >   #include <linux/smp.h>
> >   #include <linux/blk-mq.h>
> > +#include "blk.h"
> >   #include "blk-mq.h"
> >   #include "blk-mq-tag.h"
> > @@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
> >   {
> >   	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
> >   						  kobj);
> > +
> > +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> > +		cleanup_srcu_struct(hctx->srcu);
> > +	blk_free_flush_queue(hctx->fq);
> > +	sbitmap_free(&hctx->ctx_map);
> >   	free_cpumask_var(hctx->cpumask);
> >   	kfree(hctx->ctxs);
> >   	kfree(hctx);
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index b512ba0cb359..afc9912e2e42 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2259,12 +2259,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >   	if (set->ops->exit_hctx)
> >   		set->ops->exit_hctx(hctx, hctx_idx);
> > -	if (hctx->flags & BLK_MQ_F_BLOCKING)
> > -		cleanup_srcu_struct(hctx->srcu);
> > -
> >   	blk_mq_remove_cpuhp(hctx);
> > -	blk_free_flush_queue(hctx->fq);
> > -	sbitmap_free(&hctx->ctx_map);
> >   }
> >   static void blk_mq_exit_hw_queues(struct request_queue *q,
> > @@ -2899,7 +2894,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> >   }
> >   EXPORT_SYMBOL(blk_mq_init_allocated_queue);
> > -void blk_mq_free_queue(struct request_queue *q)
> > +/* tags can _not_ be used after returning from blk_mq_exit_queue */
> > +void blk_mq_exit_queue(struct request_queue *q)
> >   {
> >   	struct blk_mq_tag_set	*set = q->tag_set;
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index d704fc7766f4..c421e3a16e36 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -37,7 +37,7 @@ struct blk_mq_ctx {
> >   	struct kobject		kobj;
> >   } ____cacheline_aligned_in_smp;
> > -void blk_mq_free_queue(struct request_queue *q);
> > +void blk_mq_exit_queue(struct request_queue *q);
> >   int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
> >   void blk_mq_wake_waiters(struct request_queue *q);
> >   bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
> > 
> Hmm. While this is a good point, I'm somehow not convinced that it fixes the
> mentioned problem.

This patch fixes the following generic issue in block layer:

1) usually any block layer API may be called safely when the request
queue's refcount is held

2) when blk_cleanup_queue() is run, no use-after-free oops can be
triggered on the code path in 1)

> 
> For the sbitmap_any_set() case, the problem appears to be the racy
> interaction between nvme_scan_queues() and nvme_reset_ctrl().
> Both can (and do) run in parallel, and as nvme_scan_queue() is not able to
> update the nvme namespace list race-free (nvme_ns_remove() will kill the
> namespace before removing it from the queue) we'll hit issues where
> nvme_start_queues() is called on namespaces which are halfway through
> nvme_ns_remove(), causing all sorts of issues.

This is one race between nvme reset and scan, and especially the
refcount of ns queue isn't held before calling nvme_stop_queues()
and nvme_start_queues().

The above situation is one NVMe specific issue and doesn't belong to what
this patch tries to address.

> 
> See "[PATCHv2 RFC] nvme: use nvme_set_queue_dying() during namespace
> rescanning" for my attempt to fix this.
> That patch is actually not complete; there is another one under testing
> currently to not remove namespaces at all during nvme_reset_ctrl(), but
> results are not in yet.
> 
> Otherwise:

Looks the above 'Otherwise' shouldn't be needed, as I explained.

> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>

Thanks!

-- 
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-12 11:06   ` Hannes Reinecke
@ 2019-04-13  7:27     ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2019-04-13  7:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Fri, Apr 12, 2019 at 01:06:07PM +0200, Hannes Reinecke wrote:
> On 4/12/19 5:30 AM, Ming Lei wrote:
> > In normal queue cleanup path, hctx is released after request queue
> > is freed, see blk_mq_release().
> > 
> > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > of hw queues shrinking. This way is easy to cause use-after-free,
> > because: one implicit rule is that it is safe to call almost all block
> > layer APIs if the request queue is alive; and one hctx may be retrieved
> > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > finally use-after-free is triggered.
> > 
> > Fixes this issue by always freeing hctx after releasing request queue.
> > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > a per-queue list to hold them, then try to resuse these hctxs if numa
> > node is matched.
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
> >   include/linux/blk-mq.h |  2 ++
> >   include/linux/blkdev.h |  7 +++++++
> >   3 files changed, 36 insertions(+), 13 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 71996fe494eb..886fbb678617 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2260,6 +2260,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >   		set->ops->exit_hctx(hctx, hctx_idx);
> >   	blk_mq_remove_cpuhp(hctx);
> > +
> > +	spin_lock(&q->dead_hctx_lock);
> > +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> > +	spin_unlock(&q->dead_hctx_lock);
> >   }
> >   static void blk_mq_exit_hw_queues(struct request_queue *q,
> > @@ -2660,15 +2664,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
> >    */
> >   void blk_mq_release(struct request_queue *q)
> >   {
> > -	struct blk_mq_hw_ctx *hctx;
> > -	unsigned int i;
> > +	struct blk_mq_hw_ctx *hctx, *next;
> >   	cancel_delayed_work_sync(&q->requeue_work);
> > -	/* hctx kobj stays in hctx */
> > -	queue_for_each_hw_ctx(q, hctx, i) {
> > -		if (!hctx)
> > -			continue;
> > +	/* all hctx are in .dead_hctx_list now */
> > +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> > +		list_del_init(&hctx->hctx_list);
> >   		kobject_put(&hctx->kobj);
> >   	}
> > @@ -2735,9 +2737,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> >   		struct blk_mq_tag_set *set, struct request_queue *q,
> >   		int hctx_idx, int node)
> >   {
> > -	struct blk_mq_hw_ctx *hctx;
> > +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> > +
> > +	/* reuse dead hctx first */
> > +	spin_lock(&q->dead_hctx_lock);
> > +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> > +		if (tmp->numa_node == node) {
> > +			hctx = tmp;
> > +			break;
> > +		}
> > +	}
> > +	if (hctx)
> > +		list_del_init(&hctx->hctx_list);
> > +	spin_unlock(&q->dead_hctx_lock);
> > -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > +	if (!hctx)
> > +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> >   	if (!hctx)
> >   		goto fail;
> > @@ -2775,10 +2790,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
> >   		if (hctx) {
> > -			if (hctxs[i]) {
> > +			if (hctxs[i])
> >   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> > -				kobject_put(&hctxs[i]->kobj);
> > -			}
> >   			hctxs[i] = hctx;
> >   		} else {
> >   			if (hctxs[i])
> > @@ -2809,9 +2822,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >   			if (hctx->tags)
> >   				blk_mq_free_map_and_requests(set, j);
> >   			blk_mq_exit_hctx(q, set, hctx, j);
> > -			kobject_put(&hctx->kobj);
> >   			hctxs[j] = NULL;
> > -
> >   		}
> >   	}
> >   	mutex_unlock(&q->sysfs_lock);
> > @@ -2854,6 +2865,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> >   	if (!q->queue_hw_ctx)
> >   		goto err_sys_init;
> > +	INIT_LIST_HEAD(&q->dead_hctx_list);
> > +	spin_lock_init(&q->dead_hctx_lock);
> > +
> >   	blk_mq_realloc_hw_ctxs(set, q);
> >   	if (!q->nr_hw_queues)
> >   		goto err_hctxs;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index cb2aa7ecafff..a44c3f95dcc1 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
> >   	struct dentry		*sched_debugfs_dir;
> >   #endif
> > +	struct list_head	hctx_list;
> > +
> >   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
> >   	struct srcu_struct	srcu[0];
> >   };
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 4b85dc066264..1325f941f0be 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -535,6 +535,13 @@ struct request_queue {
> >   	struct mutex		sysfs_lock;
> > +	/*
> > +	 * for reusing dead hctx instance in case of updating
> > +	 * nr_hw_queues
> > +	 */
> > +	struct list_head	dead_hctx_list;
> > +	spinlock_t		dead_hctx_lock;
> > +
> >   	atomic_t		mq_freeze_depth;
> >   #if defined(CONFIG_BLK_DEV_BSG)
> > 
> I actually had been looking into this, too, but couldn't convince myself
> that the code really is a problem.
> Did you see this happening in real life?

User-after-free on hctx can be triggered if the hctx is freed
in blk_mq_update_nr_hw_queues().

This patch aligns to normal cleanup queue model, and will avoid this
kind of issue.

Updating nr_hw_queues is only triggered on very specific situation, so
in practice, it isn't easy to trigger.

However, if we play CPU hotplug & timeout test on nvme-loop, it should
have been easy to trigger, but nvme-loop's reset/timeout handler is simply
broken. The same test can be done on nvme-pci too, just needs real
hardware and environment and takes time...


Thanks,
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 9/9] SCSI: don't hold device refcount in IO path
  2019-04-13  6:56     ` Ming Lei
@ 2019-04-13  9:23       ` Ming Lei
  2019-04-16  2:12       ` Martin K. Petersen
  1 sibling, 0 replies; 26+ messages in thread
From: Ming Lei @ 2019-04-13  9:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin K. Petersen, Jens Axboe, linux-block, Dongli Zhang,
	James Smart, Bart Van Assche, Linux SCSI List, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Sat, Apr 13, 2019 at 2:57 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Fri, Apr 12, 2019 at 08:04:40PM -0400, Martin K. Petersen wrote:
> >
> > Ming,
> >
> > > scsi_device's refcount is always grabbed in IO path.
> > >
> > > Turns out it isn't necessary, because blk_queue_cleanup() will
> > > drain any in-flight IOs, then cancel timeout/requeue work, and
> > > SCSI's requeue_work is canceled too in __scsi_remove_device().
> > >
> > > Also scsi_device won't go away until blk_cleanup_queue() is done.
> > >
> > > So don't hold the refcount in IO path, especially the refcount isn't
> > > required in IO path since blk_queue_enter() / blk_queue_exit()
> > > is introduced in the legacy block layer.
> >
> > This all looks good to me. I assume it will go through Jens' tree?
> >
> > Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
>
> Hi Nartin,

Martin, sorry for the typo, :-(


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 9/9] SCSI: don't hold device refcount in IO path
  2019-04-13  6:56     ` Ming Lei
  2019-04-13  9:23       ` Ming Lei
@ 2019-04-16  2:12       ` Martin K. Petersen
  1 sibling, 0 replies; 26+ messages in thread
From: Martin K. Petersen @ 2019-04-16  2:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin K. Petersen, Jens Axboe, linux-block, Dongli Zhang,
	James Smart, Bart Van Assche, linux-scsi, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang


Ming,

> This patch doesn't depend on patch 1~8, so please take it via scsi
> tree if you are fine.

Applied, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-04-16  2:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-12  3:30 [PATCH V5 0/9] blk-mq: fix races related with freeing queue Ming Lei
2019-04-12  3:30 ` [PATCH V5 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
2019-04-12  8:20   ` Johannes Thumshirn
2019-04-12 10:55   ` Hannes Reinecke
2019-04-12  3:30 ` [PATCH V5 2/9] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
2019-04-12  8:23   ` Johannes Thumshirn
2019-04-12  3:30 ` [PATCH V5 3/9] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
2019-04-12 11:03   ` Hannes Reinecke
2019-04-13  7:18     ` Ming Lei
2019-04-12  3:30 ` [PATCH V5 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx Ming Lei
2019-04-12 11:04   ` Hannes Reinecke
2019-04-12  3:30 ` [PATCH V5 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
2019-04-12 11:04   ` Hannes Reinecke
2019-04-12  3:30 ` [PATCH V5 6/9] blk-mq: always free hctx after request queue is freed Ming Lei
2019-04-12 11:06   ` Hannes Reinecke
2019-04-13  7:27     ` Ming Lei
2019-04-12  3:30 ` [PATCH V5 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
2019-04-12 11:08   ` Hannes Reinecke
2019-04-12  3:30 ` [PATCH V5 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
2019-04-12 11:09   ` Hannes Reinecke
2019-04-12  3:30 ` [PATCH V5 9/9] SCSI: don't hold device refcount in IO path Ming Lei
2019-04-12 11:09   ` Hannes Reinecke
2019-04-13  0:04   ` Martin K. Petersen
2019-04-13  6:56     ` Ming Lei
2019-04-13  9:23       ` Ming Lei
2019-04-16  2:12       ` Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.