All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V8 0/7] blk-mq: fix races related with freeing queue
@ 2019-04-28  8:14 Ming Lei
  2019-04-28  8:14 ` [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V8:
	- merge the 4th and 5th of V7 into one patch, as suggested by Christoph
	- drop the 9th patch

V7:
	- add reviewed-by and tested-by tag
	- rename "dead_hctx" as "unused_hctx"
	- check if there are live hctx in queue's release handler
	- only patch 6 is modified

V6:
	- remove previous SCSI patch which will be routed via SCSI tree
	- add reviewed-by tag
	- fix one related NVMe scan vs reset race

V5:
	- refactor blk_mq_alloc_and_init_hctx()
	- fix race related updating nr_hw_queues by always freeing hctx
	  after request queue is released

V4:
	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
	- fix comment in the last patch

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler


Ming Lei (7):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: always free hctx after request queue is freed
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()

 block/blk-core.c       |  23 +-----
 block/blk-mq-sched.c   |   9 +++
 block/blk-mq-sysfs.c   |   8 +++
 block/blk-mq.c         | 188 +++++++++++++++++++++++++++++--------------------
 block/blk-mq.h         |   2 +-
 include/linux/blk-mq.h |   2 +
 include/linux/blkdev.h |   7 ++
 7 files changed, 139 insertions(+), 100 deletions(-)

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
-- 
2.9.5


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
@ 2019-04-28  8:14 ` Ming Lei
  2019-04-28 12:10   ` Christoph Hellwig
  2019-04-29 18:09   ` Bart Van Assche
  2019-04-28  8:14 ` [PATCH V8 2/7] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index aa6bc5c02643..dfe83e7935d6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
 {
 	struct elevator_queue *e;
 
+	/*
+	 * blk_mq_sched_insert_requests() is called from flush plug
+	 * context only, and hold one usage counter to prevent queue
+	 * from being released.
+	 */
+	percpu_ref_get(&hctx->queue->q_usage_counter);
+
 	e = hctx->queue->elevator;
 	if (e && e->type->ops.insert_requests)
 		e->type->ops.insert_requests(hctx, list, false);
@@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
 	}
 
 	blk_mq_run_hw_queue(hctx, run_queue_async);
+
+	percpu_ref_put(&hctx->queue->q_usage_counter);
 }
 
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V8 2/7] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-28  8:14 ` [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-28  8:14 ` Ming Lei
  2019-04-28  8:14 ` [PATCH V8 3/7] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..93dc588fabe2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc60ed7e940e..89781309a108 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2634,6 +2634,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V8 3/7] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
  2019-04-28  8:14 ` [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-28  8:14 ` [PATCH V8 2/7] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
@ 2019-04-28  8:14 ` Ming Lei
  2019-04-28  8:14 ` [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, stable

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 93dc588fabe2..2dd94b3e9ece 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 89781309a108..d98cb9614dfa 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2267,12 +2267,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2907,7 +2902,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 423ea88ab6fb..633a5a77ee8b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (2 preceding siblings ...)
  2019-04-28  8:14 ` [PATCH V8 3/7] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
@ 2019-04-28  8:14 ` Ming Lei
  2019-04-28 12:12   ` Christoph Hellwig
  2019-04-29  6:05   ` Hannes Reinecke
  2019-04-28  8:14 ` [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed Ming Lei
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 138 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 77 insertions(+), 61 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d98cb9614dfa..44ecca6b0cac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2284,15 +2284,70 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
+{
+	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+			   __alignof__(struct blk_mq_hw_ctx)) !=
+		     sizeof(struct blk_mq_hw_ctx));
+
+	if (tag_set->flags & BLK_MQ_F_BLOCKING)
+		hw_ctx_size += sizeof(struct srcu_struct);
+
+	return hw_ctx_size;
+}
+
 static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
 {
-	int node;
+	hctx->queue_num = hctx_idx;
 
-	node = hctx->numa_node;
+	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+	hctx->tags = set->tags[hctx_idx];
+
+	if (set->ops->init_hctx &&
+	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+		goto unregister_cpu_notifier;
+
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+				hctx->numa_node))
+		goto exit_hctx;
+	return 0;
+
+ exit_hctx:
+	if (set->ops->exit_hctx)
+		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_remove_cpuhp(hctx);
+	return -1;
+}
+
+static struct blk_mq_hw_ctx *
+blk_mq_alloc_hctx(struct request_queue *q,
+		struct blk_mq_tag_set *set,
+		unsigned hctx_idx, int node)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+			node);
+	if (!hctx)
+		goto fail_alloc_hctx;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+				node))
+		goto free_hctx;
+
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
 	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
+		hctx->numa_node = set->numa_node;
+	node = hctx->numa_node;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -2300,10 +2355,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-	hctx->tags = set->tags[hctx_idx];
-
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2311,47 +2362,38 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
 	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
+		goto free_cpumask;
 
 	if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
 				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
 		goto free_ctxs;
-
 	hctx->nr_ctx = 0;
 
 	spin_lock_init(&hctx->dispatch_wait_lock);
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
 	if (!hctx->fq)
-		goto exit_hctx;
-
-	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-		goto free_fq;
+		goto free_bitmap;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
+	blk_mq_hctx_kobj_init(hctx);
 
-	return 0;
+	return hctx;
 
- free_fq:
-	blk_free_flush_queue(hctx->fq);
- exit_hctx:
-	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
  free_bitmap:
 	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_remove_cpuhp(hctx);
-	return -1;
+ free_cpumask:
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+ fail_alloc_hctx:
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -2697,51 +2739,25 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_sq_queue);
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
 	struct blk_mq_hw_ctx *hctx;
 
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-			node);
+	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
 	if (!hctx)
-		return NULL;
-
-	if (!zalloc_cpumask_var_node(&hctx->cpumask,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-				node)) {
-		kfree(hctx);
-		return NULL;
-	}
+		goto fail;
 
-	atomic_set(&hctx->nr_active, 0);
-	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
-	if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx);
-		return NULL;
-	}
-	blk_mq_hctx_kobj_init(hctx);
+	if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
+		goto free_hctx;
 
 	return hctx;
+
+ free_hctx:
+	kobject_put(&hctx->kobj);
+ fail:
+	return NULL;
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed
  2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (3 preceding siblings ...)
  2019-04-28  8:14 ` [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
@ 2019-04-28  8:14 ` Ming Lei
  2019-04-28 12:14   ` Christoph Hellwig
  2019-04-28  8:14 ` [PATCH V8 6/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
  2019-04-28  8:14 ` [PATCH V8 7/7] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
  6 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 46 +++++++++++++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 44ecca6b0cac..a37090038c96 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2268,6 +2268,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	blk_mq_remove_cpuhp(hctx);
+
+	spin_lock(&q->unused_hctx_lock);
+	list_add(&hctx->hctx_list, &q->unused_hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2355,6 +2359,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	INIT_LIST_HEAD(&hctx->hctx_list);
+
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2668,15 +2674,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
+	struct blk_mq_hw_ctx *hctx, *next;
+	int i;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
-	/* hctx kobj stays in hctx */
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx)
-			continue;
+	queue_for_each_hw_ctx(q, hctx, i)
+		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
+	/* all hctx are in .unused_hctx_list now */
+	list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
+		list_del_init(&hctx->hctx_list);
 		kobject_put(&hctx->kobj);
 	}
 
@@ -2743,9 +2751,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
 
-	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	/* reuse dead hctx first */
+	spin_lock(&q->unused_hctx_lock);
+	list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
+		if (tmp->numa_node == node) {
+			hctx = tmp;
+			break;
+		}
+	}
+	if (hctx)
+		list_del_init(&hctx->hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
+
+	if (!hctx)
+		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
 	if (!hctx)
 		goto fail;
 
@@ -2783,10 +2804,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
 		if (hctx) {
-			if (hctxs[i]) {
+			if (hctxs[i])
 				blk_mq_exit_hctx(q, set, hctxs[i], i);
-				kobject_put(&hctxs[i]->kobj);
-			}
 			hctxs[i] = hctx;
 		} else {
 			if (hctxs[i])
@@ -2817,9 +2836,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			if (hctx->tags)
 				blk_mq_free_map_and_requests(set, j);
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
-
 		}
 	}
 	mutex_unlock(&q->sysfs_lock);
@@ -2862,6 +2879,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_sys_init;
 
+	INIT_LIST_HEAD(&q->unused_hctx_list);
+	spin_lock_init(&q->unused_hctx_lock);
+
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index db29928de467..15d1aa53d96c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
 	struct dentry		*sched_debugfs_dir;
 #endif
 
+	struct list_head	hctx_list;
+
 	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
 	struct srcu_struct	srcu[0];
 };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99aa98f60b9e..d7bad4ae8bc8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -535,6 +535,13 @@ struct request_queue {
 
 	struct mutex		sysfs_lock;
 
+	/*
+	 * for reusing dead hctx instance in case of updating
+	 * nr_hw_queues
+	 */
+	struct list_head	unused_hctx_list;
+	spinlock_t		unused_hctx_lock;
+
 	atomic_t		mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V8 6/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (4 preceding siblings ...)
  2019-04-28  8:14 ` [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed Ming Lei
@ 2019-04-28  8:14 ` Ming Lei
  2019-04-28  8:14 ` [PATCH V8 7/7] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
  6 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2dd94b3e9ece..f5b5f21ae4fd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V8 7/7] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
                   ` (5 preceding siblings ...)
  2019-04-28  8:14 ` [PATCH V8 6/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
@ 2019-04-28  8:14 ` Ming Lei
  6 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28  8:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f5b5f21ae4fd..e24cfcefdc19 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-28  8:14 ` [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
@ 2019-04-28 12:10   ` Christoph Hellwig
  2019-04-29 18:09   ` Bart Van Assche
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2019-04-28 12:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-28  8:14 ` [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
@ 2019-04-28 12:12   ` Christoph Hellwig
  2019-04-29  6:05   ` Hannes Reinecke
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2019-04-28 12:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

> +static struct blk_mq_hw_ctx *
> +blk_mq_alloc_hctx(struct request_queue *q,
> +		struct blk_mq_tag_set *set,

Nit: The second paramter would easily fit on the first line.

> +		unsigned hctx_idx, int node)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> +			node);
> +	if (!hctx)
> +		goto fail_alloc_hctx;
> +
> +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> +				node))

Nit: I still think a local variable for the gfp_t would be very useful
here.

> +	atomic_set(&hctx->nr_active, 0);
> +	hctx->numa_node = node;
>  	if (node == NUMA_NO_NODE)
> -		node = hctx->numa_node = set->numa_node;
> +		hctx->numa_node = set->numa_node;
> +	node = hctx->numa_node;

Why not:

	if (node == NUMA_NO_NODE)
		set->numa_node;
	hctx->numa_node = node;

?

Otherwise looks fine:


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed
  2019-04-28  8:14 ` [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed Ming Lei
@ 2019-04-28 12:14   ` Christoph Hellwig
  2019-04-28 13:15     ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2019-04-28 12:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Sun, Apr 28, 2019 at 04:14:06PM +0800, Ming Lei wrote:
> In normal queue cleanup path, hctx is released after request queue
> is freed, see blk_mq_release().
> 
> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> of hw queues shrinking. This way is easy to cause use-after-free,
> because: one implicit rule is that it is safe to call almost all block
> layer APIs if the request queue is alive; and one hctx may be retrieved
> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> finally use-after-free is triggered.
> 
> Fixes this issue by always freeing hctx after releasing request queue.
> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> a per-queue list to hold them, then try to resuse these hctxs if numa
> node is matched.

This seems a little odd.  Wouldn't it be much simpler to just keep
the hctx where it is, that is leave the queue_hw_ctx[] pointer in tact,
but have a flag marking it dead?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed
  2019-04-28 12:14   ` Christoph Hellwig
@ 2019-04-28 13:15     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-28 13:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Sun, Apr 28, 2019 at 02:14:26PM +0200, Christoph Hellwig wrote:
> On Sun, Apr 28, 2019 at 04:14:06PM +0800, Ming Lei wrote:
> > In normal queue cleanup path, hctx is released after request queue
> > is freed, see blk_mq_release().
> > 
> > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > of hw queues shrinking. This way is easy to cause use-after-free,
> > because: one implicit rule is that it is safe to call almost all block
> > layer APIs if the request queue is alive; and one hctx may be retrieved
> > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > finally use-after-free is triggered.
> > 
> > Fixes this issue by always freeing hctx after releasing request queue.
> > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > a per-queue list to hold them, then try to resuse these hctxs if numa
> > node is matched.
> 
> This seems a little odd.  Wouldn't it be much simpler to just keep
> the hctx where it is, that is leave the queue_hw_ctx[] pointer in tact,
> but have a flag marking it dead?

There are several issues with that solution:

1) q->nr_hw_queues becomes not same with number of the real active hw queues

2) if one hctx is only marked as dead and not freed, after several times of
updating nr_hw_queues, more and more hctx instances will be wasted.

3) q->queue_hw_ctx[] has to be re-allocated when nr_hw_queues is increased.

So I think this patch should be simpler, either in concept or implementation.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-28  8:14 ` [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
  2019-04-28 12:12   ` Christoph Hellwig
@ 2019-04-29  6:05   ` Hannes Reinecke
  2019-04-30  0:50     ` Ming Lei
  1 sibling, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2019-04-29  6:05 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley

On 4/28/19 10:14 AM, Ming Lei wrote:
> Split blk_mq_alloc_and_init_hctx into two parts, and one is
> blk_mq_alloc_hctx() for allocating all hctx resources, another
> is blk_mq_init_hctx() for initializing hctx, which serves as
> counter-part of blk_mq_exit_hctx().
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Tested-by: James Smart <james.smart@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 138 ++++++++++++++++++++++++++++++++-------------------------
>   1 file changed, 77 insertions(+), 61 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d98cb9614dfa..44ecca6b0cac 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2284,15 +2284,70 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
>   	}
>   }
>   
> +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
> +{
> +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> +
> +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> +			   __alignof__(struct blk_mq_hw_ctx)) !=
> +		     sizeof(struct blk_mq_hw_ctx));
> +
> +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> +		hw_ctx_size += sizeof(struct srcu_struct);
> +
> +	return hw_ctx_size;
> +}
> +
>   static int blk_mq_init_hctx(struct request_queue *q,
>   		struct blk_mq_tag_set *set,
>   		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
>   {
> -	int node;
> +	hctx->queue_num = hctx_idx;
>   
> -	node = hctx->numa_node;
> +	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
> +
> +	hctx->tags = set->tags[hctx_idx];
> +
> +	if (set->ops->init_hctx &&
> +	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
> +		goto unregister_cpu_notifier;
> +
> +	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
> +				hctx->numa_node))
> +		goto exit_hctx;
> +	return 0;
> +
> + exit_hctx:
> +	if (set->ops->exit_hctx)
> +		set->ops->exit_hctx(hctx, hctx_idx);
> + unregister_cpu_notifier:
> +	blk_mq_remove_cpuhp(hctx);
> +	return -1;
> +}
> +
> +static struct blk_mq_hw_ctx *
> +blk_mq_alloc_hctx(struct request_queue *q,
> +		struct blk_mq_tag_set *set,
> +		unsigned hctx_idx, int node)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> +			node);
> +	if (!hctx)
> +		goto fail_alloc_hctx;
> +
> +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> +				node))
> +		goto free_hctx;
> +
> +	atomic_set(&hctx->nr_active, 0);
> +	hctx->numa_node = node;
>   	if (node == NUMA_NO_NODE)
> -		node = hctx->numa_node = set->numa_node;
> +		hctx->numa_node = set->numa_node;
> +	node = hctx->numa_node;
>   
>   	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
>   	spin_lock_init(&hctx->lock);
The 'hctx_idx' argument is now unused, and should be removed from the 
function definition.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-28  8:14 ` [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
  2019-04-28 12:10   ` Christoph Hellwig
@ 2019-04-29 18:09   ` Bart Van Assche
  2019-04-30  0:48     ` Ming Lei
  1 sibling, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2019-04-29 18:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley

On Sun, 2019-04-28 at 16:14 +0800, Ming Lei wrote:
> Just like aio/io_uring, we need to grab 2 refcount for queuing one
> request, one is for submission, another is for completion.
> 
> If the request isn't queued from plug code path, the refcount grabbed
> in generic_make_request() serves for submission. In theroy, this
> refcount should have been released after the sumission(async run queue)
> is done. blk_freeze_queue() works with blk_sync_queue() together
> for avoiding race between cleanup queue and IO submission, given async
> run queue activities are canceled because hctx->run_work is scheduled with
> the refcount held, so it is fine to not hold the refcount when
> running the run queue work function for dispatch IO.
> 
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed because the queue's
> kobject refcount isn't guaranteed to be grabbed in flushing plug list
> context, then kernel oops is triggered, see the following race:
> 
> blk_mq_flush_plug_list():
>         blk_mq_sched_insert_requests()
>                 insert requests to sw queue or scheduler queue
>                 blk_mq_run_hw_queue
> 
> Because of concurrent run queue, all requests inserted above may be
> completed before calling the above blk_mq_run_hw_queue. Then queue can
> be freed during the above blk_mq_run_hw_queue().
> 
> Fixes the issue by grab .q_usage_counter before calling
> blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> safe because the queue is absolutely alive before inserting request.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Tested-by: James Smart <james.smart@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

I added my "Reviewed-by" to a previous version of this patch but not
to this version of this patch. Several "Reviewed-by" tags probably
should be removed.

> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index aa6bc5c02643..dfe83e7935d6 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
>  {
>         struct elevator_queue *e;
>  
> +       /*
> +        * blk_mq_sched_insert_requests() is called from flush plug
> +        * context only, and hold one usage counter to prevent queue
> +        * from being released.
> +        */
> +       percpu_ref_get(&hctx->queue->q_usage_counter);
> +
>         e = hctx->queue->elevator;
>         if (e && e->type->ops.insert_requests)
>                 e->type->ops.insert_requests(hctx, list, false);
> @@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
>         }
>  
>         blk_mq_run_hw_queue(hctx, run_queue_async);
> +
> +       percpu_ref_put(&hctx->queue->q_usage_counter);
>  }

I think that 'hctx' can disappear if all requests queued by this function
finish just before blk_mq_run_hw_queue() returns and if the number of hardware
queues is changed from another thread. Shouldn't the request queue pointer be
stored in a local variable instead of reading hctx->queue twice?

Bart.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-29 18:09   ` Bart Van Assche
@ 2019-04-30  0:48     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-30  0:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Mon, Apr 29, 2019 at 11:09:39AM -0700, Bart Van Assche wrote:
> On Sun, 2019-04-28 at 16:14 +0800, Ming Lei wrote:
> > Just like aio/io_uring, we need to grab 2 refcount for queuing one
> > request, one is for submission, another is for completion.
> > 
> > If the request isn't queued from plug code path, the refcount grabbed
> > in generic_make_request() serves for submission. In theroy, this
> > refcount should have been released after the sumission(async run queue)
> > is done. blk_freeze_queue() works with blk_sync_queue() together
> > for avoiding race between cleanup queue and IO submission, given async
> > run queue activities are canceled because hctx->run_work is scheduled with
> > the refcount held, so it is fine to not hold the refcount when
> > running the run queue work function for dispatch IO.
> > 
> > However, if request is staggered into plug list, and finally queued
> > from plug code path, the refcount in submission side is actually missed.
> > And we may start to run queue after queue is removed because the queue's
> > kobject refcount isn't guaranteed to be grabbed in flushing plug list
> > context, then kernel oops is triggered, see the following race:
> > 
> > blk_mq_flush_plug_list():
> >         blk_mq_sched_insert_requests()
> >                 insert requests to sw queue or scheduler queue
> >                 blk_mq_run_hw_queue
> > 
> > Because of concurrent run queue, all requests inserted above may be
> > completed before calling the above blk_mq_run_hw_queue. Then queue can
> > be freed during the above blk_mq_run_hw_queue().
> > 
> > Fixes the issue by grab .q_usage_counter before calling
> > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> > safe because the queue is absolutely alive before inserting request.
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> > Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> > Reviewed-by: Hannes Reinecke <hare@suse.com>
> > Tested-by: James Smart <james.smart@broadcom.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> I added my "Reviewed-by" to a previous version of this patch but not
> to this version of this patch. Several "Reviewed-by" tags probably
> should be removed.

Fine.

> 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index aa6bc5c02643..dfe83e7935d6 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
> >  {
> >         struct elevator_queue *e;
> >  
> > +       /*
> > +        * blk_mq_sched_insert_requests() is called from flush plug
> > +        * context only, and hold one usage counter to prevent queue
> > +        * from being released.
> > +        */
> > +       percpu_ref_get(&hctx->queue->q_usage_counter);
> > +
> >         e = hctx->queue->elevator;
> >         if (e && e->type->ops.insert_requests)
> >                 e->type->ops.insert_requests(hctx, list, false);
> > @@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
> >         }
> >  
> >         blk_mq_run_hw_queue(hctx, run_queue_async);
> > +
> > +       percpu_ref_put(&hctx->queue->q_usage_counter);
> >  }
> 
> I think that 'hctx' can disappear if all requests queued by this function
> finish just before blk_mq_run_hw_queue() returns and if the number of hardware
> queues is changed from another thread.

updating nr_hw_queues needs to freeze queues in the same tagset first,
and the added percpu_ref_get() will prevent that from happening.

> Shouldn't the request queue pointer be
> stored in a local variable instead of reading hctx->queue twice?

Yeah, it is much cleaner, will do it in V9.

BTW, one big problem in this patch(V8) is that the usage counter isn't
put in the early return branch.

thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-29  6:05   ` Hannes Reinecke
@ 2019-04-30  0:50     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-04-30  0:50 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Mon, Apr 29, 2019 at 08:05:30AM +0200, Hannes Reinecke wrote:
> On 4/28/19 10:14 AM, Ming Lei wrote:
> > Split blk_mq_alloc_and_init_hctx into two parts, and one is
> > blk_mq_alloc_hctx() for allocating all hctx resources, another
> > is blk_mq_init_hctx() for initializing hctx, which serves as
> > counter-part of blk_mq_exit_hctx().
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Reviewed-by: Hannes Reinecke <hare@suse.com>
> > Tested-by: James Smart <james.smart@broadcom.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c | 138 ++++++++++++++++++++++++++++++++-------------------------
> >   1 file changed, 77 insertions(+), 61 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index d98cb9614dfa..44ecca6b0cac 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2284,15 +2284,70 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
> >   	}
> >   }
> > +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
> > +{
> > +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> > +
> > +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> > +			   __alignof__(struct blk_mq_hw_ctx)) !=
> > +		     sizeof(struct blk_mq_hw_ctx));
> > +
> > +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> > +		hw_ctx_size += sizeof(struct srcu_struct);
> > +
> > +	return hw_ctx_size;
> > +}
> > +
> >   static int blk_mq_init_hctx(struct request_queue *q,
> >   		struct blk_mq_tag_set *set,
> >   		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> >   {
> > -	int node;
> > +	hctx->queue_num = hctx_idx;
> > -	node = hctx->numa_node;
> > +	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
> > +
> > +	hctx->tags = set->tags[hctx_idx];
> > +
> > +	if (set->ops->init_hctx &&
> > +	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
> > +		goto unregister_cpu_notifier;
> > +
> > +	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
> > +				hctx->numa_node))
> > +		goto exit_hctx;
> > +	return 0;
> > +
> > + exit_hctx:
> > +	if (set->ops->exit_hctx)
> > +		set->ops->exit_hctx(hctx, hctx_idx);
> > + unregister_cpu_notifier:
> > +	blk_mq_remove_cpuhp(hctx);
> > +	return -1;
> > +}
> > +
> > +static struct blk_mq_hw_ctx *
> > +blk_mq_alloc_hctx(struct request_queue *q,
> > +		struct blk_mq_tag_set *set,
> > +		unsigned hctx_idx, int node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx;
> > +
> > +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> > +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> > +			node);
> > +	if (!hctx)
> > +		goto fail_alloc_hctx;
> > +
> > +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> > +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> > +				node))
> > +		goto free_hctx;
> > +
> > +	atomic_set(&hctx->nr_active, 0);
> > +	hctx->numa_node = node;
> >   	if (node == NUMA_NO_NODE)
> > -		node = hctx->numa_node = set->numa_node;
> > +		hctx->numa_node = set->numa_node;
> > +	node = hctx->numa_node;
> >   	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
> >   	spin_lock_init(&hctx->lock);
> The 'hctx_idx' argument is now unused, and should be removed from the
> function definition.

OK, will do it in V9.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-04-30  0:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-28  8:14 [PATCH V8 0/7] blk-mq: fix races related with freeing queue Ming Lei
2019-04-28  8:14 ` [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
2019-04-28 12:10   ` Christoph Hellwig
2019-04-29 18:09   ` Bart Van Assche
2019-04-30  0:48     ` Ming Lei
2019-04-28  8:14 ` [PATCH V8 2/7] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
2019-04-28  8:14 ` [PATCH V8 3/7] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
2019-04-28  8:14 ` [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
2019-04-28 12:12   ` Christoph Hellwig
2019-04-29  6:05   ` Hannes Reinecke
2019-04-30  0:50     ` Ming Lei
2019-04-28  8:14 ` [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed Ming Lei
2019-04-28 12:14   ` Christoph Hellwig
2019-04-28 13:15     ` Ming Lei
2019-04-28  8:14 ` [PATCH V8 6/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
2019-04-28  8:14 ` [PATCH V8 7/7] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.