All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 0/9] blk-mq: fix races related with freeing queue
@ 2019-04-17  3:44 ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V6:
	- remove previous SCSI patch which will be routed via SCSI tree
	- add reviewed-by tag
	- fix one related NVMe scan vs reset race

V5:
	- refactor blk_mq_alloc_and_init_hctx()
	- fix race related updating nr_hw_queues by always freeing hctx
	  after request queue is released

V4:
	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
	- fix comment in the last patch

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler

Ming Lei (9):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move all hctx alloction & initialization into
    __blk_mq_alloc_and_init_hctx
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: always free hctx after request queue is freed
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  nvme: hold request queue's refcount in ns's whole lifetime

 block/blk-core.c         |  23 +-----
 block/blk-mq-sysfs.c     |   8 ++
 block/blk-mq.c           | 195 ++++++++++++++++++++++++++++-------------------
 block/blk-mq.h           |   2 +-
 drivers/nvme/host/core.c |  10 ++-
 include/linux/blk-mq.h   |   2 +
 include/linux/blkdev.h   |   7 ++
 7 files changed, 143 insertions(+), 104 deletions(-)

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
-- 
2.9.5


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 0/9] blk-mq: fix races related with freeing queue
@ 2019-04-17  3:44 ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V6:
	- remove previous SCSI patch which will be routed via SCSI tree
	- add reviewed-by tag
	- fix one related NVMe scan vs reset race

V5:
	- refactor blk_mq_alloc_and_init_hctx()
	- fix race related updating nr_hw_queues by always freeing hctx
	  after request queue is released

V4:
	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
	- fix comment in the last patch

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler

Ming Lei (9):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move all hctx alloction & initialization into
    __blk_mq_alloc_and_init_hctx
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: always free hctx after request queue is freed
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  nvme: hold request queue's refcount in ns's whole lifetime

 block/blk-core.c         |  23 +-----
 block/blk-mq-sysfs.c     |   8 ++
 block/blk-mq.c           | 195 ++++++++++++++++++++++++++++-------------------
 block/blk-mq.h           |   2 +-
 drivers/nvme/host/core.c |  10 ++-
 include/linux/blk-mq.h   |   2 +
 include/linux/blkdev.h   |   7 ++
 7 files changed, 143 insertions(+), 104 deletions(-)

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
-- 
2.9.5

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9516304a38ee..ef5a16a2d6fb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1736,9 +1736,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
 			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
+
+				percpu_ref_get(&this_q->q_usage_counter);
 				blk_mq_sched_insert_requests(this_hctx, this_ctx,
 								&rq_list,
 								from_schedule);
+				percpu_ref_put(&this_q->q_usage_counter);
 			}
 
 			this_q = rq->q;
@@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
+
+		percpu_ref_get(&this_q->q_usage_counter);
 		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
 						from_schedule);
+		percpu_ref_put(&this_q->q_usage_counter);
 	}
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9516304a38ee..ef5a16a2d6fb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1736,9 +1736,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
 			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
+
+				percpu_ref_get(&this_q->q_usage_counter);
 				blk_mq_sched_insert_requests(this_hctx, this_ctx,
 								&rq_list,
 								from_schedule);
+				percpu_ref_put(&this_q->q_usage_counter);
 			}
 
 			this_q = rq->q;
@@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
+
+		percpu_ref_get(&this_q->q_usage_counter);
 		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
 						from_schedule);
+		percpu_ref_put(&this_q->q_usage_counter);
 	}
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..93dc588fabe2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ef5a16a2d6fb..55776a6e2586 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2640,6 +2640,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..93dc588fabe2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ef5a16a2d6fb..55776a6e2586 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2640,6 +2640,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 3/9] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang,
	stable

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 93dc588fabe2..2dd94b3e9ece 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 55776a6e2586..239b404b9e6c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2273,12 +2273,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2913,7 +2908,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 423ea88ab6fb..633a5a77ee8b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 3/9] blk-mq: free hw queue's resource in hctx's release handler
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Reported-by: James Smart <james.smart at broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable at vger.kernel.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 93dc588fabe2..2dd94b3e9ece 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 55776a6e2586..239b404b9e6c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2273,12 +2273,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2913,7 +2908,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 423ea88ab6fb..633a5a77ee8b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Move all hctx related allocation and initialization into
__blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
into real two functions, one is for allocate everything, and another is for
initializing everyting.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 239b404b9e6c..4c6fbbc6791b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static int blk_mq_init_hctx(struct request_queue *q,
-		struct blk_mq_tag_set *set,
-		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 {
-	int node;
+	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+			   __alignof__(struct blk_mq_hw_ctx)) !=
+		     sizeof(struct blk_mq_hw_ctx));
+
+	if (tag_set->flags & BLK_MQ_F_BLOCKING)
+		hw_ctx_size += sizeof(struct srcu_struct);
+
+	return hw_ctx_size;
+}
+
+static struct blk_mq_hw_ctx *
+__blk_mq_alloc_and_init_hctx(struct request_queue *q,
+			     struct blk_mq_tag_set *set,
+			     unsigned hctx_idx, int node)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+			node);
+	if (!hctx)
+		goto fail_alloc_hctx;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+				node))
+		goto free_hctx;
+
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
+	hctx->queue_num = hctx_idx;
 
-	node = hctx->numa_node;
 	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
+		hctx->numa_node = set->numa_node;
+	node = hctx->numa_node;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -2343,8 +2373,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
+	blk_mq_hctx_kobj_init(hctx);
 
-	return 0;
+	return hctx;
 
  free_fq:
 	blk_free_flush_queue(hctx->fq);
@@ -2357,7 +2388,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	kfree(hctx->ctxs);
  unregister_cpu_notifier:
 	blk_mq_remove_cpuhp(hctx);
-	return -1;
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+ fail_alloc_hctx:
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -2703,51 +2738,11 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_sq_queue);
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-			node);
-	if (!hctx)
-		return NULL;
-
-	if (!zalloc_cpumask_var_node(&hctx->cpumask,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-				node)) {
-		kfree(hctx);
-		return NULL;
-	}
-
-	atomic_set(&hctx->nr_active, 0);
-	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
-	if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx);
-		return NULL;
-	}
-	blk_mq_hctx_kobj_init(hctx);
-
-	return hctx;
+	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


Move all hctx related allocation and initialization into
__blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
into real two functions, one is for allocate everything, and another is for
initializing everyting.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 239b404b9e6c..4c6fbbc6791b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static int blk_mq_init_hctx(struct request_queue *q,
-		struct blk_mq_tag_set *set,
-		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 {
-	int node;
+	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+			   __alignof__(struct blk_mq_hw_ctx)) !=
+		     sizeof(struct blk_mq_hw_ctx));
+
+	if (tag_set->flags & BLK_MQ_F_BLOCKING)
+		hw_ctx_size += sizeof(struct srcu_struct);
+
+	return hw_ctx_size;
+}
+
+static struct blk_mq_hw_ctx *
+__blk_mq_alloc_and_init_hctx(struct request_queue *q,
+			     struct blk_mq_tag_set *set,
+			     unsigned hctx_idx, int node)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+			node);
+	if (!hctx)
+		goto fail_alloc_hctx;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+				node))
+		goto free_hctx;
+
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
+	hctx->queue_num = hctx_idx;
 
-	node = hctx->numa_node;
 	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
+		hctx->numa_node = set->numa_node;
+	node = hctx->numa_node;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -2343,8 +2373,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
+	blk_mq_hctx_kobj_init(hctx);
 
-	return 0;
+	return hctx;
 
  free_fq:
 	blk_free_flush_queue(hctx->fq);
@@ -2357,7 +2388,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	kfree(hctx->ctxs);
  unregister_cpu_notifier:
 	blk_mq_remove_cpuhp(hctx);
-	return -1;
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+ fail_alloc_hctx:
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -2703,51 +2738,11 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_sq_queue);
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-			node);
-	if (!hctx)
-		return NULL;
-
-	if (!zalloc_cpumask_var_node(&hctx->cpumask,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-				node)) {
-		kfree(hctx);
-		return NULL;
-	}
-
-	atomic_set(&hctx->nr_active, 0);
-	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
-	if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx);
-		return NULL;
-	}
-	blk_mq_hctx_kobj_init(hctx);
-
-	return hctx;
+	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() which is for allocating all hctx resources,
another is blk_mq_init_hctx() which is for initializing hctx, and
serves as counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 76 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c6fbbc6791b..eeebba6ec0f7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2304,10 +2304,38 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 	return hw_ctx_size;
 }
 
+static int blk_mq_init_hctx(struct request_queue *q,
+			    struct blk_mq_tag_set *set,
+			    struct blk_mq_hw_ctx *hctx,
+			    unsigned hctx_idx)
+{
+	hctx->queue_num = hctx_idx;
+
+	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+	hctx->tags = set->tags[hctx_idx];
+
+	if (set->ops->init_hctx &&
+	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+		goto unregister_cpu_notifier;
+
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+				hctx->numa_node))
+		goto exit_hctx;
+	return 0;
+
+ exit_hctx:
+	if (set->ops->exit_hctx)
+		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_remove_cpuhp(hctx);
+	return -1;
+}
+
 static struct blk_mq_hw_ctx *
-__blk_mq_alloc_and_init_hctx(struct request_queue *q,
-			     struct blk_mq_tag_set *set,
-			     unsigned hctx_idx, int node)
+blk_mq_alloc_hctx(struct request_queue *q,
+		  struct blk_mq_tag_set *set,
+		  unsigned hctx_idx, int node)
 {
 	struct blk_mq_hw_ctx *hctx;
 
@@ -2324,8 +2352,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	atomic_set(&hctx->nr_active, 0);
 	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
 	if (node == NUMA_NO_NODE)
 		hctx->numa_node = set->numa_node;
 	node = hctx->numa_node;
@@ -2336,10 +2362,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-	hctx->tags = set->tags[hctx_idx];
-
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2347,29 +2369,21 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
 	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
+		goto free_cpumask;
 
 	if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
 				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
 		goto free_ctxs;
-
 	hctx->nr_ctx = 0;
 
 	spin_lock_init(&hctx->dispatch_wait_lock);
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
 	if (!hctx->fq)
-		goto exit_hctx;
-
-	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-		goto free_fq;
+		goto free_bitmap;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
@@ -2377,17 +2391,11 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	return hctx;
 
- free_fq:
-	blk_free_flush_queue(hctx->fq);
- exit_hctx:
-	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
  free_bitmap:
 	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_remove_cpuhp(hctx);
+ free_cpumask:
 	free_cpumask_var(hctx->cpumask);
  free_hctx:
 	kfree(hctx);
@@ -2742,7 +2750,21 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		goto fail;
+
+	if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
+		goto free_hctx;
+
+	return hctx;
+
+ free_hctx:
+	kobject_put(&hctx->kobj);
+ fail:
+	return NULL;
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() which is for allocating all hctx resources,
another is blk_mq_init_hctx() which is for initializing hctx, and
serves as counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 76 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c6fbbc6791b..eeebba6ec0f7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2304,10 +2304,38 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 	return hw_ctx_size;
 }
 
+static int blk_mq_init_hctx(struct request_queue *q,
+			    struct blk_mq_tag_set *set,
+			    struct blk_mq_hw_ctx *hctx,
+			    unsigned hctx_idx)
+{
+	hctx->queue_num = hctx_idx;
+
+	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+	hctx->tags = set->tags[hctx_idx];
+
+	if (set->ops->init_hctx &&
+	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+		goto unregister_cpu_notifier;
+
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+				hctx->numa_node))
+		goto exit_hctx;
+	return 0;
+
+ exit_hctx:
+	if (set->ops->exit_hctx)
+		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_remove_cpuhp(hctx);
+	return -1;
+}
+
 static struct blk_mq_hw_ctx *
-__blk_mq_alloc_and_init_hctx(struct request_queue *q,
-			     struct blk_mq_tag_set *set,
-			     unsigned hctx_idx, int node)
+blk_mq_alloc_hctx(struct request_queue *q,
+		  struct blk_mq_tag_set *set,
+		  unsigned hctx_idx, int node)
 {
 	struct blk_mq_hw_ctx *hctx;
 
@@ -2324,8 +2352,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	atomic_set(&hctx->nr_active, 0);
 	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
 	if (node == NUMA_NO_NODE)
 		hctx->numa_node = set->numa_node;
 	node = hctx->numa_node;
@@ -2336,10 +2362,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-	hctx->tags = set->tags[hctx_idx];
-
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2347,29 +2369,21 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
 	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
+		goto free_cpumask;
 
 	if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
 				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
 		goto free_ctxs;
-
 	hctx->nr_ctx = 0;
 
 	spin_lock_init(&hctx->dispatch_wait_lock);
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
 	if (!hctx->fq)
-		goto exit_hctx;
-
-	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-		goto free_fq;
+		goto free_bitmap;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
@@ -2377,17 +2391,11 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	return hctx;
 
- free_fq:
-	blk_free_flush_queue(hctx->fq);
- exit_hctx:
-	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
  free_bitmap:
 	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_remove_cpuhp(hctx);
+ free_cpumask:
 	free_cpumask_var(hctx->cpumask);
  free_hctx:
 	kfree(hctx);
@@ -2742,7 +2750,21 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		goto fail;
+
+	if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
+		goto free_hctx;
+
+	return hctx;
+
+ free_hctx:
+	kobject_put(&hctx->kobj);
+ fail:
+	return NULL;
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index eeebba6ec0f7..2ca4395f794d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	blk_mq_remove_cpuhp(hctx);
+
+	spin_lock(&q->dead_hctx_lock);
+	list_add(&hctx->hctx_list, &q->dead_hctx_list);
+	spin_unlock(&q->dead_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
+	struct blk_mq_hw_ctx *hctx, *next;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
-	/* hctx kobj stays in hctx */
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx)
-			continue;
+	/* all hctx are in .dead_hctx_list now */
+	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
+		list_del_init(&hctx->hctx_list);
 		kobject_put(&hctx->kobj);
 	}
 
@@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
+
+	/* reuse dead hctx first */
+	spin_lock(&q->dead_hctx_lock);
+	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
+		if (tmp->numa_node == node) {
+			hctx = tmp;
+			break;
+		}
+	}
+	if (hctx)
+		list_del_init(&hctx->hctx_list);
+	spin_unlock(&q->dead_hctx_lock);
 
-	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
 	if (!hctx)
 		goto fail;
 
@@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
 		if (hctx) {
-			if (hctxs[i]) {
+			if (hctxs[i])
 				blk_mq_exit_hctx(q, set, hctxs[i], i);
-				kobject_put(&hctxs[i]->kobj);
-			}
 			hctxs[i] = hctx;
 		} else {
 			if (hctxs[i])
@@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			if (hctx->tags)
 				blk_mq_free_map_and_requests(set, j);
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
-
 		}
 	}
 	mutex_unlock(&q->sysfs_lock);
@@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_sys_init;
 
+	INIT_LIST_HEAD(&q->dead_hctx_list);
+	spin_lock_init(&q->dead_hctx_lock);
+
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index db29928de467..15d1aa53d96c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
 	struct dentry		*sched_debugfs_dir;
 #endif
 
+	struct list_head	hctx_list;
+
 	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
 	struct srcu_struct	srcu[0];
 };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4b85dc066264..1325f941f0be 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -535,6 +535,13 @@ struct request_queue {
 
 	struct mutex		sysfs_lock;
 
+	/*
+	 * for reusing dead hctx instance in case of updating
+	 * nr_hw_queues
+	 */
+	struct list_head	dead_hctx_list;
+	spinlock_t		dead_hctx_lock;
+
 	atomic_t		mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index eeebba6ec0f7..2ca4395f794d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	blk_mq_remove_cpuhp(hctx);
+
+	spin_lock(&q->dead_hctx_lock);
+	list_add(&hctx->hctx_list, &q->dead_hctx_list);
+	spin_unlock(&q->dead_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
+	struct blk_mq_hw_ctx *hctx, *next;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
-	/* hctx kobj stays in hctx */
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx)
-			continue;
+	/* all hctx are in .dead_hctx_list now */
+	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
+		list_del_init(&hctx->hctx_list);
 		kobject_put(&hctx->kobj);
 	}
 
@@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
+
+	/* reuse dead hctx first */
+	spin_lock(&q->dead_hctx_lock);
+	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
+		if (tmp->numa_node == node) {
+			hctx = tmp;
+			break;
+		}
+	}
+	if (hctx)
+		list_del_init(&hctx->hctx_list);
+	spin_unlock(&q->dead_hctx_lock);
 
-	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
 	if (!hctx)
 		goto fail;
 
@@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
 		if (hctx) {
-			if (hctxs[i]) {
+			if (hctxs[i])
 				blk_mq_exit_hctx(q, set, hctxs[i], i);
-				kobject_put(&hctxs[i]->kobj);
-			}
 			hctxs[i] = hctx;
 		} else {
 			if (hctxs[i])
@@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			if (hctx->tags)
 				blk_mq_free_map_and_requests(set, j);
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
-
 		}
 	}
 	mutex_unlock(&q->sysfs_lock);
@@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_sys_init;
 
+	INIT_LIST_HEAD(&q->dead_hctx_list);
+	spin_lock_init(&q->dead_hctx_lock);
+
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index db29928de467..15d1aa53d96c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
 	struct dentry		*sched_debugfs_dir;
 #endif
 
+	struct list_head	hctx_list;
+
 	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
 	struct srcu_struct	srcu[0];
 };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4b85dc066264..1325f941f0be 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -535,6 +535,13 @@ struct request_queue {
 
 	struct mutex		sysfs_lock;
 
+	/*
+	 * for reusing dead hctx instance in case of updating
+	 * nr_hw_queues
+	 */
+	struct list_head	dead_hctx_list;
+	spinlock_t		dead_hctx_lock;
+
 	atomic_t		mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2dd94b3e9ece..f5b5f21ae4fd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2dd94b3e9ece..f5b5f21ae4fd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f5b5f21ae4fd..e24cfcefdc19 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue()
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f5b5f21ae4fd..e24cfcefdc19 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17  3:44   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Ming Lei, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Hennes reported the following kernel oops:

    There is a race condition between namespace rescanning and
    controller reset; during controller reset all namespaces are
    quiesed vie nams_stop_ctrl(), and after reset all namespaces
    are unquiesced again.
    When namespace scanning was active by the time controller reset
    was triggered the rescan code will call nvme_ns_remove(), which
    then will cause a kernel crash in nvme_start_ctrl() as it'll trip
    over uninitialized namespaces.

Patch "blk-mq: free hw queue's resource in hctx's release handler"
should make this issue quite difficult to trigger. However it can't
kill the issue completely becasue pre-condition of that patch is to
hold request queue's refcount before calling block layer API, and
there is still a small window between blk_cleanup_queue() and removing
the ns from the controller namspace list in nvme_ns_remove().

Hold request queue's refcount until the ns is freed, then the above race
can be avoided completely. Given the 'namespaces_rwsem' is always held
to retrieve ns for starting/stopping request queue, this lock can prevent
namespaces from being freed.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reported-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 248ff3b48041..82cda6602ca7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -406,6 +406,7 @@ static void nvme_free_ns(struct kref *kref)
 		nvme_nvm_unregister(ns);
 
 	put_disk(ns->disk);
+	blk_put_queue(ns->queue);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
@@ -3229,6 +3230,11 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		goto out_free_ns;
 	}
 
+	if (!blk_get_queue(ns->queue)) {
+		ret = -ENXIO;
+		goto out_free_queue;
+	}
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
 	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
@@ -3245,7 +3251,7 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	id = nvme_identify_ns(ctrl, nsid);
 	if (!id) {
 		ret = -EIO;
-		goto out_free_queue;
+		goto out_put_queue;
 	}
 
 	if (id->ncap == 0) {
@@ -3304,6 +3310,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	nvme_put_ns_head(ns->head);
  out_free_id:
 	kfree(id);
+ out_put_queue:
+	blk_put_queue(ns->queue);
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
  out_free_ns:
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-17  3:44   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17  3:44 UTC (permalink / raw)


Hennes reported the following kernel oops:

    There is a race condition between namespace rescanning and
    controller reset; during controller reset all namespaces are
    quiesed vie nams_stop_ctrl(), and after reset all namespaces
    are unquiesced again.
    When namespace scanning was active by the time controller reset
    was triggered the rescan code will call nvme_ns_remove(), which
    then will cause a kernel crash in nvme_start_ctrl() as it'll trip
    over uninitialized namespaces.

Patch "blk-mq: free hw queue's resource in hctx's release handler"
should make this issue quite difficult to trigger. However it can't
kill the issue completely becasue pre-condition of that patch is to
hold request queue's refcount before calling block layer API, and
there is still a small window between blk_cleanup_queue() and removing
the ns from the controller namspace list in nvme_ns_remove().

Hold request queue's refcount until the ns is freed, then the above race
can be avoided completely. Given the 'namespaces_rwsem' is always held
to retrieve ns for starting/stopping request queue, this lock can prevent
namespaces from being freed.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang at oracle.com>
Reported-by: Hannes Reinecke <hare at suse.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 248ff3b48041..82cda6602ca7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -406,6 +406,7 @@ static void nvme_free_ns(struct kref *kref)
 		nvme_nvm_unregister(ns);
 
 	put_disk(ns->disk);
+	blk_put_queue(ns->queue);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
@@ -3229,6 +3230,11 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		goto out_free_ns;
 	}
 
+	if (!blk_get_queue(ns->queue)) {
+		ret = -ENXIO;
+		goto out_free_queue;
+	}
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
 	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
@@ -3245,7 +3251,7 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	id = nvme_identify_ns(ctrl, nsid);
 	if (!id) {
 		ret = -EIO;
-		goto out_free_queue;
+		goto out_put_queue;
 	}
 
 	if (id->ncap == 0) {
@@ -3304,6 +3310,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	nvme_put_ns_head(ns->head);
  out_free_id:
 	kfree(id);
+ out_put_queue:
+	blk_put_queue(ns->queue);
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
  out_free_ns:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-17  3:44   ` Ming Lei
@ 2019-04-17 12:00     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:00 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/17/19 5:44 AM, Ming Lei wrote:
> With holding queue's kobject refcount, it is safe for driver
> to schedule requeue. However, blk_mq_kick_requeue_list() may
> be called after blk_sync_queue() is done because of concurrent
> requeue activities, then requeue work may not be completed when
> freeing queue, and kernel oops is triggered.
> 
> So moving the cancel of requeue_work into blk_mq_release() for
> avoiding race between requeue and freeing queue.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c | 1 -
>   block/blk-mq.c   | 2 ++
>   2 files changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
@ 2019-04-17 12:00     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:00 UTC (permalink / raw)


On 4/17/19 5:44 AM, Ming Lei wrote:
> With holding queue's kobject refcount, it is safe for driver
> to schedule requeue. However, blk_mq_kick_requeue_list() may
> be called after blk_sync_queue() is done because of concurrent
> requeue activities, then requeue work may not be completed when
> freeing queue, and kernel oops is triggered.
> 
> So moving the cancel of requeue_work into blk_mq_release() for
> avoiding race between requeue and freeing queue.
> 
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang at oracle.com>
> Reviewed-by: Bart Van Assche <bvanassche at acm.org>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   block/blk-core.c | 1 -
>   block/blk-mq.c   | 2 ++
>   2 files changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 3/9] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-17  3:44   ` Ming Lei
@ 2019-04-17 12:02     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:02 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang, stable

On 4/17/19 5:44 AM, Ming Lei wrote:
> Once blk_cleanup_queue() returns, tags shouldn't be used any more,
> because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
> ("blk-mq: Fix a use-after-free") fixes this issue exactly.
> 
> However, that commit introduces another issue. Before 45a9c9d909b2,
> we are allowed to run queue during cleaning up queue if the queue's
> kobj refcount is held. After that commit, queue can't be run during
> queue cleaning up, otherwise oops can be triggered easily because
> some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
> 
> We have invented ways for addressing this kind of issue before, such as:
> 
> 	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
> 	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
> 
> But still can't cover all cases, recently James reports another such
> kind of issue:
> 
> 	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
> 
> This issue can be quite hard to address by previous way, given
> scsi_run_queue() may run requeues for other LUNs.
> 
> Fixes the above issue by freeing hctx's resources in its release handler, and this
> way is safe becasue tags isn't needed for freeing such hctx resource.
> 
> This approach follows typical design pattern wrt. kobject's release handler.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reported-by: James Smart <james.smart@broadcom.com>
> Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c     | 2 +-
>   block/blk-mq-sysfs.c | 6 ++++++
>   block/blk-mq.c       | 8 ++------
>   block/blk-mq.h       | 2 +-
>   4 files changed, 10 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 3/9] blk-mq: free hw queue's resource in hctx's release handler
@ 2019-04-17 12:02     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:02 UTC (permalink / raw)


On 4/17/19 5:44 AM, Ming Lei wrote:
> Once blk_cleanup_queue() returns, tags shouldn't be used any more,
> because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
> ("blk-mq: Fix a use-after-free") fixes this issue exactly.
> 
> However, that commit introduces another issue. Before 45a9c9d909b2,
> we are allowed to run queue during cleaning up queue if the queue's
> kobj refcount is held. After that commit, queue can't be run during
> queue cleaning up, otherwise oops can be triggered easily because
> some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
> 
> We have invented ways for addressing this kind of issue before, such as:
> 
> 	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
> 	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
> 
> But still can't cover all cases, recently James reports another such
> kind of issue:
> 
> 	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
> 
> This issue can be quite hard to address by previous way, given
> scsi_run_queue() may run requeues for other LUNs.
> 
> Fixes the above issue by freeing hctx's resources in its release handler, and this
> way is safe becasue tags isn't needed for freeing such hctx resource.
> 
> This approach follows typical design pattern wrt. kobject's release handler.
> 
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang at oracle.com>
> Reported-by: James Smart <james.smart at broadcom.com>
> Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
> Cc: stable at vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   block/blk-core.c     | 2 +-
>   block/blk-mq-sysfs.c | 6 ++++++
>   block/blk-mq.c       | 8 ++------
>   block/blk-mq.h       | 2 +-
>   4 files changed, 10 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-17  3:44   ` Ming Lei
@ 2019-04-17 12:03     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:03 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/17/19 5:44 AM, Ming Lei wrote:
> Move all hctx related allocation and initialization into
> __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> into real two functions, one is for allocate everything, and another is for
> initializing everyting.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
>   1 file changed, 44 insertions(+), 49 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
@ 2019-04-17 12:03     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:03 UTC (permalink / raw)


On 4/17/19 5:44 AM, Ming Lei wrote:
> Move all hctx related allocation and initialization into
> __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> into real two functions, one is for allocate everything, and another is for
> initializing everyting.
> 
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang at oracle.com>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
>   1 file changed, 44 insertions(+), 49 deletions(-)
> Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-17  3:44   ` Ming Lei
@ 2019-04-17 12:08     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:08 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/17/19 5:44 AM, Ming Lei wrote:
> In normal queue cleanup path, hctx is released after request queue
> is freed, see blk_mq_release().
> 
> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> of hw queues shrinking. This way is easy to cause use-after-free,
> because: one implicit rule is that it is safe to call almost all block
> layer APIs if the request queue is alive; and one hctx may be retrieved
> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> finally use-after-free is triggered.
> 
> Fixes this issue by always freeing hctx after releasing request queue.
> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> a per-queue list to hold them, then try to resuse these hctxs if numa
> node is matched.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
>   include/linux/blk-mq.h |  2 ++
>   include/linux/blkdev.h |  7 +++++++
>   3 files changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index eeebba6ec0f7..2ca4395f794d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   		set->ops->exit_hctx(hctx, hctx_idx);
>   
>   	blk_mq_remove_cpuhp(hctx);
> +
> +	spin_lock(&q->dead_hctx_lock);
> +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> +	spin_unlock(&q->dead_hctx_lock);
>   }
>   
>   static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>    */
>   void blk_mq_release(struct request_queue *q)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> +	struct blk_mq_hw_ctx *hctx, *next;
>   
>   	cancel_delayed_work_sync(&q->requeue_work);
>   
> -	/* hctx kobj stays in hctx */
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (!hctx)
> -			continue;
> +	/* all hctx are in .dead_hctx_list now */
> +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> +		list_del_init(&hctx->hctx_list);
>   		kobject_put(&hctx->kobj);
>   	}
>   
> @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
>   		struct blk_mq_tag_set *set, struct request_queue *q,
>   		int hctx_idx, int node)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> +
> +	/* reuse dead hctx first */
> +	spin_lock(&q->dead_hctx_lock);
> +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> +		if (tmp->numa_node == node) {
> +			hctx = tmp;
> +			break;
> +		}
> +	}
> +	if (hctx)
> +		list_del_init(&hctx->hctx_list);
> +	spin_unlock(&q->dead_hctx_lock);
>   
> -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> +	if (!hctx)
> +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
>   	if (!hctx)
>   		goto fail;
>   
> @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>   
>   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
>   		if (hctx) {
> -			if (hctxs[i]) {
> +			if (hctxs[i])
>   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> -				kobject_put(&hctxs[i]->kobj);
> -			}
>   			hctxs[i] = hctx;
>   		} else {
>   			if (hctxs[i])
> @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>   			if (hctx->tags)
>   				blk_mq_free_map_and_requests(set, j);
>   			blk_mq_exit_hctx(q, set, hctx, j);
> -			kobject_put(&hctx->kobj);
>   			hctxs[j] = NULL;
> -
>   		}
>   	}
>   	mutex_unlock(&q->sysfs_lock);
> @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   	if (!q->queue_hw_ctx)
>   		goto err_sys_init;
>   
> +	INIT_LIST_HEAD(&q->dead_hctx_list);
> +	spin_lock_init(&q->dead_hctx_lock);
> +
>   	blk_mq_realloc_hw_ctxs(set, q);
>   	if (!q->nr_hw_queues)
>   		goto err_hctxs;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index db29928de467..15d1aa53d96c 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
>   	struct dentry		*sched_debugfs_dir;
>   #endif
>   
> +	struct list_head	hctx_list;
> +
>   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
>   	struct srcu_struct	srcu[0];
>   };
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 4b85dc066264..1325f941f0be 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -535,6 +535,13 @@ struct request_queue {
>   
>   	struct mutex		sysfs_lock;
>   
> +	/*
> +	 * for reusing dead hctx instance in case of updating
> +	 * nr_hw_queues
> +	 */
> +	struct list_head	dead_hctx_list;
> +	spinlock_t		dead_hctx_lock;
> +
>   	atomic_t		mq_freeze_depth;
>   
>   #if defined(CONFIG_BLK_DEV_BSG)
> 
Hmm.
I don't particularly like this approach.
The much saner approach would be to avoid having I/O in-flight in the 
first place by setting the queue to something else than live, no?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-17 12:08     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:08 UTC (permalink / raw)


On 4/17/19 5:44 AM, Ming Lei wrote:
> In normal queue cleanup path, hctx is released after request queue
> is freed, see blk_mq_release().
> 
> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> of hw queues shrinking. This way is easy to cause use-after-free,
> because: one implicit rule is that it is safe to call almost all block
> layer APIs if the request queue is alive; and one hctx may be retrieved
> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> finally use-after-free is triggered.
> 
> Fixes this issue by always freeing hctx after releasing request queue.
> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> a per-queue list to hold them, then try to resuse these hctxs if numa
> node is matched.
> 
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang at oracle.com>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
>   include/linux/blk-mq.h |  2 ++
>   include/linux/blkdev.h |  7 +++++++
>   3 files changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index eeebba6ec0f7..2ca4395f794d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   		set->ops->exit_hctx(hctx, hctx_idx);
>   
>   	blk_mq_remove_cpuhp(hctx);
> +
> +	spin_lock(&q->dead_hctx_lock);
> +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> +	spin_unlock(&q->dead_hctx_lock);
>   }
>   
>   static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>    */
>   void blk_mq_release(struct request_queue *q)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> +	struct blk_mq_hw_ctx *hctx, *next;
>   
>   	cancel_delayed_work_sync(&q->requeue_work);
>   
> -	/* hctx kobj stays in hctx */
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (!hctx)
> -			continue;
> +	/* all hctx are in .dead_hctx_list now */
> +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> +		list_del_init(&hctx->hctx_list);
>   		kobject_put(&hctx->kobj);
>   	}
>   
> @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
>   		struct blk_mq_tag_set *set, struct request_queue *q,
>   		int hctx_idx, int node)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> +
> +	/* reuse dead hctx first */
> +	spin_lock(&q->dead_hctx_lock);
> +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> +		if (tmp->numa_node == node) {
> +			hctx = tmp;
> +			break;
> +		}
> +	}
> +	if (hctx)
> +		list_del_init(&hctx->hctx_list);
> +	spin_unlock(&q->dead_hctx_lock);
>   
> -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> +	if (!hctx)
> +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
>   	if (!hctx)
>   		goto fail;
>   
> @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>   
>   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
>   		if (hctx) {
> -			if (hctxs[i]) {
> +			if (hctxs[i])
>   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> -				kobject_put(&hctxs[i]->kobj);
> -			}
>   			hctxs[i] = hctx;
>   		} else {
>   			if (hctxs[i])
> @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>   			if (hctx->tags)
>   				blk_mq_free_map_and_requests(set, j);
>   			blk_mq_exit_hctx(q, set, hctx, j);
> -			kobject_put(&hctx->kobj);
>   			hctxs[j] = NULL;
> -
>   		}
>   	}
>   	mutex_unlock(&q->sysfs_lock);
> @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   	if (!q->queue_hw_ctx)
>   		goto err_sys_init;
>   
> +	INIT_LIST_HEAD(&q->dead_hctx_list);
> +	spin_lock_init(&q->dead_hctx_lock);
> +
>   	blk_mq_realloc_hw_ctxs(set, q);
>   	if (!q->nr_hw_queues)
>   		goto err_hctxs;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index db29928de467..15d1aa53d96c 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
>   	struct dentry		*sched_debugfs_dir;
>   #endif
>   
> +	struct list_head	hctx_list;
> +
>   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
>   	struct srcu_struct	srcu[0];
>   };
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 4b85dc066264..1325f941f0be 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -535,6 +535,13 @@ struct request_queue {
>   
>   	struct mutex		sysfs_lock;
>   
> +	/*
> +	 * for reusing dead hctx instance in case of updating
> +	 * nr_hw_queues
> +	 */
> +	struct list_head	dead_hctx_list;
> +	spinlock_t		dead_hctx_lock;
> +
>   	atomic_t		mq_freeze_depth;
>   
>   #if defined(CONFIG_BLK_DEV_BSG)
> 
Hmm.
I don't particularly like this approach.
The much saner approach would be to avoid having I/O in-flight in the 
first place by setting the queue to something else than live, no?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-17  3:44   ` Ming Lei
@ 2019-04-17 12:10     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:10 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Dongli Zhang, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 4/17/19 5:44 AM, Ming Lei wrote:
> Hennes reported the following kernel oops:
> 
>      There is a race condition between namespace rescanning and
>      controller reset; during controller reset all namespaces are
>      quiesed vie nams_stop_ctrl(), and after reset all namespaces
>      are unquiesced again.
>      When namespace scanning was active by the time controller reset
>      was triggered the rescan code will call nvme_ns_remove(), which
>      then will cause a kernel crash in nvme_start_ctrl() as it'll trip
>      over uninitialized namespaces.
> 
> Patch "blk-mq: free hw queue's resource in hctx's release handler"
> should make this issue quite difficult to trigger. However it can't
> kill the issue completely becasue pre-condition of that patch is to
> hold request queue's refcount before calling block layer API, and
> there is still a small window between blk_cleanup_queue() and removing
> the ns from the controller namspace list in nvme_ns_remove().
> 
> Hold request queue's refcount until the ns is freed, then the above race
> can be avoided completely. Given the 'namespaces_rwsem' is always held
> to retrieve ns for starting/stopping request queue, this lock can prevent
> namespaces from being freed.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Reported-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   drivers/nvme/host/core.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 248ff3b48041..82cda6602ca7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -406,6 +406,7 @@ static void nvme_free_ns(struct kref *kref)
>   		nvme_nvm_unregister(ns);
>   
>   	put_disk(ns->disk);
> +	blk_put_queue(ns->queue);
>   	nvme_put_ns_head(ns->head);
>   	nvme_put_ctrl(ns->ctrl);
>   	kfree(ns);
> @@ -3229,6 +3230,11 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   		goto out_free_ns;
>   	}
>   
> +	if (!blk_get_queue(ns->queue)) {
> +		ret = -ENXIO;
> +		goto out_free_queue;
> +	}
> +
>   	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
>   	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
>   		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
> @@ -3245,7 +3251,7 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   	id = nvme_identify_ns(ctrl, nsid);
>   	if (!id) {
>   		ret = -EIO;
> -		goto out_free_queue;
> +		goto out_put_queue;
>   	}
>   
>   	if (id->ncap == 0) {
> @@ -3304,6 +3310,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   	nvme_put_ns_head(ns->head);
>    out_free_id:
>   	kfree(id);
> + out_put_queue:
> +	blk_put_queue(ns->queue);
>    out_free_queue:
>   	blk_cleanup_queue(ns->queue);
>    out_free_ns:
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-17 12:10     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-17 12:10 UTC (permalink / raw)


On 4/17/19 5:44 AM, Ming Lei wrote:
> Hennes reported the following kernel oops:
> 
>      There is a race condition between namespace rescanning and
>      controller reset; during controller reset all namespaces are
>      quiesed vie nams_stop_ctrl(), and after reset all namespaces
>      are unquiesced again.
>      When namespace scanning was active by the time controller reset
>      was triggered the rescan code will call nvme_ns_remove(), which
>      then will cause a kernel crash in nvme_start_ctrl() as it'll trip
>      over uninitialized namespaces.
> 
> Patch "blk-mq: free hw queue's resource in hctx's release handler"
> should make this issue quite difficult to trigger. However it can't
> kill the issue completely becasue pre-condition of that patch is to
> hold request queue's refcount before calling block layer API, and
> there is still a small window between blk_cleanup_queue() and removing
> the ns from the controller namspace list in nvme_ns_remove().
> 
> Hold request queue's refcount until the ns is freed, then the above race
> can be avoided completely. Given the 'namespaces_rwsem' is always held
> to retrieve ns for starting/stopping request queue, this lock can prevent
> namespaces from being freed.
> 
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang at oracle.com>
> Reported-by: Hannes Reinecke <hare at suse.com>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   drivers/nvme/host/core.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 248ff3b48041..82cda6602ca7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -406,6 +406,7 @@ static void nvme_free_ns(struct kref *kref)
>   		nvme_nvm_unregister(ns);
>   
>   	put_disk(ns->disk);
> +	blk_put_queue(ns->queue);
>   	nvme_put_ns_head(ns->head);
>   	nvme_put_ctrl(ns->ctrl);
>   	kfree(ns);
> @@ -3229,6 +3230,11 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   		goto out_free_ns;
>   	}
>   
> +	if (!blk_get_queue(ns->queue)) {
> +		ret = -ENXIO;
> +		goto out_free_queue;
> +	}
> +
>   	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
>   	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
>   		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
> @@ -3245,7 +3251,7 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   	id = nvme_identify_ns(ctrl, nsid);
>   	if (!id) {
>   		ret = -EIO;
> -		goto out_free_queue;
> +		goto out_put_queue;
>   	}
>   
>   	if (id->ncap == 0) {
> @@ -3304,6 +3310,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   	nvme_put_ns_head(ns->head);
>    out_free_id:
>   	kfree(id);
> + out_put_queue:
> +	blk_put_queue(ns->queue);
>    out_free_queue:
>   	blk_cleanup_queue(ns->queue);
>    out_free_ns:
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare at suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-17 12:08     ` Hannes Reinecke
@ 2019-04-17 12:59       ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17 12:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Wed, Apr 17, 2019 at 02:08:59PM +0200, Hannes Reinecke wrote:
> On 4/17/19 5:44 AM, Ming Lei wrote:
> > In normal queue cleanup path, hctx is released after request queue
> > is freed, see blk_mq_release().
> > 
> > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > of hw queues shrinking. This way is easy to cause use-after-free,
> > because: one implicit rule is that it is safe to call almost all block
> > layer APIs if the request queue is alive; and one hctx may be retrieved
> > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > finally use-after-free is triggered.
> > 
> > Fixes this issue by always freeing hctx after releasing request queue.
> > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > a per-queue list to hold them, then try to resuse these hctxs if numa
> > node is matched.
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
> >   include/linux/blk-mq.h |  2 ++
> >   include/linux/blkdev.h |  7 +++++++
> >   3 files changed, 36 insertions(+), 13 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index eeebba6ec0f7..2ca4395f794d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >   		set->ops->exit_hctx(hctx, hctx_idx);
> >   	blk_mq_remove_cpuhp(hctx);
> > +
> > +	spin_lock(&q->dead_hctx_lock);
> > +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> > +	spin_unlock(&q->dead_hctx_lock);
> >   }
> >   static void blk_mq_exit_hw_queues(struct request_queue *q,
> > @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
> >    */
> >   void blk_mq_release(struct request_queue *q)
> >   {
> > -	struct blk_mq_hw_ctx *hctx;
> > -	unsigned int i;
> > +	struct blk_mq_hw_ctx *hctx, *next;
> >   	cancel_delayed_work_sync(&q->requeue_work);
> > -	/* hctx kobj stays in hctx */
> > -	queue_for_each_hw_ctx(q, hctx, i) {
> > -		if (!hctx)
> > -			continue;
> > +	/* all hctx are in .dead_hctx_list now */
> > +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> > +		list_del_init(&hctx->hctx_list);
> >   		kobject_put(&hctx->kobj);
> >   	}
> > @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> >   		struct blk_mq_tag_set *set, struct request_queue *q,
> >   		int hctx_idx, int node)
> >   {
> > -	struct blk_mq_hw_ctx *hctx;
> > +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> > +
> > +	/* reuse dead hctx first */
> > +	spin_lock(&q->dead_hctx_lock);
> > +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> > +		if (tmp->numa_node == node) {
> > +			hctx = tmp;
> > +			break;
> > +		}
> > +	}
> > +	if (hctx)
> > +		list_del_init(&hctx->hctx_list);
> > +	spin_unlock(&q->dead_hctx_lock);
> > -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > +	if (!hctx)
> > +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> >   	if (!hctx)
> >   		goto fail;
> > @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
> >   		if (hctx) {
> > -			if (hctxs[i]) {
> > +			if (hctxs[i])
> >   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> > -				kobject_put(&hctxs[i]->kobj);
> > -			}
> >   			hctxs[i] = hctx;
> >   		} else {
> >   			if (hctxs[i])
> > @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >   			if (hctx->tags)
> >   				blk_mq_free_map_and_requests(set, j);
> >   			blk_mq_exit_hctx(q, set, hctx, j);
> > -			kobject_put(&hctx->kobj);
> >   			hctxs[j] = NULL;
> > -
> >   		}
> >   	}
> >   	mutex_unlock(&q->sysfs_lock);
> > @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> >   	if (!q->queue_hw_ctx)
> >   		goto err_sys_init;
> > +	INIT_LIST_HEAD(&q->dead_hctx_list);
> > +	spin_lock_init(&q->dead_hctx_lock);
> > +
> >   	blk_mq_realloc_hw_ctxs(set, q);
> >   	if (!q->nr_hw_queues)
> >   		goto err_hctxs;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index db29928de467..15d1aa53d96c 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
> >   	struct dentry		*sched_debugfs_dir;
> >   #endif
> > +	struct list_head	hctx_list;
> > +
> >   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
> >   	struct srcu_struct	srcu[0];
> >   };
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 4b85dc066264..1325f941f0be 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -535,6 +535,13 @@ struct request_queue {
> >   	struct mutex		sysfs_lock;
> > +	/*
> > +	 * for reusing dead hctx instance in case of updating
> > +	 * nr_hw_queues
> > +	 */
> > +	struct list_head	dead_hctx_list;
> > +	spinlock_t		dead_hctx_lock;
> > +
> >   	atomic_t		mq_freeze_depth;
> >   #if defined(CONFIG_BLK_DEV_BSG)
> > 
> Hmm.
> I don't particularly like this approach.
> The much saner approach would be to avoid having I/O in-flight in the first
> place by setting the queue to something else than live, no?

There isn't any in-flight IO since blk_mq_freeze_queue() is returned
from __blk_mq_update_nr_hw_queues().

However, there can be reference to hctx in other code paths via
arbitrary block layer APIs.

Quiesce can't avoid the use-after-free too given RCU read lock may
not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().

So this patch might be the simplest approach to fix this issue, IMO.

However, I am open to any solution for this issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-17 12:59       ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-17 12:59 UTC (permalink / raw)


On Wed, Apr 17, 2019@02:08:59PM +0200, Hannes Reinecke wrote:
> On 4/17/19 5:44 AM, Ming Lei wrote:
> > In normal queue cleanup path, hctx is released after request queue
> > is freed, see blk_mq_release().
> > 
> > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > of hw queues shrinking. This way is easy to cause use-after-free,
> > because: one implicit rule is that it is safe to call almost all block
> > layer APIs if the request queue is alive; and one hctx may be retrieved
> > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > finally use-after-free is triggered.
> > 
> > Fixes this issue by always freeing hctx after releasing request queue.
> > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > a per-queue list to hold them, then try to resuse these hctxs if numa
> > node is matched.
> > 
> > Cc: Dongli Zhang <dongli.zhang at oracle.com>
> > Cc: James Smart <james.smart at broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche at wdc.com>
> > Cc: linux-scsi at vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> > Cc: Christoph Hellwig <hch at lst.de>,
> > Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> > Cc: jianchao wang <jianchao.w.wang at oracle.com>
> > Signed-off-by: Ming Lei <ming.lei at redhat.com>
> > ---
> >   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
> >   include/linux/blk-mq.h |  2 ++
> >   include/linux/blkdev.h |  7 +++++++
> >   3 files changed, 36 insertions(+), 13 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index eeebba6ec0f7..2ca4395f794d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >   		set->ops->exit_hctx(hctx, hctx_idx);
> >   	blk_mq_remove_cpuhp(hctx);
> > +
> > +	spin_lock(&q->dead_hctx_lock);
> > +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> > +	spin_unlock(&q->dead_hctx_lock);
> >   }
> >   static void blk_mq_exit_hw_queues(struct request_queue *q,
> > @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
> >    */
> >   void blk_mq_release(struct request_queue *q)
> >   {
> > -	struct blk_mq_hw_ctx *hctx;
> > -	unsigned int i;
> > +	struct blk_mq_hw_ctx *hctx, *next;
> >   	cancel_delayed_work_sync(&q->requeue_work);
> > -	/* hctx kobj stays in hctx */
> > -	queue_for_each_hw_ctx(q, hctx, i) {
> > -		if (!hctx)
> > -			continue;
> > +	/* all hctx are in .dead_hctx_list now */
> > +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> > +		list_del_init(&hctx->hctx_list);
> >   		kobject_put(&hctx->kobj);
> >   	}
> > @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> >   		struct blk_mq_tag_set *set, struct request_queue *q,
> >   		int hctx_idx, int node)
> >   {
> > -	struct blk_mq_hw_ctx *hctx;
> > +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> > +
> > +	/* reuse dead hctx first */
> > +	spin_lock(&q->dead_hctx_lock);
> > +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> > +		if (tmp->numa_node == node) {
> > +			hctx = tmp;
> > +			break;
> > +		}
> > +	}
> > +	if (hctx)
> > +		list_del_init(&hctx->hctx_list);
> > +	spin_unlock(&q->dead_hctx_lock);
> > -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > +	if (!hctx)
> > +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> >   	if (!hctx)
> >   		goto fail;
> > @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
> >   		if (hctx) {
> > -			if (hctxs[i]) {
> > +			if (hctxs[i])
> >   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> > -				kobject_put(&hctxs[i]->kobj);
> > -			}
> >   			hctxs[i] = hctx;
> >   		} else {
> >   			if (hctxs[i])
> > @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> >   			if (hctx->tags)
> >   				blk_mq_free_map_and_requests(set, j);
> >   			blk_mq_exit_hctx(q, set, hctx, j);
> > -			kobject_put(&hctx->kobj);
> >   			hctxs[j] = NULL;
> > -
> >   		}
> >   	}
> >   	mutex_unlock(&q->sysfs_lock);
> > @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> >   	if (!q->queue_hw_ctx)
> >   		goto err_sys_init;
> > +	INIT_LIST_HEAD(&q->dead_hctx_list);
> > +	spin_lock_init(&q->dead_hctx_lock);
> > +
> >   	blk_mq_realloc_hw_ctxs(set, q);
> >   	if (!q->nr_hw_queues)
> >   		goto err_hctxs;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index db29928de467..15d1aa53d96c 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
> >   	struct dentry		*sched_debugfs_dir;
> >   #endif
> > +	struct list_head	hctx_list;
> > +
> >   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
> >   	struct srcu_struct	srcu[0];
> >   };
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 4b85dc066264..1325f941f0be 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -535,6 +535,13 @@ struct request_queue {
> >   	struct mutex		sysfs_lock;
> > +	/*
> > +	 * for reusing dead hctx instance in case of updating
> > +	 * nr_hw_queues
> > +	 */
> > +	struct list_head	dead_hctx_list;
> > +	spinlock_t		dead_hctx_lock;
> > +
> >   	atomic_t		mq_freeze_depth;
> >   #if defined(CONFIG_BLK_DEV_BSG)
> > 
> Hmm.
> I don't particularly like this approach.
> The much saner approach would be to avoid having I/O in-flight in the first
> place by setting the queue to something else than live, no?

There isn't any in-flight IO since blk_mq_freeze_queue() is returned
from __blk_mq_update_nr_hw_queues().

However, there can be reference to hctx in other code paths via
arbitrary block layer APIs.

Quiesce can't avoid the use-after-free too given RCU read lock may
not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().

So this patch might be the simplest approach to fix this issue, IMO.

However, I am open to any solution for this issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-17  3:44   ` Ming Lei
@ 2019-04-17 15:55     ` Keith Busch
  -1 siblings, 0 replies; 50+ messages in thread
From: Keith Busch @ 2019-04-17 15:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Busch, Keith,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Tue, Apr 16, 2019 at 08:44:10PM -0700, Ming Lei wrote:
> Hennes reported the following kernel oops:
> 
>     There is a race condition between namespace rescanning and
>     controller reset; during controller reset all namespaces are
>     quiesed vie nams_stop_ctrl(), and after reset all namespaces
>     are unquiesced again.
>     When namespace scanning was active by the time controller reset
>     was triggered the rescan code will call nvme_ns_remove(), which
>     then will cause a kernel crash in nvme_start_ctrl() as it'll trip
>     over uninitialized namespaces.
> 
> Patch "blk-mq: free hw queue's resource in hctx's release handler"
> should make this issue quite difficult to trigger. However it can't
> kill the issue completely becasue pre-condition of that patch is to
> hold request queue's refcount before calling block layer API, and
> there is still a small window between blk_cleanup_queue() and removing
> the ns from the controller namspace list in nvme_ns_remove().
> 
> Hold request queue's refcount until the ns is freed, then the above race
> can be avoided completely. Given the 'namespaces_rwsem' is always held
> to retrieve ns for starting/stopping request queue, this lock can prevent
> namespaces from being freed.

This looks good to me.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-17 15:55     ` Keith Busch
  0 siblings, 0 replies; 50+ messages in thread
From: Keith Busch @ 2019-04-17 15:55 UTC (permalink / raw)


On Tue, Apr 16, 2019@08:44:10PM -0700, Ming Lei wrote:
> Hennes reported the following kernel oops:
> 
>     There is a race condition between namespace rescanning and
>     controller reset; during controller reset all namespaces are
>     quiesed vie nams_stop_ctrl(), and after reset all namespaces
>     are unquiesced again.
>     When namespace scanning was active by the time controller reset
>     was triggered the rescan code will call nvme_ns_remove(), which
>     then will cause a kernel crash in nvme_start_ctrl() as it'll trip
>     over uninitialized namespaces.
> 
> Patch "blk-mq: free hw queue's resource in hctx's release handler"
> should make this issue quite difficult to trigger. However it can't
> kill the issue completely becasue pre-condition of that patch is to
> hold request queue's refcount before calling block layer API, and
> there is still a small window between blk_cleanup_queue() and removing
> the ns from the controller namspace list in nvme_ns_remove().
> 
> Hold request queue's refcount until the ns is freed, then the above race
> can be avoided completely. Given the 'namespaces_rwsem' is always held
> to retrieve ns for starting/stopping request queue, this lock can prevent
> namespaces from being freed.

This looks good to me.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 0/9] blk-mq: fix races related with freeing queue
  2019-04-17  3:44 ` Ming Lei
@ 2019-04-17 17:22   ` James Smart
  -1 siblings, 0 replies; 50+ messages in thread
From: James Smart @ 2019-04-17 17:22 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, Dongli Zhang, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang

On 4/16/2019 8:44 PM, Ming Lei wrote:
> Hi,
>
> Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
> allowed during cleanup queue even though queue refcount is held.
>
> This change has caused lots of kernel oops triggered in run queue path,
> turns out it isn't easy to fix them all.
>
> So move freeing of hw queue resources into hctx's release handler, then
> the above issue is fixed. Meantime, this way is safe given freeing hw
> queue resource doesn't require tags.
>
> V3 covers more races.
>
> V6:
> 	- remove previous SCSI patch which will be routed via SCSI tree
> 	- add reviewed-by tag
> 	- fix one related NVMe scan vs reset race
>
> V5:
> 	- refactor blk_mq_alloc_and_init_hctx()
> 	- fix race related updating nr_hw_queues by always freeing hctx
> 	  after request queue is released
>
> V4:
> 	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
> 	- fix comment in the last patch
>
> V3:
> 	- cancel q->requeue_work in queue's release handler
> 	- cancel hctx->run_work in hctx's release handler
> 	- add patch 1 for fixing race in plug code path
> 	- the last patch is added for avoiding to grab SCSI's refcont
> 	in IO path
>
> V2:
> 	- moving freeing hw queue resources into hctx's release handler
>
> Ming Lei (9):
>    blk-mq: grab .q_usage_counter when queuing request from plug code path
>    blk-mq: move cancel of requeue_work into blk_mq_release
>    blk-mq: free hw queue's resource in hctx's release handler
>    blk-mq: move all hctx alloction & initialization into
>      __blk_mq_alloc_and_init_hctx
>    blk-mq: split blk_mq_alloc_and_init_hctx into two parts
>    blk-mq: always free hctx after request queue is freed
>    blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
>    block: don't drain in-progress dispatch in blk_cleanup_queue()
>    nvme: hold request queue's refcount in ns's whole lifetime
>
>   block/blk-core.c         |  23 +-----
>   block/blk-mq-sysfs.c     |   8 ++
>   block/blk-mq.c           | 195 ++++++++++++++++++++++++++++-------------------
>   block/blk-mq.h           |   2 +-
>   drivers/nvme/host/core.c |  10 ++-
>   include/linux/blk-mq.h   |   2 +
>   include/linux/blkdev.h   |   7 ++
>   7 files changed, 143 insertions(+), 104 deletions(-)
>
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>

We've been testing with the series and so far have been seeing success.

Tested-by:  James Smart   <james.smart@broadcom.com>



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 0/9] blk-mq: fix races related with freeing queue
@ 2019-04-17 17:22   ` James Smart
  0 siblings, 0 replies; 50+ messages in thread
From: James Smart @ 2019-04-17 17:22 UTC (permalink / raw)


On 4/16/2019 8:44 PM, Ming Lei wrote:
> Hi,
>
> Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
> allowed during cleanup queue even though queue refcount is held.
>
> This change has caused lots of kernel oops triggered in run queue path,
> turns out it isn't easy to fix them all.
>
> So move freeing of hw queue resources into hctx's release handler, then
> the above issue is fixed. Meantime, this way is safe given freeing hw
> queue resource doesn't require tags.
>
> V3 covers more races.
>
> V6:
> 	- remove previous SCSI patch which will be routed via SCSI tree
> 	- add reviewed-by tag
> 	- fix one related NVMe scan vs reset race
>
> V5:
> 	- refactor blk_mq_alloc_and_init_hctx()
> 	- fix race related updating nr_hw_queues by always freeing hctx
> 	  after request queue is released
>
> V4:
> 	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
> 	- fix comment in the last patch
>
> V3:
> 	- cancel q->requeue_work in queue's release handler
> 	- cancel hctx->run_work in hctx's release handler
> 	- add patch 1 for fixing race in plug code path
> 	- the last patch is added for avoiding to grab SCSI's refcont
> 	in IO path
>
> V2:
> 	- moving freeing hw queue resources into hctx's release handler
>
> Ming Lei (9):
>    blk-mq: grab .q_usage_counter when queuing request from plug code path
>    blk-mq: move cancel of requeue_work into blk_mq_release
>    blk-mq: free hw queue's resource in hctx's release handler
>    blk-mq: move all hctx alloction & initialization into
>      __blk_mq_alloc_and_init_hctx
>    blk-mq: split blk_mq_alloc_and_init_hctx into two parts
>    blk-mq: always free hctx after request queue is freed
>    blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
>    block: don't drain in-progress dispatch in blk_cleanup_queue()
>    nvme: hold request queue's refcount in ns's whole lifetime
>
>   block/blk-core.c         |  23 +-----
>   block/blk-mq-sysfs.c     |   8 ++
>   block/blk-mq.c           | 195 ++++++++++++++++++++++++++++-------------------
>   block/blk-mq.h           |   2 +-
>   drivers/nvme/host/core.c |  10 ++-
>   include/linux/blk-mq.h   |   2 +
>   include/linux/blkdev.h   |   7 ++
>   7 files changed, 143 insertions(+), 104 deletions(-)
>
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang at oracle.com>

We've been testing with the series and so far have been seeing success.

Tested-by:? James Smart?? <james.smart at broadcom.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-17 12:59       ` Ming Lei
@ 2019-04-22  3:30         ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-22  3:30 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Wed, Apr 17, 2019 at 08:59:43PM +0800, Ming Lei wrote:
> On Wed, Apr 17, 2019 at 02:08:59PM +0200, Hannes Reinecke wrote:
> > On 4/17/19 5:44 AM, Ming Lei wrote:
> > > In normal queue cleanup path, hctx is released after request queue
> > > is freed, see blk_mq_release().
> > > 
> > > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > > of hw queues shrinking. This way is easy to cause use-after-free,
> > > because: one implicit rule is that it is safe to call almost all block
> > > layer APIs if the request queue is alive; and one hctx may be retrieved
> > > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > > finally use-after-free is triggered.
> > > 
> > > Fixes this issue by always freeing hctx after releasing request queue.
> > > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > > a per-queue list to hold them, then try to resuse these hctxs if numa
> > > node is matched.
> > > 
> > > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > > Cc: James Smart <james.smart@broadcom.com>
> > > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > > Cc: linux-scsi@vger.kernel.org,
> > > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > > Cc: Christoph Hellwig <hch@lst.de>,
> > > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
> > >   include/linux/blk-mq.h |  2 ++
> > >   include/linux/blkdev.h |  7 +++++++
> > >   3 files changed, 36 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index eeebba6ec0f7..2ca4395f794d 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> > >   		set->ops->exit_hctx(hctx, hctx_idx);
> > >   	blk_mq_remove_cpuhp(hctx);
> > > +
> > > +	spin_lock(&q->dead_hctx_lock);
> > > +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> > > +	spin_unlock(&q->dead_hctx_lock);
> > >   }
> > >   static void blk_mq_exit_hw_queues(struct request_queue *q,
> > > @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
> > >    */
> > >   void blk_mq_release(struct request_queue *q)
> > >   {
> > > -	struct blk_mq_hw_ctx *hctx;
> > > -	unsigned int i;
> > > +	struct blk_mq_hw_ctx *hctx, *next;
> > >   	cancel_delayed_work_sync(&q->requeue_work);
> > > -	/* hctx kobj stays in hctx */
> > > -	queue_for_each_hw_ctx(q, hctx, i) {
> > > -		if (!hctx)
> > > -			continue;
> > > +	/* all hctx are in .dead_hctx_list now */
> > > +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> > > +		list_del_init(&hctx->hctx_list);
> > >   		kobject_put(&hctx->kobj);
> > >   	}
> > > @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> > >   		struct blk_mq_tag_set *set, struct request_queue *q,
> > >   		int hctx_idx, int node)
> > >   {
> > > -	struct blk_mq_hw_ctx *hctx;
> > > +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> > > +
> > > +	/* reuse dead hctx first */
> > > +	spin_lock(&q->dead_hctx_lock);
> > > +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> > > +		if (tmp->numa_node == node) {
> > > +			hctx = tmp;
> > > +			break;
> > > +		}
> > > +	}
> > > +	if (hctx)
> > > +		list_del_init(&hctx->hctx_list);
> > > +	spin_unlock(&q->dead_hctx_lock);
> > > -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > > +	if (!hctx)
> > > +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > >   	if (!hctx)
> > >   		goto fail;
> > > @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > >   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
> > >   		if (hctx) {
> > > -			if (hctxs[i]) {
> > > +			if (hctxs[i])
> > >   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> > > -				kobject_put(&hctxs[i]->kobj);
> > > -			}
> > >   			hctxs[i] = hctx;
> > >   		} else {
> > >   			if (hctxs[i])
> > > @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > >   			if (hctx->tags)
> > >   				blk_mq_free_map_and_requests(set, j);
> > >   			blk_mq_exit_hctx(q, set, hctx, j);
> > > -			kobject_put(&hctx->kobj);
> > >   			hctxs[j] = NULL;
> > > -
> > >   		}
> > >   	}
> > >   	mutex_unlock(&q->sysfs_lock);
> > > @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> > >   	if (!q->queue_hw_ctx)
> > >   		goto err_sys_init;
> > > +	INIT_LIST_HEAD(&q->dead_hctx_list);
> > > +	spin_lock_init(&q->dead_hctx_lock);
> > > +
> > >   	blk_mq_realloc_hw_ctxs(set, q);
> > >   	if (!q->nr_hw_queues)
> > >   		goto err_hctxs;
> > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > > index db29928de467..15d1aa53d96c 100644
> > > --- a/include/linux/blk-mq.h
> > > +++ b/include/linux/blk-mq.h
> > > @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
> > >   	struct dentry		*sched_debugfs_dir;
> > >   #endif
> > > +	struct list_head	hctx_list;
> > > +
> > >   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
> > >   	struct srcu_struct	srcu[0];
> > >   };
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index 4b85dc066264..1325f941f0be 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -535,6 +535,13 @@ struct request_queue {
> > >   	struct mutex		sysfs_lock;
> > > +	/*
> > > +	 * for reusing dead hctx instance in case of updating
> > > +	 * nr_hw_queues
> > > +	 */
> > > +	struct list_head	dead_hctx_list;
> > > +	spinlock_t		dead_hctx_lock;
> > > +
> > >   	atomic_t		mq_freeze_depth;
> > >   #if defined(CONFIG_BLK_DEV_BSG)
> > > 
> > Hmm.
> > I don't particularly like this approach.
> > The much saner approach would be to avoid having I/O in-flight in the first
> > place by setting the queue to something else than live, no?
> 
> There isn't any in-flight IO since blk_mq_freeze_queue() is returned
> from __blk_mq_update_nr_hw_queues().
> 
> However, there can be reference to hctx in other code paths via
> arbitrary block layer APIs.
> 
> Quiesce can't avoid the use-after-free too given RCU read lock may
> not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().
> 
> So this patch might be the simplest approach to fix this issue, IMO.
> 
> However, I am open to any solution for this issue.

Hi Hannes,

Could you please let us know if you have better idea for this issue?
Otherwise, I think we need to move on since it is real issue, and users
want to fix that.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-22  3:30         ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-22  3:30 UTC (permalink / raw)


On Wed, Apr 17, 2019@08:59:43PM +0800, Ming Lei wrote:
> On Wed, Apr 17, 2019@02:08:59PM +0200, Hannes Reinecke wrote:
> > On 4/17/19 5:44 AM, Ming Lei wrote:
> > > In normal queue cleanup path, hctx is released after request queue
> > > is freed, see blk_mq_release().
> > > 
> > > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > > of hw queues shrinking. This way is easy to cause use-after-free,
> > > because: one implicit rule is that it is safe to call almost all block
> > > layer APIs if the request queue is alive; and one hctx may be retrieved
> > > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > > finally use-after-free is triggered.
> > > 
> > > Fixes this issue by always freeing hctx after releasing request queue.
> > > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > > a per-queue list to hold them, then try to resuse these hctxs if numa
> > > node is matched.
> > > 
> > > Cc: Dongli Zhang <dongli.zhang at oracle.com>
> > > Cc: James Smart <james.smart at broadcom.com>
> > > Cc: Bart Van Assche <bart.vanassche at wdc.com>
> > > Cc: linux-scsi at vger.kernel.org,
> > > Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> > > Cc: Christoph Hellwig <hch at lst.de>,
> > > Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> > > Cc: jianchao wang <jianchao.w.wang at oracle.com>
> > > Signed-off-by: Ming Lei <ming.lei at redhat.com>
> > > ---
> > >   block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
> > >   include/linux/blk-mq.h |  2 ++
> > >   include/linux/blkdev.h |  7 +++++++
> > >   3 files changed, 36 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index eeebba6ec0f7..2ca4395f794d 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> > >   		set->ops->exit_hctx(hctx, hctx_idx);
> > >   	blk_mq_remove_cpuhp(hctx);
> > > +
> > > +	spin_lock(&q->dead_hctx_lock);
> > > +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> > > +	spin_unlock(&q->dead_hctx_lock);
> > >   }
> > >   static void blk_mq_exit_hw_queues(struct request_queue *q,
> > > @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
> > >    */
> > >   void blk_mq_release(struct request_queue *q)
> > >   {
> > > -	struct blk_mq_hw_ctx *hctx;
> > > -	unsigned int i;
> > > +	struct blk_mq_hw_ctx *hctx, *next;
> > >   	cancel_delayed_work_sync(&q->requeue_work);
> > > -	/* hctx kobj stays in hctx */
> > > -	queue_for_each_hw_ctx(q, hctx, i) {
> > > -		if (!hctx)
> > > -			continue;
> > > +	/* all hctx are in .dead_hctx_list now */
> > > +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> > > +		list_del_init(&hctx->hctx_list);
> > >   		kobject_put(&hctx->kobj);
> > >   	}
> > > @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> > >   		struct blk_mq_tag_set *set, struct request_queue *q,
> > >   		int hctx_idx, int node)
> > >   {
> > > -	struct blk_mq_hw_ctx *hctx;
> > > +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> > > +
> > > +	/* reuse dead hctx first */
> > > +	spin_lock(&q->dead_hctx_lock);
> > > +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> > > +		if (tmp->numa_node == node) {
> > > +			hctx = tmp;
> > > +			break;
> > > +		}
> > > +	}
> > > +	if (hctx)
> > > +		list_del_init(&hctx->hctx_list);
> > > +	spin_unlock(&q->dead_hctx_lock);
> > > -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > > +	if (!hctx)
> > > +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > >   	if (!hctx)
> > >   		goto fail;
> > > @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > >   		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
> > >   		if (hctx) {
> > > -			if (hctxs[i]) {
> > > +			if (hctxs[i])
> > >   				blk_mq_exit_hctx(q, set, hctxs[i], i);
> > > -				kobject_put(&hctxs[i]->kobj);
> > > -			}
> > >   			hctxs[i] = hctx;
> > >   		} else {
> > >   			if (hctxs[i])
> > > @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > >   			if (hctx->tags)
> > >   				blk_mq_free_map_and_requests(set, j);
> > >   			blk_mq_exit_hctx(q, set, hctx, j);
> > > -			kobject_put(&hctx->kobj);
> > >   			hctxs[j] = NULL;
> > > -
> > >   		}
> > >   	}
> > >   	mutex_unlock(&q->sysfs_lock);
> > > @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> > >   	if (!q->queue_hw_ctx)
> > >   		goto err_sys_init;
> > > +	INIT_LIST_HEAD(&q->dead_hctx_list);
> > > +	spin_lock_init(&q->dead_hctx_lock);
> > > +
> > >   	blk_mq_realloc_hw_ctxs(set, q);
> > >   	if (!q->nr_hw_queues)
> > >   		goto err_hctxs;
> > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > > index db29928de467..15d1aa53d96c 100644
> > > --- a/include/linux/blk-mq.h
> > > +++ b/include/linux/blk-mq.h
> > > @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
> > >   	struct dentry		*sched_debugfs_dir;
> > >   #endif
> > > +	struct list_head	hctx_list;
> > > +
> > >   	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
> > >   	struct srcu_struct	srcu[0];
> > >   };
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index 4b85dc066264..1325f941f0be 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -535,6 +535,13 @@ struct request_queue {
> > >   	struct mutex		sysfs_lock;
> > > +	/*
> > > +	 * for reusing dead hctx instance in case of updating
> > > +	 * nr_hw_queues
> > > +	 */
> > > +	struct list_head	dead_hctx_list;
> > > +	spinlock_t		dead_hctx_lock;
> > > +
> > >   	atomic_t		mq_freeze_depth;
> > >   #if defined(CONFIG_BLK_DEV_BSG)
> > > 
> > Hmm.
> > I don't particularly like this approach.
> > The much saner approach would be to avoid having I/O in-flight in the first
> > place by setting the queue to something else than live, no?
> 
> There isn't any in-flight IO since blk_mq_freeze_queue() is returned
> from __blk_mq_update_nr_hw_queues().
> 
> However, there can be reference to hctx in other code paths via
> arbitrary block layer APIs.
> 
> Quiesce can't avoid the use-after-free too given RCU read lock may
> not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().
> 
> So this patch might be the simplest approach to fix this issue, IMO.
> 
> However, I am open to any solution for this issue.

Hi Hannes,

Could you please let us know if you have better idea for this issue?
Otherwise, I think we need to move on since it is real issue, and users
want to fix that.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-22  3:30         ` Ming Lei
@ 2019-04-23 11:19           ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-23 11:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On 4/22/19 5:30 AM, Ming Lei wrote:
> On Wed, Apr 17, 2019 at 08:59:43PM +0800, Ming Lei wrote:
>> On Wed, Apr 17, 2019 at 02:08:59PM +0200, Hannes Reinecke wrote:
>>> On 4/17/19 5:44 AM, Ming Lei wrote:
>>>> In normal queue cleanup path, hctx is released after request queue
>>>> is freed, see blk_mq_release().
>>>>
>>>> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
>>>> of hw queues shrinking. This way is easy to cause use-after-free,
>>>> because: one implicit rule is that it is safe to call almost all block
>>>> layer APIs if the request queue is alive; and one hctx may be retrieved
>>>> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
>>>> finally use-after-free is triggered.
>>>>
>>>> Fixes this issue by always freeing hctx after releasing request queue.
>>>> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
>>>> a per-queue list to hold them, then try to resuse these hctxs if numa
>>>> node is matched.
>>>>
>>>> Cc: Dongli Zhang <dongli.zhang@oracle.com>
>>>> Cc: James Smart <james.smart@broadcom.com>
>>>> Cc: Bart Van Assche <bart.vanassche@wdc.com>
>>>> Cc: linux-scsi@vger.kernel.org,
>>>> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
>>>> Cc: Christoph Hellwig <hch@lst.de>,
>>>> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
>>>> Cc: jianchao wang <jianchao.w.wang@oracle.com>
>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>> ---
>>>>    block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
>>>>    include/linux/blk-mq.h |  2 ++
>>>>    include/linux/blkdev.h |  7 +++++++
>>>>    3 files changed, 36 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index eeebba6ec0f7..2ca4395f794d 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>>>>    		set->ops->exit_hctx(hctx, hctx_idx);
>>>>    	blk_mq_remove_cpuhp(hctx);
>>>> +
>>>> +	spin_lock(&q->dead_hctx_lock);
>>>> +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
>>>> +	spin_unlock(&q->dead_hctx_lock);
>>>>    }
>>>>    static void blk_mq_exit_hw_queues(struct request_queue *q,
>>>> @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>>>>     */
>>>>    void blk_mq_release(struct request_queue *q)
>>>>    {
>>>> -	struct blk_mq_hw_ctx *hctx;
>>>> -	unsigned int i;
>>>> +	struct blk_mq_hw_ctx *hctx, *next;
>>>>    	cancel_delayed_work_sync(&q->requeue_work);
>>>> -	/* hctx kobj stays in hctx */
>>>> -	queue_for_each_hw_ctx(q, hctx, i) {
>>>> -		if (!hctx)
>>>> -			continue;
>>>> +	/* all hctx are in .dead_hctx_list now */
>>>> +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
>>>> +		list_del_init(&hctx->hctx_list);
>>>>    		kobject_put(&hctx->kobj);
>>>>    	}
>>>> @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
>>>>    		struct blk_mq_tag_set *set, struct request_queue *q,
>>>>    		int hctx_idx, int node)
>>>>    {
>>>> -	struct blk_mq_hw_ctx *hctx;
>>>> +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
>>>> +
>>>> +	/* reuse dead hctx first */
>>>> +	spin_lock(&q->dead_hctx_lock);
>>>> +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
>>>> +		if (tmp->numa_node == node) {
>>>> +			hctx = tmp;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +	if (hctx)
>>>> +		list_del_init(&hctx->hctx_list);
>>>> +	spin_unlock(&q->dead_hctx_lock);
>>>> -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
>>>> +	if (!hctx)
>>>> +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
>>>>    	if (!hctx)
>>>>    		goto fail;
>>>> @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>>>>    		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
>>>>    		if (hctx) {
>>>> -			if (hctxs[i]) {
>>>> +			if (hctxs[i])
>>>>    				blk_mq_exit_hctx(q, set, hctxs[i], i);
>>>> -				kobject_put(&hctxs[i]->kobj);
>>>> -			}
>>>>    			hctxs[i] = hctx;
>>>>    		} else {
>>>>    			if (hctxs[i])
>>>> @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>>>>    			if (hctx->tags)
>>>>    				blk_mq_free_map_and_requests(set, j);
>>>>    			blk_mq_exit_hctx(q, set, hctx, j);
>>>> -			kobject_put(&hctx->kobj);
>>>>    			hctxs[j] = NULL;
>>>> -
>>>>    		}
>>>>    	}
>>>>    	mutex_unlock(&q->sysfs_lock);
>>>> @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>>>    	if (!q->queue_hw_ctx)
>>>>    		goto err_sys_init;
>>>> +	INIT_LIST_HEAD(&q->dead_hctx_list);
>>>> +	spin_lock_init(&q->dead_hctx_lock);
>>>> +
>>>>    	blk_mq_realloc_hw_ctxs(set, q);
>>>>    	if (!q->nr_hw_queues)
>>>>    		goto err_hctxs;
>>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>>> index db29928de467..15d1aa53d96c 100644
>>>> --- a/include/linux/blk-mq.h
>>>> +++ b/include/linux/blk-mq.h
>>>> @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
>>>>    	struct dentry		*sched_debugfs_dir;
>>>>    #endif
>>>> +	struct list_head	hctx_list;
>>>> +
>>>>    	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
>>>>    	struct srcu_struct	srcu[0];
>>>>    };
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index 4b85dc066264..1325f941f0be 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -535,6 +535,13 @@ struct request_queue {
>>>>    	struct mutex		sysfs_lock;
>>>> +	/*
>>>> +	 * for reusing dead hctx instance in case of updating
>>>> +	 * nr_hw_queues
>>>> +	 */
>>>> +	struct list_head	dead_hctx_list;
>>>> +	spinlock_t		dead_hctx_lock;
>>>> +
>>>>    	atomic_t		mq_freeze_depth;
>>>>    #if defined(CONFIG_BLK_DEV_BSG)
>>>>
>>> Hmm.
>>> I don't particularly like this approach.
>>> The much saner approach would be to avoid having I/O in-flight in the first
>>> place by setting the queue to something else than live, no?
>>
>> There isn't any in-flight IO since blk_mq_freeze_queue() is returned
>> from __blk_mq_update_nr_hw_queues().
>>
>> However, there can be reference to hctx in other code paths via
>> arbitrary block layer APIs.
>>
>> Quiesce can't avoid the use-after-free too given RCU read lock may
>> not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().
>>
>> So this patch might be the simplest approach to fix this issue, IMO.
>>
>> However, I am open to any solution for this issue.
> 
> Hi Hannes,
> 
> Could you please let us know if you have better idea for this issue?
> Otherwise, I think we need to move on since it is real issue, and users
> want to fix that.
> 
Okay. Having looked over the problem and possible alternatives, it looks 
indeed like a viable solution.
I do agree that it's a sensible design to have an additional holding 
area for hardware context elements, given that they might be reassigned 
during blk_mq_realloc_hw_ctxs().

However, I would rename the 'dead' elements to 'unused' (ie 
unused_hctx_list etc).

And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make 
things more consistent.
Problem with the current patch is that in blk_mq_release() we iterate
the 'dead_hctx_list' and free up everything in there, but then blindly 
call 'kfree(q->queue_hw_ctx)' without checking if there are any context 
pointers left.
When moving the call to blk_mq_exit_queue() we could to a simple
WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything 
got deallocated properly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-23 11:19           ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-23 11:19 UTC (permalink / raw)


On 4/22/19 5:30 AM, Ming Lei wrote:
> On Wed, Apr 17, 2019@08:59:43PM +0800, Ming Lei wrote:
>> On Wed, Apr 17, 2019@02:08:59PM +0200, Hannes Reinecke wrote:
>>> On 4/17/19 5:44 AM, Ming Lei wrote:
>>>> In normal queue cleanup path, hctx is released after request queue
>>>> is freed, see blk_mq_release().
>>>>
>>>> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
>>>> of hw queues shrinking. This way is easy to cause use-after-free,
>>>> because: one implicit rule is that it is safe to call almost all block
>>>> layer APIs if the request queue is alive; and one hctx may be retrieved
>>>> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
>>>> finally use-after-free is triggered.
>>>>
>>>> Fixes this issue by always freeing hctx after releasing request queue.
>>>> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
>>>> a per-queue list to hold them, then try to resuse these hctxs if numa
>>>> node is matched.
>>>>
>>>> Cc: Dongli Zhang <dongli.zhang at oracle.com>
>>>> Cc: James Smart <james.smart at broadcom.com>
>>>> Cc: Bart Van Assche <bart.vanassche at wdc.com>
>>>> Cc: linux-scsi at vger.kernel.org,
>>>> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
>>>> Cc: Christoph Hellwig <hch at lst.de>,
>>>> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
>>>> Cc: jianchao wang <jianchao.w.wang at oracle.com>
>>>> Signed-off-by: Ming Lei <ming.lei at redhat.com>
>>>> ---
>>>>    block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
>>>>    include/linux/blk-mq.h |  2 ++
>>>>    include/linux/blkdev.h |  7 +++++++
>>>>    3 files changed, 36 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index eeebba6ec0f7..2ca4395f794d 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>>>>    		set->ops->exit_hctx(hctx, hctx_idx);
>>>>    	blk_mq_remove_cpuhp(hctx);
>>>> +
>>>> +	spin_lock(&q->dead_hctx_lock);
>>>> +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
>>>> +	spin_unlock(&q->dead_hctx_lock);
>>>>    }
>>>>    static void blk_mq_exit_hw_queues(struct request_queue *q,
>>>> @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>>>>     */
>>>>    void blk_mq_release(struct request_queue *q)
>>>>    {
>>>> -	struct blk_mq_hw_ctx *hctx;
>>>> -	unsigned int i;
>>>> +	struct blk_mq_hw_ctx *hctx, *next;
>>>>    	cancel_delayed_work_sync(&q->requeue_work);
>>>> -	/* hctx kobj stays in hctx */
>>>> -	queue_for_each_hw_ctx(q, hctx, i) {
>>>> -		if (!hctx)
>>>> -			continue;
>>>> +	/* all hctx are in .dead_hctx_list now */
>>>> +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
>>>> +		list_del_init(&hctx->hctx_list);
>>>>    		kobject_put(&hctx->kobj);
>>>>    	}
>>>> @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
>>>>    		struct blk_mq_tag_set *set, struct request_queue *q,
>>>>    		int hctx_idx, int node)
>>>>    {
>>>> -	struct blk_mq_hw_ctx *hctx;
>>>> +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
>>>> +
>>>> +	/* reuse dead hctx first */
>>>> +	spin_lock(&q->dead_hctx_lock);
>>>> +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
>>>> +		if (tmp->numa_node == node) {
>>>> +			hctx = tmp;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +	if (hctx)
>>>> +		list_del_init(&hctx->hctx_list);
>>>> +	spin_unlock(&q->dead_hctx_lock);
>>>> -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
>>>> +	if (!hctx)
>>>> +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
>>>>    	if (!hctx)
>>>>    		goto fail;
>>>> @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>>>>    		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
>>>>    		if (hctx) {
>>>> -			if (hctxs[i]) {
>>>> +			if (hctxs[i])
>>>>    				blk_mq_exit_hctx(q, set, hctxs[i], i);
>>>> -				kobject_put(&hctxs[i]->kobj);
>>>> -			}
>>>>    			hctxs[i] = hctx;
>>>>    		} else {
>>>>    			if (hctxs[i])
>>>> @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>>>>    			if (hctx->tags)
>>>>    				blk_mq_free_map_and_requests(set, j);
>>>>    			blk_mq_exit_hctx(q, set, hctx, j);
>>>> -			kobject_put(&hctx->kobj);
>>>>    			hctxs[j] = NULL;
>>>> -
>>>>    		}
>>>>    	}
>>>>    	mutex_unlock(&q->sysfs_lock);
>>>> @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>>>>    	if (!q->queue_hw_ctx)
>>>>    		goto err_sys_init;
>>>> +	INIT_LIST_HEAD(&q->dead_hctx_list);
>>>> +	spin_lock_init(&q->dead_hctx_lock);
>>>> +
>>>>    	blk_mq_realloc_hw_ctxs(set, q);
>>>>    	if (!q->nr_hw_queues)
>>>>    		goto err_hctxs;
>>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>>> index db29928de467..15d1aa53d96c 100644
>>>> --- a/include/linux/blk-mq.h
>>>> +++ b/include/linux/blk-mq.h
>>>> @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
>>>>    	struct dentry		*sched_debugfs_dir;
>>>>    #endif
>>>> +	struct list_head	hctx_list;
>>>> +
>>>>    	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
>>>>    	struct srcu_struct	srcu[0];
>>>>    };
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index 4b85dc066264..1325f941f0be 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -535,6 +535,13 @@ struct request_queue {
>>>>    	struct mutex		sysfs_lock;
>>>> +	/*
>>>> +	 * for reusing dead hctx instance in case of updating
>>>> +	 * nr_hw_queues
>>>> +	 */
>>>> +	struct list_head	dead_hctx_list;
>>>> +	spinlock_t		dead_hctx_lock;
>>>> +
>>>>    	atomic_t		mq_freeze_depth;
>>>>    #if defined(CONFIG_BLK_DEV_BSG)
>>>>
>>> Hmm.
>>> I don't particularly like this approach.
>>> The much saner approach would be to avoid having I/O in-flight in the first
>>> place by setting the queue to something else than live, no?
>>
>> There isn't any in-flight IO since blk_mq_freeze_queue() is returned
>> from __blk_mq_update_nr_hw_queues().
>>
>> However, there can be reference to hctx in other code paths via
>> arbitrary block layer APIs.
>>
>> Quiesce can't avoid the use-after-free too given RCU read lock may
>> not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().
>>
>> So this patch might be the simplest approach to fix this issue, IMO.
>>
>> However, I am open to any solution for this issue.
> 
> Hi Hannes,
> 
> Could you please let us know if you have better idea for this issue?
> Otherwise, I think we need to move on since it is real issue, and users
> want to fix that.
> 
Okay. Having looked over the problem and possible alternatives, it looks 
indeed like a viable solution.
I do agree that it's a sensible design to have an additional holding 
area for hardware context elements, given that they might be reassigned 
during blk_mq_realloc_hw_ctxs().

However, I would rename the 'dead' elements to 'unused' (ie 
unused_hctx_list etc).

And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make 
things more consistent.
Problem with the current patch is that in blk_mq_release() we iterate
the 'dead_hctx_list' and free up everything in there, but then blindly 
call 'kfree(q->queue_hw_ctx)' without checking if there are any context 
pointers left.
When moving the call to blk_mq_exit_queue() we could to a simple
WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything 
got deallocated properly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-23 11:19           ` Hannes Reinecke
@ 2019-04-23 13:30             ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-23 13:30 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Hi Hannes,

Thanks for your response.

On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote:
> On 4/22/19 5:30 AM, Ming Lei wrote:
> > On Wed, Apr 17, 2019 at 08:59:43PM +0800, Ming Lei wrote:
> > > On Wed, Apr 17, 2019 at 02:08:59PM +0200, Hannes Reinecke wrote:
> > > > On 4/17/19 5:44 AM, Ming Lei wrote:
> > > > > In normal queue cleanup path, hctx is released after request queue
> > > > > is freed, see blk_mq_release().
> > > > > 
> > > > > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > > > > of hw queues shrinking. This way is easy to cause use-after-free,
> > > > > because: one implicit rule is that it is safe to call almost all block
> > > > > layer APIs if the request queue is alive; and one hctx may be retrieved
> > > > > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > > > > finally use-after-free is triggered.
> > > > > 
> > > > > Fixes this issue by always freeing hctx after releasing request queue.
> > > > > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > > > > a per-queue list to hold them, then try to resuse these hctxs if numa
> > > > > node is matched.
> > > > > 
> > > > > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > > > > Cc: James Smart <james.smart@broadcom.com>
> > > > > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > > > > Cc: linux-scsi@vger.kernel.org,
> > > > > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > > > > Cc: Christoph Hellwig <hch@lst.de>,
> > > > > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > > > > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > ---
> > > > >    block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
> > > > >    include/linux/blk-mq.h |  2 ++
> > > > >    include/linux/blkdev.h |  7 +++++++
> > > > >    3 files changed, 36 insertions(+), 13 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > index eeebba6ec0f7..2ca4395f794d 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> > > > >    		set->ops->exit_hctx(hctx, hctx_idx);
> > > > >    	blk_mq_remove_cpuhp(hctx);
> > > > > +
> > > > > +	spin_lock(&q->dead_hctx_lock);
> > > > > +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> > > > > +	spin_unlock(&q->dead_hctx_lock);
> > > > >    }
> > > > >    static void blk_mq_exit_hw_queues(struct request_queue *q,
> > > > > @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
> > > > >     */
> > > > >    void blk_mq_release(struct request_queue *q)
> > > > >    {
> > > > > -	struct blk_mq_hw_ctx *hctx;
> > > > > -	unsigned int i;
> > > > > +	struct blk_mq_hw_ctx *hctx, *next;
> > > > >    	cancel_delayed_work_sync(&q->requeue_work);
> > > > > -	/* hctx kobj stays in hctx */
> > > > > -	queue_for_each_hw_ctx(q, hctx, i) {
> > > > > -		if (!hctx)
> > > > > -			continue;
> > > > > +	/* all hctx are in .dead_hctx_list now */
> > > > > +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> > > > > +		list_del_init(&hctx->hctx_list);
> > > > >    		kobject_put(&hctx->kobj);
> > > > >    	}
> > > > > @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> > > > >    		struct blk_mq_tag_set *set, struct request_queue *q,
> > > > >    		int hctx_idx, int node)
> > > > >    {
> > > > > -	struct blk_mq_hw_ctx *hctx;
> > > > > +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> > > > > +
> > > > > +	/* reuse dead hctx first */
> > > > > +	spin_lock(&q->dead_hctx_lock);
> > > > > +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> > > > > +		if (tmp->numa_node == node) {
> > > > > +			hctx = tmp;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +	if (hctx)
> > > > > +		list_del_init(&hctx->hctx_list);
> > > > > +	spin_unlock(&q->dead_hctx_lock);
> > > > > -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > > > > +	if (!hctx)
> > > > > +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > > > >    	if (!hctx)
> > > > >    		goto fail;
> > > > > @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > > > >    		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
> > > > >    		if (hctx) {
> > > > > -			if (hctxs[i]) {
> > > > > +			if (hctxs[i])
> > > > >    				blk_mq_exit_hctx(q, set, hctxs[i], i);
> > > > > -				kobject_put(&hctxs[i]->kobj);
> > > > > -			}
> > > > >    			hctxs[i] = hctx;
> > > > >    		} else {
> > > > >    			if (hctxs[i])
> > > > > @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > > > >    			if (hctx->tags)
> > > > >    				blk_mq_free_map_and_requests(set, j);
> > > > >    			blk_mq_exit_hctx(q, set, hctx, j);
> > > > > -			kobject_put(&hctx->kobj);
> > > > >    			hctxs[j] = NULL;
> > > > > -
> > > > >    		}
> > > > >    	}
> > > > >    	mutex_unlock(&q->sysfs_lock);
> > > > > @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> > > > >    	if (!q->queue_hw_ctx)
> > > > >    		goto err_sys_init;
> > > > > +	INIT_LIST_HEAD(&q->dead_hctx_list);
> > > > > +	spin_lock_init(&q->dead_hctx_lock);
> > > > > +
> > > > >    	blk_mq_realloc_hw_ctxs(set, q);
> > > > >    	if (!q->nr_hw_queues)
> > > > >    		goto err_hctxs;
> > > > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > > > > index db29928de467..15d1aa53d96c 100644
> > > > > --- a/include/linux/blk-mq.h
> > > > > +++ b/include/linux/blk-mq.h
> > > > > @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
> > > > >    	struct dentry		*sched_debugfs_dir;
> > > > >    #endif
> > > > > +	struct list_head	hctx_list;
> > > > > +
> > > > >    	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
> > > > >    	struct srcu_struct	srcu[0];
> > > > >    };
> > > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > > > index 4b85dc066264..1325f941f0be 100644
> > > > > --- a/include/linux/blkdev.h
> > > > > +++ b/include/linux/blkdev.h
> > > > > @@ -535,6 +535,13 @@ struct request_queue {
> > > > >    	struct mutex		sysfs_lock;
> > > > > +	/*
> > > > > +	 * for reusing dead hctx instance in case of updating
> > > > > +	 * nr_hw_queues
> > > > > +	 */
> > > > > +	struct list_head	dead_hctx_list;
> > > > > +	spinlock_t		dead_hctx_lock;
> > > > > +
> > > > >    	atomic_t		mq_freeze_depth;
> > > > >    #if defined(CONFIG_BLK_DEV_BSG)
> > > > > 
> > > > Hmm.
> > > > I don't particularly like this approach.
> > > > The much saner approach would be to avoid having I/O in-flight in the first
> > > > place by setting the queue to something else than live, no?
> > > 
> > > There isn't any in-flight IO since blk_mq_freeze_queue() is returned
> > > from __blk_mq_update_nr_hw_queues().
> > > 
> > > However, there can be reference to hctx in other code paths via
> > > arbitrary block layer APIs.
> > > 
> > > Quiesce can't avoid the use-after-free too given RCU read lock may
> > > not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().
> > > 
> > > So this patch might be the simplest approach to fix this issue, IMO.
> > > 
> > > However, I am open to any solution for this issue.
> > 
> > Hi Hannes,
> > 
> > Could you please let us know if you have better idea for this issue?
> > Otherwise, I think we need to move on since it is real issue, and users
> > want to fix that.
> > 
> Okay. Having looked over the problem and possible alternatives, it looks
> indeed like a viable solution.
> I do agree that it's a sensible design to have an additional holding area
> for hardware context elements, given that they might be reassigned during
> blk_mq_realloc_hw_ctxs().
> 
> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> etc).

OK, looks the name of 'unused' is better.

> 
> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> more consistent.

No, that is wrong.

The request queue's refcount is often held when blk_cleanup_queue() is running,
and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
is that we have to allow most APIs running well if the request queue is live
from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
it is quite easy to cause use-after-free.

> Problem with the current patch is that in blk_mq_release() we iterate
> the 'dead_hctx_list' and free up everything in there, but then blindly call
> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> left.

If request queue is dead, it is safe to assume that there isn't any
reference to request queue and q->queue_hw_ctx. Otherwise, it must be
a bug somewhere.

> When moving the call to blk_mq_exit_queue() we could to a simple
> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> deallocated properly.

At that time, hctx instance might be active, but that is fine given hctx
is covered by its own kobject. What we need to do is to make sure that no
any references to q->queue_hw_ctx and the request queue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-23 13:30             ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-23 13:30 UTC (permalink / raw)


Hi Hannes,

Thanks for your response.

On Tue, Apr 23, 2019@01:19:33PM +0200, Hannes Reinecke wrote:
> On 4/22/19 5:30 AM, Ming Lei wrote:
> > On Wed, Apr 17, 2019@08:59:43PM +0800, Ming Lei wrote:
> > > On Wed, Apr 17, 2019@02:08:59PM +0200, Hannes Reinecke wrote:
> > > > On 4/17/19 5:44 AM, Ming Lei wrote:
> > > > > In normal queue cleanup path, hctx is released after request queue
> > > > > is freed, see blk_mq_release().
> > > > > 
> > > > > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > > > > of hw queues shrinking. This way is easy to cause use-after-free,
> > > > > because: one implicit rule is that it is safe to call almost all block
> > > > > layer APIs if the request queue is alive; and one hctx may be retrieved
> > > > > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > > > > finally use-after-free is triggered.
> > > > > 
> > > > > Fixes this issue by always freeing hctx after releasing request queue.
> > > > > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > > > > a per-queue list to hold them, then try to resuse these hctxs if numa
> > > > > node is matched.
> > > > > 
> > > > > Cc: Dongli Zhang <dongli.zhang at oracle.com>
> > > > > Cc: James Smart <james.smart at broadcom.com>
> > > > > Cc: Bart Van Assche <bart.vanassche at wdc.com>
> > > > > Cc: linux-scsi at vger.kernel.org,
> > > > > Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> > > > > Cc: Christoph Hellwig <hch at lst.de>,
> > > > > Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> > > > > Cc: jianchao wang <jianchao.w.wang at oracle.com>
> > > > > Signed-off-by: Ming Lei <ming.lei at redhat.com>
> > > > > ---
> > > > >    block/blk-mq.c         | 40 +++++++++++++++++++++++++++-------------
> > > > >    include/linux/blk-mq.h |  2 ++
> > > > >    include/linux/blkdev.h |  7 +++++++
> > > > >    3 files changed, 36 insertions(+), 13 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > index eeebba6ec0f7..2ca4395f794d 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> > > > >    		set->ops->exit_hctx(hctx, hctx_idx);
> > > > >    	blk_mq_remove_cpuhp(hctx);
> > > > > +
> > > > > +	spin_lock(&q->dead_hctx_lock);
> > > > > +	list_add(&hctx->hctx_list, &q->dead_hctx_list);
> > > > > +	spin_unlock(&q->dead_hctx_lock);
> > > > >    }
> > > > >    static void blk_mq_exit_hw_queues(struct request_queue *q,
> > > > > @@ -2675,15 +2679,13 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
> > > > >     */
> > > > >    void blk_mq_release(struct request_queue *q)
> > > > >    {
> > > > > -	struct blk_mq_hw_ctx *hctx;
> > > > > -	unsigned int i;
> > > > > +	struct blk_mq_hw_ctx *hctx, *next;
> > > > >    	cancel_delayed_work_sync(&q->requeue_work);
> > > > > -	/* hctx kobj stays in hctx */
> > > > > -	queue_for_each_hw_ctx(q, hctx, i) {
> > > > > -		if (!hctx)
> > > > > -			continue;
> > > > > +	/* all hctx are in .dead_hctx_list now */
> > > > > +	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
> > > > > +		list_del_init(&hctx->hctx_list);
> > > > >    		kobject_put(&hctx->kobj);
> > > > >    	}
> > > > > @@ -2750,9 +2752,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> > > > >    		struct blk_mq_tag_set *set, struct request_queue *q,
> > > > >    		int hctx_idx, int node)
> > > > >    {
> > > > > -	struct blk_mq_hw_ctx *hctx;
> > > > > +	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
> > > > > +
> > > > > +	/* reuse dead hctx first */
> > > > > +	spin_lock(&q->dead_hctx_lock);
> > > > > +	list_for_each_entry(tmp, &q->dead_hctx_list, hctx_list) {
> > > > > +		if (tmp->numa_node == node) {
> > > > > +			hctx = tmp;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +	if (hctx)
> > > > > +		list_del_init(&hctx->hctx_list);
> > > > > +	spin_unlock(&q->dead_hctx_lock);
> > > > > -	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > > > > +	if (!hctx)
> > > > > +		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
> > > > >    	if (!hctx)
> > > > >    		goto fail;
> > > > > @@ -2790,10 +2805,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > > > >    		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
> > > > >    		if (hctx) {
> > > > > -			if (hctxs[i]) {
> > > > > +			if (hctxs[i])
> > > > >    				blk_mq_exit_hctx(q, set, hctxs[i], i);
> > > > > -				kobject_put(&hctxs[i]->kobj);
> > > > > -			}
> > > > >    			hctxs[i] = hctx;
> > > > >    		} else {
> > > > >    			if (hctxs[i])
> > > > > @@ -2824,9 +2837,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
> > > > >    			if (hctx->tags)
> > > > >    				blk_mq_free_map_and_requests(set, j);
> > > > >    			blk_mq_exit_hctx(q, set, hctx, j);
> > > > > -			kobject_put(&hctx->kobj);
> > > > >    			hctxs[j] = NULL;
> > > > > -
> > > > >    		}
> > > > >    	}
> > > > >    	mutex_unlock(&q->sysfs_lock);
> > > > > @@ -2869,6 +2880,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
> > > > >    	if (!q->queue_hw_ctx)
> > > > >    		goto err_sys_init;
> > > > > +	INIT_LIST_HEAD(&q->dead_hctx_list);
> > > > > +	spin_lock_init(&q->dead_hctx_lock);
> > > > > +
> > > > >    	blk_mq_realloc_hw_ctxs(set, q);
> > > > >    	if (!q->nr_hw_queues)
> > > > >    		goto err_hctxs;
> > > > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > > > > index db29928de467..15d1aa53d96c 100644
> > > > > --- a/include/linux/blk-mq.h
> > > > > +++ b/include/linux/blk-mq.h
> > > > > @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
> > > > >    	struct dentry		*sched_debugfs_dir;
> > > > >    #endif
> > > > > +	struct list_head	hctx_list;
> > > > > +
> > > > >    	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
> > > > >    	struct srcu_struct	srcu[0];
> > > > >    };
> > > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > > > index 4b85dc066264..1325f941f0be 100644
> > > > > --- a/include/linux/blkdev.h
> > > > > +++ b/include/linux/blkdev.h
> > > > > @@ -535,6 +535,13 @@ struct request_queue {
> > > > >    	struct mutex		sysfs_lock;
> > > > > +	/*
> > > > > +	 * for reusing dead hctx instance in case of updating
> > > > > +	 * nr_hw_queues
> > > > > +	 */
> > > > > +	struct list_head	dead_hctx_list;
> > > > > +	spinlock_t		dead_hctx_lock;
> > > > > +
> > > > >    	atomic_t		mq_freeze_depth;
> > > > >    #if defined(CONFIG_BLK_DEV_BSG)
> > > > > 
> > > > Hmm.
> > > > I don't particularly like this approach.
> > > > The much saner approach would be to avoid having I/O in-flight in the first
> > > > place by setting the queue to something else than live, no?
> > > 
> > > There isn't any in-flight IO since blk_mq_freeze_queue() is returned
> > > from __blk_mq_update_nr_hw_queues().
> > > 
> > > However, there can be reference to hctx in other code paths via
> > > arbitrary block layer APIs.
> > > 
> > > Quiesce can't avoid the use-after-free too given RCU read lock may
> > > not be held in lots of reference to hctx, such as queue_for_each_hw_ctx().
> > > 
> > > So this patch might be the simplest approach to fix this issue, IMO.
> > > 
> > > However, I am open to any solution for this issue.
> > 
> > Hi Hannes,
> > 
> > Could you please let us know if you have better idea for this issue?
> > Otherwise, I think we need to move on since it is real issue, and users
> > want to fix that.
> > 
> Okay. Having looked over the problem and possible alternatives, it looks
> indeed like a viable solution.
> I do agree that it's a sensible design to have an additional holding area
> for hardware context elements, given that they might be reassigned during
> blk_mq_realloc_hw_ctxs().
> 
> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> etc).

OK, looks the name of 'unused' is better.

> 
> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> more consistent.

No, that is wrong.

The request queue's refcount is often held when blk_cleanup_queue() is running,
and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
is that we have to allow most APIs running well if the request queue is live
from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
it is quite easy to cause use-after-free.

> Problem with the current patch is that in blk_mq_release() we iterate
> the 'dead_hctx_list' and free up everything in there, but then blindly call
> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> left.

If request queue is dead, it is safe to assume that there isn't any
reference to request queue and q->queue_hw_ctx. Otherwise, it must be
a bug somewhere.

> When moving the call to blk_mq_exit_queue() we could to a simple
> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> deallocated properly.

At that time, hctx instance might be active, but that is fine given hctx
is covered by its own kobject. What we need to do is to make sure that no
any references to q->queue_hw_ctx and the request queue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-23 13:30             ` Ming Lei
@ 2019-04-23 14:07               ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-23 14:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On 4/23/19 3:30 PM, Ming Lei wrote:
> Hi Hannes,
> 
> Thanks for your response.
> 
> On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote:
>> On 4/22/19 5:30 AM, Ming Lei wrote:
[ .. ]
>>>
>>> Hi Hannes,
>>>
>>> Could you please let us know if you have better idea for this issue?
>>> Otherwise, I think we need to move on since it is real issue, and users
>>> want to fix that.
>>>
>> Okay. Having looked over the problem and possible alternatives, it looks
>> indeed like a viable solution.
>> I do agree that it's a sensible design to have an additional holding area
>> for hardware context elements, given that they might be reassigned during
>> blk_mq_realloc_hw_ctxs().
>>
>> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
>> etc).
> 
> OK, looks the name of 'unused' is better.
> 
>>
>> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
>> more consistent.
> 
> No, that is wrong.
> 
> The request queue's refcount is often held when blk_cleanup_queue() is running,
> and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> is that we have to allow most APIs running well if the request queue is live
> from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> it is quite easy to cause use-after-free.
> 
Ah. Thought as much.
But then in most cases the ->queue_hw_ctx pointer is immaterial as we're 
accessing things via the hctx pointer, which remains valid.

>> Problem with the current patch is that in blk_mq_release() we iterate
>> the 'dead_hctx_list' and free up everything in there, but then blindly call
>> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
>> left.
> 
> If request queue is dead, it is safe to assume that there isn't any
> reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> a bug somewhere.
> 
Precisely.
What I'm trying to achieve with this is to protect against such issues, 
which are quite easy to introduce given the complexity of the code ...

>> When moving the call to blk_mq_exit_queue() we could to a simple
>> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
>> deallocated properly.
> 
> At that time, hctx instance might be active, but that is fine given hctx
> is covered by its own kobject. What we need to do is to make sure that no
> any references to q->queue_hw_ctx and the request queue.
> 
My point being here:
void blk_mq_release(struct request_queue *q)
{
         struct blk_mq_hw_ctx *hctx, *next;

         cancel_delayed_work_sync(&q->requeue_work);

         /* all hctx are in .dead_hctx_list now */
         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, 
hctx_list) {
                 list_del_init(&hctx->hctx_list);
                 kobject_put(&hctx->kobj);
         }

         kfree(q->queue_hw_ctx);

         /*
          * release .mq_kobj and sw queue's kobject now because
          * both share lifetime with request queue.
          */
         blk_mq_sysfs_deinit(q);
}

This assumes that _all_ hctx pointers are being removed from 
q->queue_hw_ctx, and are moved to the 'dead' list.
If for some reason this is not the case we'll be leaking hctx pointers here.
And as the reference counting _really_ is tricky it would be a good idea 
to have some safety checking here (ie if all hctx pointers from 
->queue_hw_ctx are indeed NULL) before calling kfree().

This is what I was attempting by moving the kfree() to 
blk_mq_exit_queue(), as then we could validate that we've moved all hctx 
pointers to the dead list by simply checking if ->queue_hw_ctx is NULL.

But just blindly calling 'kfree()' here is dangerous IMO.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-23 14:07               ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-23 14:07 UTC (permalink / raw)


On 4/23/19 3:30 PM, Ming Lei wrote:
> Hi Hannes,
> 
> Thanks for your response.
> 
> On Tue, Apr 23, 2019@01:19:33PM +0200, Hannes Reinecke wrote:
>> On 4/22/19 5:30 AM, Ming Lei wrote:
[ .. ]
>>>
>>> Hi Hannes,
>>>
>>> Could you please let us know if you have better idea for this issue?
>>> Otherwise, I think we need to move on since it is real issue, and users
>>> want to fix that.
>>>
>> Okay. Having looked over the problem and possible alternatives, it looks
>> indeed like a viable solution.
>> I do agree that it's a sensible design to have an additional holding area
>> for hardware context elements, given that they might be reassigned during
>> blk_mq_realloc_hw_ctxs().
>>
>> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
>> etc).
> 
> OK, looks the name of 'unused' is better.
> 
>>
>> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
>> more consistent.
> 
> No, that is wrong.
> 
> The request queue's refcount is often held when blk_cleanup_queue() is running,
> and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> is that we have to allow most APIs running well if the request queue is live
> from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> it is quite easy to cause use-after-free.
> 
Ah. Thought as much.
But then in most cases the ->queue_hw_ctx pointer is immaterial as we're 
accessing things via the hctx pointer, which remains valid.

>> Problem with the current patch is that in blk_mq_release() we iterate
>> the 'dead_hctx_list' and free up everything in there, but then blindly call
>> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
>> left.
> 
> If request queue is dead, it is safe to assume that there isn't any
> reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> a bug somewhere.
> 
Precisely.
What I'm trying to achieve with this is to protect against such issues, 
which are quite easy to introduce given the complexity of the code ...

>> When moving the call to blk_mq_exit_queue() we could to a simple
>> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
>> deallocated properly.
> 
> At that time, hctx instance might be active, but that is fine given hctx
> is covered by its own kobject. What we need to do is to make sure that no
> any references to q->queue_hw_ctx and the request queue.
> 
My point being here:
void blk_mq_release(struct request_queue *q)
{
         struct blk_mq_hw_ctx *hctx, *next;

         cancel_delayed_work_sync(&q->requeue_work);

         /* all hctx are in .dead_hctx_list now */
         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, 
hctx_list) {
                 list_del_init(&hctx->hctx_list);
                 kobject_put(&hctx->kobj);
         }

         kfree(q->queue_hw_ctx);

         /*
          * release .mq_kobj and sw queue's kobject now because
          * both share lifetime with request queue.
          */
         blk_mq_sysfs_deinit(q);
}

This assumes that _all_ hctx pointers are being removed from 
q->queue_hw_ctx, and are moved to the 'dead' list.
If for some reason this is not the case we'll be leaking hctx pointers here.
And as the reference counting _really_ is tricky it would be a good idea 
to have some safety checking here (ie if all hctx pointers from 
->queue_hw_ctx are indeed NULL) before calling kfree().

This is what I was attempting by moving the kfree() to 
blk_mq_exit_queue(), as then we could validate that we've moved all hctx 
pointers to the dead list by simply checking if ->queue_hw_ctx is NULL.

But just blindly calling 'kfree()' here is dangerous IMO.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-23 14:07               ` Hannes Reinecke
@ 2019-04-24  1:12                 ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-24  1:12 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Tue, Apr 23, 2019 at 04:07:49PM +0200, Hannes Reinecke wrote:
> On 4/23/19 3:30 PM, Ming Lei wrote:
> > Hi Hannes,
> > 
> > Thanks for your response.
> > 
> > On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote:
> > > On 4/22/19 5:30 AM, Ming Lei wrote:
> [ .. ]
> > > > 
> > > > Hi Hannes,
> > > > 
> > > > Could you please let us know if you have better idea for this issue?
> > > > Otherwise, I think we need to move on since it is real issue, and users
> > > > want to fix that.
> > > > 
> > > Okay. Having looked over the problem and possible alternatives, it looks
> > > indeed like a viable solution.
> > > I do agree that it's a sensible design to have an additional holding area
> > > for hardware context elements, given that they might be reassigned during
> > > blk_mq_realloc_hw_ctxs().
> > > 
> > > However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> > > etc).
> > 
> > OK, looks the name of 'unused' is better.
> > 
> > > 
> > > And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> > > more consistent.
> > 
> > No, that is wrong.
> > 
> > The request queue's refcount is often held when blk_cleanup_queue() is running,
> > and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> > is that we have to allow most APIs running well if the request queue is live
> > from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> > it is quite easy to cause use-after-free.
> > 
> Ah. Thought as much.
> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
> accessing things via the hctx pointer, which remains valid.
> 
> > > Problem with the current patch is that in blk_mq_release() we iterate
> > > the 'dead_hctx_list' and free up everything in there, but then blindly call
> > > 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> > > left.
> > 
> > If request queue is dead, it is safe to assume that there isn't any
> > reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> > a bug somewhere.
> > 
> Precisely.
> What I'm trying to achieve with this is to protect against such issues,
> which are quite easy to introduce given the complexity of the code ...

But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
use-after-free even though the request queue's refcount is held. We can't
do that simply.

If someone is still trying to use q->queue_hw_ctx[] after the request
queue is dead, the bug is in the caller of block layer API, not in
block layer.

What the patchset is trying to fix is the race in block layer, not
users of block layer, not drivers. So far, I don't see such driver
issue.

Just thought q->queue_hw_ctx as the request queue's resource, you will
see it is pretty reasonable to free q->queue_hw_ctx in the queue's
release handler.

> 
> > > When moving the call to blk_mq_exit_queue() we could to a simple
> > > WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> > > deallocated properly.
> > 
> > At that time, hctx instance might be active, but that is fine given hctx
> > is covered by its own kobject. What we need to do is to make sure that no
> > any references to q->queue_hw_ctx and the request queue.
> > 
> My point being here:
> void blk_mq_release(struct request_queue *q)
> {
>         struct blk_mq_hw_ctx *hctx, *next;
> 
>         cancel_delayed_work_sync(&q->requeue_work);
> 
>         /* all hctx are in .dead_hctx_list now */
>         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
> {
>                 list_del_init(&hctx->hctx_list);
>                 kobject_put(&hctx->kobj);
>         }
> 
>         kfree(q->queue_hw_ctx);
> 
>         /*
>          * release .mq_kobj and sw queue's kobject now because
>          * both share lifetime with request queue.
>          */
>         blk_mq_sysfs_deinit(q);
> }
> 
> This assumes that _all_ hctx pointers are being removed from
> q->queue_hw_ctx, and are moved to the 'dead' list.
> If for some reason this is not the case we'll be leaking hctx pointers here.

IMO, there aren't such some reasons. When blk_mq_release() is called,
every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().

If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
simply a bug in blk-mq. However, I don't see such case now.

> And as the reference counting _really_ is tricky it would be a good idea to
> have some safety checking here (ie if all hctx pointers from ->queue_hw_ctx
> are indeed NULL) before calling kfree().
> 
> This is what I was attempting by moving the kfree() to blk_mq_exit_queue(),
> as then we could validate that we've moved all hctx pointers to the dead
> list by simply checking if ->queue_hw_ctx is NULL.

As I mentioned, it will break the current code either NULL the hctx
pointer or kfree(hctx) early, given we allow APIs to run if the queue's
refcount is held.

> 
> But just blindly calling 'kfree()' here is dangerous IMO.

How can that be dangerous? As I mentioned, if the queue's refcount isn't
held, it is simply a bug of users to try to use 'q' or 'q->queue_hw_ctx'.
That is the basic use pattern of refcount, isn't it?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-24  1:12                 ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-24  1:12 UTC (permalink / raw)


On Tue, Apr 23, 2019@04:07:49PM +0200, Hannes Reinecke wrote:
> On 4/23/19 3:30 PM, Ming Lei wrote:
> > Hi Hannes,
> > 
> > Thanks for your response.
> > 
> > On Tue, Apr 23, 2019@01:19:33PM +0200, Hannes Reinecke wrote:
> > > On 4/22/19 5:30 AM, Ming Lei wrote:
> [ .. ]
> > > > 
> > > > Hi Hannes,
> > > > 
> > > > Could you please let us know if you have better idea for this issue?
> > > > Otherwise, I think we need to move on since it is real issue, and users
> > > > want to fix that.
> > > > 
> > > Okay. Having looked over the problem and possible alternatives, it looks
> > > indeed like a viable solution.
> > > I do agree that it's a sensible design to have an additional holding area
> > > for hardware context elements, given that they might be reassigned during
> > > blk_mq_realloc_hw_ctxs().
> > > 
> > > However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> > > etc).
> > 
> > OK, looks the name of 'unused' is better.
> > 
> > > 
> > > And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> > > more consistent.
> > 
> > No, that is wrong.
> > 
> > The request queue's refcount is often held when blk_cleanup_queue() is running,
> > and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> > is that we have to allow most APIs running well if the request queue is live
> > from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> > it is quite easy to cause use-after-free.
> > 
> Ah. Thought as much.
> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
> accessing things via the hctx pointer, which remains valid.
> 
> > > Problem with the current patch is that in blk_mq_release() we iterate
> > > the 'dead_hctx_list' and free up everything in there, but then blindly call
> > > 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> > > left.
> > 
> > If request queue is dead, it is safe to assume that there isn't any
> > reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> > a bug somewhere.
> > 
> Precisely.
> What I'm trying to achieve with this is to protect against such issues,
> which are quite easy to introduce given the complexity of the code ...

But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
use-after-free even though the request queue's refcount is held. We can't
do that simply.

If someone is still trying to use q->queue_hw_ctx[] after the request
queue is dead, the bug is in the caller of block layer API, not in
block layer.

What the patchset is trying to fix is the race in block layer, not
users of block layer, not drivers. So far, I don't see such driver
issue.

Just thought q->queue_hw_ctx as the request queue's resource, you will
see it is pretty reasonable to free q->queue_hw_ctx in the queue's
release handler.

> 
> > > When moving the call to blk_mq_exit_queue() we could to a simple
> > > WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> > > deallocated properly.
> > 
> > At that time, hctx instance might be active, but that is fine given hctx
> > is covered by its own kobject. What we need to do is to make sure that no
> > any references to q->queue_hw_ctx and the request queue.
> > 
> My point being here:
> void blk_mq_release(struct request_queue *q)
> {
>         struct blk_mq_hw_ctx *hctx, *next;
> 
>         cancel_delayed_work_sync(&q->requeue_work);
> 
>         /* all hctx are in .dead_hctx_list now */
>         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
> {
>                 list_del_init(&hctx->hctx_list);
>                 kobject_put(&hctx->kobj);
>         }
> 
>         kfree(q->queue_hw_ctx);
> 
>         /*
>          * release .mq_kobj and sw queue's kobject now because
>          * both share lifetime with request queue.
>          */
>         blk_mq_sysfs_deinit(q);
> }
> 
> This assumes that _all_ hctx pointers are being removed from
> q->queue_hw_ctx, and are moved to the 'dead' list.
> If for some reason this is not the case we'll be leaking hctx pointers here.

IMO, there aren't such some reasons. When blk_mq_release() is called,
every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().

If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
simply a bug in blk-mq. However, I don't see such case now.

> And as the reference counting _really_ is tricky it would be a good idea to
> have some safety checking here (ie if all hctx pointers from ->queue_hw_ctx
> are indeed NULL) before calling kfree().
> 
> This is what I was attempting by moving the kfree() to blk_mq_exit_queue(),
> as then we could validate that we've moved all hctx pointers to the dead
> list by simply checking if ->queue_hw_ctx is NULL.

As I mentioned, it will break the current code either NULL the hctx
pointer or kfree(hctx) early, given we allow APIs to run if the queue's
refcount is held.

> 
> But just blindly calling 'kfree()' here is dangerous IMO.

How can that be dangerous? As I mentioned, if the queue's refcount isn't
held, it is simply a bug of users to try to use 'q' or 'q->queue_hw_ctx'.
That is the basic use pattern of refcount, isn't it?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-24  1:12                 ` Ming Lei
@ 2019-04-24  1:45                   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-24  1:45 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Wed, Apr 24, 2019 at 09:12:42AM +0800, Ming Lei wrote:
> On Tue, Apr 23, 2019 at 04:07:49PM +0200, Hannes Reinecke wrote:
> > On 4/23/19 3:30 PM, Ming Lei wrote:
> > > Hi Hannes,
> > > 
> > > Thanks for your response.
> > > 
> > > On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote:
> > > > On 4/22/19 5:30 AM, Ming Lei wrote:
> > [ .. ]
> > > > > 
> > > > > Hi Hannes,
> > > > > 
> > > > > Could you please let us know if you have better idea for this issue?
> > > > > Otherwise, I think we need to move on since it is real issue, and users
> > > > > want to fix that.
> > > > > 
> > > > Okay. Having looked over the problem and possible alternatives, it looks
> > > > indeed like a viable solution.
> > > > I do agree that it's a sensible design to have an additional holding area
> > > > for hardware context elements, given that they might be reassigned during
> > > > blk_mq_realloc_hw_ctxs().
> > > > 
> > > > However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> > > > etc).
> > > 
> > > OK, looks the name of 'unused' is better.
> > > 
> > > > 
> > > > And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> > > > more consistent.
> > > 
> > > No, that is wrong.
> > > 
> > > The request queue's refcount is often held when blk_cleanup_queue() is running,
> > > and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> > > is that we have to allow most APIs running well if the request queue is live
> > > from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> > > it is quite easy to cause use-after-free.
> > > 
> > Ah. Thought as much.
> > But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
> > accessing things via the hctx pointer, which remains valid.
> > 
> > > > Problem with the current patch is that in blk_mq_release() we iterate
> > > > the 'dead_hctx_list' and free up everything in there, but then blindly call
> > > > 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> > > > left.
> > > 
> > > If request queue is dead, it is safe to assume that there isn't any
> > > reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> > > a bug somewhere.
> > > 
> > Precisely.
> > What I'm trying to achieve with this is to protect against such issues,
> > which are quite easy to introduce given the complexity of the code ...
> 
> But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
> use-after-free even though the request queue's refcount is held. We can't
> do that simply.
> 
> If someone is still trying to use q->queue_hw_ctx[] after the request
> queue is dead, the bug is in the caller of block layer API, not in
> block layer.
> 
> What the patchset is trying to fix is the race in block layer, not
> users of block layer, not drivers. So far, I don't see such driver
> issue.
> 
> Just thought q->queue_hw_ctx as the request queue's resource, you will
> see it is pretty reasonable to free q->queue_hw_ctx in the queue's
> release handler.
> 
> > 
> > > > When moving the call to blk_mq_exit_queue() we could to a simple
> > > > WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> > > > deallocated properly.
> > > 
> > > At that time, hctx instance might be active, but that is fine given hctx
> > > is covered by its own kobject. What we need to do is to make sure that no
> > > any references to q->queue_hw_ctx and the request queue.
> > > 
> > My point being here:
> > void blk_mq_release(struct request_queue *q)
> > {
> >         struct blk_mq_hw_ctx *hctx, *next;
> > 
> >         cancel_delayed_work_sync(&q->requeue_work);
> > 
> >         /* all hctx are in .dead_hctx_list now */
> >         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
> > {
> >                 list_del_init(&hctx->hctx_list);
> >                 kobject_put(&hctx->kobj);
> >         }
> > 
> >         kfree(q->queue_hw_ctx);
> > 
> >         /*
> >          * release .mq_kobj and sw queue's kobject now because
> >          * both share lifetime with request queue.
> >          */
> >         blk_mq_sysfs_deinit(q);
> > }
> > 
> > This assumes that _all_ hctx pointers are being removed from
> > q->queue_hw_ctx, and are moved to the 'dead' list.
> > If for some reason this is not the case we'll be leaking hctx pointers here.
> 
> IMO, there aren't such some reasons. When blk_mq_release() is called,
> every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
> either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().
> 
> If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
> simply a bug in blk-mq. However, I don't see such case now.

Or we can add the following check in blk_mq_release() for capturing
the impossible blk-mq bug:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2ca4395f794d..f0d08087b8f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2366,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	INIT_LIST_HEAD(&hctx->hctx_list);
+
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2680,9 +2682,14 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
 void blk_mq_release(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx, *next;
+	int i;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
+	/* warn if live hctx is found, that shouldn't happen */
+	queue_for_each_hw_ctx(q, hctx, i)
+		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
 	/* all hctx are in .dead_hctx_list now */
 	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
 		list_del_init(&hctx->hctx_list);

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-24  1:45                   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2019-04-24  1:45 UTC (permalink / raw)


On Wed, Apr 24, 2019@09:12:42AM +0800, Ming Lei wrote:
> On Tue, Apr 23, 2019@04:07:49PM +0200, Hannes Reinecke wrote:
> > On 4/23/19 3:30 PM, Ming Lei wrote:
> > > Hi Hannes,
> > > 
> > > Thanks for your response.
> > > 
> > > On Tue, Apr 23, 2019@01:19:33PM +0200, Hannes Reinecke wrote:
> > > > On 4/22/19 5:30 AM, Ming Lei wrote:
> > [ .. ]
> > > > > 
> > > > > Hi Hannes,
> > > > > 
> > > > > Could you please let us know if you have better idea for this issue?
> > > > > Otherwise, I think we need to move on since it is real issue, and users
> > > > > want to fix that.
> > > > > 
> > > > Okay. Having looked over the problem and possible alternatives, it looks
> > > > indeed like a viable solution.
> > > > I do agree that it's a sensible design to have an additional holding area
> > > > for hardware context elements, given that they might be reassigned during
> > > > blk_mq_realloc_hw_ctxs().
> > > > 
> > > > However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> > > > etc).
> > > 
> > > OK, looks the name of 'unused' is better.
> > > 
> > > > 
> > > > And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> > > > more consistent.
> > > 
> > > No, that is wrong.
> > > 
> > > The request queue's refcount is often held when blk_cleanup_queue() is running,
> > > and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> > > is that we have to allow most APIs running well if the request queue is live
> > > from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> > > it is quite easy to cause use-after-free.
> > > 
> > Ah. Thought as much.
> > But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
> > accessing things via the hctx pointer, which remains valid.
> > 
> > > > Problem with the current patch is that in blk_mq_release() we iterate
> > > > the 'dead_hctx_list' and free up everything in there, but then blindly call
> > > > 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> > > > left.
> > > 
> > > If request queue is dead, it is safe to assume that there isn't any
> > > reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> > > a bug somewhere.
> > > 
> > Precisely.
> > What I'm trying to achieve with this is to protect against such issues,
> > which are quite easy to introduce given the complexity of the code ...
> 
> But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
> use-after-free even though the request queue's refcount is held. We can't
> do that simply.
> 
> If someone is still trying to use q->queue_hw_ctx[] after the request
> queue is dead, the bug is in the caller of block layer API, not in
> block layer.
> 
> What the patchset is trying to fix is the race in block layer, not
> users of block layer, not drivers. So far, I don't see such driver
> issue.
> 
> Just thought q->queue_hw_ctx as the request queue's resource, you will
> see it is pretty reasonable to free q->queue_hw_ctx in the queue's
> release handler.
> 
> > 
> > > > When moving the call to blk_mq_exit_queue() we could to a simple
> > > > WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> > > > deallocated properly.
> > > 
> > > At that time, hctx instance might be active, but that is fine given hctx
> > > is covered by its own kobject. What we need to do is to make sure that no
> > > any references to q->queue_hw_ctx and the request queue.
> > > 
> > My point being here:
> > void blk_mq_release(struct request_queue *q)
> > {
> >         struct blk_mq_hw_ctx *hctx, *next;
> > 
> >         cancel_delayed_work_sync(&q->requeue_work);
> > 
> >         /* all hctx are in .dead_hctx_list now */
> >         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
> > {
> >                 list_del_init(&hctx->hctx_list);
> >                 kobject_put(&hctx->kobj);
> >         }
> > 
> >         kfree(q->queue_hw_ctx);
> > 
> >         /*
> >          * release .mq_kobj and sw queue's kobject now because
> >          * both share lifetime with request queue.
> >          */
> >         blk_mq_sysfs_deinit(q);
> > }
> > 
> > This assumes that _all_ hctx pointers are being removed from
> > q->queue_hw_ctx, and are moved to the 'dead' list.
> > If for some reason this is not the case we'll be leaking hctx pointers here.
> 
> IMO, there aren't such some reasons. When blk_mq_release() is called,
> every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
> either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().
> 
> If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
> simply a bug in blk-mq. However, I don't see such case now.

Or we can add the following check in blk_mq_release() for capturing
the impossible blk-mq bug:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2ca4395f794d..f0d08087b8f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2366,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	INIT_LIST_HEAD(&hctx->hctx_list);
+
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2680,9 +2682,14 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
 void blk_mq_release(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx, *next;
+	int i;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
+	/* warn if live hctx is found, that shouldn't happen */
+	queue_for_each_hw_ctx(q, hctx, i)
+		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
 	/* all hctx are in .dead_hctx_list now */
 	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
 		list_del_init(&hctx->hctx_list);

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-24  1:45                   ` Ming Lei
@ 2019-04-24  5:55                     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-24  5:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, Dongli Zhang, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On 4/24/19 3:45 AM, Ming Lei wrote:
> On Wed, Apr 24, 2019 at 09:12:42AM +0800, Ming Lei wrote:
>> On Tue, Apr 23, 2019 at 04:07:49PM +0200, Hannes Reinecke wrote:
>>> On 4/23/19 3:30 PM, Ming Lei wrote:
>>>> Hi Hannes,
>>>>
>>>> Thanks for your response.
>>>>
>>>> On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote:
>>>>> On 4/22/19 5:30 AM, Ming Lei wrote:
>>> [ .. ]
>>>>>>
>>>>>> Hi Hannes,
>>>>>>
>>>>>> Could you please let us know if you have better idea for this issue?
>>>>>> Otherwise, I think we need to move on since it is real issue, and users
>>>>>> want to fix that.
>>>>>>
>>>>> Okay. Having looked over the problem and possible alternatives, it looks
>>>>> indeed like a viable solution.
>>>>> I do agree that it's a sensible design to have an additional holding area
>>>>> for hardware context elements, given that they might be reassigned during
>>>>> blk_mq_realloc_hw_ctxs().
>>>>>
>>>>> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
>>>>> etc).
>>>>
>>>> OK, looks the name of 'unused' is better.
>>>>
>>>>>
>>>>> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
>>>>> more consistent.
>>>>
>>>> No, that is wrong.
>>>>
>>>> The request queue's refcount is often held when blk_cleanup_queue() is running,
>>>> and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
>>>> is that we have to allow most APIs running well if the request queue is live
>>>> from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
>>>> it is quite easy to cause use-after-free.
>>>>
>>> Ah. Thought as much.
>>> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
>>> accessing things via the hctx pointer, which remains valid.
>>>
>>>>> Problem with the current patch is that in blk_mq_release() we iterate
>>>>> the 'dead_hctx_list' and free up everything in there, but then blindly call
>>>>> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
>>>>> left.
>>>>
>>>> If request queue is dead, it is safe to assume that there isn't any
>>>> reference to request queue and q->queue_hw_ctx. Otherwise, it must be
>>>> a bug somewhere.
>>>>
>>> Precisely.
>>> What I'm trying to achieve with this is to protect against such issues,
>>> which are quite easy to introduce given the complexity of the code ...
>>
>> But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
>> use-after-free even though the request queue's refcount is held. We can't
>> do that simply.
>>
>> If someone is still trying to use q->queue_hw_ctx[] after the request
>> queue is dead, the bug is in the caller of block layer API, not in
>> block layer.
>>
>> What the patchset is trying to fix is the race in block layer, not
>> users of block layer, not drivers. So far, I don't see such driver
>> issue.
>>
>> Just thought q->queue_hw_ctx as the request queue's resource, you will
>> see it is pretty reasonable to free q->queue_hw_ctx in the queue's
>> release handler.
>>
>>>
>>>>> When moving the call to blk_mq_exit_queue() we could to a simple
>>>>> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
>>>>> deallocated properly.
>>>>
>>>> At that time, hctx instance might be active, but that is fine given hctx
>>>> is covered by its own kobject. What we need to do is to make sure that no
>>>> any references to q->queue_hw_ctx and the request queue.
>>>>
>>> My point being here:
>>> void blk_mq_release(struct request_queue *q)
>>> {
>>>          struct blk_mq_hw_ctx *hctx, *next;
>>>
>>>          cancel_delayed_work_sync(&q->requeue_work);
>>>
>>>          /* all hctx are in .dead_hctx_list now */
>>>          list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
>>> {
>>>                  list_del_init(&hctx->hctx_list);
>>>                  kobject_put(&hctx->kobj);
>>>          }
>>>
>>>          kfree(q->queue_hw_ctx);
>>>
>>>          /*
>>>           * release .mq_kobj and sw queue's kobject now because
>>>           * both share lifetime with request queue.
>>>           */
>>>          blk_mq_sysfs_deinit(q);
>>> }
>>>
>>> This assumes that _all_ hctx pointers are being removed from
>>> q->queue_hw_ctx, and are moved to the 'dead' list.
>>> If for some reason this is not the case we'll be leaking hctx pointers here.
>>
>> IMO, there aren't such some reasons. When blk_mq_release() is called,
>> every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
>> either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().
>>
>> If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
>> simply a bug in blk-mq. However, I don't see such case now.
> 
> Or we can add the following check in blk_mq_release() for capturing
> the impossible blk-mq bug:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2ca4395f794d..f0d08087b8f6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2366,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
>   	hctx->queue = q;
>   	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
>   
> +	INIT_LIST_HEAD(&hctx->hctx_list);
> +
>   	/*
>   	 * Allocate space for all possible cpus to avoid allocation at
>   	 * runtime
> @@ -2680,9 +2682,14 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>   void blk_mq_release(struct request_queue *q)
>   {
>   	struct blk_mq_hw_ctx *hctx, *next;
> +	int i;
>   
>   	cancel_delayed_work_sync(&q->requeue_work);
>   
> +	/* warn if live hctx is found, that shouldn't happen */
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
> +
>   	/* all hctx are in .dead_hctx_list now */
>   	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
>   		list_del_init(&hctx->hctx_list);
> 
Precisely, that would've been my other choice of fixing it, but I didn't 
as I tried to avoid the loop. But if you agree then I'm fine with it, too.

Please add this hunk to the patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-24  5:55                     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2019-04-24  5:55 UTC (permalink / raw)


On 4/24/19 3:45 AM, Ming Lei wrote:
> On Wed, Apr 24, 2019@09:12:42AM +0800, Ming Lei wrote:
>> On Tue, Apr 23, 2019@04:07:49PM +0200, Hannes Reinecke wrote:
>>> On 4/23/19 3:30 PM, Ming Lei wrote:
>>>> Hi Hannes,
>>>>
>>>> Thanks for your response.
>>>>
>>>> On Tue, Apr 23, 2019@01:19:33PM +0200, Hannes Reinecke wrote:
>>>>> On 4/22/19 5:30 AM, Ming Lei wrote:
>>> [ .. ]
>>>>>>
>>>>>> Hi Hannes,
>>>>>>
>>>>>> Could you please let us know if you have better idea for this issue?
>>>>>> Otherwise, I think we need to move on since it is real issue, and users
>>>>>> want to fix that.
>>>>>>
>>>>> Okay. Having looked over the problem and possible alternatives, it looks
>>>>> indeed like a viable solution.
>>>>> I do agree that it's a sensible design to have an additional holding area
>>>>> for hardware context elements, given that they might be reassigned during
>>>>> blk_mq_realloc_hw_ctxs().
>>>>>
>>>>> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
>>>>> etc).
>>>>
>>>> OK, looks the name of 'unused' is better.
>>>>
>>>>>
>>>>> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
>>>>> more consistent.
>>>>
>>>> No, that is wrong.
>>>>
>>>> The request queue's refcount is often held when blk_cleanup_queue() is running,
>>>> and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
>>>> is that we have to allow most APIs running well if the request queue is live
>>>> from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
>>>> it is quite easy to cause use-after-free.
>>>>
>>> Ah. Thought as much.
>>> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
>>> accessing things via the hctx pointer, which remains valid.
>>>
>>>>> Problem with the current patch is that in blk_mq_release() we iterate
>>>>> the 'dead_hctx_list' and free up everything in there, but then blindly call
>>>>> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
>>>>> left.
>>>>
>>>> If request queue is dead, it is safe to assume that there isn't any
>>>> reference to request queue and q->queue_hw_ctx. Otherwise, it must be
>>>> a bug somewhere.
>>>>
>>> Precisely.
>>> What I'm trying to achieve with this is to protect against such issues,
>>> which are quite easy to introduce given the complexity of the code ...
>>
>> But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
>> use-after-free even though the request queue's refcount is held. We can't
>> do that simply.
>>
>> If someone is still trying to use q->queue_hw_ctx[] after the request
>> queue is dead, the bug is in the caller of block layer API, not in
>> block layer.
>>
>> What the patchset is trying to fix is the race in block layer, not
>> users of block layer, not drivers. So far, I don't see such driver
>> issue.
>>
>> Just thought q->queue_hw_ctx as the request queue's resource, you will
>> see it is pretty reasonable to free q->queue_hw_ctx in the queue's
>> release handler.
>>
>>>
>>>>> When moving the call to blk_mq_exit_queue() we could to a simple
>>>>> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
>>>>> deallocated properly.
>>>>
>>>> At that time, hctx instance might be active, but that is fine given hctx
>>>> is covered by its own kobject. What we need to do is to make sure that no
>>>> any references to q->queue_hw_ctx and the request queue.
>>>>
>>> My point being here:
>>> void blk_mq_release(struct request_queue *q)
>>> {
>>>          struct blk_mq_hw_ctx *hctx, *next;
>>>
>>>          cancel_delayed_work_sync(&q->requeue_work);
>>>
>>>          /* all hctx are in .dead_hctx_list now */
>>>          list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
>>> {
>>>                  list_del_init(&hctx->hctx_list);
>>>                  kobject_put(&hctx->kobj);
>>>          }
>>>
>>>          kfree(q->queue_hw_ctx);
>>>
>>>          /*
>>>           * release .mq_kobj and sw queue's kobject now because
>>>           * both share lifetime with request queue.
>>>           */
>>>          blk_mq_sysfs_deinit(q);
>>> }
>>>
>>> This assumes that _all_ hctx pointers are being removed from
>>> q->queue_hw_ctx, and are moved to the 'dead' list.
>>> If for some reason this is not the case we'll be leaking hctx pointers here.
>>
>> IMO, there aren't such some reasons. When blk_mq_release() is called,
>> every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
>> either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().
>>
>> If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
>> simply a bug in blk-mq. However, I don't see such case now.
> 
> Or we can add the following check in blk_mq_release() for capturing
> the impossible blk-mq bug:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2ca4395f794d..f0d08087b8f6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2366,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
>   	hctx->queue = q;
>   	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
>   
> +	INIT_LIST_HEAD(&hctx->hctx_list);
> +
>   	/*
>   	 * Allocate space for all possible cpus to avoid allocation at
>   	 * runtime
> @@ -2680,9 +2682,14 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>   void blk_mq_release(struct request_queue *q)
>   {
>   	struct blk_mq_hw_ctx *hctx, *next;
> +	int i;
>   
>   	cancel_delayed_work_sync(&q->requeue_work);
>   
> +	/* warn if live hctx is found, that shouldn't happen */
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
> +
>   	/* all hctx are in .dead_hctx_list now */
>   	list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) {
>   		list_del_init(&hctx->hctx_list);
> 
Precisely, that would've been my other choice of fixing it, but I didn't 
as I tried to avoid the loop. But if you agree then I'm fine with it, too.

Please add this hunk to the patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2019-04-24  5:55 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-17  3:44 [PATCH V6 0/9] blk-mq: fix races related with freeing queue Ming Lei
2019-04-17  3:44 ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 2/9] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:00   ` Hannes Reinecke
2019-04-17 12:00     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 3/9] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:02   ` Hannes Reinecke
2019-04-17 12:02     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:03   ` Hannes Reinecke
2019-04-17 12:03     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:08   ` Hannes Reinecke
2019-04-17 12:08     ` Hannes Reinecke
2019-04-17 12:59     ` Ming Lei
2019-04-17 12:59       ` Ming Lei
2019-04-22  3:30       ` Ming Lei
2019-04-22  3:30         ` Ming Lei
2019-04-23 11:19         ` Hannes Reinecke
2019-04-23 11:19           ` Hannes Reinecke
2019-04-23 13:30           ` Ming Lei
2019-04-23 13:30             ` Ming Lei
2019-04-23 14:07             ` Hannes Reinecke
2019-04-23 14:07               ` Hannes Reinecke
2019-04-24  1:12               ` Ming Lei
2019-04-24  1:12                 ` Ming Lei
2019-04-24  1:45                 ` Ming Lei
2019-04-24  1:45                   ` Ming Lei
2019-04-24  5:55                   ` Hannes Reinecke
2019-04-24  5:55                     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:10   ` Hannes Reinecke
2019-04-17 12:10     ` Hannes Reinecke
2019-04-17 15:55   ` Keith Busch
2019-04-17 15:55     ` Keith Busch
2019-04-17 17:22 ` [PATCH V6 0/9] blk-mq: fix races related with freeing queue James Smart
2019-04-17 17:22   ` James Smart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.