All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V7 0/9] blk-mq: fix races related with freeing queue
@ 2019-04-24 11:02 ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V7:
	- add reviewed-by and tested-by tag
	- rename "dead_hctx" as "unused_hctx"
	- check if there are live hctx in queue's release handler
	- only patch 6 is modified

V6:
	- remove previous SCSI patch which will be routed via SCSI tree
	- add reviewed-by tag
	- fix one related NVMe scan vs reset race

V5:
	- refactor blk_mq_alloc_and_init_hctx()
	- fix race related updating nr_hw_queues by always freeing hctx
	  after request queue is released

V4:
	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
	- fix comment in the last patch

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler


Ming Lei (9):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move all hctx alloction & initialization into
    __blk_mq_alloc_and_init_hctx
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: always free hctx after request queue is freed
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  nvme: hold request queue's refcount in ns's whole lifetime

 block/blk-core.c         |  23 +-----
 block/blk-mq-sysfs.c     |   8 ++
 block/blk-mq.c           | 199 ++++++++++++++++++++++++++++-------------------
 block/blk-mq.h           |   2 +-
 drivers/nvme/host/core.c |  10 ++-
 include/linux/blk-mq.h   |   2 +
 include/linux/blkdev.h   |   7 ++
 7 files changed, 148 insertions(+), 103 deletions(-)

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,

-- 
2.9.5


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 0/9] blk-mq: fix races related with freeing queue
@ 2019-04-24 11:02 ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V7:
	- add reviewed-by and tested-by tag
	- rename "dead_hctx" as "unused_hctx"
	- check if there are live hctx in queue's release handler
	- only patch 6 is modified

V6:
	- remove previous SCSI patch which will be routed via SCSI tree
	- add reviewed-by tag
	- fix one related NVMe scan vs reset race

V5:
	- refactor blk_mq_alloc_and_init_hctx()
	- fix race related updating nr_hw_queues by always freeing hctx
	  after request queue is released

V4:
	- add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues
	- fix comment in the last patch

V3:
	- cancel q->requeue_work in queue's release handler
	- cancel hctx->run_work in hctx's release handler
	- add patch 1 for fixing race in plug code path
	- the last patch is added for avoiding to grab SCSI's refcont
	in IO path

V2:
	- moving freeing hw queue resources into hctx's release handler


Ming Lei (9):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move all hctx alloction & initialization into
    __blk_mq_alloc_and_init_hctx
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: always free hctx after request queue is freed
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  nvme: hold request queue's refcount in ns's whole lifetime

 block/blk-core.c         |  23 +-----
 block/blk-mq-sysfs.c     |   8 ++
 block/blk-mq.c           | 199 ++++++++++++++++++++++++++++-------------------
 block/blk-mq.h           |   2 +-
 drivers/nvme/host/core.c |  10 ++-
 include/linux/blk-mq.h   |   2 +
 include/linux/blkdev.h   |   7 ++
 7 files changed, 148 insertions(+), 103 deletions(-)

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,

-- 
2.9.5

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc60ed7e940e..9df43ef4b541 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1736,9 +1736,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
 			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
+
+				percpu_ref_get(&this_q->q_usage_counter);
 				blk_mq_sched_insert_requests(this_hctx, this_ctx,
 								&rq_list,
 								from_schedule);
+				percpu_ref_put(&this_q->q_usage_counter);
 			}
 
 			this_q = rq->q;
@@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
+
+		percpu_ref_get(&this_q->q_usage_counter);
 		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
 						from_schedule);
+		percpu_ref_put(&this_q->q_usage_counter);
 	}
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc60ed7e940e..9df43ef4b541 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1736,9 +1736,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
 			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
+
+				percpu_ref_get(&this_q->q_usage_counter);
 				blk_mq_sched_insert_requests(this_hctx, this_ctx,
 								&rq_list,
 								from_schedule);
+				percpu_ref_put(&this_q->q_usage_counter);
 			}
 
 			this_q = rq->q;
@@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	 */
 	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
+
+		percpu_ref_get(&this_q->q_usage_counter);
 		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
 						from_schedule);
+		percpu_ref_put(&this_q->q_usage_counter);
 	}
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..93dc588fabe2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9df43ef4b541..3e321048b259 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2640,6 +2640,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..93dc588fabe2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
-		cancel_delayed_work_sync(&q->requeue_work);
 		queue_for_each_hw_ctx(q, hctx, i)
 			cancel_delayed_work_sync(&hctx->run_work);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9df43ef4b541..3e321048b259 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2640,6 +2640,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&q->requeue_work);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 3/9] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, stable

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 93dc588fabe2..2dd94b3e9ece 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3e321048b259..862eb41f24f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2273,12 +2273,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2913,7 +2908,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 423ea88ab6fb..633a5a77ee8b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 3/9] blk-mq: free hw queue's resource in hctx's release handler
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart at broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable at vger.kernel.org
Reviewed-by: Hannes Reinecke <hare at suse.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c     | 2 +-
 block/blk-mq-sysfs.c | 6 ++++++
 block/blk-mq.c       | 8 ++------
 block/blk-mq.h       | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 93dc588fabe2..2dd94b3e9ece 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/smp.h>
 
 #include <linux/blk-mq.h>
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
+
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
 	kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3e321048b259..862eb41f24f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2273,12 +2273,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2913,7 +2908,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 423ea88ab6fb..633a5a77ee8b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Move all hctx related allocation and initialization into
__blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
into real two functions, one is for allocate everything, and another is for
initializing everyting.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 862eb41f24f8..fea25363488d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static int blk_mq_init_hctx(struct request_queue *q,
-		struct blk_mq_tag_set *set,
-		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 {
-	int node;
+	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+			   __alignof__(struct blk_mq_hw_ctx)) !=
+		     sizeof(struct blk_mq_hw_ctx));
+
+	if (tag_set->flags & BLK_MQ_F_BLOCKING)
+		hw_ctx_size += sizeof(struct srcu_struct);
+
+	return hw_ctx_size;
+}
+
+static struct blk_mq_hw_ctx *
+__blk_mq_alloc_and_init_hctx(struct request_queue *q,
+			     struct blk_mq_tag_set *set,
+			     unsigned hctx_idx, int node)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+			node);
+	if (!hctx)
+		goto fail_alloc_hctx;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+				node))
+		goto free_hctx;
+
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
+	hctx->queue_num = hctx_idx;
 
-	node = hctx->numa_node;
 	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
+		hctx->numa_node = set->numa_node;
+	node = hctx->numa_node;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -2343,8 +2373,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
+	blk_mq_hctx_kobj_init(hctx);
 
-	return 0;
+	return hctx;
 
  free_fq:
 	blk_free_flush_queue(hctx->fq);
@@ -2357,7 +2388,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	kfree(hctx->ctxs);
  unregister_cpu_notifier:
 	blk_mq_remove_cpuhp(hctx);
-	return -1;
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+ fail_alloc_hctx:
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -2703,51 +2738,11 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_sq_queue);
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-			node);
-	if (!hctx)
-		return NULL;
-
-	if (!zalloc_cpumask_var_node(&hctx->cpumask,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-				node)) {
-		kfree(hctx);
-		return NULL;
-	}
-
-	atomic_set(&hctx->nr_active, 0);
-	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
-	if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx);
-		return NULL;
-	}
-	blk_mq_hctx_kobj_init(hctx);
-
-	return hctx;
+	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


Move all hctx related allocation and initialization into
__blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
into real two functions, one is for allocate everything, and another is for
initializing everyting.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare at suse.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
 1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 862eb41f24f8..fea25363488d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static int blk_mq_init_hctx(struct request_queue *q,
-		struct blk_mq_tag_set *set,
-		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 {
-	int node;
+	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+			   __alignof__(struct blk_mq_hw_ctx)) !=
+		     sizeof(struct blk_mq_hw_ctx));
+
+	if (tag_set->flags & BLK_MQ_F_BLOCKING)
+		hw_ctx_size += sizeof(struct srcu_struct);
+
+	return hw_ctx_size;
+}
+
+static struct blk_mq_hw_ctx *
+__blk_mq_alloc_and_init_hctx(struct request_queue *q,
+			     struct blk_mq_tag_set *set,
+			     unsigned hctx_idx, int node)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+			node);
+	if (!hctx)
+		goto fail_alloc_hctx;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+				node))
+		goto free_hctx;
+
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
+	hctx->queue_num = hctx_idx;
 
-	node = hctx->numa_node;
 	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
+		hctx->numa_node = set->numa_node;
+	node = hctx->numa_node;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -2343,8 +2373,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
+	blk_mq_hctx_kobj_init(hctx);
 
-	return 0;
+	return hctx;
 
  free_fq:
 	blk_free_flush_queue(hctx->fq);
@@ -2357,7 +2388,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	kfree(hctx->ctxs);
  unregister_cpu_notifier:
 	blk_mq_remove_cpuhp(hctx);
-	return -1;
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+ fail_alloc_hctx:
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -2703,51 +2738,11 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_sq_queue);
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
-			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-			node);
-	if (!hctx)
-		return NULL;
-
-	if (!zalloc_cpumask_var_node(&hctx->cpumask,
-				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
-				node)) {
-		kfree(hctx);
-		return NULL;
-	}
-
-	atomic_set(&hctx->nr_active, 0);
-	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
-	if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx);
-		return NULL;
-	}
-	blk_mq_hctx_kobj_init(hctx);
-
-	return hctx;
+	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() which is for allocating all hctx resources,
another is blk_mq_init_hctx() which is for initializing hctx, and
serves as counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 76 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index fea25363488d..1eceeb26ae7d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2304,10 +2304,38 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 	return hw_ctx_size;
 }
 
+static int blk_mq_init_hctx(struct request_queue *q,
+			    struct blk_mq_tag_set *set,
+			    struct blk_mq_hw_ctx *hctx,
+			    unsigned hctx_idx)
+{
+	hctx->queue_num = hctx_idx;
+
+	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+	hctx->tags = set->tags[hctx_idx];
+
+	if (set->ops->init_hctx &&
+	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+		goto unregister_cpu_notifier;
+
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+				hctx->numa_node))
+		goto exit_hctx;
+	return 0;
+
+ exit_hctx:
+	if (set->ops->exit_hctx)
+		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_remove_cpuhp(hctx);
+	return -1;
+}
+
 static struct blk_mq_hw_ctx *
-__blk_mq_alloc_and_init_hctx(struct request_queue *q,
-			     struct blk_mq_tag_set *set,
-			     unsigned hctx_idx, int node)
+blk_mq_alloc_hctx(struct request_queue *q,
+		  struct blk_mq_tag_set *set,
+		  unsigned hctx_idx, int node)
 {
 	struct blk_mq_hw_ctx *hctx;
 
@@ -2324,8 +2352,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	atomic_set(&hctx->nr_active, 0);
 	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
 	if (node == NUMA_NO_NODE)
 		hctx->numa_node = set->numa_node;
 	node = hctx->numa_node;
@@ -2336,10 +2362,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-	hctx->tags = set->tags[hctx_idx];
-
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2347,29 +2369,21 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
 	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
+		goto free_cpumask;
 
 	if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
 				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
 		goto free_ctxs;
-
 	hctx->nr_ctx = 0;
 
 	spin_lock_init(&hctx->dispatch_wait_lock);
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
 	if (!hctx->fq)
-		goto exit_hctx;
-
-	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-		goto free_fq;
+		goto free_bitmap;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
@@ -2377,17 +2391,11 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	return hctx;
 
- free_fq:
-	blk_free_flush_queue(hctx->fq);
- exit_hctx:
-	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
  free_bitmap:
 	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_remove_cpuhp(hctx);
+ free_cpumask:
 	free_cpumask_var(hctx->cpumask);
  free_hctx:
 	kfree(hctx);
@@ -2742,7 +2750,21 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		goto fail;
+
+	if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
+		goto free_hctx;
+
+	return hctx;
+
+ free_hctx:
+	kobject_put(&hctx->kobj);
+ fail:
+	return NULL;
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() which is for allocating all hctx resources,
another is blk_mq_init_hctx() which is for initializing hctx, and
serves as counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare at suse.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c | 76 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index fea25363488d..1eceeb26ae7d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2304,10 +2304,38 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 	return hw_ctx_size;
 }
 
+static int blk_mq_init_hctx(struct request_queue *q,
+			    struct blk_mq_tag_set *set,
+			    struct blk_mq_hw_ctx *hctx,
+			    unsigned hctx_idx)
+{
+	hctx->queue_num = hctx_idx;
+
+	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+	hctx->tags = set->tags[hctx_idx];
+
+	if (set->ops->init_hctx &&
+	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+		goto unregister_cpu_notifier;
+
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+				hctx->numa_node))
+		goto exit_hctx;
+	return 0;
+
+ exit_hctx:
+	if (set->ops->exit_hctx)
+		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_remove_cpuhp(hctx);
+	return -1;
+}
+
 static struct blk_mq_hw_ctx *
-__blk_mq_alloc_and_init_hctx(struct request_queue *q,
-			     struct blk_mq_tag_set *set,
-			     unsigned hctx_idx, int node)
+blk_mq_alloc_hctx(struct request_queue *q,
+		  struct blk_mq_tag_set *set,
+		  unsigned hctx_idx, int node)
 {
 	struct blk_mq_hw_ctx *hctx;
 
@@ -2324,8 +2352,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	atomic_set(&hctx->nr_active, 0);
 	hctx->numa_node = node;
-	hctx->queue_num = hctx_idx;
-
 	if (node == NUMA_NO_NODE)
 		hctx->numa_node = set->numa_node;
 	node = hctx->numa_node;
@@ -2336,10 +2362,6 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-	hctx->tags = set->tags[hctx_idx];
-
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2347,29 +2369,21 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 	hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
 	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
+		goto free_cpumask;
 
 	if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
 				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
 		goto free_ctxs;
-
 	hctx->nr_ctx = 0;
 
 	spin_lock_init(&hctx->dispatch_wait_lock);
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
 			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
 	if (!hctx->fq)
-		goto exit_hctx;
-
-	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-		goto free_fq;
+		goto free_bitmap;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		init_srcu_struct(hctx->srcu);
@@ -2377,17 +2391,11 @@ __blk_mq_alloc_and_init_hctx(struct request_queue *q,
 
 	return hctx;
 
- free_fq:
-	blk_free_flush_queue(hctx->fq);
- exit_hctx:
-	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
  free_bitmap:
 	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_remove_cpuhp(hctx);
+ free_cpumask:
 	free_cpumask_var(hctx->cpumask);
  free_hctx:
 	kfree(hctx);
@@ -2742,7 +2750,21 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	if (!hctx)
+		goto fail;
+
+	if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
+		goto free_hctx;
+
+	return hctx;
+
+ free_hctx:
+	kobject_put(&hctx->kobj);
+ fail:
+	return NULL;
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 46 +++++++++++++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1eceeb26ae7d..b9d711d12cae 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	blk_mq_remove_cpuhp(hctx);
+
+	spin_lock(&q->unused_hctx_lock);
+	list_add(&hctx->hctx_list, &q->unused_hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2362,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	INIT_LIST_HEAD(&hctx->hctx_list);
+
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2675,15 +2681,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
+	struct blk_mq_hw_ctx *hctx, *next;
+	int i;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
-	/* hctx kobj stays in hctx */
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx)
-			continue;
+	queue_for_each_hw_ctx(q, hctx, i)
+		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
+	/* all hctx are in .unused_hctx_list now */
+	list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
+		list_del_init(&hctx->hctx_list);
 		kobject_put(&hctx->kobj);
 	}
 
@@ -2750,9 +2758,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
 
-	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	/* reuse dead hctx first */
+	spin_lock(&q->unused_hctx_lock);
+	list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
+		if (tmp->numa_node == node) {
+			hctx = tmp;
+			break;
+		}
+	}
+	if (hctx)
+		list_del_init(&hctx->hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
+
+	if (!hctx)
+		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
 	if (!hctx)
 		goto fail;
 
@@ -2790,10 +2811,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
 		if (hctx) {
-			if (hctxs[i]) {
+			if (hctxs[i])
 				blk_mq_exit_hctx(q, set, hctxs[i], i);
-				kobject_put(&hctxs[i]->kobj);
-			}
 			hctxs[i] = hctx;
 		} else {
 			if (hctxs[i])
@@ -2824,9 +2843,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			if (hctx->tags)
 				blk_mq_free_map_and_requests(set, j);
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
-
 		}
 	}
 	mutex_unlock(&q->sysfs_lock);
@@ -2869,6 +2886,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_sys_init;
 
+	INIT_LIST_HEAD(&q->unused_hctx_list);
+	spin_lock_init(&q->unused_hctx_lock);
+
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index db29928de467..15d1aa53d96c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
 	struct dentry		*sched_debugfs_dir;
 #endif
 
+	struct list_head	hctx_list;
+
 	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
 	struct srcu_struct	srcu[0];
 };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99aa98f60b9e..d7bad4ae8bc8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -535,6 +535,13 @@ struct request_queue {
 
 	struct mutex		sysfs_lock;
 
+	/*
+	 * for reusing dead hctx instance in case of updating
+	 * nr_hw_queues
+	 */
+	struct list_head	unused_hctx_list;
+	spinlock_t		unused_hctx_lock;
+
 	atomic_t		mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-mq.c         | 46 +++++++++++++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1eceeb26ae7d..b9d711d12cae 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		set->ops->exit_hctx(hctx, hctx_idx);
 
 	blk_mq_remove_cpuhp(hctx);
+
+	spin_lock(&q->unused_hctx_lock);
+	list_add(&hctx->hctx_list, &q->unused_hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2362,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	INIT_LIST_HEAD(&hctx->hctx_list);
+
 	/*
 	 * Allocate space for all possible cpus to avoid allocation at
 	 * runtime
@@ -2675,15 +2681,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
+	struct blk_mq_hw_ctx *hctx, *next;
+	int i;
 
 	cancel_delayed_work_sync(&q->requeue_work);
 
-	/* hctx kobj stays in hctx */
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx)
-			continue;
+	queue_for_each_hw_ctx(q, hctx, i)
+		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
+	/* all hctx are in .unused_hctx_list now */
+	list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
+		list_del_init(&hctx->hctx_list);
 		kobject_put(&hctx->kobj);
 	}
 
@@ -2750,9 +2758,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
 		struct blk_mq_tag_set *set, struct request_queue *q,
 		int hctx_idx, int node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = NULL, *tmp;
 
-	hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+	/* reuse dead hctx first */
+	spin_lock(&q->unused_hctx_lock);
+	list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
+		if (tmp->numa_node == node) {
+			hctx = tmp;
+			break;
+		}
+	}
+	if (hctx)
+		list_del_init(&hctx->hctx_list);
+	spin_unlock(&q->unused_hctx_lock);
+
+	if (!hctx)
+		hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
 	if (!hctx)
 		goto fail;
 
@@ -2790,10 +2811,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
 		if (hctx) {
-			if (hctxs[i]) {
+			if (hctxs[i])
 				blk_mq_exit_hctx(q, set, hctxs[i], i);
-				kobject_put(&hctxs[i]->kobj);
-			}
 			hctxs[i] = hctx;
 		} else {
 			if (hctxs[i])
@@ -2824,9 +2843,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			if (hctx->tags)
 				blk_mq_free_map_and_requests(set, j);
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
-
 		}
 	}
 	mutex_unlock(&q->sysfs_lock);
@@ -2869,6 +2886,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_sys_init;
 
+	INIT_LIST_HEAD(&q->unused_hctx_list);
+	spin_lock_init(&q->unused_hctx_lock);
+
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index db29928de467..15d1aa53d96c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -70,6 +70,8 @@ struct blk_mq_hw_ctx {
 	struct dentry		*sched_debugfs_dir;
 #endif
 
+	struct list_head	hctx_list;
+
 	/* Must be the last member - see also blk_mq_hw_ctx_size(). */
 	struct srcu_struct	srcu[0];
 };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99aa98f60b9e..d7bad4ae8bc8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -535,6 +535,13 @@ struct request_queue {
 
 	struct mutex		sysfs_lock;
 
+	/*
+	 * for reusing dead hctx instance in case of updating
+	 * nr_hw_queues
+	 */
+	struct list_head	unused_hctx_list;
+	spinlock_t		unused_hctx_lock;
+
 	atomic_t		mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2dd94b3e9ece..f5b5f21ae4fd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c     | 8 --------
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2dd94b3e9ece..f5b5f21ae4fd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
-
-	if (queue_is_mq(q)) {
-		struct blk_mq_hw_ctx *hctx;
-		int i;
-
-		queue_for_each_hw_ctx(q, hctx, i)
-			cancel_delayed_work_sync(&hctx->run_work);
-	}
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
+	cancel_delayed_work_sync(&hctx->run_work);
+
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f5b5f21ae4fd..e24cfcefdc19 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue()
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche at acm.org>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f5b5f21ae4fd..e24cfcefdc19 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-24 11:02 ` Ming Lei
@ 2019-04-24 11:02   ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Hannes Reinecke, Keith Busch, linux-nvme,
	Sagi Grimberg, James Smart, Ming Lei, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Hennes reported the following kernel oops:

    There is a race condition between namespace rescanning and
    controller reset; during controller reset all namespaces are
    quiesed vie nams_stop_ctrl(), and after reset all namespaces
    are unquiesced again.
    When namespace scanning was active by the time controller reset
    was triggered the rescan code will call nvme_ns_remove(), which
    then will cause a kernel crash in nvme_start_ctrl() as it'll trip
    over uninitialized namespaces.

Patch "blk-mq: free hw queue's resource in hctx's release handler"
should make this issue quite difficult to trigger. However it can't
kill the issue completely becasue pre-condition of that patch is to
hold request queue's refcount before calling block layer API, and
there is still a small window between blk_cleanup_queue() and removing
the ns from the controller namspace list in nvme_ns_remove().

Hold request queue's refcount until the ns is freed, then the above race
can be avoided completely. Given the 'namespaces_rwsem' is always held
to retrieve ns for starting/stopping request queue, this lock can prevent
namespaces from being freed.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 248ff3b48041..82cda6602ca7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -406,6 +406,7 @@ static void nvme_free_ns(struct kref *kref)
 		nvme_nvm_unregister(ns);
 
 	put_disk(ns->disk);
+	blk_put_queue(ns->queue);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
@@ -3229,6 +3230,11 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		goto out_free_ns;
 	}
 
+	if (!blk_get_queue(ns->queue)) {
+		ret = -ENXIO;
+		goto out_free_queue;
+	}
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
 	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
@@ -3245,7 +3251,7 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	id = nvme_identify_ns(ctrl, nsid);
 	if (!id) {
 		ret = -EIO;
-		goto out_free_queue;
+		goto out_put_queue;
 	}
 
 	if (id->ncap == 0) {
@@ -3304,6 +3310,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	nvme_put_ns_head(ns->head);
  out_free_id:
 	kfree(id);
+ out_put_queue:
+	blk_put_queue(ns->queue);
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
  out_free_ns:
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-24 11:02   ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-24 11:02 UTC (permalink / raw)


Hennes reported the following kernel oops:

    There is a race condition between namespace rescanning and
    controller reset; during controller reset all namespaces are
    quiesed vie nams_stop_ctrl(), and after reset all namespaces
    are unquiesced again.
    When namespace scanning was active by the time controller reset
    was triggered the rescan code will call nvme_ns_remove(), which
    then will cause a kernel crash in nvme_start_ctrl() as it'll trip
    over uninitialized namespaces.

Patch "blk-mq: free hw queue's resource in hctx's release handler"
should make this issue quite difficult to trigger. However it can't
kill the issue completely becasue pre-condition of that patch is to
hold request queue's refcount before calling block layer API, and
there is still a small window between blk_cleanup_queue() and removing
the ns from the controller namspace list in nvme_ns_remove().

Hold request queue's refcount until the ns is freed, then the above race
can be avoided completely. Given the 'namespaces_rwsem' is always held
to retrieve ns for starting/stopping request queue, this lock can prevent
namespaces from being freed.

Cc: Dongli Zhang <dongli.zhang at oracle.com>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: linux-scsi at vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen at oracle.com>,
Cc: Christoph Hellwig <hch at lst.de>,
Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
Reported-by: Hannes Reinecke <hare at suse.com>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Reviewed-by: Keith Busch <keith.busch at intel.com>
Tested-by: James Smart <james.smart at broadcom.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 248ff3b48041..82cda6602ca7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -406,6 +406,7 @@ static void nvme_free_ns(struct kref *kref)
 		nvme_nvm_unregister(ns);
 
 	put_disk(ns->disk);
+	blk_put_queue(ns->queue);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
@@ -3229,6 +3230,11 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		goto out_free_ns;
 	}
 
+	if (!blk_get_queue(ns->queue)) {
+		ret = -ENXIO;
+		goto out_free_queue;
+	}
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
 	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
@@ -3245,7 +3251,7 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	id = nvme_identify_ns(ctrl, nsid);
 	if (!id) {
 		ret = -EIO;
-		goto out_free_queue;
+		goto out_put_queue;
 	}
 
 	if (id->ncap == 0) {
@@ -3304,6 +3310,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	nvme_put_ns_head(ns->head);
  out_free_id:
 	kfree(id);
+ out_put_queue:
+	blk_put_queue(ns->queue);
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
  out_free_ns:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 6/9] blk-mq: always free hctx after request queue is freed
  2019-04-24 11:02   ` Ming Lei
@ 2019-04-24 11:15     ` Hannes Reinecke
  -1 siblings, 0 replies; 56+ messages in thread
From: Hannes Reinecke @ 2019-04-24 11:15 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Keith Busch, linux-nvme, Sagi Grimberg, James Smart,
	Dongli Zhang, Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On 4/24/19 1:02 PM, Ming Lei wrote:
> In normal queue cleanup path, hctx is released after request queue
> is freed, see blk_mq_release().
> 
> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> of hw queues shrinking. This way is easy to cause use-after-free,
> because: one implicit rule is that it is safe to call almost all block
> layer APIs if the request queue is alive; and one hctx may be retrieved
> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> finally use-after-free is triggered.
> 
> Fixes this issue by always freeing hctx after releasing request queue.
> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> a per-queue list to hold them, then try to resuse these hctxs if numa
> node is matched.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Tested-by: James Smart <james.smart@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 46 +++++++++++++++++++++++++++++++++-------------
>   include/linux/blk-mq.h |  2 ++
>   include/linux/blkdev.h |  7 +++++++
>   3 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1eceeb26ae7d..b9d711d12cae 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   		set->ops->exit_hctx(hctx, hctx_idx);
>   
>   	blk_mq_remove_cpuhp(hctx);
> +
> +	spin_lock(&q->unused_hctx_lock);
> +	list_add(&hctx->hctx_list, &q->unused_hctx_list);
> +	spin_unlock(&q->unused_hctx_lock);
>   }
>   
>   static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2362,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
>   	hctx->queue = q;
>   	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
>   
> +	INIT_LIST_HEAD(&hctx->hctx_list);
> +
>   	/*
>   	 * Allocate space for all possible cpus to avoid allocation at
>   	 * runtime
> @@ -2675,15 +2681,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>    */
>   void blk_mq_release(struct request_queue *q)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> +	struct blk_mq_hw_ctx *hctx, *next;
> +	int i;
>   
>   	cancel_delayed_work_sync(&q->requeue_work);
>   
> -	/* hctx kobj stays in hctx */
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (!hctx)
> -			continue;
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
> +
> +	/* all hctx are in .unused_hctx_list now */
> +	list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
> +		list_del_init(&hctx->hctx_list);
>   		kobject_put(&hctx->kobj);
>   	}
>   
I would consider it a bug if the hctx is still assigned to 
->queue_hw_ctx at this point.
But that's a minor issue.

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.com			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 6/9] blk-mq: always free hctx after request queue is freed
@ 2019-04-24 11:15     ` Hannes Reinecke
  0 siblings, 0 replies; 56+ messages in thread
From: Hannes Reinecke @ 2019-04-24 11:15 UTC (permalink / raw)


On 4/24/19 1:02 PM, Ming Lei wrote:
> In normal queue cleanup path, hctx is released after request queue
> is freed, see blk_mq_release().
> 
> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> of hw queues shrinking. This way is easy to cause use-after-free,
> because: one implicit rule is that it is safe to call almost all block
> layer APIs if the request queue is alive; and one hctx may be retrieved
> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> finally use-after-free is triggered.
> 
> Fixes this issue by always freeing hctx after releasing request queue.
> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> a per-queue list to hold them, then try to resuse these hctxs if numa
> node is matched.
> 
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Tested-by: James Smart <james.smart at broadcom.com>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   block/blk-mq.c         | 46 +++++++++++++++++++++++++++++++++-------------
>   include/linux/blk-mq.h |  2 ++
>   include/linux/blkdev.h |  7 +++++++
>   3 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1eceeb26ae7d..b9d711d12cae 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2274,6 +2274,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   		set->ops->exit_hctx(hctx, hctx_idx);
>   
>   	blk_mq_remove_cpuhp(hctx);
> +
> +	spin_lock(&q->unused_hctx_lock);
> +	list_add(&hctx->hctx_list, &q->unused_hctx_list);
> +	spin_unlock(&q->unused_hctx_lock);
>   }
>   
>   static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2362,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
>   	hctx->queue = q;
>   	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
>   
> +	INIT_LIST_HEAD(&hctx->hctx_list);
> +
>   	/*
>   	 * Allocate space for all possible cpus to avoid allocation at
>   	 * runtime
> @@ -2675,15 +2681,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
>    */
>   void blk_mq_release(struct request_queue *q)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> +	struct blk_mq_hw_ctx *hctx, *next;
> +	int i;
>   
>   	cancel_delayed_work_sync(&q->requeue_work);
>   
> -	/* hctx kobj stays in hctx */
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (!hctx)
> -			continue;
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
> +
> +	/* all hctx are in .unused_hctx_list now */
> +	list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
> +		list_del_init(&hctx->hctx_list);
>   		kobject_put(&hctx->kobj);
>   	}
>   
I would consider it a bug if the hctx is still assigned to 
->queue_hw_ctx at this point.
But that's a minor issue.

Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare at suse.com			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-24 11:02   ` Ming Lei
@ 2019-04-24 16:18     ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Wed, Apr 24, 2019 at 07:02:13PM +0800, Ming Lei wrote:
>  		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
>  			if (this_hctx) {
>  				trace_block_unplug(this_q, depth, !from_schedule);
> +
> +				percpu_ref_get(&this_q->q_usage_counter);
>  				blk_mq_sched_insert_requests(this_hctx, this_ctx,
>  								&rq_list,
>  								from_schedule);
> +				percpu_ref_put(&this_q->q_usage_counter);
>  			}
>  
>  			this_q = rq->q;
> @@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	 */
>  	if (this_hctx) {
>  		trace_block_unplug(this_q, depth, !from_schedule);
> +
> +		percpu_ref_get(&this_q->q_usage_counter);
>  		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
>  						from_schedule);
> +		percpu_ref_put(&this_q->q_usage_counter);

Why don't we push this into blk_mq_sched_insert_requests?  Yes, it
would need a request_queue argument, but that still seems saner
than duplicating it in both callers.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
@ 2019-04-24 16:18     ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:18 UTC (permalink / raw)


On Wed, Apr 24, 2019@07:02:13PM +0800, Ming Lei wrote:
>  		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
>  			if (this_hctx) {
>  				trace_block_unplug(this_q, depth, !from_schedule);
> +
> +				percpu_ref_get(&this_q->q_usage_counter);
>  				blk_mq_sched_insert_requests(this_hctx, this_ctx,
>  								&rq_list,
>  								from_schedule);
> +				percpu_ref_put(&this_q->q_usage_counter);
>  			}
>  
>  			this_q = rq->q;
> @@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	 */
>  	if (this_hctx) {
>  		trace_block_unplug(this_q, depth, !from_schedule);
> +
> +		percpu_ref_get(&this_q->q_usage_counter);
>  		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
>  						from_schedule);
> +		percpu_ref_put(&this_q->q_usage_counter);

Why don't we push this into blk_mq_sched_insert_requests?  Yes, it
would need a request_queue argument, but that still seems saner
than duplicating it in both callers.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
  2019-04-24 11:02   ` Ming Lei
@ 2019-04-24 16:18     ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Wed, Apr 24, 2019 at 07:02:14PM +0800, Ming Lei wrote:
> With holding queue's kobject refcount, it is safe for driver
> to schedule requeue. However, blk_mq_kick_requeue_list() may
> be called after blk_sync_queue() is done because of concurrent
> requeue activities, then requeue work may not be completed when
> freeing queue, and kernel oops is triggered.
> 
> So moving the cancel of requeue_work into blk_mq_release() for
> avoiding race between requeue and freeing queue.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 2/9] blk-mq: move cancel of requeue_work into blk_mq_release
@ 2019-04-24 16:18     ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:18 UTC (permalink / raw)


On Wed, Apr 24, 2019@07:02:14PM +0800, Ming Lei wrote:
> With holding queue's kobject refcount, it is safe for driver
> to schedule requeue. However, blk_mq_kick_requeue_list() may
> be called after blk_sync_queue() is done because of concurrent
> requeue activities, then requeue work may not be completed when
> freeing queue, and kernel oops is triggered.
> 
> So moving the cancel of requeue_work into blk_mq_release() for
> avoiding race between requeue and freeing queue.

Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 3/9] blk-mq: free hw queue's resource in hctx's release handler
  2019-04-24 11:02   ` Ming Lei
@ 2019-04-24 16:19     ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, stable

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 3/9] blk-mq: free hw queue's resource in hctx's release handler
@ 2019-04-24 16:19     ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:19 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-24 11:02   ` Ming Lei
@ 2019-04-24 16:22     ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Wed, Apr 24, 2019 at 07:02:16PM +0800, Ming Lei wrote:
> Move all hctx related allocation and initialization into
> __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> into real two functions, one is for allocate everything, and another is for
> initializing everyting.
> 
> Cc: Dongli Zhang <dongli.zhang@oracle.com>
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Tested-by: James Smart <james.smart@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
>  1 file changed, 44 insertions(+), 49 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 862eb41f24f8..fea25363488d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
>  	}
>  }
>  
> -static int blk_mq_init_hctx(struct request_queue *q,
> -		struct blk_mq_tag_set *set,
> -		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
>  {
> -	int node;
> +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> +
> +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> +			   __alignof__(struct blk_mq_hw_ctx)) !=
> +		     sizeof(struct blk_mq_hw_ctx));
> +
> +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> +		hw_ctx_size += sizeof(struct srcu_struct);
> +
> +	return hw_ctx_size;
> +}
> +
> +static struct blk_mq_hw_ctx *
> +__blk_mq_alloc_and_init_hctx(struct request_queue *q,
> +			     struct blk_mq_tag_set *set,
> +			     unsigned hctx_idx, int node)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> +			node);
> +	if (!hctx)
> +		goto fail_alloc_hctx;
> +
> +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,

Maybe throw in a

	const gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;

as the first line to shorten these two a bit..

>  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
>  		struct blk_mq_tag_set *set, struct request_queue *q,
>  		int hctx_idx, int node)
>  {
> +	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
>  }

I really don't the point of this wrapper.  If we need it later
we should introduce it there.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
@ 2019-04-24 16:22     ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:22 UTC (permalink / raw)


On Wed, Apr 24, 2019@07:02:16PM +0800, Ming Lei wrote:
> Move all hctx related allocation and initialization into
> __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> into real two functions, one is for allocate everything, and another is for
> initializing everyting.
> 
> Cc: Dongli Zhang <dongli.zhang at oracle.com>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bart.vanassche at wdc.com>
> Cc: linux-scsi at vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> Cc: Christoph Hellwig <hch at lst.de>,
> Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> Reviewed-by: Hannes Reinecke <hare at suse.com>
> Tested-by: James Smart <james.smart at broadcom.com>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>  block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
>  1 file changed, 44 insertions(+), 49 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 862eb41f24f8..fea25363488d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
>  	}
>  }
>  
> -static int blk_mq_init_hctx(struct request_queue *q,
> -		struct blk_mq_tag_set *set,
> -		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
>  {
> -	int node;
> +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> +
> +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> +			   __alignof__(struct blk_mq_hw_ctx)) !=
> +		     sizeof(struct blk_mq_hw_ctx));
> +
> +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> +		hw_ctx_size += sizeof(struct srcu_struct);
> +
> +	return hw_ctx_size;
> +}
> +
> +static struct blk_mq_hw_ctx *
> +__blk_mq_alloc_and_init_hctx(struct request_queue *q,
> +			     struct blk_mq_tag_set *set,
> +			     unsigned hctx_idx, int node)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> +			node);
> +	if (!hctx)
> +		goto fail_alloc_hctx;
> +
> +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,

Maybe throw in a

	const gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;

as the first line to shorten these two a bit..

>  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
>  		struct blk_mq_tag_set *set, struct request_queue *q,
>  		int hctx_idx, int node)
>  {
> +	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
>  }

I really don't the point of this wrapper.  If we need it later
we should introduce it there.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-24 11:02   ` Ming Lei
@ 2019-04-24 16:27     ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Wed, Apr 24, 2019 at 07:02:21PM +0800, Ming Lei wrote:
> Hennes reported the following kernel oops:

Hannes?

> +	if (!blk_get_queue(ns->queue)) {
> +		ret = -ENXIO;
> +		goto out_free_queue;
> +	}

If we always need to hold a reference, shouldn't blk_mq_init_queue
return with that reference held (and yes, that means changes to
every driver, but it seems like we need to audit all of them anyway..)

It seems like the queue lifetimes are a bit of a mess, and I'm not sure
if this just papers over the problem.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-24 16:27     ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-24 16:27 UTC (permalink / raw)


On Wed, Apr 24, 2019@07:02:21PM +0800, Ming Lei wrote:
> Hennes reported the following kernel oops:

Hannes?

> +	if (!blk_get_queue(ns->queue)) {
> +		ret = -ENXIO;
> +		goto out_free_queue;
> +	}

If we always need to hold a reference, shouldn't blk_mq_init_queue
return with that reference held (and yes, that means changes to
every driver, but it seems like we need to audit all of them anyway..)

It seems like the queue lifetimes are a bit of a mess, and I'm not sure
if this just papers over the problem.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-24 16:18     ` Christoph Hellwig
@ 2019-04-25  0:53       ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  0:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Wed, Apr 24, 2019 at 06:18:34PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019 at 07:02:13PM +0800, Ming Lei wrote:
> >  		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
> >  			if (this_hctx) {
> >  				trace_block_unplug(this_q, depth, !from_schedule);
> > +
> > +				percpu_ref_get(&this_q->q_usage_counter);
> >  				blk_mq_sched_insert_requests(this_hctx, this_ctx,
> >  								&rq_list,
> >  								from_schedule);
> > +				percpu_ref_put(&this_q->q_usage_counter);
> >  			}
> >  
> >  			this_q = rq->q;
> > @@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >  	 */
> >  	if (this_hctx) {
> >  		trace_block_unplug(this_q, depth, !from_schedule);
> > +
> > +		percpu_ref_get(&this_q->q_usage_counter);
> >  		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
> >  						from_schedule);
> > +		percpu_ref_put(&this_q->q_usage_counter);
> 
> Why don't we push this into blk_mq_sched_insert_requests?  Yes, it
> would need a request_queue argument, but that still seems saner
> than duplicating it in both callers.

It isn't in other callers of blk_mq_sched_insert_requests(), it is just
needed in some corner case like flush plug context.

For other callers of blk_mq_sched_insert_requests(), it is guaranteed
that request queue's ref is held.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
@ 2019-04-25  0:53       ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  0:53 UTC (permalink / raw)


On Wed, Apr 24, 2019@06:18:34PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019@07:02:13PM +0800, Ming Lei wrote:
> >  		if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) {
> >  			if (this_hctx) {
> >  				trace_block_unplug(this_q, depth, !from_schedule);
> > +
> > +				percpu_ref_get(&this_q->q_usage_counter);
> >  				blk_mq_sched_insert_requests(this_hctx, this_ctx,
> >  								&rq_list,
> >  								from_schedule);
> > +				percpu_ref_put(&this_q->q_usage_counter);
> >  			}
> >  
> >  			this_q = rq->q;
> > @@ -1757,8 +1760,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >  	 */
> >  	if (this_hctx) {
> >  		trace_block_unplug(this_q, depth, !from_schedule);
> > +
> > +		percpu_ref_get(&this_q->q_usage_counter);
> >  		blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
> >  						from_schedule);
> > +		percpu_ref_put(&this_q->q_usage_counter);
> 
> Why don't we push this into blk_mq_sched_insert_requests?  Yes, it
> would need a request_queue argument, but that still seems saner
> than duplicating it in both callers.

It isn't in other callers of blk_mq_sched_insert_requests(), it is just
needed in some corner case like flush plug context.

For other callers of blk_mq_sched_insert_requests(), it is guaranteed
that request queue's ref is held.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-24 16:22     ` Christoph Hellwig
@ 2019-04-25  0:55       ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  0:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Wed, Apr 24, 2019 at 06:22:23PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019 at 07:02:16PM +0800, Ming Lei wrote:
> > Move all hctx related allocation and initialization into
> > __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> > into real two functions, one is for allocate everything, and another is for
> > initializing everyting.
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Reviewed-by: Hannes Reinecke <hare@suse.com>
> > Tested-by: James Smart <james.smart@broadcom.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
> >  1 file changed, 44 insertions(+), 49 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 862eb41f24f8..fea25363488d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
> >  	}
> >  }
> >  
> > -static int blk_mq_init_hctx(struct request_queue *q,
> > -		struct blk_mq_tag_set *set,
> > -		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> > +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
> >  {
> > -	int node;
> > +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> > +
> > +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> > +			   __alignof__(struct blk_mq_hw_ctx)) !=
> > +		     sizeof(struct blk_mq_hw_ctx));
> > +
> > +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> > +		hw_ctx_size += sizeof(struct srcu_struct);
> > +
> > +	return hw_ctx_size;
> > +}
> > +
> > +static struct blk_mq_hw_ctx *
> > +__blk_mq_alloc_and_init_hctx(struct request_queue *q,
> > +			     struct blk_mq_tag_set *set,
> > +			     unsigned hctx_idx, int node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx;
> > +
> > +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> > +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> > +			node);
> > +	if (!hctx)
> > +		goto fail_alloc_hctx;
> > +
> > +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> > +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> 
> Maybe throw in a
> 
> 	const gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> 
> as the first line to shorten these two a bit..
> 
> >  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> >  		struct blk_mq_tag_set *set, struct request_queue *q,
> >  		int hctx_idx, int node)
> >  {
> > +	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
> >  }
> 
> I really don't the point of this wrapper.  If we need it later
> we should introduce it there.

It prepares for the following patch, otherwise the whole patch may
become difficult to review.

thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
@ 2019-04-25  0:55       ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  0:55 UTC (permalink / raw)


On Wed, Apr 24, 2019@06:22:23PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019@07:02:16PM +0800, Ming Lei wrote:
> > Move all hctx related allocation and initialization into
> > __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> > into real two functions, one is for allocate everything, and another is for
> > initializing everyting.
> > 
> > Cc: Dongli Zhang <dongli.zhang at oracle.com>
> > Cc: James Smart <james.smart at broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche at wdc.com>
> > Cc: linux-scsi at vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> > Cc: Christoph Hellwig <hch at lst.de>,
> > Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> > Reviewed-by: Hannes Reinecke <hare at suse.com>
> > Tested-by: James Smart <james.smart at broadcom.com>
> > Signed-off-by: Ming Lei <ming.lei at redhat.com>
> > ---
> >  block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
> >  1 file changed, 44 insertions(+), 49 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 862eb41f24f8..fea25363488d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
> >  	}
> >  }
> >  
> > -static int blk_mq_init_hctx(struct request_queue *q,
> > -		struct blk_mq_tag_set *set,
> > -		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> > +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
> >  {
> > -	int node;
> > +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> > +
> > +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> > +			   __alignof__(struct blk_mq_hw_ctx)) !=
> > +		     sizeof(struct blk_mq_hw_ctx));
> > +
> > +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> > +		hw_ctx_size += sizeof(struct srcu_struct);
> > +
> > +	return hw_ctx_size;
> > +}
> > +
> > +static struct blk_mq_hw_ctx *
> > +__blk_mq_alloc_and_init_hctx(struct request_queue *q,
> > +			     struct blk_mq_tag_set *set,
> > +			     unsigned hctx_idx, int node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx;
> > +
> > +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> > +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> > +			node);
> > +	if (!hctx)
> > +		goto fail_alloc_hctx;
> > +
> > +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> > +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> 
> Maybe throw in a
> 
> 	const gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> 
> as the first line to shorten these two a bit..
> 
> >  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> >  		struct blk_mq_tag_set *set, struct request_queue *q,
> >  		int hctx_idx, int node)
> >  {
> > +	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
> >  }
> 
> I really don't the point of this wrapper.  If we need it later
> we should introduce it there.

It prepares for the following patch, otherwise the whole patch may
become difficult to review.

thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-24 16:27     ` Christoph Hellwig
@ 2019-04-25  1:00       ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  1:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Wed, Apr 24, 2019 at 06:27:46PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019 at 07:02:21PM +0800, Ming Lei wrote:
> > Hennes reported the following kernel oops:
> 
> Hannes?
> 
> > +	if (!blk_get_queue(ns->queue)) {
> > +		ret = -ENXIO;
> > +		goto out_free_queue;
> > +	}
> 
> If we always need to hold a reference, shouldn't blk_mq_init_queue
> return with that reference held (and yes, that means changes to
> every driver, but it seems like we need to audit all of them anyway..)

The issue is driver(NVMe) specific, the race window is just between
between blk_cleanup_queue() and removing the ns from the controller namspace
list in nvme_ns_remove()

blk_mq_init_queue() does hold one refcount, and its counter-part is
blk_cleanup_queue().

It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
then who is the counter-part for releasing the extra refcount?

> 
> It seems like the queue lifetimes are a bit of a mess, and I'm not sure
> if this just papers over the problem.

Could you explain a bit what the mess is?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-25  1:00       ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  1:00 UTC (permalink / raw)


On Wed, Apr 24, 2019@06:27:46PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019@07:02:21PM +0800, Ming Lei wrote:
> > Hennes reported the following kernel oops:
> 
> Hannes?
> 
> > +	if (!blk_get_queue(ns->queue)) {
> > +		ret = -ENXIO;
> > +		goto out_free_queue;
> > +	}
> 
> If we always need to hold a reference, shouldn't blk_mq_init_queue
> return with that reference held (and yes, that means changes to
> every driver, but it seems like we need to audit all of them anyway..)

The issue is driver(NVMe) specific, the race window is just between
between blk_cleanup_queue() and removing the ns from the controller namspace
list in nvme_ns_remove()

blk_mq_init_queue() does hold one refcount, and its counter-part is
blk_cleanup_queue().

It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
then who is the counter-part for releasing the extra refcount?

> 
> It seems like the queue lifetimes are a bit of a mess, and I'm not sure
> if this just papers over the problem.

Could you explain a bit what the mess is?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-24 16:22     ` Christoph Hellwig
@ 2019-04-25  1:02       ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  1:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Wed, Apr 24, 2019 at 06:22:23PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019 at 07:02:16PM +0800, Ming Lei wrote:
> > Move all hctx related allocation and initialization into
> > __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> > into real two functions, one is for allocate everything, and another is for
> > initializing everyting.
> > 
> > Cc: Dongli Zhang <dongli.zhang@oracle.com>
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Reviewed-by: Hannes Reinecke <hare@suse.com>
> > Tested-by: James Smart <james.smart@broadcom.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
> >  1 file changed, 44 insertions(+), 49 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 862eb41f24f8..fea25363488d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
> >  	}
> >  }
> >  
> > -static int blk_mq_init_hctx(struct request_queue *q,
> > -		struct blk_mq_tag_set *set,
> > -		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> > +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
> >  {
> > -	int node;
> > +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> > +
> > +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> > +			   __alignof__(struct blk_mq_hw_ctx)) !=
> > +		     sizeof(struct blk_mq_hw_ctx));
> > +
> > +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> > +		hw_ctx_size += sizeof(struct srcu_struct);
> > +
> > +	return hw_ctx_size;
> > +}
> > +
> > +static struct blk_mq_hw_ctx *
> > +__blk_mq_alloc_and_init_hctx(struct request_queue *q,
> > +			     struct blk_mq_tag_set *set,
> > +			     unsigned hctx_idx, int node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx;
> > +
> > +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> > +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> > +			node);
> > +	if (!hctx)
> > +		goto fail_alloc_hctx;
> > +
> > +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> > +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> 
> Maybe throw in a
> 
> 	const gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> 
> as the first line to shorten these two a bit..

That is fine, but could be a cleanup.

> 
> >  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> >  		struct blk_mq_tag_set *set, struct request_queue *q,
> >  		int hctx_idx, int node)
> >  {
> > +	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
> >  }
> 
> I really don't the point of this wrapper.  If we need it later
> we should introduce it there.

This patch prepares for the conversion, especially for the next two.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
@ 2019-04-25  1:02       ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  1:02 UTC (permalink / raw)


On Wed, Apr 24, 2019@06:22:23PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019@07:02:16PM +0800, Ming Lei wrote:
> > Move all hctx related allocation and initialization into
> > __blk_mq_alloc_and_init_hctx, and prepare for splitting blk_mq_alloc_and_init_hctx
> > into real two functions, one is for allocate everything, and another is for
> > initializing everyting.
> > 
> > Cc: Dongli Zhang <dongli.zhang at oracle.com>
> > Cc: James Smart <james.smart at broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche at wdc.com>
> > Cc: linux-scsi at vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen at oracle.com>,
> > Cc: Christoph Hellwig <hch at lst.de>,
> > Cc: James E . J . Bottomley <jejb at linux.vnet.ibm.com>,
> > Reviewed-by: Hannes Reinecke <hare at suse.com>
> > Tested-by: James Smart <james.smart at broadcom.com>
> > Signed-off-by: Ming Lei <ming.lei at redhat.com>
> > ---
> >  block/blk-mq.c | 93 +++++++++++++++++++++++++++-------------------------------
> >  1 file changed, 44 insertions(+), 49 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 862eb41f24f8..fea25363488d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2290,15 +2290,45 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
> >  	}
> >  }
> >  
> > -static int blk_mq_init_hctx(struct request_queue *q,
> > -		struct blk_mq_tag_set *set,
> > -		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> > +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
> >  {
> > -	int node;
> > +	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> > +
> > +	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> > +			   __alignof__(struct blk_mq_hw_ctx)) !=
> > +		     sizeof(struct blk_mq_hw_ctx));
> > +
> > +	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> > +		hw_ctx_size += sizeof(struct srcu_struct);
> > +
> > +	return hw_ctx_size;
> > +}
> > +
> > +static struct blk_mq_hw_ctx *
> > +__blk_mq_alloc_and_init_hctx(struct request_queue *q,
> > +			     struct blk_mq_tag_set *set,
> > +			     unsigned hctx_idx, int node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx;
> > +
> > +	hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> > +			GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> > +			node);
> > +	if (!hctx)
> > +		goto fail_alloc_hctx;
> > +
> > +	if (!zalloc_cpumask_var_node(&hctx->cpumask,
> > +				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> 
> Maybe throw in a
> 
> 	const gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> 
> as the first line to shorten these two a bit..

That is fine, but could be a cleanup.

> 
> >  static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
> >  		struct blk_mq_tag_set *set, struct request_queue *q,
> >  		int hctx_idx, int node)
> >  {
> > +	return __blk_mq_alloc_and_init_hctx(q, set, hctx_idx, node);
> >  }
> 
> I really don't the point of this wrapper.  If we need it later
> we should introduce it there.

This patch prepares for the conversion, especially for the next two.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-25  0:53       ` Ming Lei
@ 2019-04-25  5:32         ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-25  5:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Hannes Reinecke,
	Keith Busch, linux-nvme, Sagi Grimberg, James Smart,
	Dongli Zhang, Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Thu, Apr 25, 2019 at 08:53:34AM +0800, Ming Lei wrote:
> It isn't in other callers of blk_mq_sched_insert_requests(), it is just
> needed in some corner case like flush plug context.
> 
> For other callers of blk_mq_sched_insert_requests(), it is guaranteed
> that request queue's ref is held.

In both Linus' tree and Jens' for-5.2 tree I only see these two
callers of blk_mq_sched_insert_requests.  What am I missing?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
@ 2019-04-25  5:32         ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-25  5:32 UTC (permalink / raw)


On Thu, Apr 25, 2019@08:53:34AM +0800, Ming Lei wrote:
> It isn't in other callers of blk_mq_sched_insert_requests(), it is just
> needed in some corner case like flush plug context.
> 
> For other callers of blk_mq_sched_insert_requests(), it is guaranteed
> that request queue's ref is held.

In both Linus' tree and Jens' for-5.2 tree I only see these two
callers of blk_mq_sched_insert_requests.  What am I missing?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
  2019-04-25  5:32         ` Christoph Hellwig
@ 2019-04-25  7:35           ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  7:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Thu, Apr 25, 2019 at 07:32:34AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019 at 08:53:34AM +0800, Ming Lei wrote:
> > It isn't in other callers of blk_mq_sched_insert_requests(), it is just
> > needed in some corner case like flush plug context.
> > 
> > For other callers of blk_mq_sched_insert_requests(), it is guaranteed
> > that request queue's ref is held.
> 
> In both Linus' tree and Jens' for-5.2 tree I only see these two
> callers of blk_mq_sched_insert_requests.  What am I missing?

OK, what I meant is that the issue is flush plug context specific.
And I didn't notice that this function is only called by
blk_mq_flush_plug_list().

Then it is fine to move the percpu_ref_get()/percpu_ref_put() into
the function, better with comment on this issue.

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path
@ 2019-04-25  7:35           ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-25  7:35 UTC (permalink / raw)


On Thu, Apr 25, 2019@07:32:34AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019@08:53:34AM +0800, Ming Lei wrote:
> > It isn't in other callers of blk_mq_sched_insert_requests(), it is just
> > needed in some corner case like flush plug context.
> > 
> > For other callers of blk_mq_sched_insert_requests(), it is guaranteed
> > that request queue's ref is held.
> 
> In both Linus' tree and Jens' for-5.2 tree I only see these two
> callers of blk_mq_sched_insert_requests.  What am I missing?

OK, what I meant is that the issue is flush plug context specific.
And I didn't notice that this function is only called by
blk_mq_flush_plug_list().

Then it is fine to move the percpu_ref_get()/percpu_ref_put() into
the function, better with comment on this issue.

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
  2019-04-25  0:55       ` Ming Lei
@ 2019-04-26 15:09         ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-26 15:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Hannes Reinecke,
	Keith Busch, linux-nvme, Sagi Grimberg, James Smart,
	Dongli Zhang, Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Thu, Apr 25, 2019 at 08:55:43AM +0800, Ming Lei wrote:
> It prepares for the following patch, otherwise the whole patch may
> become difficult to review.

Honestly, I'd find it much easier to review if this was folded into
the next patch.  That way we only move things around once instead
of twice.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx
@ 2019-04-26 15:09         ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-26 15:09 UTC (permalink / raw)


On Thu, Apr 25, 2019@08:55:43AM +0800, Ming Lei wrote:
> It prepares for the following patch, otherwise the whole patch may
> become difficult to review.

Honestly, I'd find it much easier to review if this was folded into
the next patch.  That way we only move things around once instead
of twice.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-25  1:00       ` Ming Lei
@ 2019-04-26 15:11         ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-26 15:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Hannes Reinecke,
	Keith Busch, linux-nvme, Sagi Grimberg, James Smart,
	Dongli Zhang, Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> The issue is driver(NVMe) specific, the race window is just between
> between blk_cleanup_queue() and removing the ns from the controller namspace
> list in nvme_ns_remove()

And I wouldn't be surprised if others have the same issue.

> 
> blk_mq_init_queue() does hold one refcount, and its counter-part is
> blk_cleanup_queue().
> 
> It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> then who is the counter-part for releasing the extra refcount?

Well, the problem is exactly that blk_cleanup_queue drops the reference.
If move the blk_put_queue() call from the end of it to the callers the
callers can keep the reference as long as they need them, and we wouldn't
need an extra reference.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-26 15:11         ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-26 15:11 UTC (permalink / raw)


On Thu, Apr 25, 2019@09:00:31AM +0800, Ming Lei wrote:
> The issue is driver(NVMe) specific, the race window is just between
> between blk_cleanup_queue() and removing the ns from the controller namspace
> list in nvme_ns_remove()

And I wouldn't be surprised if others have the same issue.

> 
> blk_mq_init_queue() does hold one refcount, and its counter-part is
> blk_cleanup_queue().
> 
> It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> then who is the counter-part for releasing the extra refcount?

Well, the problem is exactly that blk_cleanup_queue drops the reference.
If move the blk_put_queue() call from the end of it to the callers the
callers can keep the reference as long as they need them, and we wouldn't
need an extra reference.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-26 15:11         ` Christoph Hellwig
@ 2019-04-26 17:04           ` Bart Van Assche
  -1 siblings, 0 replies; 56+ messages in thread
From: Bart Van Assche @ 2019-04-26 17:04 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: Jens Axboe, Keith Busch, Hannes Reinecke,
	James E . J . Bottomley, Sagi Grimberg, linux-scsi, Dongli Zhang,
	James Smart, linux-nvme, linux-block, Martin K . Petersen,
	Bart Van Assche

On Fri, 2019-04-26 at 17:11 +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> > The issue is driver(NVMe) specific, the race window is just between
> > between blk_cleanup_queue() and removing the ns from the controller namspace
> > list in nvme_ns_remove()
> 
> And I wouldn't be surprised if others have the same issue.
> 
> > 
> > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > blk_cleanup_queue().
> > 
> > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > then who is the counter-part for releasing the extra refcount?
> 
> Well, the problem is exactly that blk_cleanup_queue drops the reference.
> If move the blk_put_queue() call from the end of it to the callers the
> callers can keep the reference as long as they need them, and we wouldn't
> need an extra reference.

Hi Christoph,

There are more than hundred callers of blk_cleanup_queue() so that change
would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are
available, how inserting a pair of calls to these functions where necessary?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-26 17:04           ` Bart Van Assche
  0 siblings, 0 replies; 56+ messages in thread
From: Bart Van Assche @ 2019-04-26 17:04 UTC (permalink / raw)


On Fri, 2019-04-26@17:11 +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019@09:00:31AM +0800, Ming Lei wrote:
> > The issue is driver(NVMe) specific, the race window is just between
> > between blk_cleanup_queue() and removing the ns from the controller namspace
> > list in nvme_ns_remove()
> 
> And I wouldn't be surprised if others have the same issue.
> 
> > 
> > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > blk_cleanup_queue().
> > 
> > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > then who is the counter-part for releasing the extra refcount?
> 
> Well, the problem is exactly that blk_cleanup_queue drops the reference.
> If move the blk_put_queue() call from the end of it to the callers the
> callers can keep the reference as long as they need them, and we wouldn't
> need an extra reference.

Hi Christoph,

There are more than hundred callers of blk_cleanup_queue() so that change
would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are
available, how inserting a pair of calls to these functions where necessary?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-26 15:11         ` Christoph Hellwig
@ 2019-04-26 22:45           ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-26 22:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Hannes Reinecke, Keith Busch,
	linux-nvme, Sagi Grimberg, James Smart, Dongli Zhang,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Fri, Apr 26, 2019 at 05:11:14PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> > The issue is driver(NVMe) specific, the race window is just between
> > between blk_cleanup_queue() and removing the ns from the controller namspace
> > list in nvme_ns_remove()
> 
> And I wouldn't be surprised if others have the same issue.

SCSI hasn't such issue, and loop/null/virtio-blk doesn't too.

> 
> > 
> > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > blk_cleanup_queue().
> > 
> > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > then who is the counter-part for releasing the extra refcount?
> 
> Well, the problem is exactly that blk_cleanup_queue drops the reference.
> If move the blk_put_queue() call from the end of it to the callers the
> callers can keep the reference as long as they need them, and we wouldn't
> need an extra reference.

It depends on driver.

Still no difference between your suggestion and the way in this patch, given
driver specific change is a must. Even it is more clean to hold the
queue refocunt by drivers explicitly because we usually do the get/put pair
in one place, instead that getting refcnt in one subsystem, and doing put in
another place.

I am going to drop this patch in V8 since my original plan is to fix block
layer's race in this patchset.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-26 22:45           ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-26 22:45 UTC (permalink / raw)


On Fri, Apr 26, 2019@05:11:14PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019@09:00:31AM +0800, Ming Lei wrote:
> > The issue is driver(NVMe) specific, the race window is just between
> > between blk_cleanup_queue() and removing the ns from the controller namspace
> > list in nvme_ns_remove()
> 
> And I wouldn't be surprised if others have the same issue.

SCSI hasn't such issue, and loop/null/virtio-blk doesn't too.

> 
> > 
> > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > blk_cleanup_queue().
> > 
> > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > then who is the counter-part for releasing the extra refcount?
> 
> Well, the problem is exactly that blk_cleanup_queue drops the reference.
> If move the blk_put_queue() call from the end of it to the callers the
> callers can keep the reference as long as they need them, and we wouldn't
> need an extra reference.

It depends on driver.

Still no difference between your suggestion and the way in this patch, given
driver specific change is a must. Even it is more clean to hold the
queue refocunt by drivers explicitly because we usually do the get/put pair
in one place, instead that getting refcnt in one subsystem, and doing put in
another place.

I am going to drop this patch in V8 since my original plan is to fix block
layer's race in this patchset.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-26 17:04           ` Bart Van Assche
@ 2019-04-26 22:49             ` Ming Lei
  -1 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-26 22:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Hannes Reinecke,
	James E . J . Bottomley, Sagi Grimberg, linux-scsi, Dongli Zhang,
	James Smart, linux-nvme, linux-block, Martin K . Petersen,
	Bart Van Assche

On Fri, Apr 26, 2019 at 10:04:23AM -0700, Bart Van Assche wrote:
> On Fri, 2019-04-26 at 17:11 +0200, Christoph Hellwig wrote:
> > On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> > > The issue is driver(NVMe) specific, the race window is just between
> > > between blk_cleanup_queue() and removing the ns from the controller namspace
> > > list in nvme_ns_remove()
> > 
> > And I wouldn't be surprised if others have the same issue.
> > 
> > > 
> > > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > > blk_cleanup_queue().
> > > 
> > > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > > then who is the counter-part for releasing the extra refcount?
> > 
> > Well, the problem is exactly that blk_cleanup_queue drops the reference.
> > If move the blk_put_queue() call from the end of it to the callers the
> > callers can keep the reference as long as they need them, and we wouldn't
> > need an extra reference.
> 
> Hi Christoph,
> 
> There are more than hundred callers of blk_cleanup_queue() so that change
> would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are
> available, how inserting a pair of calls to these functions where necessary?

The problem is that queue might be used after blk_cleanup_queue() is
returned by some drivers.

Gendisk is removed before cleanup queue, and the other activities on queue
depends on driver itself. There can't be universal way to deal with that.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-26 22:49             ` Ming Lei
  0 siblings, 0 replies; 56+ messages in thread
From: Ming Lei @ 2019-04-26 22:49 UTC (permalink / raw)


On Fri, Apr 26, 2019@10:04:23AM -0700, Bart Van Assche wrote:
> On Fri, 2019-04-26@17:11 +0200, Christoph Hellwig wrote:
> > On Thu, Apr 25, 2019@09:00:31AM +0800, Ming Lei wrote:
> > > The issue is driver(NVMe) specific, the race window is just between
> > > between blk_cleanup_queue() and removing the ns from the controller namspace
> > > list in nvme_ns_remove()
> > 
> > And I wouldn't be surprised if others have the same issue.
> > 
> > > 
> > > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > > blk_cleanup_queue().
> > > 
> > > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > > then who is the counter-part for releasing the extra refcount?
> > 
> > Well, the problem is exactly that blk_cleanup_queue drops the reference.
> > If move the blk_put_queue() call from the end of it to the callers the
> > callers can keep the reference as long as they need them, and we wouldn't
> > need an extra reference.
> 
> Hi Christoph,
> 
> There are more than hundred callers of blk_cleanup_queue() so that change
> would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are
> available, how inserting a pair of calls to these functions where necessary?

The problem is that queue might be used after blk_cleanup_queue() is
returned by some drivers.

Gendisk is removed before cleanup queue, and the other activities on queue
depends on driver itself. There can't be universal way to deal with that.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
  2019-04-26 22:45           ` Ming Lei
@ 2019-04-27  5:54             ` Christoph Hellwig
  -1 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-27  5:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Hannes Reinecke,
	Keith Busch, linux-nvme, Sagi Grimberg, James Smart,
	Dongli Zhang, Bart Van Assche, linux-scsi, Martin K . Petersen,
	James E . J . Bottomley

On Sat, Apr 27, 2019 at 06:45:43AM +0800, Ming Lei wrote:
> Still no difference between your suggestion and the way in this patch, given
> driver specific change is a must. Even it is more clean to hold the
> queue refocunt by drivers explicitly because we usually do the get/put pair
> in one place, instead that getting refcnt in one subsystem, and doing put in
> another place.
> 
> I am going to drop this patch in V8 since my original plan is to fix block
> layer's race in this patchset.

Doing the untegistration and final reference drop is just a bad pattern,
leading to possible bugs.  I'm fine with you just dropping it for now,
I guess I'll just send a scripted conversion moving the put to the drivers
to Jens after -rc1 to sort it out.  Probably including a rename
of blk_cleanup_queue so that everyone notices the difference.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime
@ 2019-04-27  5:54             ` Christoph Hellwig
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Hellwig @ 2019-04-27  5:54 UTC (permalink / raw)


On Sat, Apr 27, 2019@06:45:43AM +0800, Ming Lei wrote:
> Still no difference between your suggestion and the way in this patch, given
> driver specific change is a must. Even it is more clean to hold the
> queue refocunt by drivers explicitly because we usually do the get/put pair
> in one place, instead that getting refcnt in one subsystem, and doing put in
> another place.
> 
> I am going to drop this patch in V8 since my original plan is to fix block
> layer's race in this patchset.

Doing the untegistration and final reference drop is just a bad pattern,
leading to possible bugs.  I'm fine with you just dropping it for now,
I guess I'll just send a scripted conversion moving the put to the drivers
to Jens after -rc1 to sort it out.  Probably including a rename
of blk_cleanup_queue so that everyone notices the difference.

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2019-04-27  5:55 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-24 11:02 [PATCH V7 0/9] blk-mq: fix races related with freeing queue Ming Lei
2019-04-24 11:02 ` Ming Lei
2019-04-24 11:02 ` [PATCH V7 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 16:18   ` Christoph Hellwig
2019-04-24 16:18     ` Christoph Hellwig
2019-04-25  0:53     ` Ming Lei
2019-04-25  0:53       ` Ming Lei
2019-04-25  5:32       ` Christoph Hellwig
2019-04-25  5:32         ` Christoph Hellwig
2019-04-25  7:35         ` Ming Lei
2019-04-25  7:35           ` Ming Lei
2019-04-24 11:02 ` [PATCH V7 2/9] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 16:18   ` Christoph Hellwig
2019-04-24 16:18     ` Christoph Hellwig
2019-04-24 11:02 ` [PATCH V7 3/9] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 16:19   ` Christoph Hellwig
2019-04-24 16:19     ` Christoph Hellwig
2019-04-24 11:02 ` [PATCH V7 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 16:22   ` Christoph Hellwig
2019-04-24 16:22     ` Christoph Hellwig
2019-04-25  0:55     ` Ming Lei
2019-04-25  0:55       ` Ming Lei
2019-04-26 15:09       ` Christoph Hellwig
2019-04-26 15:09         ` Christoph Hellwig
2019-04-25  1:02     ` Ming Lei
2019-04-25  1:02       ` Ming Lei
2019-04-24 11:02 ` [PATCH V7 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 11:02 ` [PATCH V7 6/9] blk-mq: always free hctx after request queue is freed Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 11:15   ` Hannes Reinecke
2019-04-24 11:15     ` Hannes Reinecke
2019-04-24 11:02 ` [PATCH V7 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 11:02 ` [PATCH V7 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 11:02 ` [PATCH V7 9/9] nvme: hold request queue's refcount in ns's whole lifetime Ming Lei
2019-04-24 11:02   ` Ming Lei
2019-04-24 16:27   ` Christoph Hellwig
2019-04-24 16:27     ` Christoph Hellwig
2019-04-25  1:00     ` Ming Lei
2019-04-25  1:00       ` Ming Lei
2019-04-26 15:11       ` Christoph Hellwig
2019-04-26 15:11         ` Christoph Hellwig
2019-04-26 17:04         ` Bart Van Assche
2019-04-26 17:04           ` Bart Van Assche
2019-04-26 22:49           ` Ming Lei
2019-04-26 22:49             ` Ming Lei
2019-04-26 22:45         ` Ming Lei
2019-04-26 22:45           ` Ming Lei
2019-04-27  5:54           ` Christoph Hellwig
2019-04-27  5:54             ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.