[PATCH 0/5] blk-mq: allow to run queue if queue refcount is held

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
@ 2019-03-31  3:09 Ming Lei
  2019-03-31  3:09 ` [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Ming Lei @ 2019-03-31  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, James Smart, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into queue's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require to use tags.

Thanks,

Ming Lei (5):
  blk-mq: re-organize blk_mq_exit_hctx() into two parts
  blk-mq: re-organize blk_mq_exit_hw_queues() into two parts
  blk-mq: free hw queues in queue's release handler
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  SCSI: don't grab queue usage counter before run queue

 block/blk-core.c        | 14 +-------------
 block/blk-mq.c          | 32 +++++++++++++++++++++++++-------
 block/blk-mq.h          |  3 ++-
 drivers/scsi/scsi_lib.c |  7 -------
 4 files changed, 28 insertions(+), 28 deletions(-)

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>

-- 
2.9.5

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts
  2019-03-31  3:09 [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Ming Lei
@ 2019-03-31  3:09 ` Ming Lei
  2019-04-01  1:40   ` Dongli Zhang
  2019-03-31  3:09 ` [PATCH 2/5] blk-mq: re-organize blk_mq_exit_hw_queues() " Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-03-31  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, James Smart, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang

This patch re-organizes blk_mq_exit_hctx() into two parts, and
one part is for exit hctx, and another part is for free hw queue.

No function change, just prepare for fixing hctx lifetime issue.

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 70b210a308c4..53265ce45238 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2229,6 +2229,16 @@ static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 					    &hctx->cpuhp_dead);
 }
 
+static void blk_mq_free_hctx(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		cleanup_srcu_struct(hctx->srcu);
+
+	blk_free_flush_queue(hctx->fq);
+	sbitmap_free(&hctx->ctx_map);
+}
+
 /* hctx->ctxs will be freed in queue's release handler */
 static void blk_mq_exit_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
@@ -2243,12 +2253,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
-
 	blk_mq_remove_cpuhp(hctx);
-	blk_free_flush_queue(hctx->fq);
-	sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2262,6 +2267,7 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 			break;
 		blk_mq_debugfs_unregister_hctx(hctx);
 		blk_mq_exit_hctx(q, set, hctx, i);
+		blk_mq_free_hctx(q, hctx);
 	}
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/5] blk-mq: re-organize blk_mq_exit_hw_queues() into two parts
  2019-03-31  3:09 [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Ming Lei
  2019-03-31  3:09 ` [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts Ming Lei
@ 2019-03-31  3:09 ` Ming Lei
  2019-03-31  3:09 ` [PATCH 3/5] blk-mq: free hw queues in queue's release handler Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2019-03-31  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, James Smart, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang

This patch re-organizes blk_mq_exit_hw_queues() into two parts, and
one part is for exit hw queues really, and another part is for free
hw queues.

No function change, just prepare for fixing hctx lifetime issue.

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 53265ce45238..a264d1967396 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2265,8 +2265,17 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (i == nr_queue)
 			break;
-		blk_mq_debugfs_unregister_hctx(hctx);
 		blk_mq_exit_hctx(q, set, hctx, i);
+	}
+}
+
+static void blk_mq_free_hw_queues(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		blk_mq_debugfs_unregister_hctx(hctx);
 		blk_mq_free_hctx(q, hctx);
 	}
 }
@@ -2893,6 +2902,7 @@ void blk_mq_free_queue(struct request_queue *q)
 
 	blk_mq_del_queue_tag_set(q);
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
+	blk_mq_free_hw_queues(q);
 }
 
 static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/5] blk-mq: free hw queues in queue's release handler
  2019-03-31  3:09 [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Ming Lei
  2019-03-31  3:09 ` [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts Ming Lei
  2019-03-31  3:09 ` [PATCH 2/5] blk-mq: re-organize blk_mq_exit_hw_queues() " Ming Lei
@ 2019-03-31  3:09 ` Ming Lei
  2019-03-31  3:09 ` [PATCH 4/5] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2019-03-31  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, James Smart, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang, stable

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue
refcount is held. After this commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by splitting blk_mq_free_queue() into two parts:
one part is for exit hw queues, another is for free hw queues. The latter
is moved to queue's release handler, and this way is safe becasue tags
isn't needed for the freeing hw queues.

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 2 +-
 block/blk-mq.c   | 8 +++++---
 block/blk-mq.h   | 3 ++-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4673ebe42255..b3bbf8a5110d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -375,7 +375,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	blk_exit_queue(q);
 
 	if (queue_is_mq(q))
-		blk_mq_free_queue(q);
+		blk_mq_exit_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a264d1967396..14a8db13ba73 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2269,7 +2269,7 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static void blk_mq_free_hw_queues(struct request_queue *q)
+void blk_mq_free_hw_queues(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
@@ -2625,6 +2625,8 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
+	blk_mq_free_hw_queues(q);
+
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
@@ -2896,13 +2898,13 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
 	struct blk_mq_tag_set	*set = q->tag_set;
 
 	blk_mq_del_queue_tag_set(q);
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
-	blk_mq_free_hw_queues(q);
 }
 
 static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 0ed8e5a8729f..46f8b14c8d1d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,8 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
+void blk_mq_free_hw_queues(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/5] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-03-31  3:09 [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Ming Lei
                   ` (2 preceding siblings ...)
  2019-03-31  3:09 ` [PATCH 3/5] blk-mq: free hw queues in queue's release handler Ming Lei
@ 2019-03-31  3:09 ` Ming Lei
  2019-04-01  1:50   ` Dongli Zhang
  2019-03-31  3:09 ` [PATCH 5/5] SCSI: don't grab queue usage counter before run queue Ming Lei
  2019-03-31 15:27 ` [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Bart Van Assche
  5 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-03-31  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, James Smart, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang

Now freeing dispatch resource is moved to queue's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b3bbf8a5110d..491dc0295778 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -347,18 +347,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/*
-	 * make sure all in-progress dispatch are completed because
-	 * blk_freeze_queue() can only complete all requests, and
-	 * dispatch may still be in-progress since we dispatch requests
-	 * from more than one contexts.
-	 *
-	 * We rely on driver to deal with the race in case that queue
-	 * initialization isn't done.
-	 */
-	if (queue_is_mq(q) && blk_queue_init_done(q))
-		blk_mq_quiesce_queue(q);
-
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
 	blk_flush_integrity();
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/5] SCSI: don't grab queue usage counter before run queue
  2019-03-31  3:09 [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Ming Lei
                   ` (3 preceding siblings ...)
  2019-03-31  3:09 ` [PATCH 4/5] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
@ 2019-03-31  3:09 ` Ming Lei
  2019-04-01  1:53   ` Dongli Zhang
  2019-03-31 15:27 ` [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Bart Van Assche
  5 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-03-31  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, James Smart, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang

Now resources for dispatch is freed in queue's release handler,
we don't need to worry about the possible race between blk_cleanup_queue
and run queue.

So don't grab the queue usage counter before run queue in
scsi_end_request().

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 601b9f1de267..18bf341d1236 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -604,12 +604,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 	 */
 	scsi_mq_uninit_cmd(cmd);
 
-	/*
-	 * queue is still alive, so grab the ref for preventing it
-	 * from being cleaned up during running queue.
-	 */
-	percpu_ref_get(&q->q_usage_counter);
-
 	__blk_mq_end_request(req, error);
 
 	if (scsi_target(sdev)->single_lun ||
@@ -618,7 +612,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 	else
 		blk_mq_run_hw_queues(q, true);
 
-	percpu_ref_put(&q->q_usage_counter);
 	put_device(&sdev->sdev_gendev);
 	return false;
 }
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-03-31  3:09 [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Ming Lei
                   ` (4 preceding siblings ...)
  2019-03-31  3:09 ` [PATCH 5/5] SCSI: don't grab queue usage counter before run queue Ming Lei
@ 2019-03-31 15:27 ` Bart Van Assche
  2019-04-01  2:00   ` Ming Lei
  5 siblings, 1 reply; 39+ messages in thread
From: Bart Van Assche @ 2019-03-31 15:27 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James Smart, Bart Van Assche, linux-scsi,
	Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
	jianchao wang

On 3/30/19 8:09 PM, Ming Lei wrote:
> Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
> allowed during cleanup queue even though queue refcount is held.
> 
> This change has caused lots of kernel oops triggered in run queue path,
> turns out it isn't easy to fix them all.
> 
> So move freeing of hw queue resources into queue's release handler, then
> the above issue is fixed. Meantime, this way is safe given freeing hw
> queue resource doesn't require to use tags.

I'm not sure the approach of this patch series is really the direction 
we should pursue. There are many block driver that free resources 
immediately after blk_cleanup_queue() returns. An example from the brd 
driver:

static void brd_free(struct brd_device *brd)
{
	put_disk(brd->brd_disk);
	blk_cleanup_queue(brd->brd_queue);
	brd_free_pages(brd);
	kfree(brd);
}

I'd like to avoid having to modify all block drivers that free resources 
immediately after blk_cleanup_queue() has returned. Have you considered 
to modify blk_mq_run_hw_queues() such that it becomes safe to call that 
function while blk_cleanup_queue() is in progress, e.g. by inserting a 
percpu_ref_tryget_live(&q->q_usage_counter) / 
percpu_ref_put(&q->q_usage_counter) pair?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts
  2019-03-31  3:09 ` [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts Ming Lei
@ 2019-04-01  1:40   ` Dongli Zhang
  2019-04-01  2:06     ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Dongli Zhang @ 2019-04-01  1:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang



On 3/31/19 11:09 AM, Ming Lei wrote:
> This patch re-organizes blk_mq_exit_hctx() into two parts, and
> one part is for exit hctx, and another part is for free hw queue.
> 
> No function change, just prepare for fixing hctx lifetime issue.
> 
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 70b210a308c4..53265ce45238 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2229,6 +2229,16 @@ static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
>  					    &hctx->cpuhp_dead);
>  }
>  
> +static void blk_mq_free_hctx(struct request_queue *q,
> +		struct blk_mq_hw_ctx *hctx)
> +{
> +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> +		cleanup_srcu_struct(hctx->srcu);
> +
> +	blk_free_flush_queue(hctx->fq);
> +	sbitmap_free(&hctx->ctx_map);
> +}
> +
>  /* hctx->ctxs will be freed in queue's release handler */
>  static void blk_mq_exit_hctx(struct request_queue *q,
>  		struct blk_mq_tag_set *set,
> @@ -2243,12 +2253,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>  	if (set->ops->exit_hctx)
>  		set->ops->exit_hctx(hctx, hctx_idx);
>  
> -	if (hctx->flags & BLK_MQ_F_BLOCKING)
> -		cleanup_srcu_struct(hctx->srcu);
> -
>  	blk_mq_remove_cpuhp(hctx);
> -	blk_free_flush_queue(hctx->fq);
> -	sbitmap_free(&hctx->ctx_map);
>  }
>  
>  static void blk_mq_exit_hw_queues(struct request_queue *q,
> @@ -2262,6 +2267,7 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
>  			break;
>  		blk_mq_debugfs_unregister_hctx(hctx);
>  		blk_mq_exit_hctx(q, set, hctx, i);
> +		blk_mq_free_hctx(q, hctx);

Assuming if this direction is still followed, how about blk_mq_realloc_hw_ctxs()
which is another caller of blk_mq_exit_hctx(), while part of blk_mq_exit_hctx()
is extracted and encapsulated into blk_mq_free_hctx()?

Additional blk_mq_free_hctx() is missing?

Dongli Zhang

>  	}
>  }
>  
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-03-31  3:09 ` [PATCH 4/5] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
@ 2019-04-01  1:50   ` Dongli Zhang
  2019-04-01  2:08     ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Dongli Zhang @ 2019-04-01  1:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang



On 3/31/19 11:09 AM, Ming Lei wrote:
> Now freeing dispatch resource is moved to queue's release handler,
> we don't need to worry about the race between blk_cleanup_queue and
> run queue any more.
> 
> So don't drain in-progress dispatch in blk_cleanup_queue().

Unless this direction is not followed, how about we mention that this is revert
of prior two fixes (which are not required any longer) so that people working on
backport would feel much more easier to track the issue.

c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")

Thank you very much!

Dongli Zhang

> 
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c | 12 ------------
>  1 file changed, 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b3bbf8a5110d..491dc0295778 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -347,18 +347,6 @@ void blk_cleanup_queue(struct request_queue *q)
>  
>  	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
>  
> -	/*
> -	 * make sure all in-progress dispatch are completed because
> -	 * blk_freeze_queue() can only complete all requests, and
> -	 * dispatch may still be in-progress since we dispatch requests
> -	 * from more than one contexts.
> -	 *
> -	 * We rely on driver to deal with the race in case that queue
> -	 * initialization isn't done.
> -	 */
> -	if (queue_is_mq(q) && blk_queue_init_done(q))
> -		blk_mq_quiesce_queue(q);
> -
>  	/* for synchronous bio-based driver finish in-flight integrity i/o */
>  	blk_flush_integrity();
>  
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] SCSI: don't grab queue usage counter before run queue
  2019-03-31  3:09 ` [PATCH 5/5] SCSI: don't grab queue usage counter before run queue Ming Lei
@ 2019-04-01  1:53   ` Dongli Zhang
  0 siblings, 0 replies; 39+ messages in thread
From: Dongli Zhang @ 2019-04-01  1:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang



On 3/31/19 11:09 AM, Ming Lei wrote:
> Now resources for dispatch is freed in queue's release handler,
> we don't need to worry about the possible race between blk_cleanup_queue
> and run queue.
> 
> So don't grab the queue usage counter before run queue in
> scsi_end_request().

Similar to prior one.

If something similar to do in the future, can we have "this is revert of
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is
done")......"

Thank you very much!

Dongli Zhang

> 
> Cc: James Smart <james.smart@broadcom.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> Cc: jianchao wang <jianchao.w.wang@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 601b9f1de267..18bf341d1236 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -604,12 +604,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
>  	 */
>  	scsi_mq_uninit_cmd(cmd);
>  
> -	/*
> -	 * queue is still alive, so grab the ref for preventing it
> -	 * from being cleaned up during running queue.
> -	 */
> -	percpu_ref_get(&q->q_usage_counter);
> -
>  	__blk_mq_end_request(req, error);
>  
>  	if (scsi_target(sdev)->single_lun ||
> @@ -618,7 +612,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
>  	else
>  		blk_mq_run_hw_queues(q, true);
>  
> -	percpu_ref_put(&q->q_usage_counter);
>  	put_device(&sdev->sdev_gendev);
>  	return false;
>  }
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-03-31 15:27 ` [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Bart Van Assche
@ 2019-04-01  2:00   ` Ming Lei
  2019-04-01  2:39     ` Bart Van Assche
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-01  2:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> On 3/30/19 8:09 PM, Ming Lei wrote:
> > Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
> > allowed during cleanup queue even though queue refcount is held.
> > 
> > This change has caused lots of kernel oops triggered in run queue path,
> > turns out it isn't easy to fix them all.
> > 
> > So move freeing of hw queue resources into queue's release handler, then
> > the above issue is fixed. Meantime, this way is safe given freeing hw
> > queue resource doesn't require to use tags.
> 
> I'm not sure the approach of this patch series is really the direction we
> should pursue. There are many block driver that free resources immediately

Please see scsi_run_queue(), and the queue refcount is always held
before run queue. That said this way has been working well for long time.
However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") breaks this behaviour,
and causes kernel oops.

> after blk_cleanup_queue() returns. An example from the brd driver:
> 
> static void brd_free(struct brd_device *brd)
> {
> 	put_disk(brd->brd_disk);
> 	blk_cleanup_queue(brd->brd_queue);
> 	brd_free_pages(brd);
> 	kfree(brd);
> }

Once blk_cleanup_queue() is returned, especially queue is frozen, there
isn't any request to be queued to driver, so it isn't a problem wrt.
freeing driver resource because we won't call into driver any more
under this situation.

This patch fixes the race between cleanup queue and run queue:

1) run queue may happen before the queue is frozen, however blk_mq_run_hw_queues()
may not be done when blk_freeze_queue() returns.

2) run queue may happen after blk_freeze_queue() returns

> 
> I'd like to avoid having to modify all block drivers that free resources
> immediately after blk_cleanup_queue() has returned. Have you considered to
> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> percpu_ref_tryget_live(&q->q_usage_counter) /
> percpu_ref_put(&q->q_usage_counter) pair?

It can't work because blk_mq_run_hw_queues may happen after
percpu_ref_exit() is done.

However, if we move percpu_ref_exit() into queue's release handler, we
don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
and we still have to free hw queue resources in queue's release handler,
that is exactly what this patchset is doing.

In short, getting q->q_usage_counter doesn't make a difference on this
issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts
  2019-04-01  1:40   ` Dongli Zhang
@ 2019-04-01  2:06     ` Ming Lei
  0 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2019-04-01  2:06 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Mon, Apr 01, 2019 at 09:40:32AM +0800, Dongli Zhang wrote:
> 
> 
> On 3/31/19 11:09 AM, Ming Lei wrote:
> > This patch re-organizes blk_mq_exit_hctx() into two parts, and
> > one part is for exit hctx, and another part is for free hw queue.
> > 
> > No function change, just prepare for fixing hctx lifetime issue.
> > 
> > Cc: James Smart <james.smart@broadcom.com>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: linux-scsi@vger.kernel.org,
> > Cc: Martin K . Petersen <martin.petersen@oracle.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
> > Cc: jianchao wang <jianchao.w.wang@oracle.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 70b210a308c4..53265ce45238 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2229,6 +2229,16 @@ static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
> >  					    &hctx->cpuhp_dead);
> >  }
> >  
> > +static void blk_mq_free_hctx(struct request_queue *q,
> > +		struct blk_mq_hw_ctx *hctx)
> > +{
> > +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> > +		cleanup_srcu_struct(hctx->srcu);
> > +
> > +	blk_free_flush_queue(hctx->fq);
> > +	sbitmap_free(&hctx->ctx_map);
> > +}
> > +
> >  /* hctx->ctxs will be freed in queue's release handler */
> >  static void blk_mq_exit_hctx(struct request_queue *q,
> >  		struct blk_mq_tag_set *set,
> > @@ -2243,12 +2253,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
> >  	if (set->ops->exit_hctx)
> >  		set->ops->exit_hctx(hctx, hctx_idx);
> >  
> > -	if (hctx->flags & BLK_MQ_F_BLOCKING)
> > -		cleanup_srcu_struct(hctx->srcu);
> > -
> >  	blk_mq_remove_cpuhp(hctx);
> > -	blk_free_flush_queue(hctx->fq);
> > -	sbitmap_free(&hctx->ctx_map);
> >  }
> >  
> >  static void blk_mq_exit_hw_queues(struct request_queue *q,
> > @@ -2262,6 +2267,7 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
> >  			break;
> >  		blk_mq_debugfs_unregister_hctx(hctx);
> >  		blk_mq_exit_hctx(q, set, hctx, i);
> > +		blk_mq_free_hctx(q, hctx);
> 
> Assuming if this direction is still followed, how about blk_mq_realloc_hw_ctxs()
> which is another caller of blk_mq_exit_hctx(), while part of blk_mq_exit_hctx()
> is extracted and encapsulated into blk_mq_free_hctx()?
> 
> Additional blk_mq_free_hctx() is missing?

Good catch!

Another two blk_mq_exit_hctx() is called just before kobject_put(&hctx->kobj), then
we can simply move blk_mq_free_hctx() into hctx's release handler(blk_mq_hw_sysfs_release),
then thing is simplified too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] block: don't drain in-progress dispatch in blk_cleanup_queue()
  2019-04-01  1:50   ` Dongli Zhang
@ 2019-04-01  2:08     ` Ming Lei
  0 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2019-04-01  2:08 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Mon, Apr 01, 2019 at 09:50:12AM +0800, Dongli Zhang wrote:
> 
> 
> On 3/31/19 11:09 AM, Ming Lei wrote:
> > Now freeing dispatch resource is moved to queue's release handler,
> > we don't need to worry about the race between blk_cleanup_queue and
> > run queue any more.
> > 
> > So don't drain in-progress dispatch in blk_cleanup_queue().
> 
> Unless this direction is not followed, how about we mention that this is revert
> of prior two fixes (which are not required any longer) so that people working on
> backport would feel much more easier to track the issue.
> 
> c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
> 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")

OK, will mention that in V2.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  2:00   ` Ming Lei
@ 2019-04-01  2:39     ` Bart Van Assche
  2019-04-01  2:44       ` jianchao.wang
                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Bart Van Assche @ 2019-04-01  2:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On 3/31/19 7:00 PM, Ming Lei wrote:
> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>> I'm not sure the approach of this patch series is really the direction we
>> should pursue. There are many block driver that free resources immediately
> 
> Please see scsi_run_queue(), and the queue refcount is always held
> before run queue.

That's not correct. There is no guarantee that q->q_usage_counter > 0 
when scsi_run_queue() is called from inside scsi_requeue_run_queue().

>> I'd like to avoid having to modify all block drivers that free resources
>> immediately after blk_cleanup_queue() has returned. Have you considered to
>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>> percpu_ref_tryget_live(&q->q_usage_counter) /
>> percpu_ref_put(&q->q_usage_counter) pair?
> 
> It can't work because blk_mq_run_hw_queues may happen after
> percpu_ref_exit() is done.
> 
> However, if we move percpu_ref_exit() into queue's release handler, we
> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> and we still have to free hw queue resources in queue's release handler,
> that is exactly what this patchset is doing.
> 
> In short, getting q->q_usage_counter doesn't make a difference on this
> issue.

percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" 
state. percpu_ref_kill() changes the state of a per-cpu counter to the 
"dead" state. blk_freeze_queue_start() calls percpu_ref_kill(). 
blk_cleanup_queue() already calls blk_set_queue_dying() and that last 
function calls blk_freeze_queue_start(). So I think that what you wrote 
is not correct and that inserting a 
percpu_ref_tryget_live()/percpu_ref_put() pair in blk_mq_run_hw_queues() 
or blk_mq_run_hw_queue() would make a difference and also that moving 
the percpu_ref_exit() call into blk_release_queue() makes sense.

Bart.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  2:39     ` Bart Van Assche
@ 2019-04-01  2:44       ` jianchao.wang
  2019-04-02 18:07         ` Bart Van Assche
  2019-04-01  2:52       ` Ming Lei
  2019-04-01  3:27       ` Ming Lei
  2 siblings, 1 reply; 39+ messages in thread
From: jianchao.wang @ 2019-04-01  2:44 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley



On 4/1/19 10:39 AM, Bart Van Assche wrote:
> On 3/31/19 7:00 PM, Ming Lei wrote:
>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>> I'm not sure the approach of this patch series is really the direction we
>>> should pursue. There are many block driver that free resources immediately
>>
>> Please see scsi_run_queue(), and the queue refcount is always held
>> before run queue.
> 
> That's not correct. There is no guarantee that q->q_usage_counter > 0 when scsi_run_queue() is called from inside scsi_requeue_run_queue().
> 
>>> I'd like to avoid having to modify all block drivers that free resources
>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>> percpu_ref_put(&q->q_usage_counter) pair?
>>
>> It can't work because blk_mq_run_hw_queues may happen after
>> percpu_ref_exit() is done.
>>
>> However, if we move percpu_ref_exit() into queue's release handler, we
>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>> and we still have to free hw queue resources in queue's release handler,
>> that is exactly what this patchset is doing.
>>
>> In short, getting q->q_usage_counter doesn't make a difference on this
>> issue.
> 
> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state. percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already calls blk_set_queue_dying() and that last function calls blk_freeze_queue_start(). So I think that what you wrote is not correct and that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also that moving the percpu_ref_exit() call into blk_release_queue() makes sense.
> 

percpu_ref_tryget would be better to get pending requests to be issued when queue is frozen.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  2:39     ` Bart Van Assche
  2019-04-01  2:44       ` jianchao.wang
@ 2019-04-01  2:52       ` Ming Lei
  2019-04-01  3:25         ` jianchao.wang
  2019-04-01  5:05         ` Dongli Zhang
  2019-04-01  3:27       ` Ming Lei
  2 siblings, 2 replies; 39+ messages in thread
From: Ming Lei @ 2019-04-01  2:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> On 3/31/19 7:00 PM, Ming Lei wrote:
> > On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> > > I'm not sure the approach of this patch series is really the direction we
> > > should pursue. There are many block driver that free resources immediately
> > 
> > Please see scsi_run_queue(), and the queue refcount is always held
> > before run queue.
> 
> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> scsi_run_queue() is called from inside scsi_requeue_run_queue().

We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
queue's kobj reference counter.

What we need is to allow run queue to work correctly after queue is frozen
or cleaned up.

> 
> > > I'd like to avoid having to modify all block drivers that free resources
> > > immediately after blk_cleanup_queue() has returned. Have you considered to
> > > modify blk_mq_run_hw_queues() such that it becomes safe to call that
> > > function while blk_cleanup_queue() is in progress, e.g. by inserting a
> > > percpu_ref_tryget_live(&q->q_usage_counter) /
> > > percpu_ref_put(&q->q_usage_counter) pair?
> > 
> > It can't work because blk_mq_run_hw_queues may happen after
> > percpu_ref_exit() is done.
> > 
> > However, if we move percpu_ref_exit() into queue's release handler, we
> > don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> > and we still have to free hw queue resources in queue's release handler,
> > that is exactly what this patchset is doing.
> > 
> > In short, getting q->q_usage_counter doesn't make a difference on this
> > issue.
> 
> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> already calls blk_set_queue_dying() and that last function calls
> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> sense.

If percpu_ref_exit() is moved to blk_release_queue(), we still need to
move freeing of hw queue's resource into blk_release_queue() like what
the patchset is doing.

Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
do we?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  2:52       ` Ming Lei
@ 2019-04-01  3:25         ` jianchao.wang
  2019-04-01  3:28           ` Ming Lei
  2019-04-01  5:05         ` Dongli Zhang
  1 sibling, 1 reply; 39+ messages in thread
From: jianchao.wang @ 2019-04-01  3:25 UTC (permalink / raw)
  To: Ming Lei, Bart Van Assche
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley

Hi Ming

On 4/1/19 10:52 AM, Ming Lei wrote:
>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>> already calls blk_set_queue_dying() and that last function calls
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>> sense.
> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> move freeing of hw queue's resource into blk_release_queue() like what
> the patchset is doing.
> 
> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> do we?

IMO, if we could get a way to prevent any attempt to run queue, it would be
better and clearer.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  2:39     ` Bart Van Assche
  2019-04-01  2:44       ` jianchao.wang
  2019-04-01  2:52       ` Ming Lei
@ 2019-04-01  3:27       ` Ming Lei
  2019-04-01  3:32         ` jianchao.wang
  2 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-01  3:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley, jianchao wang

On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> On 3/31/19 7:00 PM, Ming Lei wrote:
> > On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> > > I'm not sure the approach of this patch series is really the direction we
> > > should pursue. There are many block driver that free resources immediately
> > 
> > Please see scsi_run_queue(), and the queue refcount is always held
> > before run queue.
> 
> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> 
> > > I'd like to avoid having to modify all block drivers that free resources
> > > immediately after blk_cleanup_queue() has returned. Have you considered to
> > > modify blk_mq_run_hw_queues() such that it becomes safe to call that
> > > function while blk_cleanup_queue() is in progress, e.g. by inserting a
> > > percpu_ref_tryget_live(&q->q_usage_counter) /
> > > percpu_ref_put(&q->q_usage_counter) pair?
> > 
> > It can't work because blk_mq_run_hw_queues may happen after
> > percpu_ref_exit() is done.
> > 
> > However, if we move percpu_ref_exit() into queue's release handler, we
> > don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> > and we still have to free hw queue resources in queue's release handler,
> > that is exactly what this patchset is doing.
> > 
> > In short, getting q->q_usage_counter doesn't make a difference on this
> > issue.
> 
> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> already calls blk_set_queue_dying() and that last function calls
> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> sense.

This way is easy to cause deadlock!!!

If percpu_ref_tryget_live() is called in the entry of blk_mq_run_hw_queues(), 
at the same time, blk_freeze_queue_start() is called, then percpu_ref_tryget_live()
will fail, and run queue can't move on, then blk_mq_freeze_queue_wait() will wait
forever.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  3:25         ` jianchao.wang
@ 2019-04-01  3:28           ` Ming Lei
  2019-04-01  9:19             ` jianchao.wang
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-01  3:28 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/1/19 10:52 AM, Ming Lei wrote:
> >> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >> already calls blk_set_queue_dying() and that last function calls
> >> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >> sense.
> > If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> > move freeing of hw queue's resource into blk_release_queue() like what
> > the patchset is doing.
> > 
> > Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> > do we?
> 
> IMO, if we could get a way to prevent any attempt to run queue, it would be
> better and clearer.

It is hard to do that way, and not necessary.

I will post V2 soon for review.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  3:27       ` Ming Lei
@ 2019-04-01  3:32         ` jianchao.wang
  0 siblings, 0 replies; 39+ messages in thread
From: jianchao.wang @ 2019-04-01  3:32 UTC (permalink / raw)
  To: Ming Lei, Bart Van Assche
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley



On 4/1/19 11:27 AM, Ming Lei wrote:
> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>> I'm not sure the approach of this patch series is really the direction we
>>>> should pursue. There are many block driver that free resources immediately
>>>
>>> Please see scsi_run_queue(), and the queue refcount is always held
>>> before run queue.
>>
>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
>>
>>>> I'd like to avoid having to modify all block drivers that free resources
>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>
>>> It can't work because blk_mq_run_hw_queues may happen after
>>> percpu_ref_exit() is done.
>>>
>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>> and we still have to free hw queue resources in queue's release handler,
>>> that is exactly what this patchset is doing.
>>>
>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>> issue.
>>
>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>> already calls blk_set_queue_dying() and that last function calls
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>> sense.
> 
> This way is easy to cause deadlock!!!
> 
> If percpu_ref_tryget_live() is called in the entry of blk_mq_run_hw_queues(), 
> at the same time, blk_freeze_queue_start() is called, then percpu_ref_tryget_live()
> will fail, and run queue can't move on, then blk_mq_freeze_queue_wait() will wait
> forever.
> 

percpu_ref_tryget would be better to get pending requests to be issued when queue is frozen.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  2:52       ` Ming Lei
  2019-04-01  3:25         ` jianchao.wang
@ 2019-04-01  5:05         ` Dongli Zhang
  2019-04-01  5:16           ` Ming Lei
  1 sibling, 1 reply; 39+ messages in thread
From: Dongli Zhang @ 2019-04-01  5:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang



On 4/1/19 10:52 AM, Ming Lei wrote:
> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>> I'm not sure the approach of this patch series is really the direction we
>>>> should pursue. There are many block driver that free resources immediately
>>>
>>> Please see scsi_run_queue(), and the queue refcount is always held
>>> before run queue.
>>
>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> 
> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> queue's kobj reference counter.
> 
> What we need is to allow run queue to work correctly after queue is frozen
> or cleaned up.
> 
>>
>>>> I'd like to avoid having to modify all block drivers that free resources
>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>
>>> It can't work because blk_mq_run_hw_queues may happen after
>>> percpu_ref_exit() is done.
>>>
>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>> and we still have to free hw queue resources in queue's release handler,
>>> that is exactly what this patchset is doing.
>>>
>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>> issue.
>>
>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>> already calls blk_set_queue_dying() and that last function calls
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>> sense.
> 
> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> move freeing of hw queue's resource into blk_release_queue() like what
> the patchset is doing.

Hi Ming,

Would you mind help explain why we still need to move freeing of hw queue's
resource into blk_release_queue() like what the patchset is doing?

Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
blk_mq_run_hw_queues() would not be able to move forward as __PERCPU_REF_DEAD is
already set. Why we still need to move freeing of hw queue's resource into
blk_release_queue()?

Thank you very much!

Dongli Zhang

> 
> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> do we?
> 
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  5:05         ` Dongli Zhang
@ 2019-04-01  5:16           ` Ming Lei
  2019-04-01  5:30             ` Dongli Zhang
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-01  5:16 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

Hi Dongli,

On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
> 
> 
> On 4/1/19 10:52 AM, Ming Lei wrote:
> > On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> >> On 3/31/19 7:00 PM, Ming Lei wrote:
> >>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> >>>> I'm not sure the approach of this patch series is really the direction we
> >>>> should pursue. There are many block driver that free resources immediately
> >>>
> >>> Please see scsi_run_queue(), and the queue refcount is always held
> >>> before run queue.
> >>
> >> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> >> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> > 
> > We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> > queue's kobj reference counter.
> > 
> > What we need is to allow run queue to work correctly after queue is frozen
> > or cleaned up.
> > 
> >>
> >>>> I'd like to avoid having to modify all block drivers that free resources
> >>>> immediately after blk_cleanup_queue() has returned. Have you considered to
> >>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> >>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> >>>> percpu_ref_tryget_live(&q->q_usage_counter) /
> >>>> percpu_ref_put(&q->q_usage_counter) pair?
> >>>
> >>> It can't work because blk_mq_run_hw_queues may happen after
> >>> percpu_ref_exit() is done.
> >>>
> >>> However, if we move percpu_ref_exit() into queue's release handler, we
> >>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> >>> and we still have to free hw queue resources in queue's release handler,
> >>> that is exactly what this patchset is doing.
> >>>
> >>> In short, getting q->q_usage_counter doesn't make a difference on this
> >>> issue.
> >>
> >> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >> already calls blk_set_queue_dying() and that last function calls
> >> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >> sense.
> > 
> > If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> > move freeing of hw queue's resource into blk_release_queue() like what
> > the patchset is doing.
> 
> Hi Ming,
> 
> Would you mind help explain why we still need to move freeing of hw queue's
> resource into blk_release_queue() like what the patchset is doing?
> 
> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,

Could you explain why the assumption is true?

We have to run queue after starting to freeze queue for draining
allocated requests and making forward progress. Inside blk_freeze_queue_start(),
percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
false, then queue won't be run.


Thanks, 
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  5:16           ` Ming Lei
@ 2019-04-01  5:30             ` Dongli Zhang
  2019-04-01  7:15               ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Dongli Zhang @ 2019-04-01  5:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang



On 4/1/19 1:16 PM, Ming Lei wrote:
> Hi Dongli,
> 
> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
>>
>>
>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>>>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>>>> I'm not sure the approach of this patch series is really the direction we
>>>>>> should pursue. There are many block driver that free resources immediately
>>>>>
>>>>> Please see scsi_run_queue(), and the queue refcount is always held
>>>>> before run queue.
>>>>
>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
>>>
>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
>>> queue's kobj reference counter.
>>>
>>> What we need is to allow run queue to work correctly after queue is frozen
>>> or cleaned up.
>>>
>>>>
>>>>>> I'd like to avoid having to modify all block drivers that free resources
>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>>>
>>>>> It can't work because blk_mq_run_hw_queues may happen after
>>>>> percpu_ref_exit() is done.
>>>>>
>>>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>>>> and we still have to free hw queue resources in queue's release handler,
>>>>> that is exactly what this patchset is doing.
>>>>>
>>>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>>>> issue.
>>>>
>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>> already calls blk_set_queue_dying() and that last function calls
>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>> sense.
>>>
>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>> move freeing of hw queue's resource into blk_release_queue() like what
>>> the patchset is doing.
>>
>> Hi Ming,
>>
>> Would you mind help explain why we still need to move freeing of hw queue's
>> resource into blk_release_queue() like what the patchset is doing?
>>
>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
> 
> Could you explain why the assumption is true?
> 
> We have to run queue after starting to freeze queue for draining
> allocated requests and making forward progress. Inside blk_freeze_queue_start(),
> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
> false, then queue won't be run.

Hi Ming,

I understand the assumption is invalid and there is issue when using
percpu_ref_tryget_live. And I also understand we have to run queue after
starting to freeze queue for draining allocated requests and making forward
progress.


I am just wondering specifically on why "If percpu_ref_exit() is moved to
blk_release_queue(), we still need to move freeing of hw queue's resource into
blk_release_queue() like what the patchset is doing." based on below Bart's
statement:

"percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
calls blk_set_queue_dying() and that last function call
blk_freeze_queue_start(). So I think that what you wrote is not correct and that
inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
that moving the percpu_ref_exit() call into blk_release_queue() makes sense."

That's is, what is penalty if we do not  move freeing of hw queue's resource
into blk_release_queue() like what the patchset is doing in above situation?

I ask this question just because I would like to better understand the source
code. Does "hw queue's resource" indicate the below?

+        if (hctx->flags & BLK_MQ_F_BLOCKING)
+                cleanup_srcu_struct(hctx->srcu);
+        blk_free_flush_queue(hctx->fq);
+        sbitmap_free(&hctx->ctx_map);

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  5:30             ` Dongli Zhang
@ 2019-04-01  7:15               ` Ming Lei
  2019-04-02  2:10                 ` Dongli Zhang
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-01  7:15 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Ming Lei, Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, Linux SCSI List, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
>
>
> On 4/1/19 1:16 PM, Ming Lei wrote:
> > Hi Dongli,
> >
> > On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
> >>
> >>
> >> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> >>>> On 3/31/19 7:00 PM, Ming Lei wrote:
> >>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> >>>>>> I'm not sure the approach of this patch series is really the direction we
> >>>>>> should pursue. There are many block driver that free resources immediately
> >>>>>
> >>>>> Please see scsi_run_queue(), and the queue refcount is always held
> >>>>> before run queue.
> >>>>
> >>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> >>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> >>>
> >>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> >>> queue's kobj reference counter.
> >>>
> >>> What we need is to allow run queue to work correctly after queue is frozen
> >>> or cleaned up.
> >>>
> >>>>
> >>>>>> I'd like to avoid having to modify all block drivers that free resources
> >>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
> >>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> >>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> >>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
> >>>>>> percpu_ref_put(&q->q_usage_counter) pair?
> >>>>>
> >>>>> It can't work because blk_mq_run_hw_queues may happen after
> >>>>> percpu_ref_exit() is done.
> >>>>>
> >>>>> However, if we move percpu_ref_exit() into queue's release handler, we
> >>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> >>>>> and we still have to free hw queue resources in queue's release handler,
> >>>>> that is exactly what this patchset is doing.
> >>>>>
> >>>>> In short, getting q->q_usage_counter doesn't make a difference on this
> >>>>> issue.
> >>>>
> >>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>> already calls blk_set_queue_dying() and that last function calls
> >>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>> sense.
> >>>
> >>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>> move freeing of hw queue's resource into blk_release_queue() like what
> >>> the patchset is doing.
> >>
> >> Hi Ming,
> >>
> >> Would you mind help explain why we still need to move freeing of hw queue's
> >> resource into blk_release_queue() like what the patchset is doing?
> >>
> >> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
> >
> > Could you explain why the assumption is true?
> >
> > We have to run queue after starting to freeze queue for draining
> > allocated requests and making forward progress. Inside blk_freeze_queue_start(),
> > percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
> > false, then queue won't be run.
>
> Hi Ming,
>
> I understand the assumption is invalid and there is issue when using
> percpu_ref_tryget_live. And I also understand we have to run queue after
> starting to freeze queue for draining allocated requests and making forward
> progress.

OK.

>
>
> I am just wondering specifically on why "If percpu_ref_exit() is moved to
> blk_release_queue(), we still need to move freeing of hw queue's resource into
> blk_release_queue() like what the patchset is doing." based on below Bart's
> statement:
>
> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
> calls blk_set_queue_dying() and that last function call
> blk_freeze_queue_start(). So I think that what you wrote is not correct and that
> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
> that moving the percpu_ref_exit() call into blk_release_queue() makes sense."

As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue
during cleanup, then run queue can come when queue is cleaned up.

>
> That's is, what is penalty if we do not  move freeing of hw queue's resource
> into blk_release_queue() like what the patchset is doing in above situation?

kernel oops as reported by James, because some fields of hctx will be used
by run queue, and they can be freed by blk_mq_free_queue() in
blk_cleanup_queue().

>
> I ask this question just because I would like to better understand the source
> code. Does "hw queue's resource" indicate the below?
>
> +        if (hctx->flags & BLK_MQ_F_BLOCKING)
> +                cleanup_srcu_struct(hctx->srcu);
> +        blk_free_flush_queue(hctx->fq);
> +        sbitmap_free(&hctx->ctx_map);

Right.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  3:28           ` Ming Lei
@ 2019-04-01  9:19             ` jianchao.wang
  2019-04-01 10:03               ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: jianchao.wang @ 2019-04-01  9:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Hi Ming

On 4/1/19 11:28 AM, Ming Lei wrote:
> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>> already calls blk_set_queue_dying() and that last function calls
>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>> sense.
>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>> move freeing of hw queue's resource into blk_release_queue() like what
>>> the patchset is doing.
>>>
>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
>>> do we?
>>
>> IMO, if we could get a way to prevent any attempt to run queue, it would be
>> better and clearer.
> 
> It is hard to do that way, and not necessary.
> 
> I will post V2 soon for review.
> 

Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
requet_queue is frozen and drained (run queue is also unnecessary because there is no
entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
We have similar one in blk_mq_timeout_work.

freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
the started ones, then hctx->run_queue is cleaned totally.

IMO, it would be better to have a checkpoint after which there will be no any in-flight
asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
and any attempt to start them will fail.

Perhaps, this will be a good change to do this ；）

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  9:19             ` jianchao.wang
@ 2019-04-01 10:03               ` Ming Lei
  2019-04-02  2:02                 ` jianchao.wang
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-01 10:03 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/1/19 11:28 AM, Ming Lei wrote:
> > On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>> already calls blk_set_queue_dying() and that last function calls
> >>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>> sense.
> >>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>> move freeing of hw queue's resource into blk_release_queue() like what
> >>> the patchset is doing.
> >>>
> >>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> >>> do we?
> >>
> >> IMO, if we could get a way to prevent any attempt to run queue, it would be
> >> better and clearer.
> > 
> > It is hard to do that way, and not necessary.
> > 
> > I will post V2 soon for review.
> > 
> 
> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
> requet_queue is frozen and drained (run queue is also unnecessary because there is no
> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
> We have similar one in blk_mq_timeout_work.

If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
queue's release handler.

Then we still have to move freeing hctx's resource into hctx or queue's
release handler, that is exactly what this patch is doing. Then
percpu_ref_tryget() becomes unnecessary again, right?

> 
> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
> the started ones, then hctx->run_queue is cleaned totally.
> 
> IMO, it would be better to have a checkpoint after which there will be no any in-flight
> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
> and any attempt to start them will fail.

All are canceled in blk_cleanup_queue(), but not enough, given queue can
be run in sync mode(such as via plug, direct issue, ...), or driver's
requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
just by holding queue's kobject refcount.

> 
> Perhaps, this will be a good change to do this ；）

However, I don't see it is necessary if we simply move freeing hctx's
resource into its release handler, just like V2.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01 10:03               ` Ming Lei
@ 2019-04-02  2:02                 ` jianchao.wang
  2019-04-02  2:55                   ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: jianchao.wang @ 2019-04-02  2:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Hi Ming

On 4/1/19 6:03 PM, Ming Lei wrote:
> On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 4/1/19 11:28 AM, Ming Lei wrote:
>>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>>>> already calls blk_set_queue_dying() and that last function calls
>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>>>> sense.
>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>>>> move freeing of hw queue's resource into blk_release_queue() like what
>>>>> the patchset is doing.
>>>>>
>>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
>>>>> do we?
>>>>
>>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
>>>> better and clearer.
>>>
>>> It is hard to do that way, and not necessary.
>>>
>>> I will post V2 soon for review.
>>>
>>
>> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
>> requet_queue is frozen and drained (run queue is also unnecessary because there is no
>> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
>> We have similar one in blk_mq_timeout_work.
> 
> If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
> queue's release handler.
> 
> Then we still have to move freeing hctx's resource into hctx or queue's
> release handler, that is exactly what this patch is doing. Then
> percpu_ref_tryget() becomes unnecessary again, right?

I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.

From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
But if we use it after kill, does it count a active use ?
Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
touch the freed percpu counter but just the atomic ref->count.

It looks safe.


> 
>>
>> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
>> the started ones, then hctx->run_queue is cleaned totally.
>>
>> IMO, it would be better to have a checkpoint after which there will be no any in-flight
>> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
>> and any attempt to start them will fail.
> 
> All are canceled in blk_cleanup_queue(), but not enough, given queue can
> be run in sync mode(such as via plug, direct issue, ...), or driver's
> requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
> just by holding queue's kobject refcount.

Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
We provide a guarantee that all of the activities are stopped after this checkpoint.
It will be convenient for us to do other things following, for example release request_queue's
resource.

Thanks
Jianchao

> 
>>
>> Perhaps, this will be a good change to do this ；）
> 
> However, I don't see it is necessary if we simply move freeing hctx's
> resource into its release handler, just like V2.
> 
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  7:15               ` Ming Lei
@ 2019-04-02  2:10                 ` Dongli Zhang
  2019-04-02  2:20                   ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Dongli Zhang @ 2019-04-02  2:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, Linux SCSI List, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang



On 4/1/19 3:15 PM, Ming Lei wrote:
> On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>
>>
>>
>> On 4/1/19 1:16 PM, Ming Lei wrote:
>>> Hi Dongli,
>>>
>>> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
>>>>
>>>>
>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>>>>>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>>>>>> I'm not sure the approach of this patch series is really the direction we
>>>>>>>> should pursue. There are many block driver that free resources immediately
>>>>>>>
>>>>>>> Please see scsi_run_queue(), and the queue refcount is always held
>>>>>>> before run queue.
>>>>>>
>>>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>>>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
>>>>>
>>>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
>>>>> queue's kobj reference counter.
>>>>>
>>>>> What we need is to allow run queue to work correctly after queue is frozen
>>>>> or cleaned up.
>>>>>
>>>>>>
>>>>>>>> I'd like to avoid having to modify all block drivers that free resources
>>>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>>>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>>>>>
>>>>>>> It can't work because blk_mq_run_hw_queues may happen after
>>>>>>> percpu_ref_exit() is done.
>>>>>>>
>>>>>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>>>>>> and we still have to free hw queue resources in queue's release handler,
>>>>>>> that is exactly what this patchset is doing.
>>>>>>>
>>>>>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>>>>>> issue.
>>>>>>
>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>>>> already calls blk_set_queue_dying() and that last function calls
>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>>>> sense.
>>>>>
>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>>>> move freeing of hw queue's resource into blk_release_queue() like what
>>>>> the patchset is doing.
>>>>
>>>> Hi Ming,
>>>>
>>>> Would you mind help explain why we still need to move freeing of hw queue's
>>>> resource into blk_release_queue() like what the patchset is doing?
>>>>
>>>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
>>>
>>> Could you explain why the assumption is true?
>>>
>>> We have to run queue after starting to freeze queue for draining
>>> allocated requests and making forward progress. Inside blk_freeze_queue_start(),
>>> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
>>> false, then queue won't be run.
>>
>> Hi Ming,
>>
>> I understand the assumption is invalid and there is issue when using
>> percpu_ref_tryget_live. And I also understand we have to run queue after
>> starting to freeze queue for draining allocated requests and making forward
>> progress.
> 
> OK.
> 
>>
>>
>> I am just wondering specifically on why "If percpu_ref_exit() is moved to
>> blk_release_queue(), we still need to move freeing of hw queue's resource into
>> blk_release_queue() like what the patchset is doing." based on below Bart's
>> statement:
>>
>> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
>> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
>> calls blk_set_queue_dying() and that last function call
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and that
>> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
>> that moving the percpu_ref_exit() call into blk_release_queue() makes sense."
> 
> As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue
> during cleanup, then run queue can come when queue is cleaned up.
> 
>>
>> That's is, what is penalty if we do not  move freeing of hw queue's resource
>> into blk_release_queue() like what the patchset is doing in above situation?
> 
> kernel oops as reported by James, because some fields of hctx will be used
> by run queue, and they can be freed by blk_mq_free_queue() in
> blk_cleanup_queue().
> 
>>
>> I ask this question just because I would like to better understand the source
>> code. Does "hw queue's resource" indicate the below?
>>
>> +        if (hctx->flags & BLK_MQ_F_BLOCKING)
>> +                cleanup_srcu_struct(hctx->srcu);
>> +        blk_free_flush_queue(hctx->fq);
>> +        sbitmap_free(&hctx->ctx_map);
> 
> Right.

Hi Ming,

Thank you very much for the detailed explanation.

I think maybe I misunderstood your message in the email.

In another direction posted by Bart as below, regardless about which direction
is better, that implementation does not move freeing of hw queue's resource into
blk_release_queue(), although percpu_ref_exit() is moved to blk_release_queue().
That's why I would like to confirm.

https://lore.kernel.org/linux-block/20190401212014.192753-1-bvanassche@acm.org/

In that direction, the more friendly percpu_ref_tryget(), which is suggested by
Jianchao, is used. I would like just to confirm that there is no need to move
freeing of hw queue's resource into blk_release_queue() when the get/put method
is friendly and fair enough.

I think I should not just focus on only part of your message.

Sorry for spamming you.

Dongli Zhang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02  2:10                 ` Dongli Zhang
@ 2019-04-02  2:20                   ` Ming Lei
  0 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2019-04-02  2:20 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Ming Lei, Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, Linux SCSI List, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley, jianchao wang

On Tue, Apr 2, 2019 at 10:06 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
>
>
> On 4/1/19 3:15 PM, Ming Lei wrote:
> > On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@oracle.com> wrote:
> >>
> >>
> >>
> >> On 4/1/19 1:16 PM, Ming Lei wrote:
> >>> Hi Dongli,
> >>>
> >>> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
> >>>>
> >>>>
> >>>> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> >>>>>> On 3/31/19 7:00 PM, Ming Lei wrote:
> >>>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> >>>>>>>> I'm not sure the approach of this patch series is really the direction we
> >>>>>>>> should pursue. There are many block driver that free resources immediately
> >>>>>>>
> >>>>>>> Please see scsi_run_queue(), and the queue refcount is always held
> >>>>>>> before run queue.
> >>>>>>
> >>>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> >>>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> >>>>>
> >>>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> >>>>> queue's kobj reference counter.
> >>>>>
> >>>>> What we need is to allow run queue to work correctly after queue is frozen
> >>>>> or cleaned up.
> >>>>>
> >>>>>>
> >>>>>>>> I'd like to avoid having to modify all block drivers that free resources
> >>>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
> >>>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> >>>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> >>>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
> >>>>>>>> percpu_ref_put(&q->q_usage_counter) pair?
> >>>>>>>
> >>>>>>> It can't work because blk_mq_run_hw_queues may happen after
> >>>>>>> percpu_ref_exit() is done.
> >>>>>>>
> >>>>>>> However, if we move percpu_ref_exit() into queue's release handler, we
> >>>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> >>>>>>> and we still have to free hw queue resources in queue's release handler,
> >>>>>>> that is exactly what this patchset is doing.
> >>>>>>>
> >>>>>>> In short, getting q->q_usage_counter doesn't make a difference on this
> >>>>>>> issue.
> >>>>>>
> >>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>>>> already calls blk_set_queue_dying() and that last function calls
> >>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>>>> sense.
> >>>>>
> >>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>>>> move freeing of hw queue's resource into blk_release_queue() like what
> >>>>> the patchset is doing.
> >>>>
> >>>> Hi Ming,
> >>>>
> >>>> Would you mind help explain why we still need to move freeing of hw queue's
> >>>> resource into blk_release_queue() like what the patchset is doing?
> >>>>
> >>>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
> >>>
> >>> Could you explain why the assumption is true?
> >>>
> >>> We have to run queue after starting to freeze queue for draining
> >>> allocated requests and making forward progress. Inside blk_freeze_queue_start(),
> >>> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
> >>> false, then queue won't be run.
> >>
> >> Hi Ming,
> >>
> >> I understand the assumption is invalid and there is issue when using
> >> percpu_ref_tryget_live. And I also understand we have to run queue after
> >> starting to freeze queue for draining allocated requests and making forward
> >> progress.
> >
> > OK.
> >
> >>
> >>
> >> I am just wondering specifically on why "If percpu_ref_exit() is moved to
> >> blk_release_queue(), we still need to move freeing of hw queue's resource into
> >> blk_release_queue() like what the patchset is doing." based on below Bart's
> >> statement:
> >>
> >> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
> >> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
> >> calls blk_set_queue_dying() and that last function call
> >> blk_freeze_queue_start(). So I think that what you wrote is not correct and that
> >> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
> >> that moving the percpu_ref_exit() call into blk_release_queue() makes sense."
> >
> > As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue
> > during cleanup, then run queue can come when queue is cleaned up.
> >
> >>
> >> That's is, what is penalty if we do not  move freeing of hw queue's resource
> >> into blk_release_queue() like what the patchset is doing in above situation?
> >
> > kernel oops as reported by James, because some fields of hctx will be used
> > by run queue, and they can be freed by blk_mq_free_queue() in
> > blk_cleanup_queue().
> >
> >>
> >> I ask this question just because I would like to better understand the source
> >> code. Does "hw queue's resource" indicate the below?
> >>
> >> +        if (hctx->flags & BLK_MQ_F_BLOCKING)
> >> +                cleanup_srcu_struct(hctx->srcu);
> >> +        blk_free_flush_queue(hctx->fq);
> >> +        sbitmap_free(&hctx->ctx_map);
> >
> > Right.
>
> Hi Ming,
>
> Thank you very much for the detailed explanation.
>
> I think maybe I misunderstood your message in the email.
>
> In another direction posted by Bart as below, regardless about which direction
> is better, that implementation does not move freeing of hw queue's resource into
> blk_release_queue(), although percpu_ref_exit() is moved to blk_release_queue().
> That's why I would like to confirm.
>
> https://lore.kernel.org/linux-block/20190401212014.192753-1-bvanassche@acm.org/
>
> In that direction, the more friendly percpu_ref_tryget(), which is suggested by
> Jianchao, is used. I would like just to confirm that there is no need to move
> freeing of hw queue's resource into blk_release_queue() when the get/put method
> is friendly and fair enough.

I don't think it is friendly to add such unnecessary stuff in the very
fast path.



Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02  2:02                 ` jianchao.wang
@ 2019-04-02  2:55                   ` Ming Lei
  2019-04-02  8:07                     ` jianchao.wang
  2019-04-02 18:11                     ` Bart Van Assche
  0 siblings, 2 replies; 39+ messages in thread
From: Ming Lei @ 2019-04-02  2:55 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Tue, Apr 02, 2019 at 10:02:43AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/1/19 6:03 PM, Ming Lei wrote:
> > On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 4/1/19 11:28 AM, Ming Lei wrote:
> >>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> >>>> Hi Ming
> >>>>
> >>>> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>>>> already calls blk_set_queue_dying() and that last function calls
> >>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>>>> sense.
> >>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>>>> move freeing of hw queue's resource into blk_release_queue() like what
> >>>>> the patchset is doing.
> >>>>>
> >>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> >>>>> do we?
> >>>>
> >>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
> >>>> better and clearer.
> >>>
> >>> It is hard to do that way, and not necessary.
> >>>
> >>> I will post V2 soon for review.
> >>>
> >>
> >> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
> >> requet_queue is frozen and drained (run queue is also unnecessary because there is no
> >> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
> >> We have similar one in blk_mq_timeout_work.
> > 
> > If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
> > queue's release handler.
> > 
> > Then we still have to move freeing hctx's resource into hctx or queue's
> > release handler, that is exactly what this patch is doing. Then
> > percpu_ref_tryget() becomes unnecessary again, right?
> 
> I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.
> 
> From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
> The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
> But if we use it after kill, does it count a active use ?
> Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
> touch the freed percpu counter but just the atomic ref->count.
> 
> It looks safe.

OK, you are right.

However, I still think it isn't necessary to hold the perpcu_ref in the
very fast io path.

> 
> 
> > 
> >>
> >> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
> >> the started ones, then hctx->run_queue is cleaned totally.
> >>
> >> IMO, it would be better to have a checkpoint after which there will be no any in-flight
> >> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
> >> and any attempt to start them will fail.
> > 
> > All are canceled in blk_cleanup_queue(), but not enough, given queue can
> > be run in sync mode(such as via plug, direct issue, ...), or driver's
> > requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
> > just by holding queue's kobject refcount.
> 
> Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
> We provide a guarantee that all of the activities are stopped after this checkpoint.
> It will be convenient for us to do other things following, for example release request_queue's
> resource.

We have such checkpoint already:

	blk_freeze_queue() together with blk_sync_queue()

Once the two are done, there shouldn't be any driver activities at all.

The current issue is related with blk-mq internal implementation, in which
it should have been safe to complete the run queue activity during queue
cleanup if the request queue's kobject refcount isn't released.

However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") frees hctx
resource too early, and causes the kernel oops.

Also, isn't it the typical practice to release kobject related resources in
its release handler?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02  2:55                   ` Ming Lei
@ 2019-04-02  8:07                     ` jianchao.wang
  2019-04-02 11:05                       ` Ming Lei
  2019-04-02 18:11                     ` Bart Van Assche
  1 sibling, 1 reply; 39+ messages in thread
From: jianchao.wang @ 2019-04-02  8:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

Hi Ming

On 4/2/19 10:55 AM, Ming Lei wrote:
> On Tue, Apr 02, 2019 at 10:02:43AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 4/1/19 6:03 PM, Ming Lei wrote:
>>> On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 4/1/19 11:28 AM, Ming Lei wrote:
>>>>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
>>>>>> Hi Ming
>>>>>>
>>>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>>>>>> already calls blk_set_queue_dying() and that last function calls
>>>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>>>>>> sense.
>>>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>>>>>> move freeing of hw queue's resource into blk_release_queue() like what
>>>>>>> the patchset is doing.
>>>>>>>
>>>>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
>>>>>>> do we?
>>>>>>
>>>>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
>>>>>> better and clearer.
>>>>>
>>>>> It is hard to do that way, and not necessary.
>>>>>
>>>>> I will post V2 soon for review.
>>>>>
>>>>
>>>> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
>>>> requet_queue is frozen and drained (run queue is also unnecessary because there is no
>>>> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
>>>> We have similar one in blk_mq_timeout_work.
>>>
>>> If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
>>> queue's release handler.
>>>
>>> Then we still have to move freeing hctx's resource into hctx or queue's
>>> release handler, that is exactly what this patch is doing. Then
>>> percpu_ref_tryget() becomes unnecessary again, right?
>>
>> I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.
>>
>> From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
>> The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
>> But if we use it after kill, does it count a active use ?
>> Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
>> touch the freed percpu counter but just the atomic ref->count.
>>
>> It looks safe.
> 
> OK, you are right.
> 
> However, I still think it isn't necessary to hold the perpcu_ref in the
> very fast io path.

percpu_ref is born for fast path.
There are some drivers use it in completion path, such as scsi, does it really
matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
which is the really bulk and depend on hctx restart mechanism.

> 
>>
>>
>>>
>>>>
>>>> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
>>>> the started ones, then hctx->run_queue is cleaned totally.
>>>>
>>>> IMO, it would be better to have a checkpoint after which there will be no any in-flight
>>>> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
>>>> and any attempt to start them will fail.
>>>
>>> All are canceled in blk_cleanup_queue(), but not enough, given queue can
>>> be run in sync mode(such as via plug, direct issue, ...), or driver's
>>> requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
>>> just by holding queue's kobject refcount.
>>
>> Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
>> We provide a guarantee that all of the activities are stopped after this checkpoint.
>> It will be convenient for us to do other things following, for example release request_queue's
>> resource.
> 
> We have such checkpoint already:
> 
> 	blk_freeze_queue() together with blk_sync_queue()
> 
> Once the two are done, there shouldn't be any driver activities at all.
> 
> The current issue is related with blk-mq internal implementation, in which
> it should have been safe to complete the run queue activity during queue
> cleanup if the request queue's kobject refcount isn't released.
> 
> However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") frees hctx
> resource too early, and causes the kernel oops.
> 
> Also, isn't it the typical practice to release kobject related resources in
> its release handler?

I agree with this.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02  8:07                     ` jianchao.wang
@ 2019-04-02 11:05                       ` Ming Lei
  2019-04-02 17:53                         ` Bart Van Assche
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-02 11:05 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/2/19 10:55 AM, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 10:02:43AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 4/1/19 6:03 PM, Ming Lei wrote:
> >>> On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
> >>>> Hi Ming
> >>>>
> >>>> On 4/1/19 11:28 AM, Ming Lei wrote:
> >>>>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> >>>>>> Hi Ming
> >>>>>>
> >>>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>>>>>> already calls blk_set_queue_dying() and that last function calls
> >>>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>>>>>> sense.
> >>>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>>>>>> move freeing of hw queue's resource into blk_release_queue() like what
> >>>>>>> the patchset is doing.
> >>>>>>>
> >>>>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> >>>>>>> do we?
> >>>>>>
> >>>>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
> >>>>>> better and clearer.
> >>>>>
> >>>>> It is hard to do that way, and not necessary.
> >>>>>
> >>>>> I will post V2 soon for review.
> >>>>>
> >>>>
> >>>> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
> >>>> requet_queue is frozen and drained (run queue is also unnecessary because there is no
> >>>> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
> >>>> We have similar one in blk_mq_timeout_work.
> >>>
> >>> If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
> >>> queue's release handler.
> >>>
> >>> Then we still have to move freeing hctx's resource into hctx or queue's
> >>> release handler, that is exactly what this patch is doing. Then
> >>> percpu_ref_tryget() becomes unnecessary again, right?
> >>
> >> I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.
> >>
> >> From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
> >> The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
> >> But if we use it after kill, does it count a active use ?
> >> Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
> >> touch the freed percpu counter but just the atomic ref->count.
> >>
> >> It looks safe.
> > 
> > OK, you are right.
> > 
> > However, I still think it isn't necessary to hold the perpcu_ref in the
> > very fast io path.
> 
> percpu_ref is born for fast path.
> There are some drivers use it in completion path, such as scsi, does it really
> matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> which is the really bulk and depend on hctx restart mechanism.

Yes, it is designed for fast path, but it doesn't mean percpu_ref
hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
includes the fast NVMe.

Also:

It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
only, given the idea is to use the percpu_ref to protect hctx's resources.

There are lots of uses on 'hctx', such as other exported blk-mq APIs.
If this approach were chosen, we may have to audit other blk-mq APIs,
cause they might be called after queue is frozen too.

So probably this usage may be misbuse on percpu_ref.

> 
> > 
> >>
> >>
> >>>
> >>>>
> >>>> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
> >>>> the started ones, then hctx->run_queue is cleaned totally.
> >>>>
> >>>> IMO, it would be better to have a checkpoint after which there will be no any in-flight
> >>>> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
> >>>> and any attempt to start them will fail.
> >>>
> >>> All are canceled in blk_cleanup_queue(), but not enough, given queue can
> >>> be run in sync mode(such as via plug, direct issue, ...), or driver's
> >>> requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
> >>> just by holding queue's kobject refcount.
> >>
> >> Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
> >> We provide a guarantee that all of the activities are stopped after this checkpoint.
> >> It will be convenient for us to do other things following, for example release request_queue's
> >> resource.
> > 
> > We have such checkpoint already:
> > 
> > 	blk_freeze_queue() together with blk_sync_queue()
> > 
> > Once the two are done, there shouldn't be any driver activities at all.
> > 
> > The current issue is related with blk-mq internal implementation, in which
> > it should have been safe to complete the run queue activity during queue
> > cleanup if the request queue's kobject refcount isn't released.
> > 
> > However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") frees hctx
> > resource too early, and causes the kernel oops.
> > 
> > Also, isn't it the typical practice to release kobject related resources in
> > its release handler?
> 
> I agree with this.

OK.

Another point with freeing hctx resources in its release handler is that 
things become much simple: if the queue's kobject refcount is held,
almost all blk-mq APIs can be called safely. This way works perfectly on
legacy IO path for ages.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02 11:05                       ` Ming Lei
@ 2019-04-02 17:53                         ` Bart Van Assche
  2019-04-03  3:20                           ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Bart Van Assche @ 2019-04-02 17:53 UTC (permalink / raw)
  To: Ming Lei, jianchao.wang
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley

On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > percpu_ref is born for fast path.
> > There are some drivers use it in completion path, such as scsi, does it really
> > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > which is the really bulk and depend on hctx restart mechanism.
> 
> Yes, it is designed for fast path, but it doesn't mean percpu_ref
> hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> includes the fast NVMe.

I think the overhead of adding a percpu_ref_get/put pair is acceptable for
SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
the block layer matter for the fast path code in the NVMe driver. In other
words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
affect the performance of the NVMe driver.

> Also:
> 
> It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> only, given the idea is to use the percpu_ref to protect hctx's resources.
> 
> There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> If this approach were chosen, we may have to audit other blk-mq APIs,
> cause they might be called after queue is frozen too.

The only blk_mq_hw_ctx user I have found so far that needs additional
protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
a new issue. Functions like nvme_poll() access data structures (NVMe
completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
progress. If blk_poll() is modified such that it becomes safe to call that
function while blk_cleanup_queue() is in progress then blk_poll() won't
access any hardware queue that it shouldn't access.

Bart.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-01  2:44       ` jianchao.wang
@ 2019-04-02 18:07         ` Bart Van Assche
  0 siblings, 0 replies; 39+ messages in thread
From: Bart Van Assche @ 2019-04-02 18:07 UTC (permalink / raw)
  To: jianchao.wang, Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley

On Mon, 2019-04-01 at 10:44 +0800, jianchao.wang wrote:
> percpu_ref_tryget would be better to get pending requests to be issued when queue is frozen.

Agreed.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02  2:55                   ` Ming Lei
  2019-04-02  8:07                     ` jianchao.wang
@ 2019-04-02 18:11                     ` Bart Van Assche
  2019-04-03  3:24                       ` Ming Lei
  1 sibling, 1 reply; 39+ messages in thread
From: Bart Van Assche @ 2019-04-02 18:11 UTC (permalink / raw)
  To: Ming Lei, jianchao.wang
  Cc: Jens Axboe, linux-block, James Smart, Bart Van Assche,
	linux-scsi, Martin K . Petersen, Christoph Hellwig,
	James E . J . Bottomley

On Tue, 2019-04-02 at 10:55 +0800, Ming Lei wrote:
> Also, isn't it the typical practice to release kobject related resources in
> its release handler?

A typical approach is to call kobject_del() before starting to clean up a
resource and to defer the final kfree() call to the release handler. I think
that's how it works today in the block layer core. That is easy to see in the
blk_mq_hw_sysfs_release() implementation:

static void blk_mq_hw_sysfs_release(struct kobject *kobj)
{
	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
						  kobj);
	free_cpumask_var(hctx->cpumask);
	kfree(hctx->ctxs);
	kfree(hctx);
}

Bart.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02 17:53                         ` Bart Van Assche
@ 2019-04-03  3:20                           ` Ming Lei
  2019-04-03  8:29                             ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-03  3:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jianchao.wang, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > percpu_ref is born for fast path.
> > > There are some drivers use it in completion path, such as scsi, does it really
> > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > which is the really bulk and depend on hctx restart mechanism.
> > 
> > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > includes the fast NVMe.
> 
> I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> the block layer matter for the fast path code in the NVMe driver. In other
> words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> affect the performance of the NVMe driver.

But it can be avoided easily and cleanly, why abuse it for protecting hctx?

> 
> > Also:
> > 
> > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > 
> > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > If this approach were chosen, we may have to audit other blk-mq APIs,
> > cause they might be called after queue is frozen too.
> 
> The only blk_mq_hw_ctx user I have found so far that needs additional
> protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> a new issue. Functions like nvme_poll() access data structures (NVMe
> completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> progress. If blk_poll() is modified such that it becomes safe to call that
> function while blk_cleanup_queue() is in progress then blk_poll() won't
> access any hardware queue that it shouldn't access.

There can be lots of such case:

1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
- requests can be completed just after added to ctx queue or scheduler queue
becasue there can be concurrent run queue, then queue freezing may be done

- then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
may see freed hctx fields

2) blk_mq_delay_run_hw_queue
- what if it is called after blk_sync_queue() is done in
  blk_cleanup_queue()
- but the caller follows the old rule by holding request queue's
  refcount

3) blk_mq_quiesce_queue
- called after blk_mq_free_queue() is done, then use-after-free on hctx->srcu
- but the caller follows the old rule by holding request queue's

...

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-02 18:11                     ` Bart Van Assche
@ 2019-04-03  3:24                       ` Ming Lei
  0 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2019-04-03  3:24 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jianchao.wang, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Tue, Apr 02, 2019 at 11:11:08AM -0700, Bart Van Assche wrote:
> On Tue, 2019-04-02 at 10:55 +0800, Ming Lei wrote:
> > Also, isn't it the typical practice to release kobject related resources in
> > its release handler?
> 
> A typical approach is to call kobject_del() before starting to clean up a

Yes.

> resource and to defer the final kfree() call to the release handler. I think
> that's how it works today in the block layer core. That is easy to see in the
> blk_mq_hw_sysfs_release() implementation:
> 
> static void blk_mq_hw_sysfs_release(struct kobject *kobj)
> {
> 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
> 						  kobj);
> 	free_cpumask_var(hctx->cpumask);
> 	kfree(hctx->ctxs);
> 	kfree(hctx);
> }

Right, then we should free other hctx fields here too, it is safe
and reliable.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-03  3:20                           ` Ming Lei
@ 2019-04-03  8:29                             ` Ming Lei
  2019-04-03  8:43                               ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2019-04-03  8:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jianchao.wang, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Wed, Apr 03, 2019 at 11:20:53AM +0800, Ming Lei wrote:
> On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> > On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > > percpu_ref is born for fast path.
> > > > There are some drivers use it in completion path, such as scsi, does it really
> > > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > > which is the really bulk and depend on hctx restart mechanism.
> > > 
> > > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > > includes the fast NVMe.
> > 
> > I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> > SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> > Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> > the block layer matter for the fast path code in the NVMe driver. In other
> > words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> > affect the performance of the NVMe driver.
> 
> But it can be avoided easily and cleanly, why abuse it for protecting hctx?
> 
> > 
> > > Also:
> > > 
> > > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > > 
> > > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > > If this approach were chosen, we may have to audit other blk-mq APIs,
> > > cause they might be called after queue is frozen too.
> > 
> > The only blk_mq_hw_ctx user I have found so far that needs additional
> > protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> > a new issue. Functions like nvme_poll() access data structures (NVMe
> > completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> > progress. If blk_poll() is modified such that it becomes safe to call that
> > function while blk_cleanup_queue() is in progress then blk_poll() won't
> > access any hardware queue that it shouldn't access.
> 
> There can be lots of such case:
> 
> 1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
> - requests can be completed just after added to ctx queue or scheduler queue
> becasue there can be concurrent run queue, then queue freezing may be done
> 
> - then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
> may see freed hctx fields

Actually this one is blk-mq internal race, and queue's refcount isn't
guaranteed to be held when blk_mq_run_hw_queue is called.

We might have to address this one by grabbing .q_usage_count in
blk_mq_sched_insert_requests just like commit 8dc765d438f1 ("SCSI: fix queue cleanup
race before queue initialization is done"), but I do want to avoid it.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held
  2019-04-03  8:29                             ` Ming Lei
@ 2019-04-03  8:43                               ` Ming Lei
  0 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2019-04-03  8:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jianchao.wang, Jens Axboe, linux-block, James Smart,
	Bart Van Assche, linux-scsi, Martin K . Petersen,
	Christoph Hellwig, James E . J . Bottomley

On Wed, Apr 03, 2019 at 04:29:41PM +0800, Ming Lei wrote:
> On Wed, Apr 03, 2019 at 11:20:53AM +0800, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> > > On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > > > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > > > percpu_ref is born for fast path.
> > > > > There are some drivers use it in completion path, such as scsi, does it really
> > > > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > > > which is the really bulk and depend on hctx restart mechanism.
> > > > 
> > > > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > > > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > > > includes the fast NVMe.
> > > 
> > > I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> > > SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> > > Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> > > the block layer matter for the fast path code in the NVMe driver. In other
> > > words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> > > affect the performance of the NVMe driver.
> > 
> > But it can be avoided easily and cleanly, why abuse it for protecting hctx?
> > 
> > > 
> > > > Also:
> > > > 
> > > > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > > > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > > > 
> > > > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > > > If this approach were chosen, we may have to audit other blk-mq APIs,
> > > > cause they might be called after queue is frozen too.
> > > 
> > > The only blk_mq_hw_ctx user I have found so far that needs additional
> > > protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> > > a new issue. Functions like nvme_poll() access data structures (NVMe
> > > completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> > > progress. If blk_poll() is modified such that it becomes safe to call that
> > > function while blk_cleanup_queue() is in progress then blk_poll() won't
> > > access any hardware queue that it shouldn't access.
> > 
> > There can be lots of such case:
> > 
> > 1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
> > - requests can be completed just after added to ctx queue or scheduler queue
> > becasue there can be concurrent run queue, then queue freezing may be done
> > 
> > - then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
> > may see freed hctx fields
> 
> Actually this one is blk-mq internal race, and queue's refcount isn't
> guaranteed to be held when blk_mq_run_hw_queue is called.
> 
> We might have to address this one by grabbing .q_usage_count in
> blk_mq_sched_insert_requests just like commit 8dc765d438f1 ("SCSI: fix queue cleanup
> race before queue initialization is done"), but I do want to avoid it.

The issue is very similar with kiocb's refcount in aio/io_uring, we might
need to grab two for handling one request, one is for submission side, and
another is for completion side. Now the issue is that we don't grab the
ref in submission side.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2019-04-03  8:43 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-31  3:09 [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Ming Lei
2019-03-31  3:09 ` [PATCH 1/5] blk-mq: re-organize blk_mq_exit_hctx() into two parts Ming Lei
2019-04-01  1:40   ` Dongli Zhang
2019-04-01  2:06     ` Ming Lei
2019-03-31  3:09 ` [PATCH 2/5] blk-mq: re-organize blk_mq_exit_hw_queues() " Ming Lei
2019-03-31  3:09 ` [PATCH 3/5] blk-mq: free hw queues in queue's release handler Ming Lei
2019-03-31  3:09 ` [PATCH 4/5] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
2019-04-01  1:50   ` Dongli Zhang
2019-04-01  2:08     ` Ming Lei
2019-03-31  3:09 ` [PATCH 5/5] SCSI: don't grab queue usage counter before run queue Ming Lei
2019-04-01  1:53   ` Dongli Zhang
2019-03-31 15:27 ` [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held Bart Van Assche
2019-04-01  2:00   ` Ming Lei
2019-04-01  2:39     ` Bart Van Assche
2019-04-01  2:44       ` jianchao.wang
2019-04-02 18:07         ` Bart Van Assche
2019-04-01  2:52       ` Ming Lei
2019-04-01  3:25         ` jianchao.wang
2019-04-01  3:28           ` Ming Lei
2019-04-01  9:19             ` jianchao.wang
2019-04-01 10:03               ` Ming Lei
2019-04-02  2:02                 ` jianchao.wang
2019-04-02  2:55                   ` Ming Lei
2019-04-02  8:07                     ` jianchao.wang
2019-04-02 11:05                       ` Ming Lei
2019-04-02 17:53                         ` Bart Van Assche
2019-04-03  3:20                           ` Ming Lei
2019-04-03  8:29                             ` Ming Lei
2019-04-03  8:43                               ` Ming Lei
2019-04-02 18:11                     ` Bart Van Assche
2019-04-03  3:24                       ` Ming Lei
2019-04-01  5:05         ` Dongli Zhang
2019-04-01  5:16           ` Ming Lei
2019-04-01  5:30             ` Dongli Zhang
2019-04-01  7:15               ` Ming Lei
2019-04-02  2:10                 ` Dongli Zhang
2019-04-02  2:20                   ` Ming Lei
2019-04-01  3:27       ` Ming Lei
2019-04-01  3:32         ` jianchao.wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.