linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
@ 2020-09-09 10:41 Ming Lei
  2020-09-09 10:41 ` [PATCH V4 1/4] block: use test_and_{clear|test}_bit to set/clear QUEUE_FLAG_QUIESCED Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Ming Lei @ 2020-09-09 10:41 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Hannes Reinecke, Sagi Grimberg, Bart Van Assche,
	Johannes Thumshirn, Chao Leng

Hi Jens,

The 1st patch add .mq_quiesce_mutex for serializing quiesce/unquiesce,
and prepares for replacing srcu with percpu_ref.

The 2nd patch replaces srcu with percpu_ref.

The 3rd patch adds tagset quiesce interface.

The 4th patch applies tagset quiesce interface for NVMe subsystem.

V4:
	- remove .mq_quiesce_mutex, and switch to test_and_[set|clear] for
	avoiding duplicated quiesce action
	- pass blktests(block, nvme)

V3:
	- add tagset quiesce interface
	- apply tagset quiesce interface for NVMe
	- pass blktests(block, nvme)

V2:
	- add .mq_quiesce_lock
	- add comment on patch 2 wrt. handling hctx_lock() failure
	- trivial patch style change


Ming Lei (3):
  block: use test_and_{clear|test}_bit to set/clear QUEUE_FLAG_QUIESCED
  blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  blk-mq: add tagset quiesce interface

Sagi Grimberg (1):
  nvme: use blk_mq_[un]quiesce_tagset

 block/blk-core.c         |  13 +++
 block/blk-mq-sysfs.c     |   2 -
 block/blk-mq.c           | 173 +++++++++++++++++++++++++--------------
 block/blk-sysfs.c        |   6 +-
 block/blk.h              |   2 +
 drivers/nvme/host/core.c |  19 ++---
 include/linux/blk-mq.h   |  10 +--
 include/linux/blkdev.h   |   4 +
 8 files changed, 146 insertions(+), 83 deletions(-)

Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
-- 
2.25.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V4 1/4] block: use test_and_{clear|test}_bit to set/clear QUEUE_FLAG_QUIESCED
  2020-09-09 10:41 [PATCH V4 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
@ 2020-09-09 10:41 ` Ming Lei
  2020-09-09 10:41 ` [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2020-09-09 10:41 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

Prepare for replacing srcu with percpu-refcount for implementing queue
quiesce.

The following patch needs to avoid duplicated quiesce action for
BLK_MQ_F_BLOCKING, so use test_and_{clear|test}_bit to set/clear
QUEUE_FLAG_QUIESCED.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
---
 block/blk-core.c | 13 +++++++++++++
 block/blk-mq.c   | 11 ++++++++---
 block/blk.h      |  2 ++
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 093649bd252e..de440733609c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -107,6 +107,19 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set);
 
+/**
+ * blk_queue_flag_test_and_clear - atomically test and clear a queue flag
+ * @flag: flag to be clear
+ * @q: request queue
+ *
+ * Returns the previous value of @flag - 0 if the flag was not set and 1 if
+ * the flag was set.
+ */
+bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q)
+{
+	return test_and_clear_bit(flag, &q->queue_flags);
+}
+
 void blk_rq_init(struct request_queue *q, struct request *rq)
 {
 	memset(rq, 0, sizeof(*rq));
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4abb71459f94..efd17a80fcdf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -199,13 +199,18 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
+static bool __blk_mq_quiesce_queue_nowait(struct request_queue *q)
+{
+	return blk_queue_flag_test_and_set(QUEUE_FLAG_QUIESCED, q);
+}
+
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
  * mpt3sas driver such that this function can be removed.
  */
 void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 {
-	blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
+	__blk_mq_quiesce_queue_nowait(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
@@ -224,7 +229,7 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 	unsigned int i;
 	bool rcu = false;
 
-	blk_mq_quiesce_queue_nowait(q);
+	__blk_mq_quiesce_queue_nowait(q);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (hctx->flags & BLK_MQ_F_BLOCKING)
@@ -246,7 +251,7 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
  */
 void blk_mq_unquiesce_queue(struct request_queue *q)
 {
-	blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+	blk_queue_flag_test_and_clear(QUEUE_FLAG_QUIESCED, q);
 
 	/* dispatch requests which are inserted during quiescing */
 	blk_mq_run_hw_queues(q, true);
diff --git a/block/blk.h b/block/blk.h
index c08762e10b04..312a060ea2a2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -448,4 +448,6 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
+
 #endif /* BLK_INTERNAL_H */
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-09 10:41 [PATCH V4 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
  2020-09-09 10:41 ` [PATCH V4 1/4] block: use test_and_{clear|test}_bit to set/clear QUEUE_FLAG_QUIESCED Ming Lei
@ 2020-09-09 10:41 ` Ming Lei
  2020-09-09 16:04   ` Keith Busch
  2020-09-09 10:41 ` [PATCH V4 3/4] blk-mq: add tagset quiesce interface Ming Lei
  2020-09-09 10:41 ` [PATCH V4 4/4] nvme: use blk_mq_[un]quiesce_tagset Ming Lei
  3 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2020-09-09 10:41 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn,
	Chao Leng, Hannes Reinecke

In case of BLK_MQ_F_BLOCKING, blk-mq uses SRCU to mark read critical
section during dispatching request, then request queue quiesce is based on
SRCU. What we want to get is low cost added in fast path.

With percpu-ref, it is cleaner and simpler & enough for implementing queue
quiesce. The main requirement is to make sure all read sections to observe
QUEUE_FLAG_QUIESCED once blk_mq_quiesce_queue() returns.

Also it becomes much easier to add interface of async queue quiesce.

Meantime memory footprint can be reduced with per-request-queue percpu-ref.

From implementation viewpoint, in fast path, not see percpu_ref is
slower than SRCU, and srcu tree(default option in most distributions)
could be slower since memory barrier is required in both lock & unlock,
and rcu_read_lock()/rcu_read_unlock() should be much cheap than
smp_mb().

1) percpu_ref just hold the rcu_read_lock, then run a check &
   increase/decrease on the percpu variable:

   rcu_read_lock()
   if (__ref_is_percpu(ref, &percpu_count))
	this_cpu_inc(*percpu_count);
   rcu_read_unlock()

2) srcu tree:
        idx = READ_ONCE(ssp->srcu_idx) & 0x1;
        this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
        smp_mb(); /* B */  /* Avoid leaking the critical section. */

Also from my test on null_blk(blocking), not observe percpu-ref performs
worse than srcu, see the following test:

1) test steps:

rmmod null_blk > /dev/null 2>&1
modprobe null_blk nr_devices=1 submit_queues=1 blocking=1
fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --ioengine=libaio \
	--iodepth=64 --runtime=60 --group_reporting=1  --name=nullb0 \
	--filename=/dev/nullb0 --numjobs=32

test machine: HP DL380, 16 cpu cores, 2 threads per core, dual
sockets/numa, Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

2) test result:
- srcu quiesce: 6063K IOPS
- percpu-ref quiesce: 6113K IOPS

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq-sysfs.c   |   2 -
 block/blk-mq.c         | 121 +++++++++++++++++++++--------------------
 block/blk-sysfs.c      |   6 +-
 include/linux/blk-mq.h |   8 ---
 include/linux/blkdev.h |   4 ++
 5 files changed, 72 insertions(+), 69 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 062229395a50..799db7937105 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -38,8 +38,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 
 	cancel_delayed_work_sync(&hctx->run_work);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
 	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index efd17a80fcdf..37263ed48a8a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -225,19 +225,16 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
  */
 void blk_mq_quiesce_queue(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
-	bool rcu = false;
+	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
+	bool was_quiesced =__blk_mq_quiesce_queue_nowait(q);
 
-	__blk_mq_quiesce_queue_nowait(q);
+	if (!was_quiesced && blocking)
+		percpu_ref_kill(&q->dispatch_counter);
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->flags & BLK_MQ_F_BLOCKING)
-			synchronize_srcu(hctx->srcu);
-		else
-			rcu = true;
-	}
-	if (rcu)
+	if (blocking)
+		wait_event(q->mq_quiesce_wq,
+				percpu_ref_is_zero(&q->dispatch_counter));
+	else
 		synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
@@ -251,7 +248,10 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
  */
 void blk_mq_unquiesce_queue(struct request_queue *q)
 {
-	blk_queue_flag_test_and_clear(QUEUE_FLAG_QUIESCED, q);
+	if (blk_queue_flag_test_and_clear(QUEUE_FLAG_QUIESCED, q)) {
+		if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
+			percpu_ref_resurrect(&q->dispatch_counter);
+	}
 
 	/* dispatch requests which are inserted during quiescing */
 	blk_mq_run_hw_queues(q, true);
@@ -704,24 +704,21 @@ void blk_mq_complete_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
-	__releases(hctx->srcu)
+static void hctx_unlock(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-		rcu_read_unlock();
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		percpu_ref_put(&hctx->queue->dispatch_counter);
 	else
-		srcu_read_unlock(hctx->srcu, srcu_idx);
+		rcu_read_unlock();
 }
 
-static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
-	__acquires(hctx->srcu)
+/* Returning false means that queue is being quiesced */
+static inline bool hctx_lock(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		/* shut up gcc false positive */
-		*srcu_idx = 0;
-		rcu_read_lock();
-	} else
-		*srcu_idx = srcu_read_lock(hctx->srcu);
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		return percpu_ref_tryget_live(&hctx->queue->dispatch_counter);
+	rcu_read_lock();
+	return true;
 }
 
 /**
@@ -1500,8 +1497,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
  */
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-	int srcu_idx;
-
 	/*
 	 * We should be running this queue from one of the CPUs that
 	 * are mapped to it.
@@ -1535,9 +1530,10 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
-	blk_mq_sched_dispatch_requests(hctx);
-	hctx_unlock(hctx, srcu_idx);
+	if (hctx_lock(hctx)) {
+		blk_mq_sched_dispatch_requests(hctx);
+		hctx_unlock(hctx);
+	}
 }
 
 static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
@@ -1649,7 +1645,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
  */
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	int srcu_idx;
 	bool need_run;
 
 	/*
@@ -1660,10 +1655,12 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
 	 * quiesced.
 	 */
-	hctx_lock(hctx, &srcu_idx);
+	if (!hctx_lock(hctx))
+		return;
+
 	need_run = !blk_queue_quiesced(hctx->queue) &&
 		blk_mq_hctx_has_pending(hctx);
-	hctx_unlock(hctx, srcu_idx);
+	hctx_unlock(hctx);
 
 	if (need_run)
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
@@ -2003,7 +2000,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	bool run_queue = true;
 
 	/*
-	 * RCU or SRCU read lock is needed before checking quiesced flag.
+	 * hctx_lock() is needed before checking quiesced flag.
 	 *
 	 * When queue is stopped or quiesced, ignore 'bypass_insert' from
 	 * blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
@@ -2051,11 +2048,14 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, blk_qc_t *cookie)
 {
 	blk_status_t ret;
-	int srcu_idx;
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
+	/* Insert request to queue in case of being quiesced */
+	if (!hctx_lock(hctx)) {
+		blk_mq_sched_insert_request(rq, false, false, false);
+		return;
+	}
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true);
 	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
@@ -2063,19 +2063,22 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
 
-	hctx_unlock(hctx, srcu_idx);
+	hctx_unlock(hctx);
 }
 
 blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
 {
 	blk_status_t ret;
-	int srcu_idx;
 	blk_qc_t unused_cookie;
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
-	hctx_lock(hctx, &srcu_idx);
+	/* Insert request to queue in case of being quiesced */
+	if (!hctx_lock(hctx)) {
+		blk_mq_sched_insert_request(rq, false, false, false);
+		return BLK_STS_OK;
+	}
 	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true, last);
-	hctx_unlock(hctx, srcu_idx);
+	hctx_unlock(hctx);
 
 	return ret;
 }
@@ -2606,20 +2609,6 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
@@ -2657,7 +2646,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx *hctx;
 	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
 
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set), gfp, node);
+	hctx = kzalloc_node(sizeof(struct blk_mq_hw_ctx), gfp, node);
 	if (!hctx)
 		goto fail_alloc_hctx;
 
@@ -2700,8 +2689,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	if (!hctx->fq)
 		goto free_bitmap;
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		init_srcu_struct(hctx->srcu);
 	blk_mq_hctx_kobj_init(hctx);
 
 	return hctx;
@@ -3181,6 +3168,13 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	mutex_unlock(&q->sysfs_lock);
 }
 
+static void blk_mq_dispatch_counter_release(struct percpu_ref *ref)
+{
+	struct request_queue *q = container_of(ref, struct request_queue,
+				dispatch_counter);
+	wake_up_all(&q->mq_quiesce_wq);
+}
+
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q,
 						  bool elevator_init)
@@ -3197,6 +3191,14 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (blk_mq_alloc_ctxs(q))
 		goto err_poll;
 
+	if (set->flags & BLK_MQ_F_BLOCKING) {
+		init_waitqueue_head(&q->mq_quiesce_wq);
+		if (percpu_ref_init(&q->dispatch_counter,
+					blk_mq_dispatch_counter_release,
+					PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
+			goto err_hctxs;
+	}
+
 	/* init q->mq_kobj and sw queues' kobjects */
 	blk_mq_sysfs_init(q);
 
@@ -3205,7 +3207,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
-		goto err_hctxs;
+		goto err_dispatch_counter;
 
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
@@ -3239,6 +3241,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	return q;
 
+err_dispatch_counter:
+	if (set->flags & BLK_MQ_F_BLOCKING)
+		percpu_ref_exit(&q->dispatch_counter);
 err_hctxs:
 	kfree(q->queue_hw_ctx);
 	q->nr_hw_queues = 0;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7dda709f3ccb..56b6c045e30c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -941,9 +941,13 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_queue_free_zone_bitmaps(q);
 
-	if (queue_is_mq(q))
+	if (queue_is_mq(q)) {
 		blk_mq_release(q);
 
+		if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
+			percpu_ref_exit(&q->dispatch_counter);
+	}
+
 	blk_trace_shutdown(q);
 	mutex_lock(&q->debugfs_mutex);
 	debugfs_remove_recursive(q->debugfs_dir);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b23eeca4d677..df642055f02c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -4,7 +4,6 @@
 
 #include <linux/blkdev.h>
 #include <linux/sbitmap.h>
-#include <linux/srcu.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -173,13 +172,6 @@ struct blk_mq_hw_ctx {
 	 * q->unused_hctx_list.
 	 */
 	struct list_head	hctx_list;
-
-	/**
-	 * @srcu: Sleepable RCU. Use as lock when type of the hardware queue is
-	 * blocking (BLK_MQ_F_BLOCKING). Must be the last member - see also
-	 * blk_mq_hw_ctx_size().
-	 */
-	struct srcu_struct	srcu[];
 };
 
 /**
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7b1e53084799..de8b51bd402b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -572,6 +572,10 @@ struct request_queue {
 	 */
 	struct mutex		mq_freeze_lock;
 
+	/* only used for BLK_MQ_F_BLOCKING */
+	struct percpu_ref	dispatch_counter;
+	wait_queue_head_t	mq_quiesce_wq;
+
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V4 3/4] blk-mq: add tagset quiesce interface
  2020-09-09 10:41 [PATCH V4 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
  2020-09-09 10:41 ` [PATCH V4 1/4] block: use test_and_{clear|test}_bit to set/clear QUEUE_FLAG_QUIESCED Ming Lei
  2020-09-09 10:41 ` [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
@ 2020-09-09 10:41 ` Ming Lei
  2020-09-09 10:41 ` [PATCH V4 4/4] nvme: use blk_mq_[un]quiesce_tagset Ming Lei
  3 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2020-09-09 10:41 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn,
	Chao Leng, Hannes Reinecke

drivers that have shared tagsets may need to quiesce potentially a lot
of request queues that all share a single tagset (e.g. nvme). Add an interface
to quiesce all the queues on a given tagset. This interface is useful because
it can speedup the quiesce by doing it in parallel.

For tagsets that have BLK_MQ_F_BLOCKING set, we kill request queue's dispatch
percpu-refcount such that all of them wait for the counter becoming zero. For
tagsets that don't have BLK_MQ_F_BLOCKING set, we simply call a single
synchronize_rcu as this is sufficient.

This patch is against Sagi's original post.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c         | 59 +++++++++++++++++++++++++++++++++++-------
 include/linux/blk-mq.h |  2 ++
 2 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 37263ed48a8a..7669fe815cf9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -214,16 +214,7 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
-/**
- * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
- * @q: request queue.
- *
- * Note: this function does not prevent that the struct request end_io()
- * callback function is invoked. Once this function is returned, we make
- * sure no dispatch can happen until the queue is unquiesced via
- * blk_mq_unquiesce_queue().
- */
-void blk_mq_quiesce_queue(struct request_queue *q)
+static void __blk_mq_quiesce_queue(struct request_queue *q, bool wait)
 {
 	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
 	bool was_quiesced =__blk_mq_quiesce_queue_nowait(q);
@@ -231,12 +222,29 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 	if (!was_quiesced && blocking)
 		percpu_ref_kill(&q->dispatch_counter);
 
+	if (!wait)
+		return;
+
 	if (blocking)
 		wait_event(q->mq_quiesce_wq,
 				percpu_ref_is_zero(&q->dispatch_counter));
 	else
 		synchronize_rcu();
 }
+
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
+ * @q: request queue.
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Once this function is returned, we make
+ * sure no dispatch can happen until the queue is unquiesced via
+ * blk_mq_unquiesce_queue().
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+	__blk_mq_quiesce_queue(q, true);
+}
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
 /*
@@ -258,6 +266,37 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
 
+void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	mutex_lock(&set->tag_list_lock);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		__blk_mq_quiesce_queue(q, false);
+
+	/* wait until all queues' quiesce is done */
+	if (set->flags & BLK_MQ_F_BLOCKING) {
+		list_for_each_entry(q, &set->tag_list, tag_set_list)
+			wait_event(q->mq_quiesce_wq,
+				   percpu_ref_is_zero(&q->dispatch_counter));
+	} else {
+		synchronize_rcu();
+	}
+	mutex_unlock(&set->tag_list_lock);
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset);
+
+void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	mutex_lock(&set->tag_list_lock);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_unquiesce_queue(q);
+	mutex_unlock(&set->tag_list_lock);
+}
+EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
+
 void blk_mq_wake_waiters(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index df642055f02c..90da3582b91d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -519,6 +519,8 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
+void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set);
+void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set);
 
 unsigned int blk_mq_rq_cpu(struct request *rq);
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V4 4/4] nvme: use blk_mq_[un]quiesce_tagset
  2020-09-09 10:41 [PATCH V4 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
                   ` (2 preceding siblings ...)
  2020-09-09 10:41 ` [PATCH V4 3/4] blk-mq: add tagset quiesce interface Ming Lei
@ 2020-09-09 10:41 ` Ming Lei
  3 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2020-09-09 10:41 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Sagi Grimberg, Hannes Reinecke, Bart Van Assche,
	Johannes Thumshirn, Chao Leng, Ming Lei

From: Sagi Grimberg <sagi@grimberg.me>

All controller namespaces share the same tagset, so we
can use this interface which does the optimal operation
for parallel quiesce based on the tagset type (e.g.
blocking tagsets and non-blocking tagsets).

Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>

Add code to unquiesce ctrl->connect_q in nvme_stop_queues(), meantime
avoid to call blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset() if
this tagset isn't initialized.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ea1fa41fbba8..a6af8978a3ba 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4623,23 +4623,22 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
+	if (list_empty_careful(&ctrl->namespaces))
+		return;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_quiesce_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	blk_mq_quiesce_tagset(ctrl->tagset);
+
+	if (ctrl->connect_q)
+		blk_mq_unquiesce_queue(ctrl->connect_q);
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
 void nvme_start_queues(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
+	if (list_empty_careful(&ctrl->namespaces))
+		return;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unquiesce_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	blk_mq_unquiesce_tagset(ctrl->tagset);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-09 10:41 ` [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
@ 2020-09-09 16:04   ` Keith Busch
  2020-09-09 20:53     ` Sagi Grimberg
  2020-09-10  1:53     ` Ming Lei
  0 siblings, 2 replies; 10+ messages in thread
From: Keith Busch @ 2020-09-09 16:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng,
	Hannes Reinecke

On Wed, Sep 09, 2020 at 06:41:14PM +0800, Ming Lei wrote:
>  void blk_mq_quiesce_queue(struct request_queue *q)
>  {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> -	bool rcu = false;
> +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
> +	bool was_quiesced =__blk_mq_quiesce_queue_nowait(q);
>  
> -	__blk_mq_quiesce_queue_nowait(q);
> +	if (!was_quiesced && blocking)
> +		percpu_ref_kill(&q->dispatch_counter);
>  
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->flags & BLK_MQ_F_BLOCKING)
> -			synchronize_srcu(hctx->srcu);
> -		else
> -			rcu = true;
> -	}
> -	if (rcu)
> +	if (blocking)
> +		wait_event(q->mq_quiesce_wq,
> +				percpu_ref_is_zero(&q->dispatch_counter));
> +	else
>  		synchronize_rcu();
>  }

In the previous version, you had ensured no thread can unquiesce a queue
while another is waiting for quiescence. Now that the locking is gone,
a thread could unquiesce the queue before percpu_ref reaches zero, so
the wait_event() may never complete on the resurrected percpu_ref.

I don't think any drivers do such a thing today, but it just looks like
the implementation leaves open the possibility.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-09 16:04   ` Keith Busch
@ 2020-09-09 20:53     ` Sagi Grimberg
  2020-09-10  8:03       ` Ming Lei
  2020-09-10  1:53     ` Ming Lei
  1 sibling, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2020-09-09 20:53 UTC (permalink / raw)
  To: Keith Busch, Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Bart Van Assche, Johannes Thumshirn, Chao Leng, Hannes Reinecke


>>   void blk_mq_quiesce_queue(struct request_queue *q)
>>   {
>> -	struct blk_mq_hw_ctx *hctx;
>> -	unsigned int i;
>> -	bool rcu = false;
>> +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
>> +	bool was_quiesced =__blk_mq_quiesce_queue_nowait(q);
>>   
>> -	__blk_mq_quiesce_queue_nowait(q);
>> +	if (!was_quiesced && blocking)
>> +		percpu_ref_kill(&q->dispatch_counter);
>>   
>> -	queue_for_each_hw_ctx(q, hctx, i) {
>> -		if (hctx->flags & BLK_MQ_F_BLOCKING)
>> -			synchronize_srcu(hctx->srcu);
>> -		else
>> -			rcu = true;
>> -	}
>> -	if (rcu)
>> +	if (blocking)
>> +		wait_event(q->mq_quiesce_wq,
>> +				percpu_ref_is_zero(&q->dispatch_counter));
>> +	else
>>   		synchronize_rcu();
>>   }
> 
> In the previous version, you had ensured no thread can unquiesce a queue
> while another is waiting for quiescence. Now that the locking is gone,
> a thread could unquiesce the queue before percpu_ref reaches zero, so
> the wait_event() may never complete on the resurrected percpu_ref.

Yea, where did that go?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-09 16:04   ` Keith Busch
  2020-09-09 20:53     ` Sagi Grimberg
@ 2020-09-10  1:53     ` Ming Lei
  2020-09-10 14:55       ` Keith Busch
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2020-09-10  1:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng,
	Hannes Reinecke

On Wed, Sep 09, 2020 at 09:04:09AM -0700, Keith Busch wrote:
> On Wed, Sep 09, 2020 at 06:41:14PM +0800, Ming Lei wrote:
> >  void blk_mq_quiesce_queue(struct request_queue *q)
> >  {
> > -	struct blk_mq_hw_ctx *hctx;
> > -	unsigned int i;
> > -	bool rcu = false;
> > +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
> > +	bool was_quiesced =__blk_mq_quiesce_queue_nowait(q);
> >  
> > -	__blk_mq_quiesce_queue_nowait(q);
> > +	if (!was_quiesced && blocking)
> > +		percpu_ref_kill(&q->dispatch_counter);
> >  
> > -	queue_for_each_hw_ctx(q, hctx, i) {
> > -		if (hctx->flags & BLK_MQ_F_BLOCKING)
> > -			synchronize_srcu(hctx->srcu);
> > -		else
> > -			rcu = true;
> > -	}
> > -	if (rcu)
> > +	if (blocking)
> > +		wait_event(q->mq_quiesce_wq,
> > +				percpu_ref_is_zero(&q->dispatch_counter));
> > +	else
> >  		synchronize_rcu();
> >  }
> 
> In the previous version, you had ensured no thread can unquiesce a queue
> while another is waiting for quiescence. Now that the locking is gone,
> a thread could unquiesce the queue before percpu_ref reaches zero, so
> the wait_event() may never complete on the resurrected percpu_ref.
> 
> I don't think any drivers do such a thing today, but it just looks like
> the implementation leaves open the possibility.

This driver can cause bigger trouble if it unquiesces its queue which is
being quiesced and still not done.

However, we can avoid that by the following delta change:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7669fe815cf9..5632727d71fa 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -225,9 +225,16 @@ static void __blk_mq_quiesce_queue(struct request_queue *q, bool wait)
 	if (!wait)
 		return;
 
+	/*
+	 * In case of F_BLOCKING, if driver unquiesces its queue which is being
+	 * quiesced and still not done, it can cause bigger trouble, and we simply
+	 * return & warn once for avoiding hang here.
+	 */
 	if (blocking)
 		wait_event(q->mq_quiesce_wq,
-				percpu_ref_is_zero(&q->dispatch_counter));
+				percpu_ref_is_zero(&q->dispatch_counter) ||
+				WARN_ON_ONCE(!percpu_ref_is_dying(
+						&q->dispatch_counter)));
 	else
 		synchronize_rcu();
 }

Thanks, 
Ming


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-09 20:53     ` Sagi Grimberg
@ 2020-09-10  8:03       ` Ming Lei
  0 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2020-09-10  8:03 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Bart Van Assche, Johannes Thumshirn,
	Chao Leng, Hannes Reinecke

On Wed, Sep 09, 2020 at 01:53:30PM -0700, Sagi Grimberg wrote:
> 
> > >   void blk_mq_quiesce_queue(struct request_queue *q)
> > >   {
> > > -	struct blk_mq_hw_ctx *hctx;
> > > -	unsigned int i;
> > > -	bool rcu = false;
> > > +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
> > > +	bool was_quiesced =__blk_mq_quiesce_queue_nowait(q);
> > > -	__blk_mq_quiesce_queue_nowait(q);
> > > +	if (!was_quiesced && blocking)
> > > +		percpu_ref_kill(&q->dispatch_counter);
> > > -	queue_for_each_hw_ctx(q, hctx, i) {
> > > -		if (hctx->flags & BLK_MQ_F_BLOCKING)
> > > -			synchronize_srcu(hctx->srcu);
> > > -		else
> > > -			rcu = true;
> > > -	}
> > > -	if (rcu)
> > > +	if (blocking)
> > > +		wait_event(q->mq_quiesce_wq,
> > > +				percpu_ref_is_zero(&q->dispatch_counter));
> > > +	else
> > >   		synchronize_rcu();
> > >   }
> > 
> > In the previous version, you had ensured no thread can unquiesce a queue
> > while another is waiting for quiescence. Now that the locking is gone,
> > a thread could unquiesce the queue before percpu_ref reaches zero, so
> > the wait_event() may never complete on the resurrected percpu_ref.
> 
> Yea, where did that go?

The mutex is removed because:

1) As Bart mentioned, blk_mq_quiesce_queue() may be called in context
which doesn't allow sleep.

2) Both percpu_ref_kill() and percpu_ref_resurrect() have been protected by
one global spinlock, so both two can be run concurrently.

3) warning may be triggered when percpu_ref_kill() is run on one DEAD
percpu-refcount, or when percpu_ref_resurrect() is run on one live
percpu-refcount. We can avoid the warning with test_and_{clear|test}_bit
exactly by running the actual quiesce/unquiesce action only once.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-10  1:53     ` Ming Lei
@ 2020-09-10 14:55       ` Keith Busch
  0 siblings, 0 replies; 10+ messages in thread
From: Keith Busch @ 2020-09-10 14:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng,
	Hannes Reinecke

On Thu, Sep 10, 2020 at 09:53:21AM +0800, Ming Lei wrote:
> On Wed, Sep 09, 2020 at 09:04:09AM -0700, Keith Busch wrote:
> > On Wed, Sep 09, 2020 at 06:41:14PM +0800, Ming Lei wrote:
> > >  void blk_mq_quiesce_queue(struct request_queue *q)
> > >  {
> > > -	struct blk_mq_hw_ctx *hctx;
> > > -	unsigned int i;
> > > -	bool rcu = false;
> > > +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
> > > +	bool was_quiesced =__blk_mq_quiesce_queue_nowait(q);
> > >  
> > > -	__blk_mq_quiesce_queue_nowait(q);
> > > +	if (!was_quiesced && blocking)
> > > +		percpu_ref_kill(&q->dispatch_counter);
> > >  
> > > -	queue_for_each_hw_ctx(q, hctx, i) {
> > > -		if (hctx->flags & BLK_MQ_F_BLOCKING)
> > > -			synchronize_srcu(hctx->srcu);
> > > -		else
> > > -			rcu = true;
> > > -	}
> > > -	if (rcu)
> > > +	if (blocking)
> > > +		wait_event(q->mq_quiesce_wq,
> > > +				percpu_ref_is_zero(&q->dispatch_counter));
> > > +	else
> > >  		synchronize_rcu();
> > >  }
> > 
> > In the previous version, you had ensured no thread can unquiesce a queue
> > while another is waiting for quiescence. Now that the locking is gone,
> > a thread could unquiesce the queue before percpu_ref reaches zero, so
> > the wait_event() may never complete on the resurrected percpu_ref.
> > 
> > I don't think any drivers do such a thing today, but it just looks like
> > the implementation leaves open the possibility.
> 
> This driver can cause bigger trouble if it unquiesces its queue which is
> being quiesced and still not done.
> 
> However, we can avoid that by the following delta change:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7669fe815cf9..5632727d71fa 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -225,9 +225,16 @@ static void __blk_mq_quiesce_queue(struct request_queue *q, bool wait)
>  	if (!wait)
>  		return;
>  
> +	/*
> +	 * In case of F_BLOCKING, if driver unquiesces its queue which is being
> +	 * quiesced and still not done, it can cause bigger trouble, and we simply
> +	 * return & warn once for avoiding hang here.
> +	 */
>  	if (blocking)
>  		wait_event(q->mq_quiesce_wq,
> -				percpu_ref_is_zero(&q->dispatch_counter));
> +				percpu_ref_is_zero(&q->dispatch_counter) ||
> +				WARN_ON_ONCE(!percpu_ref_is_dying(
> +						&q->dispatch_counter)));
>  	else
>  		synchronize_rcu();
>  }

Yeah, I'm okay with this. A warning and return should be good to
indicate driver sequence errors. So if you just want to fold the above
into this patch, then I don't think I have any remaining concerns.

The other race condition is if unquiesce resurrects the ref before
quiesce kills it, and there's already a warning there too.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-09-10 14:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-09 10:41 [PATCH V4 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
2020-09-09 10:41 ` [PATCH V4 1/4] block: use test_and_{clear|test}_bit to set/clear QUEUE_FLAG_QUIESCED Ming Lei
2020-09-09 10:41 ` [PATCH V4 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
2020-09-09 16:04   ` Keith Busch
2020-09-09 20:53     ` Sagi Grimberg
2020-09-10  8:03       ` Ming Lei
2020-09-10  1:53     ` Ming Lei
2020-09-10 14:55       ` Keith Busch
2020-09-09 10:41 ` [PATCH V4 3/4] blk-mq: add tagset quiesce interface Ming Lei
2020-09-09 10:41 ` [PATCH V4 4/4] nvme: use blk_mq_[un]quiesce_tagset Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).