linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
@ 2020-09-08  8:15 Ming Lei
  2020-09-08  8:15 ` [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Ming Lei @ 2020-09-08  8:15 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

Hi Jens,

The 1st patch add .mq_quiesce_mutex for serializing quiesce/unquiesce,
and prepares for replacing srcu with percpu_ref.

The 2nd patch replaces srcu with percpu_ref.

The 3rd patch adds tagset quiesce interface.

The 4th patch applies tagset quiesce interface for NVMe subsystem.


V3:
	- add tagset quiesce interface
	- apply tagset quiesce interface for NVMe
	- pass blktests(block, nvme)

V2:
	- add .mq_quiesce_lock
	- add comment on patch 2 wrt. handling hctx_lock() failure
	- trivial patch style change



Ming Lei (3):
  blk-mq: serialize queue quiesce and unquiesce by mutex
  blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  blk-mq: add tagset quiesce interface

Sagi Grimberg (1):
  nvme: use blk_mq_[un]quiesce_tagset

 block/blk-core.c         |   2 +
 block/blk-mq-sysfs.c     |   2 -
 block/blk-mq.c           | 177 +++++++++++++++++++++++++--------------
 block/blk-sysfs.c        |   6 +-
 drivers/nvme/host/core.c |  19 ++---
 include/linux/blk-mq.h   |  10 +--
 include/linux/blkdev.h   |   6 ++
 7 files changed, 140 insertions(+), 82 deletions(-)

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
-- 
2.25.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex
  2020-09-08  8:15 [PATCH V3 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
@ 2020-09-08  8:15 ` Ming Lei
  2020-09-08  8:38   ` Hannes Reinecke
                     ` (2 more replies)
  2020-09-08  8:15 ` [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 16+ messages in thread
From: Ming Lei @ 2020-09-08  8:15 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

Add .mq_quiesce_mutext to request queue, so that queue quiesce and
unquiesce can be serialized. Meantime we can avoid unnecessary
synchronize_rcu() in case that queue has been quiesced already.

Prepare for replace SRCU with percpu-refcount, so that we can avoid
warning in percpu_ref_kill_and_confirm() in case that blk_mq_quiesce_queue()
is run on already-quiesced request queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       |  2 ++
 block/blk-mq.c         | 11 +++++++++++
 include/linux/blkdev.h |  2 ++
 3 files changed, 15 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 093649bd252e..e7e787ea77be 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -564,6 +564,8 @@ struct request_queue *blk_alloc_queue(int node_id)
 	init_waitqueue_head(&q->mq_freeze_wq);
 	mutex_init(&q->mq_freeze_lock);
 
+	mutex_init(&q->mq_quiesce_lock);
+
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
 	 * See blk_register_queue() for details.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4abb71459f94..13cc10b89629 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -224,6 +224,11 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 	unsigned int i;
 	bool rcu = false;
 
+	mutex_lock(&q->mq_quiesce_lock);
+
+	if (blk_queue_quiesced(q))
+		goto exit;
+
 	blk_mq_quiesce_queue_nowait(q);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -234,6 +239,8 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 	}
 	if (rcu)
 		synchronize_rcu();
+ exit:
+	mutex_unlock(&q->mq_quiesce_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
@@ -246,10 +253,14 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
  */
 void blk_mq_unquiesce_queue(struct request_queue *q)
 {
+	mutex_lock(&q->mq_quiesce_lock);
+
 	blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
 
 	/* dispatch requests which are inserted during quiescing */
 	blk_mq_run_hw_queues(q, true);
+
+	mutex_unlock(&q->mq_quiesce_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7b1e53084799..cc6fb4d0d078 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -572,6 +572,8 @@ struct request_queue {
 	 */
 	struct mutex		mq_freeze_lock;
 
+	struct mutex		mq_quiesce_lock;
+
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-08  8:15 [PATCH V3 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
  2020-09-08  8:15 ` [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex Ming Lei
@ 2020-09-08  8:15 ` Ming Lei
  2020-09-08  8:44   ` Hannes Reinecke
                     ` (2 more replies)
  2020-09-08  8:15 ` [PATCH V3 3/4] blk-mq: add tagset quiesce interface Ming Lei
  2020-09-08  8:15 ` [PATCH V3 4/4] nvme: use blk_mq_[un]quiesce_tagset Ming Lei
  3 siblings, 3 replies; 16+ messages in thread
From: Ming Lei @ 2020-09-08  8:15 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

In case of BLK_MQ_F_BLOCKING, blk-mq uses SRCU to mark read critical
section during dispatching request, then request queue quiesce is based on
SRCU. What we want to get is low cost added in fast path.

With percpu-ref, it is cleaner and simpler & enough for implementing queue
quiesce. The main requirement is to make sure all read sections to observe
QUEUE_FLAG_QUIESCED once blk_mq_quiesce_queue() returns.

Also it becomes much easier to add interface of async queue quiesce.

Meantime memory footprint can be reduced with per-request-queue percpu-ref.

From implementation viewpoint, in fast path, not see percpu_ref is
slower than SRCU, and srcu tree(default option in most distributions)
could be slower since memory barrier is required in both lock & unlock,
and rcu_read_lock()/rcu_read_unlock() should be much cheap than
smp_mb().

1) percpu_ref just hold the rcu_read_lock, then run a check &
   increase/decrease on the percpu variable:

   rcu_read_lock()
   if (__ref_is_percpu(ref, &percpu_count))
	this_cpu_inc(*percpu_count);
   rcu_read_unlock()

2) srcu tree:
        idx = READ_ONCE(ssp->srcu_idx) & 0x1;
        this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
        smp_mb(); /* B */  /* Avoid leaking the critical section. */

Also from my test on null_blk(blocking), not observe percpu-ref performs
worse than srcu, see the following test:

1) test steps:

rmmod null_blk > /dev/null 2>&1
modprobe null_blk nr_devices=1 submit_queues=1 blocking=1
fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --ioengine=libaio \
	--iodepth=64 --runtime=60 --group_reporting=1  --name=nullb0 \
	--filename=/dev/nullb0 --numjobs=32

test machine: HP DL380, 16 cpu cores, 2 threads per core, dual
sockets/numa, Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

2) test result:
- srcu quiesce: 6063K IOPS
- percpu-ref quiesce: 6113K IOPS

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
---
 block/blk-mq-sysfs.c   |   2 -
 block/blk-mq.c         | 130 +++++++++++++++++++++--------------------
 block/blk-sysfs.c      |   6 +-
 include/linux/blk-mq.h |   8 ---
 include/linux/blkdev.h |   4 ++
 5 files changed, 77 insertions(+), 73 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 062229395a50..799db7937105 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -38,8 +38,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 
 	cancel_delayed_work_sync(&hctx->run_work);
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		cleanup_srcu_struct(hctx->srcu);
 	blk_free_flush_queue(hctx->fq);
 	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 13cc10b89629..60630a720449 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -220,26 +220,22 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
  */
 void blk_mq_quiesce_queue(struct request_queue *q)
 {
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
-	bool rcu = false;
+	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
 
 	mutex_lock(&q->mq_quiesce_lock);
 
-	if (blk_queue_quiesced(q))
-		goto exit;
-
-	blk_mq_quiesce_queue_nowait(q);
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->flags & BLK_MQ_F_BLOCKING)
-			synchronize_srcu(hctx->srcu);
-		else
-			rcu = true;
+	if (!blk_queue_quiesced(q)) {
+		blk_mq_quiesce_queue_nowait(q);
+		if (blocking)
+			percpu_ref_kill(&q->dispatch_counter);
 	}
-	if (rcu)
+
+	if (blocking)
+		wait_event(q->mq_quiesce_wq,
+			   percpu_ref_is_zero(&q->dispatch_counter));
+	else
 		synchronize_rcu();
- exit:
+
 	mutex_unlock(&q->mq_quiesce_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
@@ -255,7 +251,12 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
 {
 	mutex_lock(&q->mq_quiesce_lock);
 
-	blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+	if (blk_queue_quiesced(q)) {
+		blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+
+		if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
+			percpu_ref_resurrect(&q->dispatch_counter);
+	}
 
 	/* dispatch requests which are inserted during quiescing */
 	blk_mq_run_hw_queues(q, true);
@@ -710,24 +711,21 @@ void blk_mq_complete_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
-	__releases(hctx->srcu)
+static void hctx_unlock(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-		rcu_read_unlock();
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		percpu_ref_put(&hctx->queue->dispatch_counter);
 	else
-		srcu_read_unlock(hctx->srcu, srcu_idx);
+		rcu_read_unlock();
 }
 
-static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
-	__acquires(hctx->srcu)
+/* Returning false means that queue is being quiesced */
+static inline bool hctx_lock(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-		/* shut up gcc false positive */
-		*srcu_idx = 0;
-		rcu_read_lock();
-	} else
-		*srcu_idx = srcu_read_lock(hctx->srcu);
+	if (hctx->flags & BLK_MQ_F_BLOCKING)
+		return percpu_ref_tryget_live(&hctx->queue->dispatch_counter);
+	rcu_read_lock();
+	return true;
 }
 
 /**
@@ -1506,8 +1504,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
  */
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-	int srcu_idx;
-
 	/*
 	 * We should be running this queue from one of the CPUs that
 	 * are mapped to it.
@@ -1541,9 +1537,10 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
-	blk_mq_sched_dispatch_requests(hctx);
-	hctx_unlock(hctx, srcu_idx);
+	if (hctx_lock(hctx)) {
+		blk_mq_sched_dispatch_requests(hctx);
+		hctx_unlock(hctx);
+	}
 }
 
 static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
@@ -1655,7 +1652,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
  */
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-	int srcu_idx;
 	bool need_run;
 
 	/*
@@ -1666,10 +1662,12 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
 	 * quiesced.
 	 */
-	hctx_lock(hctx, &srcu_idx);
+	if (!hctx_lock(hctx))
+		return;
+
 	need_run = !blk_queue_quiesced(hctx->queue) &&
 		blk_mq_hctx_has_pending(hctx);
-	hctx_unlock(hctx, srcu_idx);
+	hctx_unlock(hctx);
 
 	if (need_run)
 		__blk_mq_delay_run_hw_queue(hctx, async, 0);
@@ -2009,7 +2007,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	bool run_queue = true;
 
 	/*
-	 * RCU or SRCU read lock is needed before checking quiesced flag.
+	 * hctx_lock() is needed before checking quiesced flag.
 	 *
 	 * When queue is stopped or quiesced, ignore 'bypass_insert' from
 	 * blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
@@ -2057,11 +2055,14 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, blk_qc_t *cookie)
 {
 	blk_status_t ret;
-	int srcu_idx;
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
-	hctx_lock(hctx, &srcu_idx);
+	/* Insert request to queue in case of being quiesced */
+	if (!hctx_lock(hctx)) {
+		blk_mq_sched_insert_request(rq, false, false, false);
+		return;
+	}
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true);
 	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
@@ -2069,19 +2070,22 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
 
-	hctx_unlock(hctx, srcu_idx);
+	hctx_unlock(hctx);
 }
 
 blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
 {
 	blk_status_t ret;
-	int srcu_idx;
 	blk_qc_t unused_cookie;
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
-	hctx_lock(hctx, &srcu_idx);
+	/* Insert request to queue in case of being quiesced */
+	if (!hctx_lock(hctx)) {
+		blk_mq_sched_insert_request(rq, false, false, false);
+		return BLK_STS_OK;
+	}
 	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true, last);
-	hctx_unlock(hctx, srcu_idx);
+	hctx_unlock(hctx);
 
 	return ret;
 }
@@ -2612,20 +2616,6 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	}
 }
 
-static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
-{
-	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
-
-	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
-			   __alignof__(struct blk_mq_hw_ctx)) !=
-		     sizeof(struct blk_mq_hw_ctx));
-
-	if (tag_set->flags & BLK_MQ_F_BLOCKING)
-		hw_ctx_size += sizeof(struct srcu_struct);
-
-	return hw_ctx_size;
-}
-
 static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
@@ -2663,7 +2653,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx *hctx;
 	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
 
-	hctx = kzalloc_node(blk_mq_hw_ctx_size(set), gfp, node);
+	hctx = kzalloc_node(sizeof(struct blk_mq_hw_ctx), gfp, node);
 	if (!hctx)
 		goto fail_alloc_hctx;
 
@@ -2706,8 +2696,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	if (!hctx->fq)
 		goto free_bitmap;
 
-	if (hctx->flags & BLK_MQ_F_BLOCKING)
-		init_srcu_struct(hctx->srcu);
 	blk_mq_hctx_kobj_init(hctx);
 
 	return hctx;
@@ -3187,6 +3175,13 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	mutex_unlock(&q->sysfs_lock);
 }
 
+static void blk_mq_dispatch_counter_release(struct percpu_ref *ref)
+{
+	struct request_queue *q = container_of(ref, struct request_queue,
+				dispatch_counter);
+	wake_up_all(&q->mq_quiesce_wq);
+}
+
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q,
 						  bool elevator_init)
@@ -3203,6 +3198,14 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (blk_mq_alloc_ctxs(q))
 		goto err_poll;
 
+	if (set->flags & BLK_MQ_F_BLOCKING) {
+		init_waitqueue_head(&q->mq_quiesce_wq);
+		if (percpu_ref_init(&q->dispatch_counter,
+					blk_mq_dispatch_counter_release,
+					PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
+			goto err_hctxs;
+	}
+
 	/* init q->mq_kobj and sw queues' kobjects */
 	blk_mq_sysfs_init(q);
 
@@ -3211,7 +3214,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
-		goto err_hctxs;
+		goto err_dispatch_counter;
 
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
@@ -3245,6 +3248,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	return q;
 
+err_dispatch_counter:
+	if (set->flags & BLK_MQ_F_BLOCKING)
+		percpu_ref_exit(&q->dispatch_counter);
 err_hctxs:
 	kfree(q->queue_hw_ctx);
 	q->nr_hw_queues = 0;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7dda709f3ccb..56b6c045e30c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -941,9 +941,13 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_queue_free_zone_bitmaps(q);
 
-	if (queue_is_mq(q))
+	if (queue_is_mq(q)) {
 		blk_mq_release(q);
 
+		if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
+			percpu_ref_exit(&q->dispatch_counter);
+	}
+
 	blk_trace_shutdown(q);
 	mutex_lock(&q->debugfs_mutex);
 	debugfs_remove_recursive(q->debugfs_dir);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b23eeca4d677..df642055f02c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -4,7 +4,6 @@
 
 #include <linux/blkdev.h>
 #include <linux/sbitmap.h>
-#include <linux/srcu.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -173,13 +172,6 @@ struct blk_mq_hw_ctx {
 	 * q->unused_hctx_list.
 	 */
 	struct list_head	hctx_list;
-
-	/**
-	 * @srcu: Sleepable RCU. Use as lock when type of the hardware queue is
-	 * blocking (BLK_MQ_F_BLOCKING). Must be the last member - see also
-	 * blk_mq_hw_ctx_size().
-	 */
-	struct srcu_struct	srcu[];
 };
 
 /**
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cc6fb4d0d078..f27819574e86 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -574,6 +574,10 @@ struct request_queue {
 
 	struct mutex		mq_quiesce_lock;
 
+	/* only used for BLK_MQ_F_BLOCKING */
+	struct percpu_ref	dispatch_counter;
+	wait_queue_head_t	mq_quiesce_wq;
+
 	struct blk_mq_tag_set	*tag_set;
 	struct list_head	tag_set_list;
 	struct bio_set		bio_split;
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V3 3/4] blk-mq: add tagset quiesce interface
  2020-09-08  8:15 [PATCH V3 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
  2020-09-08  8:15 ` [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex Ming Lei
  2020-09-08  8:15 ` [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
@ 2020-09-08  8:15 ` Ming Lei
  2020-09-08  8:45   ` Hannes Reinecke
  2020-09-08  8:15 ` [PATCH V3 4/4] nvme: use blk_mq_[un]quiesce_tagset Ming Lei
  3 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2020-09-08  8:15 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Ming Lei, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

drivers that have shared tagsets may need to quiesce potentially a lot
of request queues that all share a single tagset (e.g. nvme). Add an interface
to quiesce all the queues on a given tagset. This interface is useful because
it can speedup the quiesce by doing it in parallel.

For tagsets that have BLK_MQ_F_BLOCKING set, we kill request queue's dispatch
percpu-refcount such that all of them wait for the counter becoming zero. For
tagsets that don't have BLK_MQ_F_BLOCKING set, we simply call a single
synchronize_rcu as this is sufficient.

This patch is against Sagi's original post.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>
---
 block/blk-mq.c         | 68 ++++++++++++++++++++++++++++++++----------
 include/linux/blk-mq.h |  2 ++
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 60630a720449..49933d7dceba 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -209,16 +209,7 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
-/**
- * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
- * @q: request queue.
- *
- * Note: this function does not prevent that the struct request end_io()
- * callback function is invoked. Once this function is returned, we make
- * sure no dispatch can happen until the queue is unquiesced via
- * blk_mq_unquiesce_queue().
- */
-void blk_mq_quiesce_queue(struct request_queue *q)
+static void __blk_mq_quiesce_queue(struct request_queue *q, bool wait)
 {
 	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
 
@@ -230,14 +221,30 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 			percpu_ref_kill(&q->dispatch_counter);
 	}
 
-	if (blocking)
-		wait_event(q->mq_quiesce_wq,
-			   percpu_ref_is_zero(&q->dispatch_counter));
-	else
-		synchronize_rcu();
+	if (wait) {
+		if (blocking)
+			wait_event(q->mq_quiesce_wq,
+				   percpu_ref_is_zero(&q->dispatch_counter));
+		else
+			synchronize_rcu();
+	}
 
 	mutex_unlock(&q->mq_quiesce_lock);
 }
+
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
+ * @q: request queue.
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Once this function is returned, we make
+ * sure no dispatch can happen until the queue is unquiesced via
+ * blk_mq_unquiesce_queue().
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+	__blk_mq_quiesce_queue(q, true);
+}
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
 /*
@@ -265,6 +272,37 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
 
+void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	mutex_lock(&set->tag_list_lock);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		__blk_mq_quiesce_queue(q, false);
+
+	/* wait until all queues' quiesce is done */
+	if (set->flags & BLK_MQ_F_BLOCKING) {
+		list_for_each_entry(q, &set->tag_list, tag_set_list)
+			wait_event(q->mq_quiesce_wq,
+				   percpu_ref_is_zero(&q->dispatch_counter));
+	} else {
+		synchronize_rcu();
+	}
+	mutex_unlock(&set->tag_list_lock);
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_tagset);
+
+void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	mutex_lock(&set->tag_list_lock);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_unquiesce_queue(q);
+	mutex_unlock(&set->tag_list_lock);
+}
+EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
+
 void blk_mq_wake_waiters(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index df642055f02c..90da3582b91d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -519,6 +519,8 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
+void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set);
+void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set);
 
 unsigned int blk_mq_rq_cpu(struct request *rq);
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V3 4/4] nvme: use blk_mq_[un]quiesce_tagset
  2020-09-08  8:15 [PATCH V3 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
                   ` (2 preceding siblings ...)
  2020-09-08  8:15 ` [PATCH V3 3/4] blk-mq: add tagset quiesce interface Ming Lei
@ 2020-09-08  8:15 ` Ming Lei
  2020-09-08  8:46   ` Hannes Reinecke
  3 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2020-09-08  8:15 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig, Keith Busch
  Cc: Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng, Ming Lei

From: Sagi Grimberg <sagi@grimberg.me>

All controller namespaces share the same tagset, so we
can use this interface which does the optimal operation
for parallel quiesce based on the tagset type (e.g.
blocking tagsets and non-blocking tagsets).

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Chao Leng <lengchao@huawei.com>

Add code to unquiesce ctrl->connect_q in nvme_stop_queues(), meantime
avoid to call blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset() if
this tagset isn't initialized.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ea1fa41fbba8..a6af8978a3ba 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4623,23 +4623,22 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
+	if (list_empty_careful(&ctrl->namespaces))
+		return;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_quiesce_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	blk_mq_quiesce_tagset(ctrl->tagset);
+
+	if (ctrl->connect_q)
+		blk_mq_unquiesce_queue(ctrl->connect_q);
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
 void nvme_start_queues(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
+	if (list_empty_careful(&ctrl->namespaces))
+		return;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unquiesce_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	blk_mq_unquiesce_tagset(ctrl->tagset);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex
  2020-09-08  8:15 ` [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex Ming Lei
@ 2020-09-08  8:38   ` Hannes Reinecke
  2020-09-08 11:31   ` Johannes Thumshirn
  2020-09-08 17:54   ` Bart Van Assche
  2 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-09-08  8:38 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch
  Cc: Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

On 9/8/20 10:15 AM, Ming Lei wrote:
> Add .mq_quiesce_mutext to request queue, so that queue quiesce and
> unquiesce can be serialized. Meantime we can avoid unnecessary
> synchronize_rcu() in case that queue has been quiesced already.
> 
> Prepare for replace SRCU with percpu-refcount, so that we can avoid
> warning in percpu_ref_kill_and_confirm() in case that blk_mq_quiesce_queue()
> is run on already-quiesced request queue.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Cc: Chao Leng <lengchao@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c       |  2 ++
>  block/blk-mq.c         | 11 +++++++++++
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 15 insertions(+)
> I was wondering if we couldn't make do with a simple test_and_set_bit()
/ test_and_clear_bit(), but that can also be done in a later patch,
seeing that this is a worthwhile improvement. So:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-08  8:15 ` [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
@ 2020-09-08  8:44   ` Hannes Reinecke
  2020-09-08  9:13   ` Chao Leng
  2020-09-08 15:31   ` Keith Busch
  2 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-09-08  8:44 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch
  Cc: Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

On 9/8/20 10:15 AM, Ming Lei wrote:
> In case of BLK_MQ_F_BLOCKING, blk-mq uses SRCU to mark read critical
> section during dispatching request, then request queue quiesce is based on
> SRCU. What we want to get is low cost added in fast path.
> 
> With percpu-ref, it is cleaner and simpler & enough for implementing queue
> quiesce. The main requirement is to make sure all read sections to observe
> QUEUE_FLAG_QUIESCED once blk_mq_quiesce_queue() returns.
> 
> Also it becomes much easier to add interface of async queue quiesce.
> 
> Meantime memory footprint can be reduced with per-request-queue percpu-ref.
> 
> From implementation viewpoint, in fast path, not see percpu_ref is
> slower than SRCU, and srcu tree(default option in most distributions)
> could be slower since memory barrier is required in both lock & unlock,
> and rcu_read_lock()/rcu_read_unlock() should be much cheap than
> smp_mb().
> 
> 1) percpu_ref just hold the rcu_read_lock, then run a check &
>    increase/decrease on the percpu variable:
> 
>    rcu_read_lock()
>    if (__ref_is_percpu(ref, &percpu_count))
> 	this_cpu_inc(*percpu_count);
>    rcu_read_unlock()
> 
> 2) srcu tree:
>         idx = READ_ONCE(ssp->srcu_idx) & 0x1;
>         this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
>         smp_mb(); /* B */  /* Avoid leaking the critical section. */
> 
> Also from my test on null_blk(blocking), not observe percpu-ref performs
> worse than srcu, see the following test:
> 
> 1) test steps:
> 
> rmmod null_blk > /dev/null 2>&1
> modprobe null_blk nr_devices=1 submit_queues=1 blocking=1
> fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --ioengine=libaio \
> 	--iodepth=64 --runtime=60 --group_reporting=1  --name=nullb0 \
> 	--filename=/dev/nullb0 --numjobs=32
> 
> test machine: HP DL380, 16 cpu cores, 2 threads per core, dual
> sockets/numa, Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
> 
> 2) test result:
> - srcu quiesce: 6063K IOPS
> - percpu-ref quiesce: 6113K IOPS
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Cc: Chao Leng <lengchao@huawei.com>
> ---
>  block/blk-mq-sysfs.c   |   2 -
>  block/blk-mq.c         | 130 +++++++++++++++++++++--------------------
>  block/blk-sysfs.c      |   6 +-
>  include/linux/blk-mq.h |   8 ---
>  include/linux/blkdev.h |   4 ++
>  5 files changed, 77 insertions(+), 73 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 3/4] blk-mq: add tagset quiesce interface
  2020-09-08  8:15 ` [PATCH V3 3/4] blk-mq: add tagset quiesce interface Ming Lei
@ 2020-09-08  8:45   ` Hannes Reinecke
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-09-08  8:45 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch
  Cc: Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

On 9/8/20 10:15 AM, Ming Lei wrote:
> drivers that have shared tagsets may need to quiesce potentially a lot
> of request queues that all share a single tagset (e.g. nvme). Add an interface
> to quiesce all the queues on a given tagset. This interface is useful because
> it can speedup the quiesce by doing it in parallel.
> 
> For tagsets that have BLK_MQ_F_BLOCKING set, we kill request queue's dispatch
> percpu-refcount such that all of them wait for the counter becoming zero. For
> tagsets that don't have BLK_MQ_F_BLOCKING set, we simply call a single
> synchronize_rcu as this is sufficient.
> 
> This patch is against Sagi's original post.
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Sagi <ming.lei@redhat.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Cc: Chao Leng <lengchao@huawei.com>
> ---
>  block/blk-mq.c         | 68 ++++++++++++++++++++++++++++++++----------
>  include/linux/blk-mq.h |  2 ++
>  2 files changed, 55 insertions(+), 15 deletions(-)
> 
Makes one wonder for which devices we _do not_ have a shared tagset :-)

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 4/4] nvme: use blk_mq_[un]quiesce_tagset
  2020-09-08  8:15 ` [PATCH V3 4/4] nvme: use blk_mq_[un]quiesce_tagset Ming Lei
@ 2020-09-08  8:46   ` Hannes Reinecke
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2020-09-08  8:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch
  Cc: Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

On 9/8/20 10:15 AM, Ming Lei wrote:
> From: Sagi Grimberg <sagi@grimberg.me>
> 
> All controller namespaces share the same tagset, so we
> can use this interface which does the optimal operation
> for parallel quiesce based on the tagset type (e.g.
> blocking tagsets and non-blocking tagsets).
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Cc: Chao Leng <lengchao@huawei.com>
> 
> Add code to unquiesce ctrl->connect_q in nvme_stop_queues(), meantime
> avoid to call blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset() if
> this tagset isn't initialized.
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/nvme/host/core.c | 19 +++++++++----------
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-08  8:15 ` [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
  2020-09-08  8:44   ` Hannes Reinecke
@ 2020-09-08  9:13   ` Chao Leng
  2020-09-08  9:27     ` Ming Lei
  2020-09-08 15:31   ` Keith Busch
  2 siblings, 1 reply; 16+ messages in thread
From: Chao Leng @ 2020-09-08  9:13 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch
  Cc: Sagi Grimberg, Bart Van Assche, Johannes Thumshirn



On 2020/9/8 16:15, Ming Lei wrote:
> In case of BLK_MQ_F_BLOCKING, blk-mq uses SRCU to mark read critical
> section during dispatching request, then request queue quiesce is based on
> SRCU. What we want to get is low cost added in fast path.
> 
> With percpu-ref, it is cleaner and simpler & enough for implementing queue
> quiesce. The main requirement is to make sure all read sections to observe
> QUEUE_FLAG_QUIESCED once blk_mq_quiesce_queue() returns.
> 
> Also it becomes much easier to add interface of async queue quiesce.
> 
> Meantime memory footprint can be reduced with per-request-queue percpu-ref.
> 
>>From implementation viewpoint, in fast path, not see percpu_ref is
> slower than SRCU, and srcu tree(default option in most distributions)
> could be slower since memory barrier is required in both lock & unlock,
> and rcu_read_lock()/rcu_read_unlock() should be much cheap than
> smp_mb().
> 
> 1) percpu_ref just hold the rcu_read_lock, then run a check &
>     increase/decrease on the percpu variable:
> 
>     rcu_read_lock()
>     if (__ref_is_percpu(ref, &percpu_count))
> 	this_cpu_inc(*percpu_count);
>     rcu_read_unlock()
> 
> 2) srcu tree:
>          idx = READ_ONCE(ssp->srcu_idx) & 0x1;
>          this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
>          smp_mb(); /* B */  /* Avoid leaking the critical section. */
> 
> Also from my test on null_blk(blocking), not observe percpu-ref performs
> worse than srcu, see the following test:
> 
> 1) test steps:
> 
> rmmod null_blk > /dev/null 2>&1
> modprobe null_blk nr_devices=1 submit_queues=1 blocking=1
> fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --ioengine=libaio \
> 	--iodepth=64 --runtime=60 --group_reporting=1  --name=nullb0 \
> 	--filename=/dev/nullb0 --numjobs=32
> 
> test machine: HP DL380, 16 cpu cores, 2 threads per core, dual
> sockets/numa, Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
> 
> 2) test result:
> - srcu quiesce: 6063K IOPS
> - percpu-ref quiesce: 6113K IOPS
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Cc: Chao Leng <lengchao@huawei.com>
> ---
>   block/blk-mq-sysfs.c   |   2 -
>   block/blk-mq.c         | 130 +++++++++++++++++++++--------------------
>   block/blk-sysfs.c      |   6 +-
>   include/linux/blk-mq.h |   8 ---
>   include/linux/blkdev.h |   4 ++
>   5 files changed, 77 insertions(+), 73 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 062229395a50..799db7937105 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -38,8 +38,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
>   
>   	cancel_delayed_work_sync(&hctx->run_work);
>   
> -	if (hctx->flags & BLK_MQ_F_BLOCKING)
> -		cleanup_srcu_struct(hctx->srcu);
>   	blk_free_flush_queue(hctx->fq);
>   	sbitmap_free(&hctx->ctx_map);
>   	free_cpumask_var(hctx->cpumask);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 13cc10b89629..60630a720449 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -220,26 +220,22 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>    */
>   void blk_mq_quiesce_queue(struct request_queue *q)
>   {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> -	bool rcu = false;
> +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
>   
>   	mutex_lock(&q->mq_quiesce_lock);
>   
> -	if (blk_queue_quiesced(q))
> -		goto exit;
> -
> -	blk_mq_quiesce_queue_nowait(q);
> -
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->flags & BLK_MQ_F_BLOCKING)
> -			synchronize_srcu(hctx->srcu);
> -		else
> -			rcu = true;
> +	if (!blk_queue_quiesced(q)) {
> +		blk_mq_quiesce_queue_nowait(q);
> +		if (blocking)
> +			percpu_ref_kill(&q->dispatch_counter);
>   	}
> -	if (rcu)
> +
> +	if (blocking)
> +		wait_event(q->mq_quiesce_wq,
> +			   percpu_ref_is_zero(&q->dispatch_counter));
> +	else
>   		synchronize_rcu();
> - exit:
> +
>   	mutex_unlock(&q->mq_quiesce_lock);
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
> @@ -255,7 +251,12 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>   {
>   	mutex_lock(&q->mq_quiesce_lock);
>   
> -	blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> +	if (blk_queue_quiesced(q)) {
> +		blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> +
> +		if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
> +			percpu_ref_resurrect(&q->dispatch_counter);
> +	}
>   
>   	/* dispatch requests which are inserted during quiescing */
>   	blk_mq_run_hw_queues(q, true);
> @@ -710,24 +711,21 @@ void blk_mq_complete_request(struct request *rq)
>   }
>   EXPORT_SYMBOL(blk_mq_complete_request);
>   
> -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> -	__releases(hctx->srcu)
> +static void hctx_unlock(struct blk_mq_hw_ctx *hctx)
>   {
> -	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> -		rcu_read_unlock();
> +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> +		percpu_ref_put(&hctx->queue->dispatch_counter);
>   	else
> -		srcu_read_unlock(hctx->srcu, srcu_idx);
> +		rcu_read_unlock();
>   }
>   
> -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> -	__acquires(hctx->srcu)
> +/* Returning false means that queue is being quiesced */
> +static inline bool hctx_lock(struct blk_mq_hw_ctx *hctx)
>   {
> -	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> -		/* shut up gcc false positive */
> -		*srcu_idx = 0;
> -		rcu_read_lock();
> -	} else
> -		*srcu_idx = srcu_read_lock(hctx->srcu);
> +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> +		return percpu_ref_tryget_live(&hctx->queue->dispatch_counter);
> +	rcu_read_lock();
> +	return true;
>   }
>   
>   /**
> @@ -1506,8 +1504,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
>    */
>   static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>   {
> -	int srcu_idx;
> -
>   	/*
>   	 * We should be running this queue from one of the CPUs that
>   	 * are mapped to it.
> @@ -1541,9 +1537,10 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>   
>   	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
>   
> -	hctx_lock(hctx, &srcu_idx);
> -	blk_mq_sched_dispatch_requests(hctx);
> -	hctx_unlock(hctx, srcu_idx);
> +	if (hctx_lock(hctx)) {
> +		blk_mq_sched_dispatch_requests(hctx);
> +		hctx_unlock(hctx);
> +	}
>   }
>   
>   static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
> @@ -1655,7 +1652,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>    */
>   void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>   {
> -	int srcu_idx;
>   	bool need_run;
>   
>   	/*
> @@ -1666,10 +1662,12 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>   	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>   	 * quiesced.
>   	 */
> -	hctx_lock(hctx, &srcu_idx);
> +	if (!hctx_lock(hctx))
> +		return;
> +
>   	need_run = !blk_queue_quiesced(hctx->queue) &&
>   		blk_mq_hctx_has_pending(hctx);
> -	hctx_unlock(hctx, srcu_idx);
> +	hctx_unlock(hctx);
>   
>   	if (need_run)
>   		__blk_mq_delay_run_hw_queue(hctx, async, 0);
> @@ -2009,7 +2007,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   	bool run_queue = true;
>   
>   	/*
> -	 * RCU or SRCU read lock is needed before checking quiesced flag.
> +	 * hctx_lock() is needed before checking quiesced flag.
>   	 *
>   	 * When queue is stopped or quiesced, ignore 'bypass_insert' from
>   	 * blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
> @@ -2057,11 +2055,14 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   		struct request *rq, blk_qc_t *cookie)
>   {
>   	blk_status_t ret;
> -	int srcu_idx;
>   
>   	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
>   
> -	hctx_lock(hctx, &srcu_idx);
> +	/* Insert request to queue in case of being quiesced */
> +	if (!hctx_lock(hctx)) {
> +		blk_mq_sched_insert_request(rq, false, false, false);
Suggest: use blk_mq_request_bypass_insert, the rq should do first.
> +		return;
> +	}
>   
>   	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true);
>   	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> @@ -2069,19 +2070,22 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   	else if (ret != BLK_STS_OK)
>   		blk_mq_end_request(rq, ret);
>   
> -	hctx_unlock(hctx, srcu_idx);
> +	hctx_unlock(hctx);
>   }
>   
>   blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
>   {
>   	blk_status_t ret;
> -	int srcu_idx;
>   	blk_qc_t unused_cookie;
>   	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>   
> -	hctx_lock(hctx, &srcu_idx);
> +	/* Insert request to queue in case of being quiesced */
> +	if (!hctx_lock(hctx)) {
> +		blk_mq_sched_insert_request(rq, false, false, false);
Same here.
> +		return BLK_STS_OK;
> +	}
>   	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true, last);
> -	hctx_unlock(hctx, srcu_idx);
> +	hctx_unlock(hctx);
>   
>   	return ret;
>   }
> @@ -2612,20 +2616,6 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
>   	}
>   }
>   
> -static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
> -{
> -	int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
> -
> -	BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
> -			   __alignof__(struct blk_mq_hw_ctx)) !=
> -		     sizeof(struct blk_mq_hw_ctx));
> -
> -	if (tag_set->flags & BLK_MQ_F_BLOCKING)
> -		hw_ctx_size += sizeof(struct srcu_struct);
> -
> -	return hw_ctx_size;
> -}
> -
>   static int blk_mq_init_hctx(struct request_queue *q,
>   		struct blk_mq_tag_set *set,
>   		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
> @@ -2663,7 +2653,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
>   	struct blk_mq_hw_ctx *hctx;
>   	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>   
> -	hctx = kzalloc_node(blk_mq_hw_ctx_size(set), gfp, node);
> +	hctx = kzalloc_node(sizeof(struct blk_mq_hw_ctx), gfp, node);
>   	if (!hctx)
>   		goto fail_alloc_hctx;
>   
> @@ -2706,8 +2696,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
>   	if (!hctx->fq)
>   		goto free_bitmap;
>   
> -	if (hctx->flags & BLK_MQ_F_BLOCKING)
> -		init_srcu_struct(hctx->srcu);
>   	blk_mq_hctx_kobj_init(hctx);
>   
>   	return hctx;
> @@ -3187,6 +3175,13 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>   	mutex_unlock(&q->sysfs_lock);
>   }
>   
> +static void blk_mq_dispatch_counter_release(struct percpu_ref *ref)
> +{
> +	struct request_queue *q = container_of(ref, struct request_queue,
> +				dispatch_counter);
> +	wake_up_all(&q->mq_quiesce_wq);
> +}
> +
>   struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   						  struct request_queue *q,
>   						  bool elevator_init)
> @@ -3203,6 +3198,14 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   	if (blk_mq_alloc_ctxs(q))
>   		goto err_poll;
>   
> +	if (set->flags & BLK_MQ_F_BLOCKING) {
> +		init_waitqueue_head(&q->mq_quiesce_wq);
> +		if (percpu_ref_init(&q->dispatch_counter,
> +					blk_mq_dispatch_counter_release,
> +					PERCPU_REF_ALLOW_REINIT, GFP_KERNEL))
> +			goto err_hctxs;
> +	}
> +
>   	/* init q->mq_kobj and sw queues' kobjects */
>   	blk_mq_sysfs_init(q);
>   
> @@ -3211,7 +3214,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   
>   	blk_mq_realloc_hw_ctxs(set, q);
>   	if (!q->nr_hw_queues)
> -		goto err_hctxs;
> +		goto err_dispatch_counter;
>   
>   	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
>   	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
> @@ -3245,6 +3248,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>   
>   	return q;
>   
> +err_dispatch_counter:
> +	if (set->flags & BLK_MQ_F_BLOCKING)
> +		percpu_ref_exit(&q->dispatch_counter);
>   err_hctxs:
>   	kfree(q->queue_hw_ctx);
>   	q->nr_hw_queues = 0;
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7dda709f3ccb..56b6c045e30c 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -941,9 +941,13 @@ static void blk_release_queue(struct kobject *kobj)
>   
>   	blk_queue_free_zone_bitmaps(q);
>   
> -	if (queue_is_mq(q))
> +	if (queue_is_mq(q)) {
>   		blk_mq_release(q);
>   
> +		if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
> +			percpu_ref_exit(&q->dispatch_counter);
> +	}
> +
>   	blk_trace_shutdown(q);
>   	mutex_lock(&q->debugfs_mutex);
>   	debugfs_remove_recursive(q->debugfs_dir);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index b23eeca4d677..df642055f02c 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -4,7 +4,6 @@
>   
>   #include <linux/blkdev.h>
>   #include <linux/sbitmap.h>
> -#include <linux/srcu.h>
>   
>   struct blk_mq_tags;
>   struct blk_flush_queue;
> @@ -173,13 +172,6 @@ struct blk_mq_hw_ctx {
>   	 * q->unused_hctx_list.
>   	 */
>   	struct list_head	hctx_list;
> -
> -	/**
> -	 * @srcu: Sleepable RCU. Use as lock when type of the hardware queue is
> -	 * blocking (BLK_MQ_F_BLOCKING). Must be the last member - see also
> -	 * blk_mq_hw_ctx_size().
> -	 */
> -	struct srcu_struct	srcu[];
>   };
>   
>   /**
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index cc6fb4d0d078..f27819574e86 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -574,6 +574,10 @@ struct request_queue {
>   
>   	struct mutex		mq_quiesce_lock;
>   
> +	/* only used for BLK_MQ_F_BLOCKING */
> +	struct percpu_ref	dispatch_counter;
> +	wait_queue_head_t	mq_quiesce_wq;
> +
>   	struct blk_mq_tag_set	*tag_set;
>   	struct list_head	tag_set_list;
>   	struct bio_set		bio_split;
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-08  9:13   ` Chao Leng
@ 2020-09-08  9:27     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2020-09-08  9:27 UTC (permalink / raw)
  To: Chao Leng
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch, Sagi Grimberg, Bart Van Assche, Johannes Thumshirn

On Tue, Sep 08, 2020 at 05:13:22PM +0800, Chao Leng wrote:
> 
> 
> On 2020/9/8 16:15, Ming Lei wrote:
> > In case of BLK_MQ_F_BLOCKING, blk-mq uses SRCU to mark read critical
> > section during dispatching request, then request queue quiesce is based on
> > SRCU. What we want to get is low cost added in fast path.
> > 
> > With percpu-ref, it is cleaner and simpler & enough for implementing queue
> > quiesce. The main requirement is to make sure all read sections to observe
> > QUEUE_FLAG_QUIESCED once blk_mq_quiesce_queue() returns.
> > 
> > Also it becomes much easier to add interface of async queue quiesce.
> > 
> > Meantime memory footprint can be reduced with per-request-queue percpu-ref.
> > 
> > > From implementation viewpoint, in fast path, not see percpu_ref is
> > slower than SRCU, and srcu tree(default option in most distributions)
> > could be slower since memory barrier is required in both lock & unlock,
> > and rcu_read_lock()/rcu_read_unlock() should be much cheap than
> > smp_mb().
> > 
> > 1) percpu_ref just hold the rcu_read_lock, then run a check &
> >     increase/decrease on the percpu variable:
> > 
> >     rcu_read_lock()
> >     if (__ref_is_percpu(ref, &percpu_count))
> > 	this_cpu_inc(*percpu_count);
> >     rcu_read_unlock()
> > 
> > 2) srcu tree:
> >          idx = READ_ONCE(ssp->srcu_idx) & 0x1;
> >          this_cpu_inc(ssp->sda->srcu_lock_count[idx]);
> >          smp_mb(); /* B */  /* Avoid leaking the critical section. */
> > 
> > Also from my test on null_blk(blocking), not observe percpu-ref performs
> > worse than srcu, see the following test:
> > 
> > 1) test steps:
> > 
> > rmmod null_blk > /dev/null 2>&1
> > modprobe null_blk nr_devices=1 submit_queues=1 blocking=1
> > fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --ioengine=libaio \
> > 	--iodepth=64 --runtime=60 --group_reporting=1  --name=nullb0 \
> > 	--filename=/dev/nullb0 --numjobs=32
> > 
> > test machine: HP DL380, 16 cpu cores, 2 threads per core, dual
> > sockets/numa, Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
> > 
> > 2) test result:
> > - srcu quiesce: 6063K IOPS
> > - percpu-ref quiesce: 6113K IOPS
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> > Cc: Chao Leng <lengchao@huawei.com>
> > ---
> >   block/blk-mq-sysfs.c   |   2 -
> >   block/blk-mq.c         | 130 +++++++++++++++++++++--------------------
> >   block/blk-sysfs.c      |   6 +-
> >   include/linux/blk-mq.h |   8 ---
> >   include/linux/blkdev.h |   4 ++
> >   5 files changed, 77 insertions(+), 73 deletions(-)
> > 
> > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> > index 062229395a50..799db7937105 100644
> > --- a/block/blk-mq-sysfs.c
> > +++ b/block/blk-mq-sysfs.c
> > @@ -38,8 +38,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
> >   	cancel_delayed_work_sync(&hctx->run_work);
> > -	if (hctx->flags & BLK_MQ_F_BLOCKING)
> > -		cleanup_srcu_struct(hctx->srcu);
> >   	blk_free_flush_queue(hctx->fq);
> >   	sbitmap_free(&hctx->ctx_map);
> >   	free_cpumask_var(hctx->cpumask);
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 13cc10b89629..60630a720449 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -220,26 +220,22 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
> >    */
> >   void blk_mq_quiesce_queue(struct request_queue *q)
> >   {
> > -	struct blk_mq_hw_ctx *hctx;
> > -	unsigned int i;
> > -	bool rcu = false;
> > +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
> >   	mutex_lock(&q->mq_quiesce_lock);
> > -	if (blk_queue_quiesced(q))
> > -		goto exit;
> > -
> > -	blk_mq_quiesce_queue_nowait(q);
> > -
> > -	queue_for_each_hw_ctx(q, hctx, i) {
> > -		if (hctx->flags & BLK_MQ_F_BLOCKING)
> > -			synchronize_srcu(hctx->srcu);
> > -		else
> > -			rcu = true;
> > +	if (!blk_queue_quiesced(q)) {
> > +		blk_mq_quiesce_queue_nowait(q);
> > +		if (blocking)
> > +			percpu_ref_kill(&q->dispatch_counter);
> >   	}
> > -	if (rcu)
> > +
> > +	if (blocking)
> > +		wait_event(q->mq_quiesce_wq,
> > +			   percpu_ref_is_zero(&q->dispatch_counter));
> > +	else
> >   		synchronize_rcu();
> > - exit:
> > +
> >   	mutex_unlock(&q->mq_quiesce_lock);
> >   }
> >   EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
> > @@ -255,7 +251,12 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> >   {
> >   	mutex_lock(&q->mq_quiesce_lock);
> > -	blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > +	if (blk_queue_quiesced(q)) {
> > +		blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > +
> > +		if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
> > +			percpu_ref_resurrect(&q->dispatch_counter);
> > +	}
> >   	/* dispatch requests which are inserted during quiescing */
> >   	blk_mq_run_hw_queues(q, true);
> > @@ -710,24 +711,21 @@ void blk_mq_complete_request(struct request *rq)
> >   }
> >   EXPORT_SYMBOL(blk_mq_complete_request);
> > -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> > -	__releases(hctx->srcu)
> > +static void hctx_unlock(struct blk_mq_hw_ctx *hctx)
> >   {
> > -	if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> > -		rcu_read_unlock();
> > +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> > +		percpu_ref_put(&hctx->queue->dispatch_counter);
> >   	else
> > -		srcu_read_unlock(hctx->srcu, srcu_idx);
> > +		rcu_read_unlock();
> >   }
> > -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> > -	__acquires(hctx->srcu)
> > +/* Returning false means that queue is being quiesced */
> > +static inline bool hctx_lock(struct blk_mq_hw_ctx *hctx)
> >   {
> > -	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> > -		/* shut up gcc false positive */
> > -		*srcu_idx = 0;
> > -		rcu_read_lock();
> > -	} else
> > -		*srcu_idx = srcu_read_lock(hctx->srcu);
> > +	if (hctx->flags & BLK_MQ_F_BLOCKING)
> > +		return percpu_ref_tryget_live(&hctx->queue->dispatch_counter);
> > +	rcu_read_lock();
> > +	return true;
> >   }
> >   /**
> > @@ -1506,8 +1504,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
> >    */
> >   static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> >   {
> > -	int srcu_idx;
> > -
> >   	/*
> >   	 * We should be running this queue from one of the CPUs that
> >   	 * are mapped to it.
> > @@ -1541,9 +1537,10 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> >   	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
> > -	hctx_lock(hctx, &srcu_idx);
> > -	blk_mq_sched_dispatch_requests(hctx);
> > -	hctx_unlock(hctx, srcu_idx);
> > +	if (hctx_lock(hctx)) {
> > +		blk_mq_sched_dispatch_requests(hctx);
> > +		hctx_unlock(hctx);
> > +	}
> >   }
> >   static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
> > @@ -1655,7 +1652,6 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
> >    */
> >   void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> >   {
> > -	int srcu_idx;
> >   	bool need_run;
> >   	/*
> > @@ -1666,10 +1662,12 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> >   	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
> >   	 * quiesced.
> >   	 */
> > -	hctx_lock(hctx, &srcu_idx);
> > +	if (!hctx_lock(hctx))
> > +		return;
> > +
> >   	need_run = !blk_queue_quiesced(hctx->queue) &&
> >   		blk_mq_hctx_has_pending(hctx);
> > -	hctx_unlock(hctx, srcu_idx);
> > +	hctx_unlock(hctx);
> >   	if (need_run)
> >   		__blk_mq_delay_run_hw_queue(hctx, async, 0);
> > @@ -2009,7 +2007,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >   	bool run_queue = true;
> >   	/*
> > -	 * RCU or SRCU read lock is needed before checking quiesced flag.
> > +	 * hctx_lock() is needed before checking quiesced flag.
> >   	 *
> >   	 * When queue is stopped or quiesced, ignore 'bypass_insert' from
> >   	 * blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
> > @@ -2057,11 +2055,14 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >   		struct request *rq, blk_qc_t *cookie)
> >   {
> >   	blk_status_t ret;
> > -	int srcu_idx;
> >   	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
> > -	hctx_lock(hctx, &srcu_idx);
> > +	/* Insert request to queue in case of being quiesced */
> > +	if (!hctx_lock(hctx)) {
> > +		blk_mq_sched_insert_request(rq, false, false, false);
> Suggest: use blk_mq_request_bypass_insert, the rq should do first.

No, when request is issued directly, it never be queued to LLD, so we
should insert it to scheduler queue, see commit:

db03f88fae8a ("blk-mq: insert request not through ->queue_rq into sw/scheduler queue")



Thanks,
Ming


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex
  2020-09-08  8:15 ` [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex Ming Lei
  2020-09-08  8:38   ` Hannes Reinecke
@ 2020-09-08 11:31   ` Johannes Thumshirn
  2020-09-08 17:54   ` Bart Van Assche
  2 siblings, 0 replies; 16+ messages in thread
From: Johannes Thumshirn @ 2020-09-08 11:31 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch
  Cc: Sagi Grimberg, Bart Van Assche, Chao Leng

On 08/09/2020 10:16, Ming Lei wrote:
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
[...]
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Double S-o-b

Otherwise looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-08  8:15 ` [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
  2020-09-08  8:44   ` Hannes Reinecke
  2020-09-08  9:13   ` Chao Leng
@ 2020-09-08 15:31   ` Keith Busch
  2020-09-09  1:19     ` Ming Lei
  2 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2020-09-08 15:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

On Tue, Sep 08, 2020 at 04:15:36PM +0800, Ming Lei wrote:
>  void blk_mq_quiesce_queue(struct request_queue *q)
>  {
> -	struct blk_mq_hw_ctx *hctx;
> -	unsigned int i;
> -	bool rcu = false;
> +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
>  
>  	mutex_lock(&q->mq_quiesce_lock);
>  
> -	if (blk_queue_quiesced(q))
> -		goto exit;

Why remove the 'goto exit' on this condition? There shouldn't be a need
to synchronize dispatch again if a previous quiesce already did so.

> -	blk_mq_quiesce_queue_nowait(q);
> -
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->flags & BLK_MQ_F_BLOCKING)
> -			synchronize_srcu(hctx->srcu);
> -		else
> -			rcu = true;
> +	if (!blk_queue_quiesced(q)) {
> +		blk_mq_quiesce_queue_nowait(q);
> +		if (blocking)
> +			percpu_ref_kill(&q->dispatch_counter);
>  	}
> -	if (rcu)
> +
> +	if (blocking)
> +		wait_event(q->mq_quiesce_wq,
> +			   percpu_ref_is_zero(&q->dispatch_counter));
> +	else
>  		synchronize_rcu();
> - exit:
> +
>  	mutex_unlock(&q->mq_quiesce_lock);
>  }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex
  2020-09-08  8:15 ` [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex Ming Lei
  2020-09-08  8:38   ` Hannes Reinecke
  2020-09-08 11:31   ` Johannes Thumshirn
@ 2020-09-08 17:54   ` Bart Van Assche
  2020-09-09  1:16     ` Ming Lei
  2 siblings, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2020-09-08 17:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch
  Cc: Sagi Grimberg, Johannes Thumshirn, Chao Leng

On 2020-09-08 01:15, Ming Lei wrote:
>  void blk_mq_unquiesce_queue(struct request_queue *q)
>  {
> +	mutex_lock(&q->mq_quiesce_lock);
> +
>  	blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
>  
>  	/* dispatch requests which are inserted during quiescing */
>  	blk_mq_run_hw_queues(q, true);
> +
> +	mutex_unlock(&q->mq_quiesce_lock);
>  }
Has the sunvdc driver been retested? It calls blk_mq_unquiesce_queue()
with a spinlock held. As you know calling mutex_lock() while holding a
spinlock is not allowed.

There may be other drivers than the sunvdc driver that do this.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex
  2020-09-08 17:54   ` Bart Van Assche
@ 2020-09-09  1:16     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2020-09-09  1:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Keith Busch, Johannes Thumshirn, Sagi Grimberg, Chao Leng

On Tue, Sep 08, 2020 at 10:54:14AM -0700, Bart Van Assche wrote:
> On 2020-09-08 01:15, Ming Lei wrote:
> >  void blk_mq_unquiesce_queue(struct request_queue *q)
> >  {
> > +	mutex_lock(&q->mq_quiesce_lock);
> > +
> >  	blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> >  
> >  	/* dispatch requests which are inserted during quiescing */
> >  	blk_mq_run_hw_queues(q, true);
> > +
> > +	mutex_unlock(&q->mq_quiesce_lock);
> >  }
> Has the sunvdc driver been retested? It calls blk_mq_unquiesce_queue()
> with a spinlock held. As you know calling mutex_lock() while holding a
> spinlock is not allowed.

I am wondering if sunvdc is still being actively used, the similar lock issue
has been existed since 7996a8b5511a ("blk-mq: fix hang caused by
freeze/unfreeze sequence") which is committed in May 2019.

+       spin_lock_irq(&port->vio.lock);
+       port->drain = 0;
+       blk_mq_unquiesce_queue(q);
+       blk_mq_unfreeze_queue(q);

mutex_lock is added to blk_mq_unfreeze_queue(q) since commit 7996a8b5511a.

Not see such report actually.

> There may be other drivers than the sunvdc driver that do this.

Most calls of blk_mq_unquiesce_queue are easily to be audited because
blk_mq_quiesce_queue is used in same callsite.

I will take a close look at this thing before posting next version.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING
  2020-09-08 15:31   ` Keith Busch
@ 2020-09-09  1:19     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2020-09-09  1:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Bart Van Assche, Johannes Thumshirn, Chao Leng

On Tue, Sep 08, 2020 at 08:31:08AM -0700, Keith Busch wrote:
> On Tue, Sep 08, 2020 at 04:15:36PM +0800, Ming Lei wrote:
> >  void blk_mq_quiesce_queue(struct request_queue *q)
> >  {
> > -	struct blk_mq_hw_ctx *hctx;
> > -	unsigned int i;
> > -	bool rcu = false;
> > +	bool blocking = !!(q->tag_set->flags & BLK_MQ_F_BLOCKING);
> >  
> >  	mutex_lock(&q->mq_quiesce_lock);
> >  
> > -	if (blk_queue_quiesced(q))
> > -		goto exit;
> 
> Why remove the 'goto exit' on this condition? There shouldn't be a need
> to synchronize dispatch again if a previous quiesce already did so.

Hammm, this change is supposed to be done in next patch of 'blk-mq: add tagset
quiesce interface'. The tagset quiesce interface starts to apply async quiesce,
so synchronize dispatch has to be done in blk_mq_quiesce_queue().

thanks,
Ming


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-09  1:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-08  8:15 [PATCH V3 0/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
2020-09-08  8:15 ` [PATCH V3 1/4] blk-mq: serialize queue quiesce and unquiesce by mutex Ming Lei
2020-09-08  8:38   ` Hannes Reinecke
2020-09-08 11:31   ` Johannes Thumshirn
2020-09-08 17:54   ` Bart Van Assche
2020-09-09  1:16     ` Ming Lei
2020-09-08  8:15 ` [PATCH V3 2/4] blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING Ming Lei
2020-09-08  8:44   ` Hannes Reinecke
2020-09-08  9:13   ` Chao Leng
2020-09-08  9:27     ` Ming Lei
2020-09-08 15:31   ` Keith Busch
2020-09-09  1:19     ` Ming Lei
2020-09-08  8:15 ` [PATCH V3 3/4] blk-mq: add tagset quiesce interface Ming Lei
2020-09-08  8:45   ` Hannes Reinecke
2020-09-08  8:15 ` [PATCH V3 4/4] nvme: use blk_mq_[un]quiesce_tagset Ming Lei
2020-09-08  8:46   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).