All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET block/for-next] blk-mq: update freezing
@ 2014-06-18 15:21 Tejun Heo
  2014-06-18 15:21 ` [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-18 15:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, kmo, nab

Hello, Jens.

While reading through blk-mq, I spotted several issues and this
patchset addresses them.  The biggest one is how freezing is
implemented.  Coupling it with bypassing doesn't seem to work well and
there's a subtle bug in the perpcu switch implementation.

I don't think open-coding this level of percpu logic is a good idea.
It tends to be very error-prone and difficult to follow.  The barrier
problem is the only thing I spotted but it's very difficult to say
that it's correct.  percpu_ref already implements most of what's
necessary to implement this sort of percpu switch and I added the
missing bits in a recent patchset and converted blk-mq freezing
mechanism to use it in this patch.

It's far simpler and easier to wrap one's head around, and, as it's
shared with other mechanisms including aio and cgroups, we should have
better testing coverage and more eyes scrutinizing it.

This patchset contains the following six patches.

 0001-blk-mq-make-blk_mq_stop_hw_queue-reliably-block-queu.patch
 0002-blk-mq-fix-a-memory-ordering-bug-in-blk_mq_queue_ent.patch
 0003-block-blk-mq-draining-can-t-be-skipped-even-if-bypas.patch
 0004-blk-mq-decouble-blk-mq-freezing-from-generic-bypassi.patch
 0005-blk-mq-collapse-__blk_mq_drain_queue-into-blk_mq_fre.patch
 0006-blk-mq-use-percpu_ref-for-mq-usage-count.patch

0001-0003 are fix patches that can be applied separately.

0004 decouples blk-mq freezing from queue bypassing.

0005-0006 replace the custom percpu switch with percpu_ref.

This patchset is on top of

 percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
+[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref

 block/blk-core.c       |   13 ++++---
 block/blk-mq.c         |   90 ++++++++++++++++---------------------------------
 block/blk-mq.h         |    2 -
 block/blk-sysfs.c      |    2 -
 include/linux/blkdev.h |    4 +-
 5 files changed, 44 insertions(+), 67 deletions(-)

Thanks.

--
tejun

[1] http://lkml.kernel.org/g/1403053685-28240-1-git-send-email-tj@kernel.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing
  2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
@ 2014-06-18 15:21 ` Tejun Heo
  2014-06-19 10:10   ` Ming Lei
  2014-06-18 15:21 ` [PATCH 2/6] blk-mq: fix a memory ordering bug in blk_mq_queue_enter() Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2014-06-18 15:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, kmo, nab, Tejun Heo

blk_mq_stop_hw_queue() has the following two issues.

* BLK_MQ_S_STOPPED should be set before canceling the work items;
  otherwise, a new instance may proceed beyond STOPPED checking
  inbetween.

* cancel_delayed_work() doesn't do anything to work instance already
  executing.  Use cancel_delayed_work_sync() to ensure that the
  currently in-flight one is finished.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
---
 block/blk-mq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e11f5f8..4787c3d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -862,9 +862,9 @@ EXPORT_SYMBOL(blk_mq_run_queues);
 
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-	cancel_delayed_work(&hctx->run_work);
-	cancel_delayed_work(&hctx->delay_work);
 	set_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	cancel_delayed_work_sync(&hctx->run_work);
+	cancel_delayed_work_sync(&hctx->delay_work);
 }
 EXPORT_SYMBOL(blk_mq_stop_hw_queue);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/6] blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
  2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
  2014-06-18 15:21 ` [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing Tejun Heo
@ 2014-06-18 15:21 ` Tejun Heo
  2014-06-18 15:21 ` [PATCH 3/6] block, blk-mq: draining can't be skipped even if bypass_depth was non-zero Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-18 15:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, kmo, nab, Tejun Heo

blk-mq uses a percpu_counter to keep track of how many usages are in
flight.  The percpu_counter is drained while freezing to ensure that
no usage is left in-flight after freezing is complete.

blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism; unfortunately, it contains a subtle bug -
smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
if freezing happens inbetween the caller can slip through and freezing
can be complete while there are active users.

Use smp_mb() instead so that bypass_depth and mq_usage_counter
modifications and tests are properly interlocked.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4787c3d..1c09d57 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -81,7 +81,7 @@ static int blk_mq_queue_enter(struct request_queue *q)
 	int ret;
 
 	__percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
-	smp_wmb();
+	smp_mb();
 
 	/* we have problems freezing the queue if it's initializing */
 	if (!blk_queue_dying(q) &&
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/6] block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
  2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
  2014-06-18 15:21 ` [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing Tejun Heo
  2014-06-18 15:21 ` [PATCH 2/6] blk-mq: fix a memory ordering bug in blk_mq_queue_enter() Tejun Heo
@ 2014-06-18 15:21 ` Tejun Heo
  2014-06-18 15:21 ` [PATCH 4/6] blk-mq: decouble blk-mq freezing from generic bypassing Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-18 15:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, kmo, nab, Tejun Heo

Currently, both blk_queeu_bypass_start() and blk_mq_freeze_queue()
skip queue draining if bypass_depth was already above zero.  The
assumption is that the one which bumped the bypass_depth should have
performed draining already; however, there's nothing which prevents a
new instance of bypassing/freezing from starting before the previous
one finishes draining.  The current code may allow the later
bypassing/freezing instances to complete while there still are
in-flight requests which haven't finished draining.

Fix it by draining regardless of bypass_depth.  We still skip draining
from blk_queue_bypass_start() while the queue is initializing to avoid
introducing excessive delays during boot.  INIT_DONE setting is moved
above the initial blk_queue_bypass_end() so that bypassing attempts
can't slip inbetween.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
---
 block/blk-core.c  | 11 +++++++----
 block/blk-mq.c    |  7 ++-----
 block/blk-sysfs.c |  2 +-
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f6f6b9a..2375130 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -438,14 +438,17 @@ static void __blk_drain_queue(struct request_queue *q, bool drain_all)
  */
 void blk_queue_bypass_start(struct request_queue *q)
 {
-	bool drain;
-
 	spin_lock_irq(q->queue_lock);
-	drain = !q->bypass_depth++;
+	q->bypass_depth++;
 	queue_flag_set(QUEUE_FLAG_BYPASS, q);
 	spin_unlock_irq(q->queue_lock);
 
-	if (drain) {
+	/*
+	 * Queues start drained.  Skip actual draining till init is
+	 * complete.  This avoids lenghty delays during queue init which
+	 * can happen many times during boot.
+	 */
+	if (blk_queue_init_done(q)) {
 		spin_lock_irq(q->queue_lock);
 		__blk_drain_queue(q, false);
 		spin_unlock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1c09d57..c0122a4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -131,15 +131,12 @@ static void __blk_mq_drain_queue(struct request_queue *q)
  */
 static void blk_mq_freeze_queue(struct request_queue *q)
 {
-	bool drain;
-
 	spin_lock_irq(q->queue_lock);
-	drain = !q->bypass_depth++;
+	q->bypass_depth++;
 	queue_flag_set(QUEUE_FLAG_BYPASS, q);
 	spin_unlock_irq(q->queue_lock);
 
-	if (drain)
-		__blk_mq_drain_queue(q);
+	__blk_mq_drain_queue(q);
 }
 
 void blk_mq_drain_queue(struct request_queue *q)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 23321fb..4db5abf 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -554,8 +554,8 @@ int blk_register_queue(struct gendisk *disk)
 	 * Initialization must be complete by now.  Finish the initial
 	 * bypass from queue allocation.
 	 */
-	blk_queue_bypass_end(q);
 	queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
+	blk_queue_bypass_end(q);
 
 	ret = blk_trace_init_sysfs(dev);
 	if (ret)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/6] blk-mq: decouble blk-mq freezing from generic bypassing
  2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
                   ` (2 preceding siblings ...)
  2014-06-18 15:21 ` [PATCH 3/6] block, blk-mq: draining can't be skipped even if bypass_depth was non-zero Tejun Heo
@ 2014-06-18 15:21 ` Tejun Heo
  2014-06-18 15:21 ` [PATCH 5/6] blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-18 15:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, kmo, nab, Tejun Heo

blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order.  This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.

However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones.  This causes problems as bypassing assumes that request
processing is online.  blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.

Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones.  This shouldn't break
anything but again highlights that something isn't quite right here.

The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms.  The only intersecting purpose
that they serve is during queue cleanup.  Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.

* request_queue->mq_freeze_depth is added and
  blk_mq_[un]freeze_queue() now operate on this counter instead of
  ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
  but the counter is tested directly.  This will be further updated by
  later changes.

* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
  blk_mq_freeze_queue().  Queue cleanup path now calls
  blk_mq_freeze_queue() directly.

* blk_queue_enter()'s fast path condition is simplified to simply
  check @q->mq_freeze_depth.  Previously, the condition was

	!blk_queue_dying(q) &&
	    (!blk_queue_bypass(q) || !blk_queue_init_done(q))

  mq_freeze_depth is incremented right after dying is set and
  blk_queue_init_done() exception isn't necessary as blk-mq doesn't
  start frozen, which only leaves the blk_queue_bypass() test which
  can be replaced by @q->mq_freeze_depth test.

This change simplifies the code and reduces confusion in the area.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
---
 block/blk-core.c       |  2 +-
 block/blk-mq.c         | 22 ++++++----------------
 block/blk-mq.h         |  2 +-
 include/linux/blkdev.h |  1 +
 4 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2375130..0251947 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -514,7 +514,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * prevent that q->request_fn() gets invoked after draining finished.
 	 */
 	if (q->mq_ops) {
-		blk_mq_drain_queue(q);
+		blk_mq_freeze_queue(q);
 		spin_lock_irq(lock);
 	} else {
 		spin_lock_irq(lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c0122a4..f49b2ab 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -84,15 +84,14 @@ static int blk_mq_queue_enter(struct request_queue *q)
 	smp_mb();
 
 	/* we have problems freezing the queue if it's initializing */
-	if (!blk_queue_dying(q) &&
-	    (!blk_queue_bypass(q) || !blk_queue_init_done(q)))
+	if (!q->mq_freeze_depth)
 		return 0;
 
 	__percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
 
 	spin_lock_irq(q->queue_lock);
 	ret = wait_event_interruptible_lock_irq(q->mq_freeze_wq,
-		!blk_queue_bypass(q) || blk_queue_dying(q),
+		!q->mq_freeze_depth || blk_queue_dying(q),
 		*q->queue_lock);
 	/* inc usage with lock hold to avoid freeze_queue runs here */
 	if (!ret && !blk_queue_dying(q))
@@ -129,31 +128,22 @@ static void __blk_mq_drain_queue(struct request_queue *q)
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
  */
-static void blk_mq_freeze_queue(struct request_queue *q)
+void blk_mq_freeze_queue(struct request_queue *q)
 {
 	spin_lock_irq(q->queue_lock);
-	q->bypass_depth++;
-	queue_flag_set(QUEUE_FLAG_BYPASS, q);
+	q->mq_freeze_depth++;
 	spin_unlock_irq(q->queue_lock);
 
 	__blk_mq_drain_queue(q);
 }
 
-void blk_mq_drain_queue(struct request_queue *q)
-{
-	__blk_mq_drain_queue(q);
-}
-
 static void blk_mq_unfreeze_queue(struct request_queue *q)
 {
 	bool wake = false;
 
 	spin_lock_irq(q->queue_lock);
-	if (!--q->bypass_depth) {
-		queue_flag_clear(QUEUE_FLAG_BYPASS, q);
-		wake = true;
-	}
-	WARN_ON_ONCE(q->bypass_depth < 0);
+	wake = !--q->mq_freeze_depth;
+	WARN_ON_ONCE(q->mq_freeze_depth < 0);
 	spin_unlock_irq(q->queue_lock);
 	if (wake)
 		wake_up_all(&q->mq_freeze_wq);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2646088..ca4964a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -28,7 +28,7 @@ struct blk_mq_ctx {
 void __blk_mq_complete_request(struct request *rq);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_init_flush(struct request_queue *q);
-void blk_mq_drain_queue(struct request_queue *q);
+void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 void blk_mq_clone_flush_request(struct request *flush_rq,
 		struct request *orig_rq);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 31e1105..2efe26a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -470,6 +470,7 @@ struct request_queue {
 	struct mutex		sysfs_lock;
 
 	int			bypass_depth;
+	int			mq_freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/6] blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
  2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
                   ` (3 preceding siblings ...)
  2014-06-18 15:21 ` [PATCH 4/6] blk-mq: decouble blk-mq freezing from generic bypassing Tejun Heo
@ 2014-06-18 15:21 ` Tejun Heo
  2014-06-18 15:21 ` [PATCH 6/6] blk-mq: use percpu_ref for mq usage count Tejun Heo
  2014-06-18 16:19 ` [PATCHSET block/for-next] blk-mq: update freezing Jens Axboe
  6 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-18 15:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, kmo, nab, Tejun Heo

Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
anything and it's gonna be further simplified.  Let's flatten it into
its caller.

This patch doesn't make any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
---
 block/blk-mq.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f49b2ab..eda78c8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -108,8 +108,16 @@ static void blk_mq_queue_exit(struct request_queue *q)
 	__percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
 }
 
-static void __blk_mq_drain_queue(struct request_queue *q)
+/*
+ * Guarantee no request is in use, so we can change any data structure of
+ * the queue afterward.
+ */
+void blk_mq_freeze_queue(struct request_queue *q)
 {
+	spin_lock_irq(q->queue_lock);
+	q->mq_freeze_depth++;
+	spin_unlock_irq(q->queue_lock);
+
 	while (true) {
 		s64 count;
 
@@ -124,19 +132,6 @@ static void __blk_mq_drain_queue(struct request_queue *q)
 	}
 }
 
-/*
- * Guarantee no request is in use, so we can change any data structure of
- * the queue afterward.
- */
-void blk_mq_freeze_queue(struct request_queue *q)
-{
-	spin_lock_irq(q->queue_lock);
-	q->mq_freeze_depth++;
-	spin_unlock_irq(q->queue_lock);
-
-	__blk_mq_drain_queue(q);
-}
-
 static void blk_mq_unfreeze_queue(struct request_queue *q)
 {
 	bool wake = false;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/6] blk-mq: use percpu_ref for mq usage count
  2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
                   ` (4 preceding siblings ...)
  2014-06-18 15:21 ` [PATCH 5/6] blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() Tejun Heo
@ 2014-06-18 15:21 ` Tejun Heo
  2014-06-18 16:19 ` [PATCHSET block/for-next] blk-mq: update freezing Jens Axboe
  6 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-18 15:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, kmo, nab, Tejun Heo

Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight.  The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.

This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq.  percpu_ref can serve the same purpose after the recent
changes.  This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.

blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put.  blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero.  blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
---
 block/blk-mq.c         | 68 +++++++++++++++++++++-----------------------------
 include/linux/blkdev.h |  3 ++-
 2 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index eda78c8..7b5701a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -78,34 +78,32 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 
 static int blk_mq_queue_enter(struct request_queue *q)
 {
-	int ret;
-
-	__percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
-	smp_mb();
-
-	/* we have problems freezing the queue if it's initializing */
-	if (!q->mq_freeze_depth)
-		return 0;
-
-	__percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
+	while (true) {
+		int ret;
 
-	spin_lock_irq(q->queue_lock);
-	ret = wait_event_interruptible_lock_irq(q->mq_freeze_wq,
-		!q->mq_freeze_depth || blk_queue_dying(q),
-		*q->queue_lock);
-	/* inc usage with lock hold to avoid freeze_queue runs here */
-	if (!ret && !blk_queue_dying(q))
-		__percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
-	else if (blk_queue_dying(q))
-		ret = -ENODEV;
-	spin_unlock_irq(q->queue_lock);
+		if (percpu_ref_tryget_live(&q->mq_usage_counter))
+			return 0;
 
-	return ret;
+		ret = wait_event_interruptible(q->mq_freeze_wq,
+				!q->mq_freeze_depth || blk_queue_dying(q));
+		if (blk_queue_dying(q))
+			return -ENODEV;
+		if (ret)
+			return ret;
+	}
 }
 
 static void blk_mq_queue_exit(struct request_queue *q)
 {
-	__percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
+	percpu_ref_put(&q->mq_usage_counter);
+}
+
+static void blk_mq_usage_counter_release(struct percpu_ref *ref)
+{
+	struct request_queue *q =
+		container_of(ref, struct request_queue, mq_usage_counter);
+
+	wake_up_all(&q->mq_freeze_wq);
 }
 
 /*
@@ -118,18 +116,9 @@ void blk_mq_freeze_queue(struct request_queue *q)
 	q->mq_freeze_depth++;
 	spin_unlock_irq(q->queue_lock);
 
-	while (true) {
-		s64 count;
-
-		spin_lock_irq(q->queue_lock);
-		count = percpu_counter_sum(&q->mq_usage_counter);
-		spin_unlock_irq(q->queue_lock);
-
-		if (count == 0)
-			break;
-		blk_mq_run_queues(q, false);
-		msleep(10);
-	}
+	percpu_ref_kill(&q->mq_usage_counter);
+	blk_mq_run_queues(q, false);
+	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
 }
 
 static void blk_mq_unfreeze_queue(struct request_queue *q)
@@ -140,8 +129,10 @@ static void blk_mq_unfreeze_queue(struct request_queue *q)
 	wake = !--q->mq_freeze_depth;
 	WARN_ON_ONCE(q->mq_freeze_depth < 0);
 	spin_unlock_irq(q->queue_lock);
-	if (wake)
+	if (wake) {
+		percpu_ref_reinit(&q->mq_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
+	}
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
@@ -1785,7 +1776,7 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!q)
 		goto err_hctxs;
 
-	if (percpu_counter_init(&q->mq_usage_counter, 0))
+	if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release))
 		goto err_map;
 
 	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
@@ -1878,7 +1869,7 @@ void blk_mq_free_queue(struct request_queue *q)
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 	blk_mq_free_hw_queues(q, set);
 
-	percpu_counter_destroy(&q->mq_usage_counter);
+	percpu_ref_exit(&q->mq_usage_counter);
 
 	free_percpu(q->queue_ctx);
 	kfree(q->queue_hw_ctx);
@@ -2037,8 +2028,7 @@ static int __init blk_mq_init(void)
 {
 	blk_mq_cpu_init();
 
-	/* Must be called after percpu_counter_hotcpu_callback() */
-	hotcpu_notifier(blk_mq_queue_reinit_notify, -10);
+	hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
 
 	return 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2efe26a..d5933dd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -21,6 +21,7 @@
 #include <linux/bsg.h>
 #include <linux/smp.h>
 #include <linux/rcupdate.h>
+#include <linux/percpu-refcount.h>
 
 #include <asm/scatterlist.h>
 
@@ -484,7 +485,7 @@ struct request_queue {
 #endif
 	struct rcu_head		rcu_head;
 	wait_queue_head_t	mq_freeze_wq;
-	struct percpu_counter	mq_usage_counter;
+	struct percpu_ref	mq_usage_counter;
 	struct list_head	all_q_node;
 
 	struct blk_mq_tag_set	*tag_set;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCHSET block/for-next] blk-mq: update freezing
  2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
                   ` (5 preceding siblings ...)
  2014-06-18 15:21 ` [PATCH 6/6] blk-mq: use percpu_ref for mq usage count Tejun Heo
@ 2014-06-18 16:19 ` Jens Axboe
  2014-06-28 12:13   ` Tejun Heo
  6 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2014-06-18 16:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, kmo, nab

On 2014-06-18 08:21, Tejun Heo wrote:
> Hello, Jens.
>
> While reading through blk-mq, I spotted several issues and this
> patchset addresses them.  The biggest one is how freezing is
> implemented.  Coupling it with bypassing doesn't seem to work well and
> there's a subtle bug in the perpcu switch implementation.
>
> I don't think open-coding this level of percpu logic is a good idea.
> It tends to be very error-prone and difficult to follow.  The barrier
> problem is the only thing I spotted but it's very difficult to say
> that it's correct.  percpu_ref already implements most of what's
> necessary to implement this sort of percpu switch and I added the
> missing bits in a recent patchset and converted blk-mq freezing
> mechanism to use it in this patch.
>
> It's far simpler and easier to wrap one's head around, and, as it's
> shared with other mechanisms including aio and cgroups, we should have
> better testing coverage and more eyes scrutinizing it.
>
> This patchset contains the following six patches.
>
>   0001-blk-mq-make-blk_mq_stop_hw_queue-reliably-block-queu.patch
>   0002-blk-mq-fix-a-memory-ordering-bug-in-blk_mq_queue_ent.patch
>   0003-block-blk-mq-draining-can-t-be-skipped-even-if-bypas.patch
>   0004-blk-mq-decouble-blk-mq-freezing-from-generic-bypassi.patch
>   0005-blk-mq-collapse-__blk_mq_drain_queue-into-blk_mq_fre.patch
>   0006-blk-mq-use-percpu_ref-for-mq-usage-count.patch
>
> 0001-0003 are fix patches that can be applied separately.
>
> 0004 decouples blk-mq freezing from queue bypassing.
>
> 0005-0006 replace the custom percpu switch with percpu_ref.
>
> This patchset is on top of
>
>   percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
> +[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()
>
> and available in the following git branch.
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref
>
>   block/blk-core.c       |   13 ++++---
>   block/blk-mq.c         |   90 ++++++++++++++++---------------------------------
>   block/blk-mq.h         |    2 -
>   block/blk-sysfs.c      |    2 -
>   include/linux/blkdev.h |    4 +-
>   5 files changed, 44 insertions(+), 67 deletions(-)

Thanks Tejun, this looks pretty good. I was worried the tryget_live() 
would be too expensive, but that looks not to be the case. I like the 
cleanups to using a general mechanism. I'll run this through some 
functional and performance testing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing
  2014-06-18 15:21 ` [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing Tejun Heo
@ 2014-06-19 10:10   ` Ming Lei
  2014-06-19 13:12     ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2014-06-19 10:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Linux Kernel Mailing List, Kent Overstreet,
	Nicholas A. Bellinger

On Wed, Jun 18, 2014 at 11:21 PM, Tejun Heo <tj@kernel.org> wrote:
> blk_mq_stop_hw_queue() has the following two issues.
>
> * BLK_MQ_S_STOPPED should be set before canceling the work items;
>   otherwise, a new instance may proceed beyond STOPPED checking
>   inbetween.
>
> * cancel_delayed_work() doesn't do anything to work instance already
>   executing.  Use cancel_delayed_work_sync() to ensure that the
>   currently in-flight one is finished.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
> ---
>  block/blk-mq.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e11f5f8..4787c3d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -862,9 +862,9 @@ EXPORT_SYMBOL(blk_mq_run_queues);
>
>  void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
>  {
> -       cancel_delayed_work(&hctx->run_work);
> -       cancel_delayed_work(&hctx->delay_work);
>         set_bit(BLK_MQ_S_STOPPED, &hctx->state);
> +       cancel_delayed_work_sync(&hctx->run_work);
> +       cancel_delayed_work_sync(&hctx->delay_work);

The function can be called from atomic context, like virtio-blk.

Currently blk_mq_stop_hw_queue() is called in case of queue
being busy, so looks it is ok with cancel_delayed_work().


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing
  2014-06-19 10:10   ` Ming Lei
@ 2014-06-19 13:12     ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-19 13:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, Kent Overstreet,
	Nicholas A. Bellinger

Hello, Ming.

On Thu, Jun 19, 2014 at 06:10:28PM +0800, Ming Lei wrote:
> >  void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
> >  {
> > -       cancel_delayed_work(&hctx->run_work);
> > -       cancel_delayed_work(&hctx->delay_work);
> >         set_bit(BLK_MQ_S_STOPPED, &hctx->state);
> > +       cancel_delayed_work_sync(&hctx->run_work);
> > +       cancel_delayed_work_sync(&hctx->delay_work);
> 
> The function can be called from atomic context, like virtio-blk.
> 
> Currently blk_mq_stop_hw_queue() is called in case of queue
> being busy, so looks it is ok with cancel_delayed_work().

Hmm... you're right.  So the function is used to stop future
invocations of the queue processing function and thus the order
doesn't matter.  We probably wanna document that.  I don't understand
why virblk_freeze() is invoking it tho.  If freezing is complete,
there's no request in-flight on the driver side, why does it need to
invoke stop too?

Jens, please drop this one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHSET block/for-next] blk-mq: update freezing
  2014-06-18 16:19 ` [PATCHSET block/for-next] blk-mq: update freezing Jens Axboe
@ 2014-06-28 12:13   ` Tejun Heo
  2014-06-28 14:29     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2014-06-28 12:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, kmo, nab

On Wed, Jun 18, 2014 at 09:19:26AM -0700, Jens Axboe wrote:
> >  percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
> >+[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()
> >
> >and available in the following git branch.
> >
> >  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref
> >
> >  block/blk-core.c       |   13 ++++---
> >  block/blk-mq.c         |   90 ++++++++++++++++---------------------------------
> >  block/blk-mq.h         |    2 -
> >  block/blk-sysfs.c      |    2 -
> >  include/linux/blkdev.h |    4 +-
> >  5 files changed, 44 insertions(+), 67 deletions(-)
> 
> Thanks Tejun, this looks pretty good. I was worried the tryget_live() would
> be too expensive, but that looks not to be the case. I like the cleanups to
> using a general mechanism. I'll run this through some functional and
> performance testing.

FYI, the percpu_ref changes needed by this patchset are applied to
percpu/for-3.17 branch which is stable and can be pulled into the
block tree.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-3.17

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHSET block/for-next] blk-mq: update freezing
  2014-06-28 12:13   ` Tejun Heo
@ 2014-06-28 14:29     ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2014-06-28 14:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, kmo, nab

On 2014-06-28 06:13, Tejun Heo wrote:
> On Wed, Jun 18, 2014 at 09:19:26AM -0700, Jens Axboe wrote:
>>>   percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
>>> +[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()
>>>
>>> and available in the following git branch.
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref
>>>
>>>   block/blk-core.c       |   13 ++++---
>>>   block/blk-mq.c         |   90 ++++++++++++++++---------------------------------
>>>   block/blk-mq.h         |    2 -
>>>   block/blk-sysfs.c      |    2 -
>>>   include/linux/blkdev.h |    4 +-
>>>   5 files changed, 44 insertions(+), 67 deletions(-)
>>
>> Thanks Tejun, this looks pretty good. I was worried the tryget_live() would
>> be too expensive, but that looks not to be the case. I like the cleanups to
>> using a general mechanism. I'll run this through some functional and
>> performance testing.
>
> FYI, the percpu_ref changes needed by this patchset are applied to
> percpu/for-3.17 branch which is stable and can be pulled into the
> block tree.
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-3.17

Great, thanks! Was just mulling over this yesterday. I'll get it pulled 
in as a base.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-06-28 14:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-18 15:21 [PATCHSET block/for-next] blk-mq: update freezing Tejun Heo
2014-06-18 15:21 ` [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing Tejun Heo
2014-06-19 10:10   ` Ming Lei
2014-06-19 13:12     ` Tejun Heo
2014-06-18 15:21 ` [PATCH 2/6] blk-mq: fix a memory ordering bug in blk_mq_queue_enter() Tejun Heo
2014-06-18 15:21 ` [PATCH 3/6] block, blk-mq: draining can't be skipped even if bypass_depth was non-zero Tejun Heo
2014-06-18 15:21 ` [PATCH 4/6] blk-mq: decouble blk-mq freezing from generic bypassing Tejun Heo
2014-06-18 15:21 ` [PATCH 5/6] blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() Tejun Heo
2014-06-18 15:21 ` [PATCH 6/6] blk-mq: use percpu_ref for mq usage count Tejun Heo
2014-06-18 16:19 ` [PATCHSET block/for-next] blk-mq: update freezing Jens Axboe
2014-06-28 12:13   ` Tejun Heo
2014-06-28 14:29     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.