All of lore.kernel.org
 help / color / mirror / Atom feed
* boot stall regression due to blk-mq: use percpu_ref for mq usage count
@ 2014-09-19 11:38 Christoph Hellwig
  2014-09-19 19:13 ` Jens Axboe
  2014-09-23  6:08 ` [PATCH block/for-3.18/core] blk-mq: start q->mq_usage_counter in atomic mode Tejun Heo
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:38 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo; +Cc: linux-kernel, linux-scsi

Hi Jens, hi Tejun,

I've seen multi-second boot stalls in one of my KVM setups during
the initial scsi scan:

[    0.949892] scsi host0: Virtio SCSI HBA
[    1.007864] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
[    1.021299] scsi 0:0:1:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
[    1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

<stall>

[   16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[   16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[   16.194099] osd: LOADED open-osd 0.2.1
[   16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[   16.208478] sd 0:0:0:0: [sda] Write Protect is off
[   16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[   16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[   16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA



I've tracked this down to "blk-mq: use percpu_ref for mq usage count" in
a rather painful way as that one introduced enough other regressions
to mess up bisect.

If I revert the following commits:

dd840087086f3b93ac20f7472b4fca59aff7b79f
cddd5d17642cc6881352732693c2ae6930e9ce65
add703fda981b9719d37f371498b9f129acbd997

which are the above mentioned commit and two fixes to it the problem goes
away.

My qemu command line is below:

kvm \
	-m 2048 \
	-smp 1 \
	-kernel arch/x86/boot/bzImage \
	-append "root=/dev/vda console=tty0 console=ttyS0,115200n8 scsi_mod.use_blk_mq=Y" \
	-nographic \
	-drive if=virtio,file=/work/images/debian.qcow2,cache=none,serial="test1234" \
	-drive if=none,id=test,file=/work/images/test.img,cache=none,aio=native \
	-drive if=none,id=scratch,file=/work/images/scratch.img,cache=none,aio=native \
	-device virtio-scsi-pci,id=scsi \
	-device scsi-hd,drive=test \
	-device scsi-hd,drive=scratch \
	-drive file=/work/images/debian-7.3.0-amd64-netinst.iso,index=2,media=cdrom

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-19 11:38 boot stall regression due to blk-mq: use percpu_ref for mq usage count Christoph Hellwig
@ 2014-09-19 19:13 ` Jens Axboe
  2014-09-23  5:55   ` Christoph Hellwig
  2014-09-23  6:08 ` [PATCH block/for-3.18/core] blk-mq: start q->mq_usage_counter in atomic mode Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2014-09-19 19:13 UTC (permalink / raw)
  To: Christoph Hellwig, Tejun Heo; +Cc: linux-kernel, linux-scsi

On 09/19/2014 05:38 AM, Christoph Hellwig wrote:
> Hi Jens, hi Tejun,
> 
> I've seen multi-second boot stalls in one of my KVM setups during
> the initial scsi scan:
> 
> [    0.949892] scsi host0: Virtio SCSI HBA
> [    1.007864] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
> [    1.021299] scsi 0:0:1:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
> [    1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz
> 
> <stall>
> 
> [   16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
> [   16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
> [   16.194099] osd: LOADED open-osd 0.2.1
> [   16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
> [   16.208478] sd 0:0:0:0: [sda] Write Protect is off
> [   16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [   16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
> [   16.223264] sd 0:0:1:0: [sdb] Write Protect is off
> [   16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 
> 
> 
> I've tracked this down to "blk-mq: use percpu_ref for mq usage count" in
> a rather painful way as that one introduced enough other regressions
> to mess up bisect.
> 
> If I revert the following commits:
> 
> dd840087086f3b93ac20f7472b4fca59aff7b79f
> cddd5d17642cc6881352732693c2ae6930e9ce65
> add703fda981b9719d37f371498b9f129acbd997
> 
> which are the above mentioned commit and two fixes to it the problem goes
> away.

Thanks for bisecting this, I ran into something that I think is the same
issue about a week ago. Tejun, any ideas before I dig into this one?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-19 19:13 ` Jens Axboe
@ 2014-09-23  5:55   ` Christoph Hellwig
  2014-09-23  5:56     ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2014-09-23  5:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Tejun Heo, linux-kernel, linux-scsi

Jens,

can we simply get these commits reverted from now if there's no better
fix?  I'd hate to have this boot stall in the first kernel with blk-mq
support for scsi.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  5:55   ` Christoph Hellwig
@ 2014-09-23  5:56     ` Tejun Heo
  2014-09-23  5:57       ` Tejun Heo
  2014-09-23  5:59       ` Christoph Hellwig
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-23  5:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 07:55:54AM +0200, Christoph Hellwig wrote:
> Jens,
> 
> can we simply get these commits reverted from now if there's no better
> fix?  I'd hate to have this boot stall in the first kernel with blk-mq
> support for scsi.

Patches going out right now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  5:56     ` Tejun Heo
@ 2014-09-23  5:57       ` Tejun Heo
  2014-09-23  5:59       ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-23  5:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 01:56:48AM -0400, Tejun Heo wrote:
> On Tue, Sep 23, 2014 at 07:55:54AM +0200, Christoph Hellwig wrote:
> > Jens,
> > 
> > can we simply get these commits reverted from now if there's no better
> > fix?  I'd hate to have this boot stall in the first kernel with blk-mq
> > support for scsi.
> 
> Patches going out right now.

And the original implementation was broken, so...

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  5:56     ` Tejun Heo
  2014-09-23  5:57       ` Tejun Heo
@ 2014-09-23  5:59       ` Christoph Hellwig
  2014-09-23  6:01         ` Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2014-09-23  5:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 01:56:48AM -0400, Tejun Heo wrote:
> On Tue, Sep 23, 2014 at 07:55:54AM +0200, Christoph Hellwig wrote:
> > Jens,
> > 
> > can we simply get these commits reverted from now if there's no better
> > fix?  I'd hate to have this boot stall in the first kernel with blk-mq
> > support for scsi.
> 
> Patches going out right now.

"[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"

looks way to big for 3.17, and the regression was introduced in the 3.17
merge window.  I'm not sure what was broken before, but it defintively
survived a lot of testing.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  5:59       ` Christoph Hellwig
@ 2014-09-23  6:01         ` Tejun Heo
  2014-09-23  6:03           ` Tejun Heo
  2014-09-23  6:09           ` Christoph Hellwig
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-23  6:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 07:59:24AM +0200, Christoph Hellwig wrote:
> "[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
> 
> looks way to big for 3.17, and the regression was introduced in the 3.17
> merge window.  I'm not sure what was broken before, but it defintively
> survived a lot of testing.

Do we even care about fixing it for 3.17?  scsi-mq isn't enabled by
default even for 3.18.  The open-coded percpu ref thing was subtly
broken there.  It'd be difficult to trigger but I'm fairly sure it'd
crap out in the wild once in a blue moon.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  6:01         ` Tejun Heo
@ 2014-09-23  6:03           ` Tejun Heo
  2014-09-23  6:09           ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-23  6:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 02:01:41AM -0400, Tejun Heo wrote:
> On Tue, Sep 23, 2014 at 07:59:24AM +0200, Christoph Hellwig wrote:
> > "[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
> > 
> > looks way to big for 3.17, and the regression was introduced in the 3.17
> > merge window.  I'm not sure what was broken before, but it defintively
> > survived a lot of testing.
> 
> Do we even care about fixing it for 3.17?  scsi-mq isn't enabled by
> default even for 3.18.  The open-coded percpu ref thing was subtly
> broken there.  It'd be difficult to trigger but I'm fairly sure it'd
> crap out in the wild once in a blue moon.

Hmmm... also, this is actually an issue with scsi that block layer is
working around.  For other drivers (virtio), the latency during
shutdown should be completely fine.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH block/for-3.18/core] blk-mq: start q->mq_usage_counter in atomic mode
  2014-09-19 11:38 boot stall regression due to blk-mq: use percpu_ref for mq usage count Christoph Hellwig
  2014-09-19 19:13 ` Jens Axboe
@ 2014-09-23  6:08 ` Tejun Heo
  2014-09-24 17:33   ` Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2014-09-23  6:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

>From 83b06f4fc6ca2f7f3d706a168b71c248bdada668 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Tue, 23 Sep 2014 01:58:34 -0400

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze.  percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period.  This means that draining a blk-mq
takes measureable wallclock time.  One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs.  This means that SCSI probing may take above ten seconds when
scsi-mq is used.

  [    0.949892] scsi host0: Virtio SCSI HBA
  [    1.007864] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
  [    1.021299] scsi 0:0:1:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
  [    1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

  <stall>

  [   16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
  [   16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
  [   16.194099] osd: LOADED open-osd 0.2.1
  [   16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
  [   16.208478] sd 0:0:0:0: [sda] Write Protect is off
  [   16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
  [   16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
  [   16.223264] sd 0:0:1:0: [sdb] Write Protect is off
  [   16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

This is also the reason why request_queues start in bypass mode which
is ended on blk_register_queue() as shutting down a fully functional
queue also involves a RCU grace period and the queues for non-existent
SCSI devices never reach registration.

blk-mq basically needs to do the same thing - start the mq in a
degraded mode which is faster to shut down and then make it fully
functional only after the queue reaches registration.  percpu_ref
recently grew facilities to force atomic operation until explicitly
switched to percpu mode, which can be used for this purpose.  This
patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
switch it to percpu mode only once blk_register_queue() is reached.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
Hello,

This patch is on top of

  block/for-3.18/core fe052529e465 ("scsi: move blk_mq_start_request call earlier")
+ [PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()
  http://lkml.kernel.org/g/1411451718-17807-1-git-send-email-tj@kernel.org

and available in the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git review-blk_mq-latency-fix

Thanks.

 block/blk-mq-sysfs.c   |  6 ++++++
 block/blk-mq.c         |  6 +++++-
 block/blk-sysfs.c      | 11 +++++++++--
 include/linux/blk-mq.h |  1 +
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index ed52178..371d880 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -402,6 +402,12 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 	}
 }
 
+/* see blk_register_queue() */
+void blk_mq_finish_init(struct request_queue *q)
+{
+	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+}
+
 int blk_mq_register_disk(struct gendisk *disk)
 {
 	struct device *dev = disk_to_dev(disk);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 91c49b3..4ae58b1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1783,8 +1783,12 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!q)
 		goto err_hctxs;
 
+	/*
+	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
+	 * See blk_register_queue() for details.
+	 */
 	if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
-			    0, GFP_KERNEL))
+			    PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto err_map;
 
 	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 17f5c84..521ae90 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -551,12 +551,19 @@ int blk_register_queue(struct gendisk *disk)
 		return -ENXIO;
 
 	/*
-	 * Initialization must be complete by now.  Finish the initial
-	 * bypass from queue allocation.
+	 * SCSI probing may synchronously create and destroy a lot of
+	 * request_queues for non-existent devices.  Shutting down a fully
+	 * functional queue takes measureable wallclock time as RCU grace
+	 * periods are involved.  To avoid excessive latency in these
+	 * cases, a request_queue starts out in a degraded mode which is
+	 * faster to shut down and is made fully functional here as
+	 * request_queues for non-existent devices never get registered.
 	 */
 	if (!blk_queue_init_done(q)) {
 		queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
 		blk_queue_bypass_end(q);
+		if (q->mq_ops)
+			blk_mq_finish_init(q);
 	}
 
 	ret = blk_trace_init_sysfs(dev);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3253495..4495dd4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -144,6 +144,7 @@ enum {
 };
 
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
+void blk_mq_finish_init(struct request_queue *q);
 int blk_mq_register_disk(struct gendisk *);
 void blk_mq_unregister_disk(struct gendisk *);
 
-- 
1.9.3


-- 
tejun

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  6:01         ` Tejun Heo
  2014-09-23  6:03           ` Tejun Heo
@ 2014-09-23  6:09           ` Christoph Hellwig
  2014-09-23  6:11             ` Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2014-09-23  6:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 02:01:41AM -0400, Tejun Heo wrote:
> On Tue, Sep 23, 2014 at 07:59:24AM +0200, Christoph Hellwig wrote:
> > "[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
> > 
> > looks way to big for 3.17, and the regression was introduced in the 3.17
> > merge window.  I'm not sure what was broken before, but it defintively
> > survived a lot of testing.
> 
> Do we even care about fixing it for 3.17?  scsi-mq isn't enabled by
> default even for 3.18.  The open-coded percpu ref thing was subtly
> broken there.  It'd be difficult to trigger but I'm fairly sure it'd
> crap out in the wild once in a blue moon.

It's compiled in by default, and people are extremly eager to test it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  6:09           ` Christoph Hellwig
@ 2014-09-23  6:11             ` Tejun Heo
  2014-09-23 14:57               ` Elliott, Robert (Server Storage)
  2014-09-23 15:49               ` Jens Axboe
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-23  6:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 08:09:06AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 23, 2014 at 02:01:41AM -0400, Tejun Heo wrote:
> > On Tue, Sep 23, 2014 at 07:59:24AM +0200, Christoph Hellwig wrote:
> > > "[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
> > > 
> > > looks way to big for 3.17, and the regression was introduced in the 3.17
> > > merge window.  I'm not sure what was broken before, but it defintively
> > > survived a lot of testing.
> > 
> > Do we even care about fixing it for 3.17?  scsi-mq isn't enabled by
> > default even for 3.18.  The open-coded percpu ref thing was subtly
> > broken there.  It'd be difficult to trigger but I'm fairly sure it'd
> > crap out in the wild once in a blue moon.
> 
> It's compiled in by default, and people are extremly eager to test it.

Ugh, I don't know.  It's not like we have a very good baseline we can
go back to and reverting it for -stable and then redoing it seems
kinda excessive for a yet experimental feature.  Jens?

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  6:11             ` Tejun Heo
@ 2014-09-23 14:57               ` Elliott, Robert (Server Storage)
  2014-09-23 15:49               ` Jens Axboe
  1 sibling, 0 replies; 23+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-09-23 14:57 UTC (permalink / raw)
  To: Tejun Heo, Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Tejun Heo
> Sent: Tuesday, 23 September, 2014 1:12 AM
> To: Christoph Hellwig
> Cc: Jens Axboe; linux-kernel@vger.kernel.org; linux-scsi@vger.kernel.org
> Subject: Re: boot stall regression due to blk-mq: use percpu_ref for mq usage
> count
> 
> On Tue, Sep 23, 2014 at 08:09:06AM +0200, Christoph Hellwig wrote:
> > On Tue, Sep 23, 2014 at 02:01:41AM -0400, Tejun Heo wrote:
> > > On Tue, Sep 23, 2014 at 07:59:24AM +0200, Christoph Hellwig wrote:
> > > > "[PATCHSET percpu/for-3.18] percpu_ref: implement
> switch_to_atomic/percpu()"
> > > >
> > > > looks way to big for 3.17, and the regression was introduced in the
> 3.17
> > > > merge window.  I'm not sure what was broken before, but it defintively
> > > > survived a lot of testing.
> > >
> > > Do we even care about fixing it for 3.17?  scsi-mq isn't enabled by
> > > default even for 3.18.  The open-coded percpu ref thing was subtly
> > > broken there.  It'd be difficult to trigger but I'm fairly sure it'd
> > > crap out in the wild once in a blue moon.
> >
> > It's compiled in by default, and people are extremly eager to test it.
> 
> Ugh, I don't know.  It's not like we have a very good baseline we can
> go back to and reverting it for -stable and then redoing it seems
> kinda excessive for a yet experimental feature.  Jens?

scsi_mod.use_blk_mq is not listed in 3.17rc6 Documentation/kernel-
parameters.txt or Documentation/scsi/scsi-parameters.txt (which
does admit "This document may not be entirely up to date and
comprehensive.").

Perhaps a description with an "experimental" warning could be 
slipped into scsi-parameters.txt in 3.17, with plans to remove
that warning in 3.18.

---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
  2014-09-23  6:11             ` Tejun Heo
  2014-09-23 14:57               ` Elliott, Robert (Server Storage)
@ 2014-09-23 15:49               ` Jens Axboe
  2014-09-23 19:24                 ` [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2014-09-23 15:49 UTC (permalink / raw)
  To: Tejun Heo, Christoph Hellwig; +Cc: linux-kernel, linux-scsi

On 09/23/2014 12:11 AM, Tejun Heo wrote:
> On Tue, Sep 23, 2014 at 08:09:06AM +0200, Christoph Hellwig wrote:
>> On Tue, Sep 23, 2014 at 02:01:41AM -0400, Tejun Heo wrote:
>>> On Tue, Sep 23, 2014 at 07:59:24AM +0200, Christoph Hellwig wrote:
>>>> "[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()"
>>>>
>>>> looks way to big for 3.17, and the regression was introduced in the 3.17
>>>> merge window.  I'm not sure what was broken before, but it defintively
>>>> survived a lot of testing.
>>>
>>> Do we even care about fixing it for 3.17?  scsi-mq isn't enabled by
>>> default even for 3.18.  The open-coded percpu ref thing was subtly
>>> broken there.  It'd be difficult to trigger but I'm fairly sure it'd
>>> crap out in the wild once in a blue moon.
>>
>> It's compiled in by default, and people are extremly eager to test it.
> 
> Ugh, I don't know.  It's not like we have a very good baseline we can
> go back to and reverting it for -stable and then redoing it seems
> kinda excessive for a yet experimental feature.  Jens?

It's not just scsi-mq, there are active users of blk-mq in the current
tree - like virtio_blk, mtip32xx. None of those are affected by the RCU
slowdown due to these changes, so it's not a big deal to them. But it is
a big deal if we can't tell people to test scsi-mq in 3.17, that was the
entire point of having it there but not default to on. So yeah, this
really should be fixed for 3.17.

I'm not aware of any reports on the existing enter count breaking things
for them. So while it may not be perfect, reverting the percpu ref count
changes for 3.17 may be the best option that we have.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe
  2014-09-23 15:49               ` Jens Axboe
@ 2014-09-23 19:24                 ` Tejun Heo
  2014-09-23 19:29                   ` Jens Axboe
  2014-09-24  8:23                   ` Christoph Hellwig
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-23 19:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-scsi

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze.  percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period.  This means that draining a blk-mq
takes measureable wallclock time.  One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs.  This means that SCSI probing may take more than ten seconds
when scsi-mq is used.

This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment).  As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it.  This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
---
Hello, Jens, Christoph.

How about this one?  This is kinda ugly but should work fine in most
cases and easy to apply to v3.17 and take out during v3.18.

Thanks.

 block/blk-mq.c                  |   11 ++++++++++-
 include/linux/percpu-refcount.h |    1 +
 lib/percpu-refcount.c           |   16 ++++++++++++++++
 3 files changed, 27 insertions(+), 1 deletion(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -120,7 +120,16 @@ void blk_mq_freeze_queue(struct request_
 	spin_unlock_irq(q->queue_lock);
 
 	if (freeze) {
-		percpu_ref_kill(&q->mq_usage_counter);
+		/*
+		 * XXX: Temporary kludge to work around SCSI blk-mq stall.
+		 * SCSI synchronously creates and destroys many queues
+		 * back-to-back during probe leading to lengthy stalls.
+		 * This will be fixed by keeping ->mq_usage_counter in
+		 * atomic mode until genhd registration, but, for now,
+		 * let's work around using expedited synchronization.
+		 */
+		__percpu_ref_kill_expedited(&q->mq_usage_counter);
+
 		blk_mq_run_queues(q, false);
 	}
 	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -72,6 +72,7 @@ void percpu_ref_reinit(struct percpu_ref
 void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill);
+void __percpu_ref_kill_expedited(struct percpu_ref *ref);
 
 /**
  * percpu_ref_kill - drop the initial ref
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -189,3 +189,19 @@ void percpu_ref_kill_and_confirm(struct
 	call_rcu_sched(&ref->rcu, percpu_ref_kill_rcu);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
+
+/*
+ * XXX: Temporary kludge to work around SCSI blk-mq stall.  Used only by
+ * block/blk-mq.c::blk_mq_freeze_queue().  Will be removed during v3.18
+ * devel cycle.  Do not use anywhere else.
+ */
+void __percpu_ref_kill_expedited(struct percpu_ref *ref)
+{
+	WARN_ONCE(ref->pcpu_count_ptr & PCPU_REF_DEAD,
+		  "percpu_ref_kill() called more than once on %pf!",
+		  ref->release);
+
+	ref->pcpu_count_ptr |= PCPU_REF_DEAD;
+	synchronize_sched_expedited();
+	percpu_ref_kill_rcu(&ref->rcu);
+}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe
  2014-09-23 19:24                 ` [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe Tejun Heo
@ 2014-09-23 19:29                   ` Jens Axboe
  2014-09-23 19:48                     ` Tejun Heo
  2014-09-24  8:23                   ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2014-09-23 19:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Christoph Hellwig, linux-kernel, linux-scsi

On 09/23/2014 01:24 PM, Tejun Heo wrote:
> blk-mq uses percpu_ref for its usage counter which tracks the number
> of in-flight commands and used to synchronously drain the queue on
> freeze.  percpu_ref shutdown takes measureable wallclock time as it
> involves a sched RCU grace period.  This means that draining a blk-mq
> takes measureable wallclock time.  One would think that this shouldn't
> matter as queue shutdown should be a rare event which takes place
> asynchronously w.r.t. userland.
> 
> Unfortunately, SCSI probing involves synchronously setting up and then
> tearing down a lot of request_queues back-to-back for non-existent
> LUNs.  This means that SCSI probing may take more than ten seconds
> when scsi-mq is used.
> 
> This will be properly fixed by implementing a mechanism to keep
> q->mq_usage_counter in atomic mode till genhd registration; however,
> that involves rather big updates to percpu_ref which is difficult to
> apply late in the devel cycle (v3.17-rc6 at the moment).  As a
> stop-gap measure till the proper fix can be implemented in the next
> cycle, this patch introduces __percpu_ref_kill_expedited() and makes
> blk_mq_freeze_queue() use it.  This is heavy-handed but should work
> for testing the experimental SCSI blk-mq implementation.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Christoph Hellwig <hch@infradead.org>
> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
> Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
> Cc: Kent Overstreet <kmo@daterainc.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> ---
> Hello, Jens, Christoph.
> 
> How about this one?  This is kinda ugly but should work fine in most
> cases and easy to apply to v3.17 and take out during v3.18.

The hack looks fine to me, if it passes Christophs boot stall test. The
hack isn't THAT nasty, since you already have a series queued up to fix
it for real for 3.18.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe
  2014-09-23 19:29                   ` Jens Axboe
@ 2014-09-23 19:48                     ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-23 19:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-scsi

Oops, I forgot to cc Kent.  Kent, the patch is at

 http://lkml.kernel.org/g/20140923192432.GA24441@mtj.dyndns.org

On Tue, Sep 23, 2014 at 01:29:03PM -0600, Jens Axboe wrote:
> The hack looks fine to me, if it passes Christophs boot stall test. The
> hack isn't THAT nasty, since you already have a series queued up to fix
> it for real for 3.18.

Yeah, then I can pull in block fixes branch into a percpu for-3.18
branch, revert the hack and then apply the patches to implement the
proper fix.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe
  2014-09-23 19:24                 ` [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe Tejun Heo
  2014-09-23 19:29                   ` Jens Axboe
@ 2014-09-24  8:23                   ` Christoph Hellwig
  2014-09-24 14:30                     ` Jens Axboe
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2014-09-24  8:23 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, Christoph Hellwig, linux-kernel, linux-scsi

Thanks, this fixes the boot stall for me.

Tested-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe
  2014-09-24  8:23                   ` Christoph Hellwig
@ 2014-09-24 14:30                     ` Jens Axboe
  2014-09-24 14:33                       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2014-09-24 14:30 UTC (permalink / raw)
  To: Christoph Hellwig, Tejun Heo; +Cc: linux-kernel, linux-scsi

On 09/24/2014 02:23 AM, Christoph Hellwig wrote:
> Thanks, this fixes the boot stall for me.
> 
> Tested-by: Christoph Hellwig <hch@lst.de>

Sweet, I'll shove this off today, so we are done for 3.17 release.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe
  2014-09-24 14:30                     ` Jens Axboe
@ 2014-09-24 14:33                       ` Tejun Heo
  2014-09-24 14:33                         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2014-09-24 14:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-scsi

On Wed, Sep 24, 2014 at 08:30:33AM -0600, Jens Axboe wrote:
> On 09/24/2014 02:23 AM, Christoph Hellwig wrote:
> > Thanks, this fixes the boot stall for me.
> > 
> > Tested-by: Christoph Hellwig <hch@lst.de>
> 
> Sweet, I'll shove this off today, so we are done for 3.17 release.

Cool, once it hits the block branch, I'll pull it into percpu/for-3.18
and apply the patches for proper fix on top.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe
  2014-09-24 14:33                       ` Tejun Heo
@ 2014-09-24 14:33                         ` Jens Axboe
  2014-09-24 17:20                           ` [PATCH percpu/for-3.18] Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2014-09-24 14:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Christoph Hellwig, linux-kernel, linux-scsi

On 09/24/2014 08:33 AM, Tejun Heo wrote:
> On Wed, Sep 24, 2014 at 08:30:33AM -0600, Jens Axboe wrote:
>> On 09/24/2014 02:23 AM, Christoph Hellwig wrote:
>>> Thanks, this fixes the boot stall for me.
>>>
>>> Tested-by: Christoph Hellwig <hch@lst.de>
>>
>> Sweet, I'll shove this off today, so we are done for 3.17 release.
> 
> Cool, once it hits the block branch, I'll pull it into percpu/for-3.18
> and apply the patches for proper fix on top.

It's in for-linus now.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH percpu/for-3.18] Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
  2014-09-24 14:33                         ` Jens Axboe
@ 2014-09-24 17:20                           ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-24 17:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-scsi, Kent Overstreet

>From 9eca80461a45177e456219a9cd944c27675d6512 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 24 Sep 2014 13:07:33 -0400

This reverts commit 0a30288da1aec914e158c2d7a3482a85f632750f, which
was a temporary fix for SCSI blk-mq stall issue.  The following
patches will fix the issue properly by introducing atomic mode to
percpu_ref.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
---
Hello,

I pulled in block/for-linus into percpu/for-3.18 and reverted the temp
fix.  Will apply the patchset to implement the proper fix on top of
it.

Thanks.

 block/blk-mq.c                  | 11 +----------
 include/linux/percpu-refcount.h |  1 -
 lib/percpu-refcount.c           | 16 ----------------
 3 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 255d79c..44a78ae 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -119,16 +119,7 @@ void blk_mq_freeze_queue(struct request_queue *q)
 	spin_unlock_irq(q->queue_lock);
 
 	if (freeze) {
-		/*
-		 * XXX: Temporary kludge to work around SCSI blk-mq stall.
-		 * SCSI synchronously creates and destroys many queues
-		 * back-to-back during probe leading to lengthy stalls.
-		 * This will be fixed by keeping ->mq_usage_counter in
-		 * atomic mode until genhd registration, but, for now,
-		 * let's work around using expedited synchronization.
-		 */
-		__percpu_ref_kill_expedited(&q->mq_usage_counter);
-
+		percpu_ref_kill(&q->mq_usage_counter);
 		blk_mq_run_queues(q, false);
 	}
 	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 11b38ce..5df6784 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -72,7 +72,6 @@ void percpu_ref_reinit(struct percpu_ref *ref);
 void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill);
-void __percpu_ref_kill_expedited(struct percpu_ref *ref);
 
 /**
  * percpu_ref_kill - drop the initial ref
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index c6c31e2..559ee0b 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -189,19 +189,3 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 	call_rcu_sched(&ref->rcu, percpu_ref_kill_rcu);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
-
-/*
- * XXX: Temporary kludge to work around SCSI blk-mq stall.  Used only by
- * block/blk-mq.c::blk_mq_freeze_queue().  Will be removed during v3.18
- * devel cycle.  Do not use anywhere else.
- */
-void __percpu_ref_kill_expedited(struct percpu_ref *ref)
-{
-	WARN_ONCE(ref->pcpu_count_ptr & PCPU_REF_DEAD,
-		  "percpu_ref_kill() called more than once on %pf!",
-		  ref->release);
-
-	ref->pcpu_count_ptr |= PCPU_REF_DEAD;
-	synchronize_sched_expedited();
-	percpu_ref_kill_rcu(&ref->rcu);
-}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH block/for-3.18/core] blk-mq: start q->mq_usage_counter in atomic mode
  2014-09-23  6:08 ` [PATCH block/for-3.18/core] blk-mq: start q->mq_usage_counter in atomic mode Tejun Heo
@ 2014-09-24 17:33   ` Tejun Heo
  2014-09-24 17:43     ` [PATCH percpu/for-3.18] blk-mq, percpu_ref: " Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2014-09-24 17:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

On Tue, Sep 23, 2014 at 02:08:12AM -0400, Tejun Heo wrote:
> From 83b06f4fc6ca2f7f3d706a168b71c248bdada668 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Tue, 23 Sep 2014 01:58:34 -0400
> 
> blk-mq uses percpu_ref for its usage counter which tracks the number
> of in-flight commands and used to synchronously drain the queue on
> freeze.  percpu_ref shutdown takes measureable wallclock time as it
> involves a sched RCU grace period.  This means that draining a blk-mq
> takes measureable wallclock time.  One would think that this shouldn't
> matter as queue shutdown should be a rare event which takes place
> asynchronously w.r.t. userland.

Applied to percpu/for-3.18 on top of the klude, its revert and the
required percpu_ref changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH percpu/for-3.18] blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
  2014-09-24 17:33   ` Tejun Heo
@ 2014-09-24 17:43     ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2014-09-24 17:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-scsi

FYI, I added a paragraph explaining what happened to the temp fix at
the end.  The following is the applied version.

Thanks.
----- 8< -----
>From 17497acbdce9506fd6a75115dee4ab80c3cc5ee5 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 24 Sep 2014 13:31:50 -0400

blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze.  percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period.  This means that draining a blk-mq
takes measureable wallclock time.  One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.

Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs.  This means that SCSI probing may take above ten seconds when
scsi-mq is used.

  [    0.949892] scsi host0: Virtio SCSI HBA
  [    1.007864] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
  [    1.021299] scsi 0:0:1:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
  [    1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz

  <stall>

  [   16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
  [   16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
  [   16.194099] osd: LOADED open-osd 0.2.1
  [   16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
  [   16.208478] sd 0:0:0:0: [sda] Write Protect is off
  [   16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
  [   16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
  [   16.223264] sd 0:0:1:0: [sdb] Write Protect is off
  [   16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

This is also the reason why request_queues start in bypass mode which
is ended on blk_register_queue() as shutting down a fully functional
queue also involves a RCU grace period and the queues for non-existent
SCSI devices never reach registration.

blk-mq basically needs to do the same thing - start the mq in a
degraded mode which is faster to shut down and then make it fully
functional only after the queue reaches registration.  percpu_ref
recently grew facilities to force atomic operation until explicitly
switched to percpu mode, which can be used for this purpose.  This
patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
switch it to percpu mode only once blk_register_queue() is reached.

Note that this issue was previously worked around by 0a30288da1ae
("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
probe") for v3.17.  The temp fix was reverted in preparation of adding
persistent atomic mode to percpu_ref by 9eca80461a45 ("Revert "blk-mq,
percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
This patch and the prerequisite percpu_ref changes will be merged
during v3.18 devel cycle.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda981 ("blk-mq: use percpu_ref for mq usage count")
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 block/blk-mq-sysfs.c   |  6 ++++++
 block/blk-mq.c         |  6 +++++-
 block/blk-sysfs.c      | 11 +++++++++--
 include/linux/blk-mq.h |  1 +
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index ed52178..371d880 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -402,6 +402,12 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 	}
 }
 
+/* see blk_register_queue() */
+void blk_mq_finish_init(struct request_queue *q)
+{
+	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+}
+
 int blk_mq_register_disk(struct gendisk *disk)
 {
 	struct device *dev = disk_to_dev(disk);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d85fe01..38f4a16 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1795,8 +1795,12 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!q)
 		goto err_hctxs;
 
+	/*
+	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
+	 * See blk_register_queue() for details.
+	 */
 	if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
-			    0, GFP_KERNEL))
+			    PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto err_map;
 
 	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 17f5c84..521ae90 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -551,12 +551,19 @@ int blk_register_queue(struct gendisk *disk)
 		return -ENXIO;
 
 	/*
-	 * Initialization must be complete by now.  Finish the initial
-	 * bypass from queue allocation.
+	 * SCSI probing may synchronously create and destroy a lot of
+	 * request_queues for non-existent devices.  Shutting down a fully
+	 * functional queue takes measureable wallclock time as RCU grace
+	 * periods are involved.  To avoid excessive latency in these
+	 * cases, a request_queue starts out in a degraded mode which is
+	 * faster to shut down and is made fully functional here as
+	 * request_queues for non-existent devices never get registered.
 	 */
 	if (!blk_queue_init_done(q)) {
 		queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
 		blk_queue_bypass_end(q);
+		if (q->mq_ops)
+			blk_mq_finish_init(q);
 	}
 
 	ret = blk_trace_init_sysfs(dev);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a1e31f2..c13a0c0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,7 @@ enum {
 };
 
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
+void blk_mq_finish_init(struct request_queue *q);
 int blk_mq_register_disk(struct gendisk *);
 void blk_mq_unregister_disk(struct gendisk *);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-09-24 17:43 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-19 11:38 boot stall regression due to blk-mq: use percpu_ref for mq usage count Christoph Hellwig
2014-09-19 19:13 ` Jens Axboe
2014-09-23  5:55   ` Christoph Hellwig
2014-09-23  5:56     ` Tejun Heo
2014-09-23  5:57       ` Tejun Heo
2014-09-23  5:59       ` Christoph Hellwig
2014-09-23  6:01         ` Tejun Heo
2014-09-23  6:03           ` Tejun Heo
2014-09-23  6:09           ` Christoph Hellwig
2014-09-23  6:11             ` Tejun Heo
2014-09-23 14:57               ` Elliott, Robert (Server Storage)
2014-09-23 15:49               ` Jens Axboe
2014-09-23 19:24                 ` [PATCH block/for-3.17-fixes/core] blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe Tejun Heo
2014-09-23 19:29                   ` Jens Axboe
2014-09-23 19:48                     ` Tejun Heo
2014-09-24  8:23                   ` Christoph Hellwig
2014-09-24 14:30                     ` Jens Axboe
2014-09-24 14:33                       ` Tejun Heo
2014-09-24 14:33                         ` Jens Axboe
2014-09-24 17:20                           ` [PATCH percpu/for-3.18] Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Tejun Heo
2014-09-23  6:08 ` [PATCH block/for-3.18/core] blk-mq: start q->mq_usage_counter in atomic mode Tejun Heo
2014-09-24 17:33   ` Tejun Heo
2014-09-24 17:43     ` [PATCH percpu/for-3.18] blk-mq, percpu_ref: " Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.