All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] dm: simplify request-based DM a bit and an RFC-like perf tweak
@ 2015-02-24 16:44 Mike Snitzer
  2015-02-24 16:44 ` [PATCH 1/4] dm: remove unnecessary wrapper around blk_lld_busy Mike Snitzer
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 16:44 UTC (permalink / raw)
  To: dm-devel; +Cc: axboe, shivakrishna.merla, jmoyer

These changes were born out of me staring at the DM core code that
deals with request-based DM, in response to this thread:
https://www.redhat.com/archives/dm-devel/2015-February/msg00118.html

I've tested patches 1-3.  Patch 4 is more an RFC patch that fell out
of discussion I had with Jeff Moyer ("phro" below):

<phro> are you sure it isn't just i/o completion that's pulling more requests off the queue?
<phro> at least at the lower layer, that's what happens.  you get an interrupt for i/o completion, and in that contxt you submit any pending i/o
<snitm> in drivers/md/dm.c: dm_end_request -> rq_completed -> blk_run_queue_async
<snitm> so could very well be
<phro> ok, only difference is you kick it off to a work queue
<phro> for immediate dispatch
<snitm> right
<phro> so that makes sense.  as soon as an i/o is completed, a new one is issued
<snitm> could impose a delay on the running of the queue
<phro> right
<snitm> e.g. how dm.c:dm_request_fn does blk_delay_queue(q, HZ / 10);
<snitm> (which is what promotes merging in the ->lld_busy_fn returning true case)
<snitm> that _could_ be enough to slow things down slightly
<phro> it's worth a shot

Mike Snitzer (4):
  dm: remove unnecessary wrapper around blk_lld_busy
  dm: remove request-based DM queue's lld_busy_fn hook
  dm: remove request-based logic from make_request_fn wrapper
  dm: delay running the queue slightly during request completion

 block/blk-core.c              |  5 ++--
 drivers/md/dm-mpath.c         |  2 +-
 drivers/md/dm-table.c         | 14 -----------
 drivers/md/dm.c               | 54 +++++++++++--------------------------------
 drivers/md/dm.h               |  1 -
 include/linux/blkdev.h        |  2 --
 include/linux/device-mapper.h |  5 ----
 7 files changed, 17 insertions(+), 66 deletions(-)

-- 
1.9.3 (Apple Git-50)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] dm: remove unnecessary wrapper around blk_lld_busy
  2015-02-24 16:44 [PATCH 0/4] dm: simplify request-based DM a bit and an RFC-like perf tweak Mike Snitzer
@ 2015-02-24 16:44 ` Mike Snitzer
  2015-02-24 16:44 ` [PATCH 2/4] dm: remove request-based DM queue's lld_busy_fn hook Mike Snitzer
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 16:44 UTC (permalink / raw)
  To: dm-devel; +Cc: axboe, shivakrishna.merla, jmoyer

There is no need for DM to export a wrapper around the already exported
blk_lld_busy().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c         | 2 +-
 drivers/md/dm.c               | 6 ------
 include/linux/device-mapper.h | 5 -----
 3 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index d376dc8..add6391 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1627,7 +1627,7 @@ static int __pgpath_busy(struct pgpath *pgpath)
 {
 	struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
 
-	return dm_underlying_device_busy(q);
+	return blk_lld_busy(q);
 }
 
 /*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 73f2880..1766ba0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2003,12 +2003,6 @@ out:
 	dm_put_live_table(md, srcu_idx);
 }
 
-int dm_underlying_device_busy(struct request_queue *q)
-{
-	return blk_lld_busy(q);
-}
-EXPORT_SYMBOL_GPL(dm_underlying_device_busy);
-
 static int dm_lld_busy(struct request_queue *q)
 {
 	int r;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 2646aed..3e4db11 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -604,9 +604,4 @@ static inline unsigned long to_bytes(sector_t n)
 	return (n << SECTOR_SHIFT);
 }
 
-/*-----------------------------------------------------------------
- * Helper for block layer and dm core operations
- *---------------------------------------------------------------*/
-int dm_underlying_device_busy(struct request_queue *q);
-
 #endif	/* _LINUX_DEVICE_MAPPER_H */
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] dm: remove request-based DM queue's lld_busy_fn hook
  2015-02-24 16:44 [PATCH 0/4] dm: simplify request-based DM a bit and an RFC-like perf tweak Mike Snitzer
  2015-02-24 16:44 ` [PATCH 1/4] dm: remove unnecessary wrapper around blk_lld_busy Mike Snitzer
@ 2015-02-24 16:44 ` Mike Snitzer
  2015-02-24 16:44 ` [PATCH 3/4] dm: remove request-based logic from make_request_fn wrapper Mike Snitzer
  2015-02-24 16:44 ` [RFC PATCH 4/4] dm: delay running the queue slightly during request completion Mike Snitzer
  3 siblings, 0 replies; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 16:44 UTC (permalink / raw)
  To: dm-devel; +Cc: axboe, shivakrishna.merla, jmoyer

DM multipath is the only caller of blk_lld_busy() -- which calls a
queue's lld_busy_fn hook.  Request-based DM doesn't support stacking
multipath devices so there is no reason to register the lld_busy_fn hook
on a multipath device's queue using blk_queue_lld_busy().

As such, remove functions dm_lld_busy and dm_table_any_busy_target.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-table.c | 14 --------------
 drivers/md/dm.c       | 17 -----------------
 drivers/md/dm.h       |  1 -
 3 files changed, 32 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 6554d91..0573120 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1677,20 +1677,6 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
 	return r;
 }
 
-int dm_table_any_busy_target(struct dm_table *t)
-{
-	unsigned i;
-	struct dm_target *ti;
-
-	for (i = 0; i < t->num_targets; i++) {
-		ti = t->targets + i;
-		if (ti->type->busy && ti->type->busy(ti))
-			return 1;
-	}
-
-	return 0;
-}
-
 struct mapped_device *dm_table_get_md(struct dm_table *t)
 {
 	return t->md;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1766ba0..897af6c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2003,22 +2003,6 @@ out:
 	dm_put_live_table(md, srcu_idx);
 }
 
-static int dm_lld_busy(struct request_queue *q)
-{
-	int r;
-	struct mapped_device *md = q->queuedata;
-	struct dm_table *map = dm_get_live_table_fast(md);
-
-	if (!map || test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))
-		r = 1;
-	else
-		r = dm_table_any_busy_target(map);
-
-	dm_put_live_table_fast(md);
-
-	return r;
-}
-
 static int dm_any_congested(void *congested_data, int bdi_bits)
 {
 	int r = bdi_bits;
@@ -2540,7 +2524,6 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	dm_init_md_queue(md);
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
-	blk_queue_lld_busy(md->queue, dm_lld_busy);
 
 	/* Also initialize the request-based DM worker thread */
 	init_kthread_worker(&md->kworker);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 59f53e7..db49586 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -70,7 +70,6 @@ void dm_table_presuspend_undo_targets(struct dm_table *t);
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
 int dm_table_any_congested(struct dm_table *t, int bdi_bits);
-int dm_table_any_busy_target(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
 struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
 bool dm_table_request_based(struct dm_table *t);
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] dm: remove request-based logic from make_request_fn wrapper
  2015-02-24 16:44 [PATCH 0/4] dm: simplify request-based DM a bit and an RFC-like perf tweak Mike Snitzer
  2015-02-24 16:44 ` [PATCH 1/4] dm: remove unnecessary wrapper around blk_lld_busy Mike Snitzer
  2015-02-24 16:44 ` [PATCH 2/4] dm: remove request-based DM queue's lld_busy_fn hook Mike Snitzer
@ 2015-02-24 16:44 ` Mike Snitzer
  2015-02-24 16:44 ` [RFC PATCH 4/4] dm: delay running the queue slightly during request completion Mike Snitzer
  3 siblings, 0 replies; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 16:44 UTC (permalink / raw)
  To: dm-devel; +Cc: axboe, shivakrishna.merla, jmoyer

The old dm_request() method used for q->make_request_fn had a branch for
request-based DM support but it isn't needed given that
dm_init_request_based_queue() sets it to the standard blk_queue_bio()
anyway.

Cleanup dm_init_md_queue() to be DM device-type agnostic and have
dm_setup_md_queue() properly finish queue setup based on DM device-type
(bio-based vs request-based).

Also, remove the export for blk_queue_bio() now that DM no longer calls
it directly.  Doing so required a forward declaration in blk-core.c.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c       |  5 +++--
 drivers/md/dm.c        | 29 ++++++++++++-----------------
 include/linux/blkdev.h |  2 --
 3 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..4bcd739 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,6 +719,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 }
 EXPORT_SYMBOL(blk_init_queue_node);
 
+static void blk_queue_bio(struct request_queue *q, struct bio *bio);
+
 struct request_queue *
 blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 			 spinlock_t *lock)
@@ -1563,7 +1565,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
 	blk_rq_bio_prep(req->q, req, bio);
 }
 
-void blk_queue_bio(struct request_queue *q, struct bio *bio)
+static void blk_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	const bool sync = !!(bio->bi_rw & REQ_SYNC);
 	struct blk_plug *plug;
@@ -1671,7 +1673,6 @@ out_unlock:
 		spin_unlock_irq(q->queue_lock);
 	}
 }
-EXPORT_SYMBOL_GPL(blk_queue_bio);	/* for device mapper only */
 
 /*
  * If bio->bi_dev is a partition, remap the location
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 897af6c..fc92899 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1690,7 +1690,7 @@ out:
  * The request function that just remaps the bio built up by
  * dm_merge_bvec.
  */
-static void _dm_request(struct request_queue *q, struct bio *bio)
+static void dm_make_request(struct request_queue *q, struct bio *bio)
 {
 	int rw = bio_data_dir(bio);
 	struct mapped_device *md = q->queuedata;
@@ -1722,16 +1722,6 @@ int dm_request_based(struct mapped_device *md)
 	return blk_queue_stackable(md->queue);
 }
 
-static void dm_request(struct request_queue *q, struct bio *bio)
-{
-	struct mapped_device *md = q->queuedata;
-
-	if (dm_request_based(md))
-		blk_queue_bio(q, bio);
-	else
-		_dm_request(q, bio);
-}
-
 static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
 {
 	int r;
@@ -2097,9 +2087,8 @@ static void dm_init_md_queue(struct mapped_device *md)
 	md->queue->queuedata = md;
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
-	blk_queue_make_request(md->queue, dm_request);
+
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
-	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
 }
 
 /*
@@ -2330,7 +2319,7 @@ int dm_queue_merge_is_compulsory(struct request_queue *q)
 	if (!q->merge_bvec_fn)
 		return 0;
 
-	if (q->make_request_fn == dm_request) {
+	if (q->make_request_fn == dm_make_request) {
 		dev_md = q->queuedata;
 		if (test_bit(DMF_MERGE_IS_OPTIONAL, &dev_md->flags))
 			return 0;
@@ -2540,9 +2529,15 @@ static int dm_init_request_based_queue(struct mapped_device *md)
  */
 int dm_setup_md_queue(struct mapped_device *md)
 {
-	if (dm_md_type_request_based(md) && !dm_init_request_based_queue(md)) {
-		DMWARN("Cannot initialize queue for request-based mapped device");
-		return -EINVAL;
+	if (dm_md_type_request_based(md)) {
+		if (!dm_init_request_based_queue(md)) {
+			DMWARN("Cannot initialize queue for request-based mapped device");
+			return -EINVAL;
+		}
+	} else {
+		/* bio-based specific initialization */
+		blk_queue_make_request(md->queue, dm_make_request);
+		blk_queue_merge_bvec(md->queue, dm_merge_bvec);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..5d93a66 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -821,8 +821,6 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
-extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
-
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/4] dm: delay running the queue slightly during request completion
  2015-02-24 16:44 [PATCH 0/4] dm: simplify request-based DM a bit and an RFC-like perf tweak Mike Snitzer
                   ` (2 preceding siblings ...)
  2015-02-24 16:44 ` [PATCH 3/4] dm: remove request-based logic from make_request_fn wrapper Mike Snitzer
@ 2015-02-24 16:44 ` Mike Snitzer
  2015-02-24 16:51   ` Jens Axboe
  3 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 16:44 UTC (permalink / raw)
  To: dm-devel; +Cc: axboe, shivakrishna.merla, jmoyer

On really fast storage it can be beneficial to delay running the
request_queue to allow the elevator more opportunity to merge requests.

Otherwise, it has been observed that requests are being sent to
q->request_fn much quicker than is ideal on IOPS-bound backends.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index fc92899..92091e0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1034,7 +1034,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
 	 * queue lock again.
 	 */
 	if (run_queue)
-		blk_run_queue_async(md->queue);
+		blk_delay_queue(md->queue, HZ / 10);
 
 	/*
 	 * dm_put() must be at the end of this function. See the comment above
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4] dm: delay running the queue slightly during request completion
  2015-02-24 16:44 ` [RFC PATCH 4/4] dm: delay running the queue slightly during request completion Mike Snitzer
@ 2015-02-24 16:51   ` Jens Axboe
  2015-02-24 17:22     ` [RFC PATCH 4/4 v2] " Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2015-02-24 16:51 UTC (permalink / raw)
  To: Mike Snitzer, dm-devel; +Cc: shivakrishna.merla, jmoyer

On 02/24/2015 08:44 AM, Mike Snitzer wrote:
> On really fast storage it can be beneficial to delay running the
> request_queue to allow the elevator more opportunity to merge requests.
>
> Otherwise, it has been observed that requests are being sent to
> q->request_fn much quicker than is ideal on IOPS-bound backends.
>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>   drivers/md/dm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index fc92899..92091e0 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1034,7 +1034,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
>   	 * queue lock again.
>   	 */
>   	if (run_queue)
> -		blk_run_queue_async(md->queue);
> +		blk_delay_queue(md->queue, HZ / 10);

This looks dangerous... How will this impact sync IO? Heuristics like 
this will always come back and bite you in the ass.

A slightly more friendly heuristic might be to delay running the queue, 
if you still have pending IO. That would give you a more sawtooth like 
queue depth management, so it would potentially slow down a bit, but the 
upside would be more efficient merging since it would allow some 
requests so sit a little bit before being dispatched.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/4 v2] dm: delay running the queue slightly during request completion
  2015-02-24 16:51   ` Jens Axboe
@ 2015-02-24 17:22     ` Mike Snitzer
  2015-02-24 17:52       ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 17:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jmoyer, dm-devel, shivakrishna.merla

On Tue, Feb 24 2015 at 11:51P -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 02/24/2015 08:44 AM, Mike Snitzer wrote:
> >On really fast storage it can be beneficial to delay running the
> >request_queue to allow the elevator more opportunity to merge requests.
> >
> >Otherwise, it has been observed that requests are being sent to
> >q->request_fn much quicker than is ideal on IOPS-bound backends.
> >
> >Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >---
> >  drivers/md/dm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >index fc92899..92091e0 100644
> >--- a/drivers/md/dm.c
> >+++ b/drivers/md/dm.c
> >@@ -1034,7 +1034,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
> >  	 * queue lock again.
> >  	 */
> >  	if (run_queue)
> >-		blk_run_queue_async(md->queue);
> >+		blk_delay_queue(md->queue, HZ / 10);
> 
> This looks dangerous... How will this impact sync IO? Heuristics like this
> will always come back and bite you in the ass.
> 
> A slightly more friendly heuristic might be to delay running the queue, if
> you still have pending IO. That would give you a more sawtooth like queue
> depth management, so it would potentially slow down a bit, but the upside
> would be more efficient merging since it would allow some requests so sit a
> little bit before being dispatched.

OK, thanks for the suggestion, sending RFC patches FTW:

From: Mike Snitzer <snitzer@redhat.com>
Subject: [PATCH] dm: delay running the queue slightly during request
 completion

On really fast storage it can be beneficial to delay running the
request_queue to allow the elevator more opportunity to merge requests.

Otherwise, it has been observed that requests are being sent to
q->request_fn much quicker than is ideal on IOPS-bound backends.

To avoid impacting sync IO, the delay when running the queue is only
used if there is pending IO.  As Jens put it when suggesting this
heuristic:
 "That would give you a more sawtooth like queue depth management, so it
 would potentially slow down a bit, but the upside would be more
 efficient merging since it would allow some requests to sit a little
 bit before being dispatched."

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index fc92899..85b8919 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1033,8 +1033,12 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
 	 * back into ->request_fn() could deadlock attempting to grab the
 	 * queue lock again.
 	 */
-	if (run_queue)
-		blk_run_queue_async(md->queue);
+	if (run_queue) {
+		if (md->queue->nr_pending)
+			blk_delay_queue(md->queue, HZ / 10);
+		else
+			blk_run_queue_async(md->queue);
+	}
 
 	/*
 	 * dm_put() must be at the end of this function. See the comment above
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4 v2] dm: delay running the queue slightly during request completion
  2015-02-24 17:22     ` [RFC PATCH 4/4 v2] " Mike Snitzer
@ 2015-02-24 17:52       ` Jens Axboe
  2015-02-24 18:12         ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2015-02-24 17:52 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: jmoyer, dm-devel, shivakrishna.merla

On 02/24/2015 09:22 AM, Mike Snitzer wrote:
> On Tue, Feb 24 2015 at 11:51P -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 02/24/2015 08:44 AM, Mike Snitzer wrote:
>>> On really fast storage it can be beneficial to delay running the
>>> request_queue to allow the elevator more opportunity to merge requests.
>>>
>>> Otherwise, it has been observed that requests are being sent to
>>> q->request_fn much quicker than is ideal on IOPS-bound backends.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>> ---
>>>   drivers/md/dm.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>> index fc92899..92091e0 100644
>>> --- a/drivers/md/dm.c
>>> +++ b/drivers/md/dm.c
>>> @@ -1034,7 +1034,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
>>>   	 * queue lock again.
>>>   	 */
>>>   	if (run_queue)
>>> -		blk_run_queue_async(md->queue);
>>> +		blk_delay_queue(md->queue, HZ / 10);
>>
>> This looks dangerous... How will this impact sync IO? Heuristics like this
>> will always come back and bite you in the ass.
>>
>> A slightly more friendly heuristic might be to delay running the queue, if
>> you still have pending IO. That would give you a more sawtooth like queue
>> depth management, so it would potentially slow down a bit, but the upside
>> would be more efficient merging since it would allow some requests so sit a
>> little bit before being dispatched.
>
> OK, thanks for the suggestion, sending RFC patches FTW:
>
> From: Mike Snitzer <snitzer@redhat.com>
> Subject: [PATCH] dm: delay running the queue slightly during request
>   completion
>
> On really fast storage it can be beneficial to delay running the
> request_queue to allow the elevator more opportunity to merge requests.
>
> Otherwise, it has been observed that requests are being sent to
> q->request_fn much quicker than is ideal on IOPS-bound backends.
>
> To avoid impacting sync IO, the delay when running the queue is only
> used if there is pending IO.  As Jens put it when suggesting this
> heuristic:
>   "That would give you a more sawtooth like queue depth management, so it
>   would potentially slow down a bit, but the upside would be more
>   efficient merging since it would allow some requests to sit a little
>   bit before being dispatched."
>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>   drivers/md/dm.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index fc92899..85b8919 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1033,8 +1033,12 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
>   	 * back into ->request_fn() could deadlock attempting to grab the
>   	 * queue lock again.
>   	 */
> -	if (run_queue)
> -		blk_run_queue_async(md->queue);
> +	if (run_queue) {
> +		if (md->queue->nr_pending)
> +			blk_delay_queue(md->queue, HZ / 10);
> +		else
> +			blk_run_queue_async(md->queue);
> +	}

So all of this needs to be tested and performance vetted. But my 
original suggestion was something like:

if (run_queue && !md->queue->nr_pending)
	blk_run_queue_async(md->queue);

which might be a bit extreme, but if we hit 0, that's the only case 
where you truly do need to run the queue. So that kind of logic would 
give you the highest chance of merge success, potentially at the cost of 
reduced performance for other cases.

That aside, where did you pull this ->nr_pending from? I think you need 
to look at that again...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4 v2] dm: delay running the queue slightly during request completion
  2015-02-24 17:52       ` Jens Axboe
@ 2015-02-24 18:12         ` Mike Snitzer
  2015-02-24 18:16           ` Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 18:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jmoyer, dm-devel, shivakrishna.merla

On Tue, Feb 24 2015 at 12:52pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 02/24/2015 09:22 AM, Mike Snitzer wrote:
> >On Tue, Feb 24 2015 at 11:51P -0500,
> >Jens Axboe <axboe@kernel.dk> wrote:
> >
> >>On 02/24/2015 08:44 AM, Mike Snitzer wrote:
> >>>On really fast storage it can be beneficial to delay running the
> >>>request_queue to allow the elevator more opportunity to merge requests.
> >>>
> >>>Otherwise, it has been observed that requests are being sent to
> >>>q->request_fn much quicker than is ideal on IOPS-bound backends.
> >>>
> >>>Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>>---
> >>>  drivers/md/dm.c | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>>diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>index fc92899..92091e0 100644
> >>>--- a/drivers/md/dm.c
> >>>+++ b/drivers/md/dm.c
> >>>@@ -1034,7 +1034,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
> >>>  	 * queue lock again.
> >>>  	 */
> >>>  	if (run_queue)
> >>>-		blk_run_queue_async(md->queue);
> >>>+		blk_delay_queue(md->queue, HZ / 10);
> >>
> >>This looks dangerous... How will this impact sync IO? Heuristics like this
> >>will always come back and bite you in the ass.
> >>
> >>A slightly more friendly heuristic might be to delay running the queue, if
> >>you still have pending IO. That would give you a more sawtooth like queue
> >>depth management, so it would potentially slow down a bit, but the upside
> >>would be more efficient merging since it would allow some requests so sit a
> >>little bit before being dispatched.
> >
> >OK, thanks for the suggestion, sending RFC patches FTW:
> >
> >From: Mike Snitzer <snitzer@redhat.com>
> >Subject: [PATCH] dm: delay running the queue slightly during request
> >  completion
> >
> >On really fast storage it can be beneficial to delay running the
> >request_queue to allow the elevator more opportunity to merge requests.
> >
> >Otherwise, it has been observed that requests are being sent to
> >q->request_fn much quicker than is ideal on IOPS-bound backends.
> >
> >To avoid impacting sync IO, the delay when running the queue is only
> >used if there is pending IO.  As Jens put it when suggesting this
> >heuristic:
> >  "That would give you a more sawtooth like queue depth management, so it
> >  would potentially slow down a bit, but the upside would be more
> >  efficient merging since it would allow some requests to sit a little
> >  bit before being dispatched."
> >
> >Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >---
> >  drivers/md/dm.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >index fc92899..85b8919 100644
> >--- a/drivers/md/dm.c
> >+++ b/drivers/md/dm.c
> >@@ -1033,8 +1033,12 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
> >  	 * back into ->request_fn() could deadlock attempting to grab the
> >  	 * queue lock again.
> >  	 */
> >-	if (run_queue)
> >-		blk_run_queue_async(md->queue);
> >+	if (run_queue) {
> >+		if (md->queue->nr_pending)
> >+			blk_delay_queue(md->queue, HZ / 10);
> >+		else
> >+			blk_run_queue_async(md->queue);
> >+	}
> 
> So all of this needs to be tested and performance vetted. But my
> original suggestion was something like:
> 
> if (run_queue && !md->queue->nr_pending)
> 	blk_run_queue_async(md->queue);
> 
> which might be a bit extreme, but if we hit 0, that's the only case
> where you truly do need to run the queue. So that kind of logic
> would give you the highest chance of merge success, potentially at
> the cost of reduced performance for other cases.

Yeah, I was wondering about not running the queue at all when discussing
with Jeff earlier today.  Seemed extreme, and Jeff thought it could
cause performance to really take a hit.
 
> That aside, where did you pull this ->nr_pending from? I think you
> need to look at that again...

Um, as in q->nr_pending doesn't reflect the number of pending requests?
I looked at blk-core.c, saw nr_pending and ran with it...

But looking closer it is only used if CONFIG_PM.  SO what is the right
way to check for pending requests on the queue?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4 v2] dm: delay running the queue slightly during request completion
  2015-02-24 18:12         ` Mike Snitzer
@ 2015-02-24 18:16           ` Jens Axboe
  2015-02-24 18:32             ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2015-02-24 18:16 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: jmoyer, dm-devel, shivakrishna.merla

On 02/24/2015 10:12 AM, Mike Snitzer wrote:
> On Tue, Feb 24 2015 at 12:52pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 02/24/2015 09:22 AM, Mike Snitzer wrote:
>>> On Tue, Feb 24 2015 at 11:51P -0500,
>>> Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>> On 02/24/2015 08:44 AM, Mike Snitzer wrote:
>>>>> On really fast storage it can be beneficial to delay running the
>>>>> request_queue to allow the elevator more opportunity to merge requests.
>>>>>
>>>>> Otherwise, it has been observed that requests are being sent to
>>>>> q->request_fn much quicker than is ideal on IOPS-bound backends.
>>>>>
>>>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>>>> ---
>>>>>   drivers/md/dm.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>> index fc92899..92091e0 100644
>>>>> --- a/drivers/md/dm.c
>>>>> +++ b/drivers/md/dm.c
>>>>> @@ -1034,7 +1034,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
>>>>>   	 * queue lock again.
>>>>>   	 */
>>>>>   	if (run_queue)
>>>>> -		blk_run_queue_async(md->queue);
>>>>> +		blk_delay_queue(md->queue, HZ / 10);
>>>>
>>>> This looks dangerous... How will this impact sync IO? Heuristics like this
>>>> will always come back and bite you in the ass.
>>>>
>>>> A slightly more friendly heuristic might be to delay running the queue, if
>>>> you still have pending IO. That would give you a more sawtooth like queue
>>>> depth management, so it would potentially slow down a bit, but the upside
>>>> would be more efficient merging since it would allow some requests so sit a
>>>> little bit before being dispatched.
>>>
>>> OK, thanks for the suggestion, sending RFC patches FTW:
>>>
>>> From: Mike Snitzer <snitzer@redhat.com>
>>> Subject: [PATCH] dm: delay running the queue slightly during request
>>>   completion
>>>
>>> On really fast storage it can be beneficial to delay running the
>>> request_queue to allow the elevator more opportunity to merge requests.
>>>
>>> Otherwise, it has been observed that requests are being sent to
>>> q->request_fn much quicker than is ideal on IOPS-bound backends.
>>>
>>> To avoid impacting sync IO, the delay when running the queue is only
>>> used if there is pending IO.  As Jens put it when suggesting this
>>> heuristic:
>>>   "That would give you a more sawtooth like queue depth management, so it
>>>   would potentially slow down a bit, but the upside would be more
>>>   efficient merging since it would allow some requests to sit a little
>>>   bit before being dispatched."
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>> ---
>>>   drivers/md/dm.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>> index fc92899..85b8919 100644
>>> --- a/drivers/md/dm.c
>>> +++ b/drivers/md/dm.c
>>> @@ -1033,8 +1033,12 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
>>>   	 * back into ->request_fn() could deadlock attempting to grab the
>>>   	 * queue lock again.
>>>   	 */
>>> -	if (run_queue)
>>> -		blk_run_queue_async(md->queue);
>>> +	if (run_queue) {
>>> +		if (md->queue->nr_pending)
>>> +			blk_delay_queue(md->queue, HZ / 10);
>>> +		else
>>> +			blk_run_queue_async(md->queue);
>>> +	}
>>
>> So all of this needs to be tested and performance vetted. But my
>> original suggestion was something like:
>>
>> if (run_queue && !md->queue->nr_pending)
>> 	blk_run_queue_async(md->queue);
>>
>> which might be a bit extreme, but if we hit 0, that's the only case
>> where you truly do need to run the queue. So that kind of logic
>> would give you the highest chance of merge success, potentially at
>> the cost of reduced performance for other cases.
>
> Yeah, I was wondering about not running the queue at all when discussing
> with Jeff earlier today.  Seemed extreme, and Jeff thought it could
> cause performance to really take a hit.

Whether you have to or not depends on how you break out of queueing 
loops. But you definitely don't have to run it on every single request 
completion...

>> That aside, where did you pull this ->nr_pending from? I think you
>> need to look at that again...
>
> Um, as in q->nr_pending doesn't reflect the number of pending requests?
> I looked at blk-core.c, saw nr_pending and ran with it...
>
> But looking closer it is only used if CONFIG_PM.  SO what is the right
> way to check for pending requests on the queue?

You probably want to track it yourself. You could look at the request 
counters, but that would introduce a dependency on the old request path.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/4 v2] dm: delay running the queue slightly during request completion
  2015-02-24 18:16           ` Jens Axboe
@ 2015-02-24 18:32             ` Mike Snitzer
  2015-02-25  0:56               ` awful request merge results while simulating high IOPS multipath Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2015-02-24 18:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jmoyer, dm-devel, shivakrishna.merla

On Tue, Feb 24 2015 at  1:16pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 02/24/2015 10:12 AM, Mike Snitzer wrote:
> >On Tue, Feb 24 2015 at 12:52pm -0500,
> >Jens Axboe <axboe@kernel.dk> wrote:
> >
> >>On 02/24/2015 09:22 AM, Mike Snitzer wrote:
> >>>On Tue, Feb 24 2015 at 11:51P -0500,
> >>>Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>>>On 02/24/2015 08:44 AM, Mike Snitzer wrote:
> >>>>>On really fast storage it can be beneficial to delay running the
> >>>>>request_queue to allow the elevator more opportunity to merge requests.
> >>>>>
> >>>>>Otherwise, it has been observed that requests are being sent to
> >>>>>q->request_fn much quicker than is ideal on IOPS-bound backends.
> >>>>>
> >>>>>Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>>>>---
> >>>>>  drivers/md/dm.c | 2 +-
> >>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>>diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>>>index fc92899..92091e0 100644
> >>>>>--- a/drivers/md/dm.c
> >>>>>+++ b/drivers/md/dm.c
> >>>>>@@ -1034,7 +1034,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
> >>>>>  	 * queue lock again.
> >>>>>  	 */
> >>>>>  	if (run_queue)
> >>>>>-		blk_run_queue_async(md->queue);
> >>>>>+		blk_delay_queue(md->queue, HZ / 10);
> >>>>
> >>>>This looks dangerous... How will this impact sync IO? Heuristics like this
> >>>>will always come back and bite you in the ass.
> >>>>
> >>>>A slightly more friendly heuristic might be to delay running the queue, if
> >>>>you still have pending IO. That would give you a more sawtooth like queue
> >>>>depth management, so it would potentially slow down a bit, but the upside
> >>>>would be more efficient merging since it would allow some requests so sit a
> >>>>little bit before being dispatched.
> >>>
> >>>OK, thanks for the suggestion, sending RFC patches FTW:
> >>>
> >>>From: Mike Snitzer <snitzer@redhat.com>
> >>>Subject: [PATCH] dm: delay running the queue slightly during request
> >>>  completion
> >>>
> >>>On really fast storage it can be beneficial to delay running the
> >>>request_queue to allow the elevator more opportunity to merge requests.
> >>>
> >>>Otherwise, it has been observed that requests are being sent to
> >>>q->request_fn much quicker than is ideal on IOPS-bound backends.
> >>>
> >>>To avoid impacting sync IO, the delay when running the queue is only
> >>>used if there is pending IO.  As Jens put it when suggesting this
> >>>heuristic:
> >>>  "That would give you a more sawtooth like queue depth management, so it
> >>>  would potentially slow down a bit, but the upside would be more
> >>>  efficient merging since it would allow some requests to sit a little
> >>>  bit before being dispatched."
> >>>
> >>>Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>>---
> >>>  drivers/md/dm.c | 8 ++++++--
> >>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>>diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>index fc92899..85b8919 100644
> >>>--- a/drivers/md/dm.c
> >>>+++ b/drivers/md/dm.c
> >>>@@ -1033,8 +1033,12 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
> >>>  	 * back into ->request_fn() could deadlock attempting to grab the
> >>>  	 * queue lock again.
> >>>  	 */
> >>>-	if (run_queue)
> >>>-		blk_run_queue_async(md->queue);
> >>>+	if (run_queue) {
> >>>+		if (md->queue->nr_pending)
> >>>+			blk_delay_queue(md->queue, HZ / 10);
> >>>+		else
> >>>+			blk_run_queue_async(md->queue);
> >>>+	}
> >>
> >>So all of this needs to be tested and performance vetted. But my
> >>original suggestion was something like:
> >>
> >>if (run_queue && !md->queue->nr_pending)
> >>	blk_run_queue_async(md->queue);
> >>
> >>which might be a bit extreme, but if we hit 0, that's the only case
> >>where you truly do need to run the queue. So that kind of logic
> >>would give you the highest chance of merge success, potentially at
> >>the cost of reduced performance for other cases.
> >
> >Yeah, I was wondering about not running the queue at all when discussing
> >with Jeff earlier today.  Seemed extreme, and Jeff thought it could
> >cause performance to really take a hit.
> 
> Whether you have to or not depends on how you break out of queueing
> loops. But you definitely don't have to run it on every single
> request completion...

OK, thanks for clarifying.

Will see how the initial RFC patch I shared works for Netapp's testcase
but based on those results will work to arrive at a more
generic/intelligent solution.

> >>That aside, where did you pull this ->nr_pending from? I think you
> >>need to look at that again...
> >
> >Um, as in q->nr_pending doesn't reflect the number of pending requests?
> >I looked at blk-core.c, saw nr_pending and ran with it...
> >
> >But looking closer it is only used if CONFIG_PM.  SO what is the right
> >way to check for pending requests on the queue?
> 
> You probably want to track it yourself. You could look at the
> request counters, but that would introduce a dependency on the old
> request path.

True.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* awful request merge results while simulating high IOPS multipath
  2015-02-24 18:32             ` Mike Snitzer
@ 2015-02-25  0:56               ` Mike Snitzer
  2015-02-25  4:14                 ` Keith Busch
  2015-02-25  4:38                 ` FIXED! [was: awful request merge results while simulating high IOPS] multipath Mike Snitzer
  0 siblings, 2 replies; 21+ messages in thread
From: Mike Snitzer @ 2015-02-25  0:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, jmoyer, shivakrishna.merla, dm-devel, Bart Van Assche

On Tue, Feb 24 2015 at  1:32pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Feb 24 2015 at  1:16pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
> > >>
> > >>So all of this needs to be tested and performance vetted. But my
> > >>original suggestion was something like:
> > >>
> > >>if (run_queue && !md->queue->nr_pending)
> > >>	blk_run_queue_async(md->queue);
> > >>
> > >>which might be a bit extreme, but if we hit 0, that's the only case
> > >>where you truly do need to run the queue. So that kind of logic
> > >>would give you the highest chance of merge success, potentially at
> > >>the cost of reduced performance for other cases.
> > >
> > >Yeah, I was wondering about not running the queue at all when discussing
> > >with Jeff earlier today.  Seemed extreme, and Jeff thought it could
> > >cause performance to really take a hit.
> > 
> > Whether you have to or not depends on how you break out of queueing
> > loops. But you definitely don't have to run it on every single
> > request completion...
> 
> OK, thanks for clarifying.
> 
> Will see how the initial RFC patch I shared works for Netapp's testcase
> but based on those results will work to arrive at a more
> generic/intelligent solution.

That initial RFC didn't do well.  And given my results below, I'm not
holding out any hope for this io completion change, which we already
discussed as being more sane, having any impact:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=9068f2f15d5fc7a5c9dc6e6963913216ede0c745
 
Here is the sorry state of affairs:

Summary:
========

Request-based DM is somehow breaking the block layer's ability to properly merge requests.
Only good news is I seem to have a testbed that I can use to chase this issue down.

Test kernel is v4.0-rc1 with a few DM patches, see:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next


baseline:
=========

Writing to an STEC PCIe SSD using 64 threads, each using 4K sequentional IOs.
There are clearly merges happening.

fio job:
--------
[64_seq_write]
filename=/dev/skd0
rw=write
rwmixread=0
blocksize=4k
iodepth=16
direct=1
numjobs=64
#nrfiles=1
runtime=100
ioengine=libaio
time_based

deadline:
---------
Run status group 0 (all jobs):
  WRITE: io=33356MB, aggrb=341513KB/s, minb=4082KB/s, maxb=5463KB/s, mint=100001msec, maxt=100014msec

Disk stats (read/write):
  skd0: ios=0/638099, merge=0/7483318, ticks=0/8015451, in_queue=8073672, util=100.00%

cfq:
----
rotational=0
------------
Run status group 0 (all jobs):
  WRITE: io=20970MB, aggrb=214663KB/s, minb=3202KB/s, maxb=3620KB/s, mint=100001msec, maxt=100030msec

Disk stats (read/write):
  skd0: ios=47/390277, merge=0/4939075, ticks=0/7082618, in_queue=7109995, util=100.00%

rotational=1
------------
Run status group 0 (all jobs):
  WRITE: io=30075MB, aggrb=307130KB/s, minb=2613KB/s, maxb=10247KB/s, mint=100001msec, maxt=100271msec

Disk stats (read/write):
  skd0: ios=57/2217835, merge=0/5414906, ticks=0/6110715, in_queue=6120362, util=99.97%


multipath:
==========

Same workload as above, but this time through a request-based DM multipath target.
There are clearly _no_ merges happening.

echo "0 1563037696 multipath 0 0 1 1 service-time 0 1 2 /dev/skd0 1000 1" | dmsetup create skd_mpath

[64_seq_write]
#filename=/dev/mapper/skd_mpath
filename=/dev/dm-9
rw=write
rwmixread=0
blocksize=4k
iodepth=16
direct=1
numjobs=64
#nrfiles=1
runtime=100
ioengine=libaio
time_based

deadline:
---------
Run status group 0 (all jobs):
  WRITE: io=22559MB, aggrb=230976KB/s, minb=3565KB/s, maxb=3659KB/s, mint=100001msec, maxt=100011msec

Disk stats (read/write):
    dm-9: ios=71/5772755, merge=0/0, ticks=3/15309410, in_queue=15360084, util=100.00%, aggrios=164/5775057, aggrmerge=0/0, aggrticks=16/14960469, aggrin_queue=14956943, aggrutil=99.92%
  skd0: ios=164/5775057, merge=0/0, ticks=16/14960469, in_queue=14956943, util=99.92%

cfq:
----
rotational=0
------------
Run status group 0 (all jobs):
  WRITE: io=19477MB, aggrb=199424KB/s, minb=3001KB/s, maxb=3268KB/s, mint=100001msec, maxt=100010msec

Disk stats (read/write):
    dm-9: ios=40/4811814, merge=0/159762, ticks=0/14866276, in_queue=14912510, util=100.00%, aggrios=224/4814462, aggrmerge=0/0, aggrticks=13/14399688, aggrin_queue=14396342, aggrutil=99.92%
  skd0: ios=224/4814462, merge=0/0, ticks=13/14399688, in_queue=14396342, util=99.92%

rotational=1
------------
Run status group 0 (all jobs):
  WRITE: io=21051MB, aggrb=215188KB/s, minb=2147KB/s, maxb=8671KB/s, mint=100001msec, maxt=100175msec

Disk stats (read/write):
    dm-9: ios=59/1468500, merge=0/3836597, ticks=2/5928782, in_queue=5959348, util=100.00%, aggrios=152/1469016, aggrmerge=0/0, aggrticks=5/395471, aggrin_queue=395287, aggrutil=71.17%
  skd0: ios=152/1469016, merge=0/0, ticks=5/395471, in_queue=395287, util=71.17%

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: awful request merge results while simulating high IOPS multipath
  2015-02-25  0:56               ` awful request merge results while simulating high IOPS multipath Mike Snitzer
@ 2015-02-25  4:14                 ` Keith Busch
  2015-02-25 15:11                   ` Jens Axboe
  2015-02-25  4:38                 ` FIXED! [was: awful request merge results while simulating high IOPS] multipath Mike Snitzer
  1 sibling, 1 reply; 21+ messages in thread
From: Keith Busch @ 2015-02-25  4:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, shivakrishna.merla, Bart Van Assche, dm-devel,
	Keith Busch, jmoyer

To be honest, we also see underwhelming performance on high IOPS PCIe
SSDs. We know of some bottlenecks and ideas to test, but hoping to work
out the kinks in those ideas before Vault in two weeks.

On Tue, 24 Feb 2015, Mike Snitzer wrote:
> On Tue, Feb 24 2015 at  1:32pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>
>> On Tue, Feb 24 2015 at  1:16pm -0500,
>> Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> So all of this needs to be tested and performance vetted. But my
>>>>> original suggestion was something like:
>>>>>
>>>>> if (run_queue && !md->queue->nr_pending)
>>>>> 	blk_run_queue_async(md->queue);
>>>>>
>>>>> which might be a bit extreme, but if we hit 0, that's the only case
>>>>> where you truly do need to run the queue. So that kind of logic
>>>>> would give you the highest chance of merge success, potentially at
>>>>> the cost of reduced performance for other cases.
>>>>
>>>> Yeah, I was wondering about not running the queue at all when discussing
>>>> with Jeff earlier today.  Seemed extreme, and Jeff thought it could
>>>> cause performance to really take a hit.
>>>
>>> Whether you have to or not depends on how you break out of queueing
>>> loops. But you definitely don't have to run it on every single
>>> request completion...
>>
>> OK, thanks for clarifying.
>>
>> Will see how the initial RFC patch I shared works for Netapp's testcase
>> but based on those results will work to arrive at a more
>> generic/intelligent solution.
>
> That initial RFC didn't do well.  And given my results below, I'm not
> holding out any hope for this io completion change, which we already
> discussed as being more sane, having any impact:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=9068f2f15d5fc7a5c9dc6e6963913216ede0c745
>
> Here is the sorry state of affairs:
>
> Summary:
> ========
>
> Request-based DM is somehow breaking the block layer's ability to properly merge requests.
> Only good news is I seem to have a testbed that I can use to chase this issue down.
>
> Test kernel is v4.0-rc1 with a few DM patches, see:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next
>
>
> baseline:
> =========
>
> Writing to an STEC PCIe SSD using 64 threads, each using 4K sequentional IOs.
> There are clearly merges happening.
>
> fio job:
> --------
> [64_seq_write]
> filename=/dev/skd0
> rw=write
> rwmixread=0
> blocksize=4k
> iodepth=16
> direct=1
> numjobs=64
> #nrfiles=1
> runtime=100
> ioengine=libaio
> time_based
>
> deadline:
> ---------
> Run status group 0 (all jobs):
>  WRITE: io=33356MB, aggrb=341513KB/s, minb=4082KB/s, maxb=5463KB/s, mint=100001msec, maxt=100014msec
>
> Disk stats (read/write):
>  skd0: ios=0/638099, merge=0/7483318, ticks=0/8015451, in_queue=8073672, util=100.00%
>
> cfq:
> ----
> rotational=0
> ------------
> Run status group 0 (all jobs):
>  WRITE: io=20970MB, aggrb=214663KB/s, minb=3202KB/s, maxb=3620KB/s, mint=100001msec, maxt=100030msec
>
> Disk stats (read/write):
>  skd0: ios=47/390277, merge=0/4939075, ticks=0/7082618, in_queue=7109995, util=100.00%
>
> rotational=1
> ------------
> Run status group 0 (all jobs):
>  WRITE: io=30075MB, aggrb=307130KB/s, minb=2613KB/s, maxb=10247KB/s, mint=100001msec, maxt=100271msec
>
> Disk stats (read/write):
>  skd0: ios=57/2217835, merge=0/5414906, ticks=0/6110715, in_queue=6120362, util=99.97%
>
>
> multipath:
> ==========
>
> Same workload as above, but this time through a request-based DM multipath target.
> There are clearly _no_ merges happening.
>
> echo "0 1563037696 multipath 0 0 1 1 service-time 0 1 2 /dev/skd0 1000 1" | dmsetup create skd_mpath
>
> [64_seq_write]
> #filename=/dev/mapper/skd_mpath
> filename=/dev/dm-9
> rw=write
> rwmixread=0
> blocksize=4k
> iodepth=16
> direct=1
> numjobs=64
> #nrfiles=1
> runtime=100
> ioengine=libaio
> time_based
>
> deadline:
> ---------
> Run status group 0 (all jobs):
>  WRITE: io=22559MB, aggrb=230976KB/s, minb=3565KB/s, maxb=3659KB/s, mint=100001msec, maxt=100011msec
>
> Disk stats (read/write):
>    dm-9: ios=71/5772755, merge=0/0, ticks=3/15309410, in_queue=15360084, util=100.00%, aggrios=164/5775057, aggrmerge=0/0, aggrticks=16/14960469, aggrin_queue=14956943, aggrutil=99.92%
>  skd0: ios=164/5775057, merge=0/0, ticks=16/14960469, in_queue=14956943, util=99.92%
>
> cfq:
> ----
> rotational=0
> ------------
> Run status group 0 (all jobs):
>  WRITE: io=19477MB, aggrb=199424KB/s, minb=3001KB/s, maxb=3268KB/s, mint=100001msec, maxt=100010msec
>
> Disk stats (read/write):
>    dm-9: ios=40/4811814, merge=0/159762, ticks=0/14866276, in_queue=14912510, util=100.00%, aggrios=224/4814462, aggrmerge=0/0, aggrticks=13/14399688, aggrin_queue=14396342, aggrutil=99.92%
>  skd0: ios=224/4814462, merge=0/0, ticks=13/14399688, in_queue=14396342, util=99.92%
>
> rotational=1
> ------------
> Run status group 0 (all jobs):
>  WRITE: io=21051MB, aggrb=215188KB/s, minb=2147KB/s, maxb=8671KB/s, mint=100001msec, maxt=100175msec
>
> Disk stats (read/write):
>    dm-9: ios=59/1468500, merge=0/3836597, ticks=2/5928782, in_queue=5959348, util=100.00%, aggrios=152/1469016, aggrmerge=0/0, aggrticks=5/395471, aggrin_queue=395287, aggrutil=71.17%
>  skd0: ios=152/1469016, merge=0/0, ticks=5/395471, in_queue=395287, util=71.17%
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* FIXED! [was: awful request merge results while simulating high IOPS] multipath
  2015-02-25  0:56               ` awful request merge results while simulating high IOPS multipath Mike Snitzer
  2015-02-25  4:14                 ` Keith Busch
@ 2015-02-25  4:38                 ` Mike Snitzer
  2015-02-25  4:41                   ` Mike Snitzer
  1 sibling, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2015-02-25  4:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, jmoyer, shivakrishna.merla, dm-devel, Bart Van Assche

New multipath results using these 2 additional DM patches:

https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=d379ecbac533d3ac45c7f155b1fc1b760abace6e
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=d379ecbac533d3ac45c7f155b1fc1b760abace6e

(the important one being the last patch, thanks to Jens for the suggestion!)

Both deadline and noop schedulers work really well this fio test now
(cfq is slightly slower):

[64_seq_write]
#filename=/dev/mapper/skd_mpath
filename=/dev/dm-7
rw=write
rwmixread=0
blocksize=4k
iodepth=16
direct=1
numjobs=64
runtime=100
ioengine=libaio
time_based

deadline:
---------
Run status group 0 (all jobs):
  WRITE: io=64402MB, aggrb=659426KB/s, minb=8960KB/s, maxb=10348KB/s, mint=100001msec, maxt=100007msec

Disk stats (read/write):
    dm-7: ios=78/951571, merge=0/14721874, ticks=1/6766375, in_queue=6868921, util=100.00%, aggrios=257/951897, aggrmerge=0/0, aggrticks=3/3467932, aggrin_queue=3467835, aggrutil=99.09%
  skd0: ios=257/951897, merge=0/0, ticks=3/3467932, in_queue=3467835, util=99.09%

(deadline is _really_ fast with this test now, noop is fast too, see below)

cfq:
----
rotational=0
------------
Run status group 0 (all jobs):
  WRITE: io=18719MB, aggrb=191652KB/s, minb=2639KB/s, maxb=4210KB/s, mint=100001msec, maxt=100018msec

Disk stats (read/write):
    dm-7: ios=91/368755, merge=0/4407385, ticks=2/6783504, in_queue=6815287, util=100.00%, aggrios=273/368851, aggrmerge=0/0, aggrticks=5/2962495, aggrin_queue=2962487, aggrutil=95.40%
  skd0: ios=273/368851, merge=0/0, ticks=5/2962495, in_queue=2962487, util=95.40%

rotational=1
------------
Run status group 0 (all jobs):
  WRITE: io=19660MB, aggrb=200940KB/s, minb=2304KB/s, maxb=6441KB/s, mint=100001msec, maxt=100188msec

Disk stats (read/write):
    dm-7: ios=91/786111, merge=0/4179472, ticks=2/5820606, in_queue=5834058, util=99.98%, aggrios=182/786507, aggrmerge=0/0, aggrticks=4/131800, aggrin_queue=131758, aggrutil=70.55%
  skd0: ios=182/786507, merge=0/0, ticks=4/131800, in_queue=131758, util=70.55%

noop:
-----
Run status group 0 (all jobs):
  WRITE: io=58264MB, aggrb=596517KB/s, minb=9301KB/s, maxb=9354KB/s, mint=100001msec, maxt=100018msec

Disk stats (read/write):
    dm-7: ios=116/998937, merge=0/13874486, ticks=4/6853397, in_queue=6934793, util=100.00%, aggrios=295/999061, aggrmerge=0/0, aggrticks=8/3776141, aggrin_queue=3776117, aggrutil=98.81%
  skd0: ios=295/999061, merge=0/0, ticks=8/3776141, in_queue=3776117, util=98.81%

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: FIXED! [was: awful request merge results while simulating high IOPS] multipath
  2015-02-25  4:38                 ` FIXED! [was: awful request merge results while simulating high IOPS] multipath Mike Snitzer
@ 2015-02-25  4:41                   ` Mike Snitzer
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Snitzer @ 2015-02-25  4:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, jmoyer, shivakrishna.merla, dm-devel, Bart Van Assche

On Tue, Feb 24 2015 at 11:38pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> New multipath results using these 2 additional DM patches:
> 
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=d379ecbac533d3ac45c7f155b1fc1b760abace6e
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=d379ecbac533d3ac45c7f155b1fc1b760abace6e
> 
> (the important one being the last patch, thanks to Jens for the suggestion!)

Bleh, pasted the same url twice, this is the less signifcant patch of
the 2:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=489ab49e738667ba9d40b6d2b0b47910b5194208

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: awful request merge results while simulating high IOPS multipath
  2015-02-25  4:14                 ` Keith Busch
@ 2015-02-25 15:11                   ` Jens Axboe
  2015-02-25 18:17                     ` Busch, Keith
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2015-02-25 15:11 UTC (permalink / raw)
  To: Keith Busch, Mike Snitzer
  Cc: jmoyer, shivakrishna.merla, dm-devel, Bart Van Assche

On 02/24/2015 08:14 PM, Keith Busch wrote:
> To be honest, we also see underwhelming performance on high IOPS PCIe
> SSDs. We know of some bottlenecks and ideas to test, but hoping to work
> out the kinks in those ideas before Vault in two weeks.

Related to merging, or?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: awful request merge results while simulating high IOPS multipath
  2015-02-25 15:11                   ` Jens Axboe
@ 2015-02-25 18:17                     ` Busch, Keith
  2015-02-25 22:10                       ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Busch, Keith @ 2015-02-25 18:17 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer
  Cc: jmoyer, shivakrishna.merla, dm-devel, Bart Van Assche

Sorry, my reply was non sequitur to this thread. We don't do merging in NVMe.

Our first bottleneck appears to be the device mapper's single lock request queue.

> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Wednesday, February 25, 2015 8:12 AM
> To: Busch, Keith; Mike Snitzer
> Cc: jmoyer@redhat.com; dm-devel@redhat.com; shivakrishna.merla@netapp.com; Bart Van Assche
> Subject: Re: awful request merge results while simulating high IOPS multipath
> 
> On 02/24/2015 08:14 PM, Keith Busch wrote:
> > To be honest, we also see underwhelming performance on high IOPS PCIe
> > SSDs. We know of some bottlenecks and ideas to test, but hoping to work
> > out the kinks in those ideas before Vault in two weeks.
> 
> Related to merging, or?
> 
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: awful request merge results while simulating high IOPS multipath
  2015-02-25 18:17                     ` Busch, Keith
@ 2015-02-25 22:10                       ` Mike Snitzer
  2015-02-25 23:57                         ` Keith Busch
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2015-02-25 22:10 UTC (permalink / raw)
  To: Busch, Keith
  Cc: Jens Axboe, jmoyer, shivakrishna.merla, dm-devel, Bart Van Assche

On Wed, Feb 25 2015 at  1:17pm -0500,
Busch, Keith <keith.busch@intel.com> wrote:

> Sorry, my reply was non sequitur to this thread. We don't do merging
> in NVMe.

NVMe may not but current dm-multipath's top-level queue will.  And any
future blk-mq enabled dm-multipath (which I'm starting to look into now)
will need to also.

> Our first bottleneck appears to be the device mapper's single lock
> request queue.

Obviously if we switched dm-multipath over to blk-mq we'd eliminate
that.  I'll see how things go and will share any changes I come up
with.

FYI, here is a related exchange Jens and I had on the LSF-only mailing
list:

On Tue, Feb 24 2015 at  1:43pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 02/24/2015 10:37 AM, Mike Snitzer wrote:
>
> >I agree.  I'd hate to be called up front to tap dance around some yet to
> >be analyzed issue.  But discussing the best way to update multipath for
> >blk-mq devices is fair game.
> >
> >As is, the current blk-mq code doesn't have any IO scheduler so the
> >overall approach that DM multipath _attempts_ to take (namely leaning on
> >the elevator to create larger requests that are then balanced across the
> >underlying paths) is a non-starter.
> 
> No it isn't, blk-mq still provides merging, the logic would very
> much be the same there... I think the crux of the problem is the way
> too frequent queue runs, that'd similarly be a problem on the blk-mq
> front.
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: awful request merge results while simulating high IOPS multipath
  2015-02-25 22:10                       ` Mike Snitzer
@ 2015-02-25 23:57                         ` Keith Busch
  2015-02-26  0:11                           ` Mike Snitzer
  0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2015-02-25 23:57 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, shivakrishna.merla, Bart Van Assche, dm-devel, Busch,
	Keith, jmoyer

On Wed, 25 Feb 2015, Mike Snitzer wrote:
> On Wed, Feb 25 2015 at  1:17pm -0500,
> Busch, Keith <keith.busch@intel.com> wrote:
>> Our first bottleneck appears to be the device mapper's single lock
>> request queue.
>
> Obviously if we switched dm-multipath over to blk-mq we'd eliminate
> that.  I'll see how things go and will share any changes I come up
> with.

Yes, I'm also looking at blk-mq. It appears conversion helps a lot.
I'm not sure though how many tags or h/w contexts to allocate to ensure
there's enough but not too many. You can't use the underlying device's
blk-tagset count (assuming you could even get to it) since those are
potentially shared among many block devices.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: awful request merge results while simulating high IOPS multipath
  2015-02-25 23:57                         ` Keith Busch
@ 2015-02-26  0:11                           ` Mike Snitzer
  2015-02-26  0:28                             ` Keith Busch
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Snitzer @ 2015-02-26  0:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, dm-devel, shivakrishna.merla, Bart Van Assche, jmoyer

On Wed, Feb 25 2015 at  6:57pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, 25 Feb 2015, Mike Snitzer wrote:
> >On Wed, Feb 25 2015 at  1:17pm -0500,
> >Busch, Keith <keith.busch@intel.com> wrote:
> >>Our first bottleneck appears to be the device mapper's single lock
> >>request queue.
> >
> >Obviously if we switched dm-multipath over to blk-mq we'd eliminate
> >that.  I'll see how things go and will share any changes I come up
> >with.
> 
> Yes, I'm also looking at blk-mq. It appears conversion helps a lot.

Oh, so you've already started a conversion of request-based DM?

> I'm not sure though how many tags or h/w contexts to allocate to ensure
> there's enough but not too many. You can't use the underlying device's
> blk-tagset count (assuming you could even get to it) since those are
> potentially shared among many block devices.

I'm very new to the blk-mq model so I have some learning to do before
I'll be much help.

But there really won't be a one size fits all amount for those resources
will there?  A multipath device can have _a lot_ of underlying paths.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: awful request merge results while simulating high IOPS multipath
  2015-02-26  0:11                           ` Mike Snitzer
@ 2015-02-26  0:28                             ` Keith Busch
  0 siblings, 0 replies; 21+ messages in thread
From: Keith Busch @ 2015-02-26  0:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, shivakrishna.merla, Bart Van Assche, jmoyer,
	Keith Busch, dm-devel

On Wed, 25 Feb 2015, Mike Snitzer wrote:
> On Wed, Feb 25 2015 at  6:57pm -0500,
> Keith Busch <keith.busch@intel.com> wrote:
>
>> Yes, I'm also looking at blk-mq. It appears conversion helps a lot.
>
> Oh, so you've already started a conversion of request-based DM?

Actually no, I've only barely started looking at it. I'm guessing
improvement based on using a bio-based device mapper on top of nvme. In
my experience blk-mq performs similiar or better than bio on benchmarks,
so I hope it's a fair assumption here too.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-02-26  0:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24 16:44 [PATCH 0/4] dm: simplify request-based DM a bit and an RFC-like perf tweak Mike Snitzer
2015-02-24 16:44 ` [PATCH 1/4] dm: remove unnecessary wrapper around blk_lld_busy Mike Snitzer
2015-02-24 16:44 ` [PATCH 2/4] dm: remove request-based DM queue's lld_busy_fn hook Mike Snitzer
2015-02-24 16:44 ` [PATCH 3/4] dm: remove request-based logic from make_request_fn wrapper Mike Snitzer
2015-02-24 16:44 ` [RFC PATCH 4/4] dm: delay running the queue slightly during request completion Mike Snitzer
2015-02-24 16:51   ` Jens Axboe
2015-02-24 17:22     ` [RFC PATCH 4/4 v2] " Mike Snitzer
2015-02-24 17:52       ` Jens Axboe
2015-02-24 18:12         ` Mike Snitzer
2015-02-24 18:16           ` Jens Axboe
2015-02-24 18:32             ` Mike Snitzer
2015-02-25  0:56               ` awful request merge results while simulating high IOPS multipath Mike Snitzer
2015-02-25  4:14                 ` Keith Busch
2015-02-25 15:11                   ` Jens Axboe
2015-02-25 18:17                     ` Busch, Keith
2015-02-25 22:10                       ` Mike Snitzer
2015-02-25 23:57                         ` Keith Busch
2015-02-26  0:11                           ` Mike Snitzer
2015-02-26  0:28                             ` Keith Busch
2015-02-25  4:38                 ` FIXED! [was: awful request merge results while simulating high IOPS] multipath Mike Snitzer
2015-02-25  4:41                   ` Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.