dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()"
       [not found] <529c2394-1b58-b9d8-d462-1f3de1b78ac8@oracle.com>
@ 2020-09-10 14:24 ` Mike Snitzer
  2020-09-10 19:29   ` Vijayendra Suman
  2020-09-11 12:20   ` Ming Lei
  0 siblings, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2020-09-10 14:24 UTC (permalink / raw)
  To: Vijayendra Suman; +Cc: linux-block, dm-devel

[cc'ing dm-devel and linux-block because this is upstream concern too]

On Wed, Sep 09 2020 at  1:00pm -0400,
Vijayendra Suman <vijayendra.suman@oracle.com> wrote:

>    Hello Mike,
> 
>    While Running pgbench tool with  5.4.17 kernel build
> 
>    Following performance degrade is found out
> 
>    buffer read/write metric : -17.2%
>    cache read/write metric : -18.7%
>    disk read/write metric : -19%
> 
>    buffer
>    number of transactions actually processed: 840972
>    latency average = 24.013 ms
>    tps = 4664.153934 (including connections establishing)
>    tps = 4664.421492 (excluding connections establishing)
> 
>    cache
>    number of transactions actually processed: 551345
>    latency average = 36.949 ms
>    tps = 3031.223905 (including connections establishing)
>    tps = 3031.402581 (excluding connections establishing)
> 
>    After revert of Commit
>    2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call
>    blk_queue_split() in dm_process_bio()")

I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's
backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ?

>    Performance is Counter measurement
> 
>    buffer ->
>    number of transactions actually processed: 1135735
>    latency average = 17.799 ms
>    tps = 6292.586749 (including connections establishing)
>    tps = 6292.875089 (excluding connections establishing)
> 
>    cache ->
>    number of transactions actually processed: 648177
>    latency average = 31.217 ms
>    tps = 3587.755975 (including connections establishing)
>    tps = 3587.966359 (excluding connections establishing)
> 
>    Following is your commit
> 
>    diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>    index cf71a2277d60..1e6e0c970e19 100644
>    --- a/drivers/md/dm.c
>    +++ b/drivers/md/dm.c
>    @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device
>    *md,
>             * won't be imposed.
>             */
>            if (current->bio_list) {
>    -               blk_queue_split(md->queue, &bio);
>    -               if (!is_abnormal_io(bio))
>    +               if (is_abnormal_io(bio))
>    +                       blk_queue_split(md->queue, &bio);
>    +               else
>                            dm_queue_split(md, ti, &bio);
>            }
> 
>    Could you have a look if it is safe to revert this commit.

No, it really isn't a good idea given what was documented in the commit
header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the
excessive splitting is not conducive to performance either.

So I think we need to identify _why_ reverting this commit is causing
such a performance improvement.  Why is calling blk_queue_split() before
dm_queue_split() benefiting your pgbench workload?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()"
  2020-09-10 14:24 ` Revert "dm: always call blk_queue_split() in dm_process_bio()" Mike Snitzer
@ 2020-09-10 19:29   ` Vijayendra Suman
  2020-09-15  1:33     ` Mike Snitzer
  2020-09-11 12:20   ` Ming Lei
  1 sibling, 1 reply; 28+ messages in thread
From: Vijayendra Suman @ 2020-09-10 19:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-block, RAMANAN_GOVINDARAJAN, Somu Krishnasamy

Hello Mike,

I checked with upstream, performance measurement is similar and shows 
performance improvement when 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is 
reverted.

On 9/10/2020 7:54 PM, Mike Snitzer wrote:
> [cc'ing dm-devel and linux-block because this is upstream concern too]
>
> On Wed, Sep 09 2020 at  1:00pm -0400,
> Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
>
>>     Hello Mike,
>>
>>     While Running pgbench tool with  5.4.17 kernel build
>>
>>     Following performance degrade is found out
>>
>>     buffer read/write metric : -17.2%
>>     cache read/write metric : -18.7%
>>     disk read/write metric : -19%
>>
>>     buffer
>>     number of transactions actually processed: 840972
>>     latency average = 24.013 ms
>>     tps = 4664.153934 (including connections establishing)
>>     tps = 4664.421492 (excluding connections establishing)
>>
>>     cache
>>     number of transactions actually processed: 551345
>>     latency average = 36.949 ms
>>     tps = 3031.223905 (including connections establishing)
>>     tps = 3031.402581 (excluding connections establishing)
>>
>>     After revert of Commit
>>     2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call
>>     blk_queue_split() in dm_process_bio()")
> I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's
> backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ?
Yes
>>     Performance is Counter measurement
>>
>>     buffer ->
>>     number of transactions actually processed: 1135735
>>     latency average = 17.799 ms
>>     tps = 6292.586749 (including connections establishing)
>>     tps = 6292.875089 (excluding connections establishing)
>>
>>     cache ->
>>     number of transactions actually processed: 648177
>>     latency average = 31.217 ms
>>     tps = 3587.755975 (including connections establishing)
>>     tps = 3587.966359 (excluding connections establishing)
>>
>>     Following is your commit
>>
>>     diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>     index cf71a2277d60..1e6e0c970e19 100644
>>     --- a/drivers/md/dm.c
>>     +++ b/drivers/md/dm.c
>>     @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device
>>     *md,
>>              * won't be imposed.
>>              */
>>             if (current->bio_list) {
>>     -               blk_queue_split(md->queue, &bio);
>>     -               if (!is_abnormal_io(bio))
>>     +               if (is_abnormal_io(bio))
>>     +                       blk_queue_split(md->queue, &bio);
>>     +               else
>>                             dm_queue_split(md, ti, &bio);
>>             }
>>
>>     Could you have a look if it is safe to revert this commit.
> No, it really isn't a good idea given what was documented in the commit
> header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the
> excessive splitting is not conducive to performance either.
>
> So I think we need to identify _why_ reverting this commit is causing
> such a performance improvement.  Why is calling blk_queue_split() before
> dm_queue_split() benefiting your pgbench workload?
Let me know if you want to check some patch.
>
> Thanks,
> Mike
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()"
  2020-09-10 14:24 ` Revert "dm: always call blk_queue_split() in dm_process_bio()" Mike Snitzer
  2020-09-10 19:29   ` Vijayendra Suman
@ 2020-09-11 12:20   ` Ming Lei
  2020-09-11 16:13     ` Mike Snitzer
  1 sibling, 1 reply; 28+ messages in thread
From: Ming Lei @ 2020-09-11 12:20 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block

On Thu, Sep 10, 2020 at 10:24:39AM -0400, Mike Snitzer wrote:
> [cc'ing dm-devel and linux-block because this is upstream concern too]
> 
> On Wed, Sep 09 2020 at  1:00pm -0400,
> Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
> 
> >    Hello Mike,
> > 
> >    While Running pgbench tool with  5.4.17 kernel build
> > 
> >    Following performance degrade is found out
> > 
> >    buffer read/write metric : -17.2%
> >    cache read/write metric : -18.7%
> >    disk read/write metric : -19%
> > 
> >    buffer
> >    number of transactions actually processed: 840972
> >    latency average = 24.013 ms
> >    tps = 4664.153934 (including connections establishing)
> >    tps = 4664.421492 (excluding connections establishing)
> > 
> >    cache
> >    number of transactions actually processed: 551345
> >    latency average = 36.949 ms
> >    tps = 3031.223905 (including connections establishing)
> >    tps = 3031.402581 (excluding connections establishing)
> > 
> >    After revert of Commit
> >    2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call
> >    blk_queue_split() in dm_process_bio()")
> 
> I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's
> backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ?
> 
> >    Performance is Counter measurement
> > 
> >    buffer ->
> >    number of transactions actually processed: 1135735
> >    latency average = 17.799 ms
> >    tps = 6292.586749 (including connections establishing)
> >    tps = 6292.875089 (excluding connections establishing)
> > 
> >    cache ->
> >    number of transactions actually processed: 648177
> >    latency average = 31.217 ms
> >    tps = 3587.755975 (including connections establishing)
> >    tps = 3587.966359 (excluding connections establishing)
> > 
> >    Following is your commit
> > 
> >    diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >    index cf71a2277d60..1e6e0c970e19 100644
> >    --- a/drivers/md/dm.c
> >    +++ b/drivers/md/dm.c
> >    @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device
> >    *md,
> >             * won't be imposed.
> >             */
> >            if (current->bio_list) {
> >    -               blk_queue_split(md->queue, &bio);
> >    -               if (!is_abnormal_io(bio))
> >    +               if (is_abnormal_io(bio))
> >    +                       blk_queue_split(md->queue, &bio);
> >    +               else
> >                            dm_queue_split(md, ti, &bio);
> >            }
> > 
> >    Could you have a look if it is safe to revert this commit.
> 
> No, it really isn't a good idea given what was documented in the commit
> header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the
> excessive splitting is not conducive to performance either.
> 
> So I think we need to identify _why_ reverting this commit is causing
> such a performance improvement.  Why is calling blk_queue_split() before
> dm_queue_split() benefiting your pgbench workload?

blk_queue_split() takes every queue's limit into account, and dm_queue_split()
only splits bio according to max len(offset, chunk size), so the
splitted bio may not be optimal one from device viewpoint.

Maybe DM can switch to blk_queue_split() if 'chunk_sectors' limit is power-2
aligned.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()"
  2020-09-11 12:20   ` Ming Lei
@ 2020-09-11 16:13     ` Mike Snitzer
  2020-09-11 21:53       ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2020-09-11 16:13 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, dm-devel, Vijayendra Suman

On Fri, Sep 11 2020 at  8:20am -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Thu, Sep 10, 2020 at 10:24:39AM -0400, Mike Snitzer wrote:
> > [cc'ing dm-devel and linux-block because this is upstream concern too]
> > 
> > On Wed, Sep 09 2020 at  1:00pm -0400,
> > Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
> > 
> > >    Hello Mike,
> > > 
> > >    While Running pgbench tool with  5.4.17 kernel build
> > > 
> > >    Following performance degrade is found out
> > > 
> > >    buffer read/write metric : -17.2%
> > >    cache read/write metric : -18.7%
> > >    disk read/write metric : -19%
> > > 
> > >    buffer
> > >    number of transactions actually processed: 840972
> > >    latency average = 24.013 ms
> > >    tps = 4664.153934 (including connections establishing)
> > >    tps = 4664.421492 (excluding connections establishing)
> > > 
> > >    cache
> > >    number of transactions actually processed: 551345
> > >    latency average = 36.949 ms
> > >    tps = 3031.223905 (including connections establishing)
> > >    tps = 3031.402581 (excluding connections establishing)
> > > 
> > >    After revert of Commit
> > >    2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call
> > >    blk_queue_split() in dm_process_bio()")
> > 
> > I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's
> > backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ?
> > 
> > >    Performance is Counter measurement
> > > 
> > >    buffer ->
> > >    number of transactions actually processed: 1135735
> > >    latency average = 17.799 ms
> > >    tps = 6292.586749 (including connections establishing)
> > >    tps = 6292.875089 (excluding connections establishing)
> > > 
> > >    cache ->
> > >    number of transactions actually processed: 648177
> > >    latency average = 31.217 ms
> > >    tps = 3587.755975 (including connections establishing)
> > >    tps = 3587.966359 (excluding connections establishing)
> > > 
> > >    Following is your commit
> > > 
> > >    diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > >    index cf71a2277d60..1e6e0c970e19 100644
> > >    --- a/drivers/md/dm.c
> > >    +++ b/drivers/md/dm.c
> > >    @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device
> > >    *md,
> > >             * won't be imposed.
> > >             */
> > >            if (current->bio_list) {
> > >    -               blk_queue_split(md->queue, &bio);
> > >    -               if (!is_abnormal_io(bio))
> > >    +               if (is_abnormal_io(bio))
> > >    +                       blk_queue_split(md->queue, &bio);
> > >    +               else
> > >                            dm_queue_split(md, ti, &bio);
> > >            }
> > > 
> > >    Could you have a look if it is safe to revert this commit.
> > 
> > No, it really isn't a good idea given what was documented in the commit
> > header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the
> > excessive splitting is not conducive to performance either.
> > 
> > So I think we need to identify _why_ reverting this commit is causing
> > such a performance improvement.  Why is calling blk_queue_split() before
> > dm_queue_split() benefiting your pgbench workload?
> 
> blk_queue_split() takes every queue's limit into account, and dm_queue_split()
> only splits bio according to max len(offset, chunk size), so the
> splitted bio may not be optimal one from device viewpoint.

Yes, but the issue is blk_queue_split() is doing the wrong thing for the
case described in the header for commit
120c9257f5f19e5d1e87efcbb5531b7cd81b7d74
 
> Maybe DM can switch to blk_queue_split() if 'chunk_sectors' limit is power-2
> aligned.

Not seeing relation to chunk_sectors being power of 2 -- other than that
is all block core supports.  But chunk_sectors isn't set for DM, you
added chunk_sectors for MD or something and DM was caught out, so
blk_queue_split() falls back to splitting on max_sectors.

You're saying DM should set 'chunk_sectors' IFF it'd be a power of 2?
While I could do that, it seems like just continuing a sequence of
hacks around earlier imposed chunk_sectors infrastructure that was a
half-measure to begin with.  Think chunk_sectors logic in block core
needs to be enhanced -- but I'll take a closer look.

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 0/3] block: a few chunk_sectors fixes/improvements
  2020-09-11 16:13     ` Mike Snitzer
@ 2020-09-11 21:53       ` Mike Snitzer
  2020-09-11 21:53         ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block

Hi,

Here are some changes that seem needed but that stop short of fixing the
initial report (of DM pgbench regression).  Would be nice if you could
review these block patches since they stand on their own.

I still have to look closer at how to get blk_queue_split() the info DM
has (in ti->max_io_len). Ideally a variant of blk_queue_split() could be
created that allows a 'chunk_sectors' override to be passed in, e.g.:
dm_process_bio() would call: blk_queue_split(&bio, ti->max_io_len);
And the provided ti->max_io_len would be used instead of a global (one
size fits all) q->limits.chunk_sectors.  The reason why this is
needed/ideal is due to DM's stacked nature.  Different offsets into a DM
device could yield entirely different max_io_len (or chunk_sectors)
settings.

BUT short of standing up a new variant of blk_queue_split() with per bio
override for 'chunk_sectors' (which is likely a non-starter for a few
reasons, recurssive nature of bio_split being the biggest): I'll need to
update all DM targets that call dm_set_target_max_io_len() to also do
this in each target's .io_hints hook:
  limits->chunk_sectors = lcm_not_zero(ti->max_io_len, limits->chunk_sectors);
Won't be perfect for some stacked devices (given it'll constrain
chunk_sectors to be less optimal as the IO limits are stacked) but it
should be an improvment -- and hopefully fix this pgbench regression.

Thanks,
Mike

Mike Snitzer (3):
  block: fix blk_rq_get_max_sectors() to flow more carefully
  block: use lcm_not_zero() when stacking chunk_sectors
  block: allow 'chunk_sectors' to be non-power-of-2

 block/blk-settings.c   | 22 ++++++++++++----------
 include/linux/blkdev.h | 31 ++++++++++++++++++++++---------
 2 files changed, 34 insertions(+), 19 deletions(-)

-- 
2.15.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-11 21:53       ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer
@ 2020-09-11 21:53         ` Mike Snitzer
  2020-09-12 13:52           ` Ming Lei
  2020-09-14  0:46           ` Damien Le Moal
  2020-09-11 21:53         ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer
  2020-09-11 21:53         ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer
  2 siblings, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block

blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
those operations.

Also, there is no need to avoid blk_max_size_offset() if
'chunk_sectors' isn't set because it falls back to 'max_sectors'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bb5636cc17b9..453a3d735d66 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 						  sector_t offset)
 {
 	struct request_queue *q = rq->q;
+	int op;
+	unsigned int max_sectors;
 
 	if (blk_rq_is_passthrough(rq))
 		return q->limits.max_hw_sectors;
 
-	if (!q->limits.chunk_sectors ||
-	    req_op(rq) == REQ_OP_DISCARD ||
-	    req_op(rq) == REQ_OP_SECURE_ERASE)
-		return blk_queue_get_max_sectors(q, req_op(rq));
+	op = req_op(rq);
+	max_sectors = blk_queue_get_max_sectors(q, op);
 
-	return min(blk_max_size_offset(q, offset),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+	switch (op) {
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE_ZEROES:
+		return max_sectors;
+	}
+
+	return min(blk_max_size_offset(q, offset), max_sectors);
 }
 
 static inline unsigned int blk_rq_count_bios(struct request *rq)
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors
  2020-09-11 21:53       ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer
  2020-09-11 21:53         ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer
@ 2020-09-11 21:53         ` Mike Snitzer
  2020-09-12 13:58           ` Ming Lei
  2020-09-11 21:53         ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer
  2 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block

Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
lcm_not_zero() rather than min_not_zero() -- otherwise the final
'chunk_sectors' could result in sub-optimal alignment of IO to
component devices in the IO stack.

Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
then it is a bug in the driver and the device should be flagged as
'misaligned'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-settings.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 76a7e03bcd6c..b09642d5f15e 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -534,6 +534,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 
 	t->io_min = max(t->io_min, b->io_min);
 	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+	t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors);
 
 	/* Physical block size a multiple of the logical block size? */
 	if (t->physical_block_size & (t->logical_block_size - 1)) {
@@ -556,6 +557,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		ret = -1;
 	}
 
+	/* chunk_sectors a multiple of the physical block size? */
+	if (t->chunk_sectors & (t->physical_block_size - 1)) {
+		t->chunk_sectors = 0;
+		t->misaligned = 1;
+		ret = -1;
+	}
+
 	t->raid_partial_stripes_expensive =
 		max(t->raid_partial_stripes_expensive,
 		    b->raid_partial_stripes_expensive);
@@ -594,10 +602,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 			t->discard_granularity;
 	}
 
-	if (b->chunk_sectors)
-		t->chunk_sectors = min_not_zero(t->chunk_sectors,
-						b->chunk_sectors);
-
 	t->zoned = max(t->zoned, b->zoned);
 	return ret;
 }
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2
  2020-09-11 21:53       ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer
  2020-09-11 21:53         ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer
  2020-09-11 21:53         ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer
@ 2020-09-11 21:53         ` Mike Snitzer
  2020-09-12 14:06           ` Ming Lei
  2020-09-14  0:55           ` Damien Le Moal
  2 siblings, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block

It is possible for a block device to use a non power-of-2 for chunk
size which results in a full-stripe size that is also a non
power-of-2.

Update blk_queue_chunk_sectors() and blk_max_size_offset() to
accommodate drivers that need a non power-of-2 chunk_sectors.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-settings.c   | 10 ++++------
 include/linux/blkdev.h | 12 +++++++++---
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index b09642d5f15e..e40a162cc946 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors);
  *
  * Description:
  *    If a driver doesn't want IOs to cross a given chunk size, it can set
- *    this limit and prevent merging across chunks. Note that the chunk size
- *    must currently be a power-of-2 in sectors. Also note that the block
- *    layer must accept a page worth of data at any offset. So if the
- *    crossing of chunks is a hard limitation in the driver, it must still be
- *    prepared to split single page bios.
+ *    this limit and prevent merging across chunks. Note that the block layer
+ *    must accept a page worth of data at any offset. So if the crossing of
+ *    chunks is a hard limitation in the driver, it must still be prepared
+ *    to split single page bios.
  **/
 void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors)
 {
-	BUG_ON(!is_power_of_2(chunk_sectors));
 	q->limits.chunk_sectors = chunk_sectors;
 }
 EXPORT_SYMBOL(blk_queue_chunk_sectors);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 453a3d735d66..e72bcce22143 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 static inline unsigned int blk_max_size_offset(struct request_queue *q,
 					       sector_t offset)
 {
-	if (!q->limits.chunk_sectors)
+	unsigned int chunk_sectors = q->limits.chunk_sectors;
+
+	if (!chunk_sectors)
 		return q->limits.max_sectors;
 
-	return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors -
-			(offset & (q->limits.chunk_sectors - 1))));
+	if (is_power_of_2(chunk_sectors))
+		chunk_sectors -= (offset & (chunk_sectors - 1));
+	else
+		chunk_sectors -= sector_div(offset, chunk_sectors);
+
+	return min(q->limits.max_sectors, chunk_sectors);
 }
 
 static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-11 21:53         ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer
@ 2020-09-12 13:52           ` Ming Lei
  2020-09-14  0:43             ` Damien Le Moal
  2020-09-14 14:49             ` Mike Snitzer
  2020-09-14  0:46           ` Damien Le Moal
  1 sibling, 2 replies; 28+ messages in thread
From: Ming Lei @ 2020-09-12 13:52 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block

On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> those operations.

Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
chunk_sectors is set:

        return min(blk_max_size_offset(q, offset),
                        blk_queue_get_max_sectors(q, req_op(rq)));
 
> Also, there is no need to avoid blk_max_size_offset() if
> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  include/linux/blkdev.h | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bb5636cc17b9..453a3d735d66 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>  						  sector_t offset)
>  {
>  	struct request_queue *q = rq->q;
> +	int op;
> +	unsigned int max_sectors;
>  
>  	if (blk_rq_is_passthrough(rq))
>  		return q->limits.max_hw_sectors;
>  
> -	if (!q->limits.chunk_sectors ||
> -	    req_op(rq) == REQ_OP_DISCARD ||
> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> -		return blk_queue_get_max_sectors(q, req_op(rq));
> +	op = req_op(rq);
> +	max_sectors = blk_queue_get_max_sectors(q, op);
>  
> -	return min(blk_max_size_offset(q, offset),
> -			blk_queue_get_max_sectors(q, req_op(rq)));
> +	switch (op) {
> +	case REQ_OP_DISCARD:
> +	case REQ_OP_SECURE_ERASE:
> +	case REQ_OP_WRITE_SAME:
> +	case REQ_OP_WRITE_ZEROES:
> +		return max_sectors;
> +	}
> +
> +	return min(blk_max_size_offset(q, offset), max_sectors);
>  }

It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
needs to be considered.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors
  2020-09-11 21:53         ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer
@ 2020-09-12 13:58           ` Ming Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2020-09-12 13:58 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, Vijayendra Suman

On Fri, Sep 11, 2020 at 05:53:37PM -0400, Mike Snitzer wrote:
> Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
> lcm_not_zero() rather than min_not_zero() -- otherwise the final
> 'chunk_sectors' could result in sub-optimal alignment of IO to
> component devices in the IO stack.
> 
> Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
> then it is a bug in the driver and the device should be flagged as
> 'misaligned'.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/blk-settings.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 76a7e03bcd6c..b09642d5f15e 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -534,6 +534,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  
>  	t->io_min = max(t->io_min, b->io_min);
>  	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> +	t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors);
>  
>  	/* Physical block size a multiple of the logical block size? */
>  	if (t->physical_block_size & (t->logical_block_size - 1)) {
> @@ -556,6 +557,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  		ret = -1;
>  	}
>  
> +	/* chunk_sectors a multiple of the physical block size? */
> +	if (t->chunk_sectors & (t->physical_block_size - 1)) {
> +		t->chunk_sectors = 0;
> +		t->misaligned = 1;
> +		ret = -1;
> +	}
> +
>  	t->raid_partial_stripes_expensive =
>  		max(t->raid_partial_stripes_expensive,
>  		    b->raid_partial_stripes_expensive);
> @@ -594,10 +602,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  			t->discard_granularity;
>  	}
>  
> -	if (b->chunk_sectors)
> -		t->chunk_sectors = min_not_zero(t->chunk_sectors,
> -						b->chunk_sectors);
> -
>  	t->zoned = max(t->zoned, b->zoned);
>  	return ret;
>  }

Looks fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2
  2020-09-11 21:53         ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer
@ 2020-09-12 14:06           ` Ming Lei
  2020-09-14  2:43             ` Keith Busch
  2020-09-14  0:55           ` Damien Le Moal
  1 sibling, 1 reply; 28+ messages in thread
From: Ming Lei @ 2020-09-12 14:06 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, Vijayendra Suman

On Fri, Sep 11, 2020 at 05:53:38PM -0400, Mike Snitzer wrote:
> It is possible for a block device to use a non power-of-2 for chunk
> size which results in a full-stripe size that is also a non
> power-of-2.
> 
> Update blk_queue_chunk_sectors() and blk_max_size_offset() to
> accommodate drivers that need a non power-of-2 chunk_sectors.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/blk-settings.c   | 10 ++++------
>  include/linux/blkdev.h | 12 +++++++++---
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index b09642d5f15e..e40a162cc946 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors);
>   *
>   * Description:
>   *    If a driver doesn't want IOs to cross a given chunk size, it can set
> - *    this limit and prevent merging across chunks. Note that the chunk size
> - *    must currently be a power-of-2 in sectors. Also note that the block
> - *    layer must accept a page worth of data at any offset. So if the
> - *    crossing of chunks is a hard limitation in the driver, it must still be
> - *    prepared to split single page bios.
> + *    this limit and prevent merging across chunks. Note that the block layer
> + *    must accept a page worth of data at any offset. So if the crossing of
> + *    chunks is a hard limitation in the driver, it must still be prepared
> + *    to split single page bios.
>   **/
>  void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors)
>  {
> -	BUG_ON(!is_power_of_2(chunk_sectors));
>  	q->limits.chunk_sectors = chunk_sectors;
>  }
>  EXPORT_SYMBOL(blk_queue_chunk_sectors);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 453a3d735d66..e72bcce22143 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
>  static inline unsigned int blk_max_size_offset(struct request_queue *q,
>  					       sector_t offset)
>  {
> -	if (!q->limits.chunk_sectors)
> +	unsigned int chunk_sectors = q->limits.chunk_sectors;
> +
> +	if (!chunk_sectors)
>  		return q->limits.max_sectors;
>  
> -	return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors -
> -			(offset & (q->limits.chunk_sectors - 1))));
> +	if (is_power_of_2(chunk_sectors))
> +		chunk_sectors -= (offset & (chunk_sectors - 1));
> +	else
> +		chunk_sectors -= sector_div(offset, chunk_sectors);
> +
> +	return min(q->limits.max_sectors, chunk_sectors);
>  }
>  
>  static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> -- 
> 2.15.0
> 

is_power_of_2() is cheap enough for fast path, so looks fine to support
non-power-of-2 chunk sectors.

Maybe NVMe PCI can remove the power_of_2() limit too.

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-12 13:52           ` Ming Lei
@ 2020-09-14  0:43             ` Damien Le Moal
  2020-09-14 14:52               ` Mike Snitzer
  2020-09-15  2:03               ` Ming Lei
  2020-09-14 14:49             ` Mike Snitzer
  1 sibling, 2 replies; 28+ messages in thread
From: Damien Le Moal @ 2020-09-14  0:43 UTC (permalink / raw)
  To: Ming Lei, Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block

On 2020/09/12 22:53, Ming Lei wrote:
> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>> those operations.
> 
> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> chunk_sectors is set:
> 
>         return min(blk_max_size_offset(q, offset),
>                         blk_queue_get_max_sectors(q, req_op(rq)));
>  
>> Also, there is no need to avoid blk_max_size_offset() if
>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>
>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>> ---
>>  include/linux/blkdev.h | 19 +++++++++++++------
>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index bb5636cc17b9..453a3d735d66 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>  						  sector_t offset)
>>  {
>>  	struct request_queue *q = rq->q;
>> +	int op;
>> +	unsigned int max_sectors;
>>  
>>  	if (blk_rq_is_passthrough(rq))
>>  		return q->limits.max_hw_sectors;
>>  
>> -	if (!q->limits.chunk_sectors ||
>> -	    req_op(rq) == REQ_OP_DISCARD ||
>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>> +	op = req_op(rq);
>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>  
>> -	return min(blk_max_size_offset(q, offset),
>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>> +	switch (op) {
>> +	case REQ_OP_DISCARD:
>> +	case REQ_OP_SECURE_ERASE:
>> +	case REQ_OP_WRITE_SAME:
>> +	case REQ_OP_WRITE_ZEROES:
>> +		return max_sectors;
>> +	}>> +
>> +	return min(blk_max_size_offset(q, offset), max_sectors);
>>  }
> 
> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> needs to be considered.

That limit is needed for zoned block devices to ensure that *any* write request,
no matter the command, do not cross zone boundaries. Otherwise, the write would
be immediately failed by the device.

> 
> 
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-11 21:53         ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer
  2020-09-12 13:52           ` Ming Lei
@ 2020-09-14  0:46           ` Damien Le Moal
  2020-09-14 15:03             ` Mike Snitzer
  1 sibling, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2020-09-14  0:46 UTC (permalink / raw)
  To: Mike Snitzer, Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block

On 2020/09/12 6:53, Mike Snitzer wrote:
> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> those operations.
> 
> Also, there is no need to avoid blk_max_size_offset() if
> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  include/linux/blkdev.h | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bb5636cc17b9..453a3d735d66 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>  						  sector_t offset)
>  {
>  	struct request_queue *q = rq->q;
> +	int op;
> +	unsigned int max_sectors;
>  
>  	if (blk_rq_is_passthrough(rq))
>  		return q->limits.max_hw_sectors;
>  
> -	if (!q->limits.chunk_sectors ||
> -	    req_op(rq) == REQ_OP_DISCARD ||
> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> -		return blk_queue_get_max_sectors(q, req_op(rq));
> +	op = req_op(rq);
> +	max_sectors = blk_queue_get_max_sectors(q, op);
>  
> -	return min(blk_max_size_offset(q, offset),
> -			blk_queue_get_max_sectors(q, req_op(rq)));
> +	switch (op) {
> +	case REQ_OP_DISCARD:
> +	case REQ_OP_SECURE_ERASE:
> +	case REQ_OP_WRITE_SAME:
> +	case REQ_OP_WRITE_ZEROES:
> +		return max_sectors;
> +	}

Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
no ?)

As mentioned in my reply to Ming's email, this will allow these commands to
potentially cross over zone boundaries on zoned block devices, which would be an
immediate command failure.

> +
> +	return min(blk_max_size_offset(q, offset), max_sectors);
>  }
>  
>  static inline unsigned int blk_rq_count_bios(struct request *rq)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2
  2020-09-11 21:53         ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer
  2020-09-12 14:06           ` Ming Lei
@ 2020-09-14  0:55           ` Damien Le Moal
  1 sibling, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2020-09-14  0:55 UTC (permalink / raw)
  To: Mike Snitzer, Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block

On 2020/09/12 6:53, Mike Snitzer wrote:
> It is possible for a block device to use a non power-of-2 for chunk
> size which results in a full-stripe size that is also a non
> power-of-2.
> 
> Update blk_queue_chunk_sectors() and blk_max_size_offset() to
> accommodate drivers that need a non power-of-2 chunk_sectors.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/blk-settings.c   | 10 ++++------
>  include/linux/blkdev.h | 12 +++++++++---
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index b09642d5f15e..e40a162cc946 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors);
>   *
>   * Description:
>   *    If a driver doesn't want IOs to cross a given chunk size, it can set
> - *    this limit and prevent merging across chunks. Note that the chunk size
> - *    must currently be a power-of-2 in sectors. Also note that the block
> - *    layer must accept a page worth of data at any offset. So if the
> - *    crossing of chunks is a hard limitation in the driver, it must still be
> - *    prepared to split single page bios.
> + *    this limit and prevent merging across chunks. Note that the block layer
> + *    must accept a page worth of data at any offset. So if the crossing of
> + *    chunks is a hard limitation in the driver, it must still be prepared
> + *    to split single page bios.
>   **/
>  void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors)
>  {
> -	BUG_ON(!is_power_of_2(chunk_sectors));
>  	q->limits.chunk_sectors = chunk_sectors;
>  }
>  EXPORT_SYMBOL(blk_queue_chunk_sectors);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 453a3d735d66..e72bcce22143 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
>  static inline unsigned int blk_max_size_offset(struct request_queue *q,
>  					       sector_t offset)
>  {
> -	if (!q->limits.chunk_sectors)
> +	unsigned int chunk_sectors = q->limits.chunk_sectors;
> +
> +	if (!chunk_sectors)
>  		return q->limits.max_sectors;
>  
> -	return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors -
> -			(offset & (q->limits.chunk_sectors - 1))));
> +	if (is_power_of_2(chunk_sectors))
> +		chunk_sectors -= (offset & (chunk_sectors - 1));

I do not think you need the outer parenthesis here.

> +	else
> +		chunk_sectors -= sector_div(offset, chunk_sectors);
> +
> +	return min(q->limits.max_sectors, chunk_sectors);
>  }
>  
>  static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> 

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2
  2020-09-12 14:06           ` Ming Lei
@ 2020-09-14  2:43             ` Keith Busch
  0 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2020-09-14  2:43 UTC (permalink / raw)
  To: Ming Lei; +Cc: Mike Snitzer, Vijayendra Suman, dm-devel, linux-block

On Sat, Sep 12, 2020 at 10:06:30PM +0800, Ming Lei wrote:
> On Fri, Sep 11, 2020 at 05:53:38PM -0400, Mike Snitzer wrote:
> > It is possible for a block device to use a non power-of-2 for chunk
> > size which results in a full-stripe size that is also a non
> > power-of-2.
> > 
> > Update blk_queue_chunk_sectors() and blk_max_size_offset() to
> > accommodate drivers that need a non power-of-2 chunk_sectors.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  block/blk-settings.c   | 10 ++++------
> >  include/linux/blkdev.h | 12 +++++++++---
> >  2 files changed, 13 insertions(+), 9 deletions(-)
> > 
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index b09642d5f15e..e40a162cc946 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors);
> >   *
> >   * Description:
> >   *    If a driver doesn't want IOs to cross a given chunk size, it can set
> > - *    this limit and prevent merging across chunks. Note that the chunk size
> > - *    must currently be a power-of-2 in sectors. Also note that the block
> > - *    layer must accept a page worth of data at any offset. So if the
> > - *    crossing of chunks is a hard limitation in the driver, it must still be
> > - *    prepared to split single page bios.
> > + *    this limit and prevent merging across chunks. Note that the block layer
> > + *    must accept a page worth of data at any offset. So if the crossing of
> > + *    chunks is a hard limitation in the driver, it must still be prepared
> > + *    to split single page bios.
> >   **/
> >  void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors)
> >  {
> > -	BUG_ON(!is_power_of_2(chunk_sectors));
> >  	q->limits.chunk_sectors = chunk_sectors;
> >  }
> >  EXPORT_SYMBOL(blk_queue_chunk_sectors);
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 453a3d735d66..e72bcce22143 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
> >  static inline unsigned int blk_max_size_offset(struct request_queue *q,
> >  					       sector_t offset)
> >  {
> > -	if (!q->limits.chunk_sectors)
> > +	unsigned int chunk_sectors = q->limits.chunk_sectors;
> > +
> > +	if (!chunk_sectors)
> >  		return q->limits.max_sectors;
> >  
> > -	return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors -
> > -			(offset & (q->limits.chunk_sectors - 1))));
> > +	if (is_power_of_2(chunk_sectors))
> > +		chunk_sectors -= (offset & (chunk_sectors - 1));
> > +	else
> > +		chunk_sectors -= sector_div(offset, chunk_sectors);
> > +
> > +	return min(q->limits.max_sectors, chunk_sectors);
> >  }
> >  
> >  static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> > -- 
> > 2.15.0
> > 
> 
> is_power_of_2() is cheap enough for fast path, so looks fine to support
> non-power-of-2 chunk sectors.
> 
> Maybe NVMe PCI can remove the power_of_2() limit too.

I'd need to see the justification for that. The boundary is just a
suggestion in NVMe. The majority of IO never crosses it so the
calculation is usually wasted CPU cycles. Crossing the boundary is going
to have to be very costly on the device side in order to justify the
host side per-IO overhead for a non-power-of-2 split. 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-12 13:52           ` Ming Lei
  2020-09-14  0:43             ` Damien Le Moal
@ 2020-09-14 14:49             ` Mike Snitzer
  2020-09-15  1:50               ` Ming Lei
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2020-09-14 14:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, dm-devel, Vijayendra Suman

On Sat, Sep 12 2020 at  9:52am -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> > those operations.
> 
> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> chunk_sectors is set:
> 
>         return min(blk_max_size_offset(q, offset),
>                         blk_queue_get_max_sectors(q, req_op(rq)));

Yes, but blk_rq_get_max_sectors() is a bit of a mess structurally.  he
duality of imposing chunk_sectors and/or considering offset when
calculating the return is very confused.

> > Also, there is no need to avoid blk_max_size_offset() if
> > 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  include/linux/blkdev.h | 19 +++++++++++++------
> >  1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index bb5636cc17b9..453a3d735d66 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >  						  sector_t offset)
> >  {
> >  	struct request_queue *q = rq->q;
> > +	int op;
> > +	unsigned int max_sectors;
> >  
> >  	if (blk_rq_is_passthrough(rq))
> >  		return q->limits.max_hw_sectors;
> >  
> > -	if (!q->limits.chunk_sectors ||
> > -	    req_op(rq) == REQ_OP_DISCARD ||
> > -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> > -		return blk_queue_get_max_sectors(q, req_op(rq));
> > +	op = req_op(rq);
> > +	max_sectors = blk_queue_get_max_sectors(q, op);
> >  
> > -	return min(blk_max_size_offset(q, offset),
> > -			blk_queue_get_max_sectors(q, req_op(rq)));
> > +	switch (op) {
> > +	case REQ_OP_DISCARD:
> > +	case REQ_OP_SECURE_ERASE:
> > +	case REQ_OP_WRITE_SAME:
> > +	case REQ_OP_WRITE_ZEROES:
> > +		return max_sectors;
> > +	}
> > +
> > +	return min(blk_max_size_offset(q, offset), max_sectors);
> >  }
> 
> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> needs to be considered.

Yes, I see that now.  But why don't they need to be considered for
REQ_OP_DISCARD and REQ_OP_SECURE_ERASE?  Is it because the intent of the
block core is to offer late splitting of bios?  If so, then why impose
chunk_sectors so early?

Obviously this patch 1/3 should be dropped.  I didn't treat
chunk_sectors with proper priority.

But like I said above, blk_rq_get_max_sectors() vs blk_max_size_offset()
is not at all straight-forward.  And the code looks prone to imposing
limits that shouldn't be (or vice-versa).

Also, when falling back to max_sectors, why not consider offset to treat
max_sectors like a granularity?  Would allow for much more consistent IO
patterns.

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-14  0:43             ` Damien Le Moal
@ 2020-09-14 14:52               ` Mike Snitzer
  2020-09-14 23:28                 ` Damien Le Moal
  2020-09-15  2:03               ` Ming Lei
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2020-09-14 14:52 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block

On Sun, Sep 13 2020 at  8:43pm -0400,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> On 2020/09/12 22:53, Ming Lei wrote:
> > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> >> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> >> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> >> those operations.
> > 
> > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> > chunk_sectors is set:
> > 
> >         return min(blk_max_size_offset(q, offset),
> >                         blk_queue_get_max_sectors(q, req_op(rq)));
> >  
> >> Also, there is no need to avoid blk_max_size_offset() if
> >> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> >>
> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> ---
> >>  include/linux/blkdev.h | 19 +++++++++++++------
> >>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> index bb5636cc17b9..453a3d735d66 100644
> >> --- a/include/linux/blkdev.h
> >> +++ b/include/linux/blkdev.h
> >> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >>  						  sector_t offset)
> >>  {
> >>  	struct request_queue *q = rq->q;
> >> +	int op;
> >> +	unsigned int max_sectors;
> >>  
> >>  	if (blk_rq_is_passthrough(rq))
> >>  		return q->limits.max_hw_sectors;
> >>  
> >> -	if (!q->limits.chunk_sectors ||
> >> -	    req_op(rq) == REQ_OP_DISCARD ||
> >> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> >> -		return blk_queue_get_max_sectors(q, req_op(rq));
> >> +	op = req_op(rq);
> >> +	max_sectors = blk_queue_get_max_sectors(q, op);
> >>  
> >> -	return min(blk_max_size_offset(q, offset),
> >> -			blk_queue_get_max_sectors(q, req_op(rq)));
> >> +	switch (op) {
> >> +	case REQ_OP_DISCARD:
> >> +	case REQ_OP_SECURE_ERASE:
> >> +	case REQ_OP_WRITE_SAME:
> >> +	case REQ_OP_WRITE_ZEROES:
> >> +		return max_sectors;
> >> +	}>> +
> >> +	return min(blk_max_size_offset(q, offset), max_sectors);
> >>  }
> > 
> > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> > needs to be considered.
> 
> That limit is needed for zoned block devices to ensure that *any* write request,
> no matter the command, do not cross zone boundaries. Otherwise, the write would
> be immediately failed by the device.

Thanks for the additional context, sorry to make you so concerned! ;)

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-14  0:46           ` Damien Le Moal
@ 2020-09-14 15:03             ` Mike Snitzer
  2020-09-15  1:09               ` Damien Le Moal
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2020-09-14 15:03 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, dm-devel, Vijayendra Suman, Ming Lei

On Sun, Sep 13 2020 at  8:46pm -0400,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> On 2020/09/12 6:53, Mike Snitzer wrote:
> > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> > those operations.
> > 
> > Also, there is no need to avoid blk_max_size_offset() if
> > 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  include/linux/blkdev.h | 19 +++++++++++++------
> >  1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index bb5636cc17b9..453a3d735d66 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >  						  sector_t offset)
> >  {
> >  	struct request_queue *q = rq->q;
> > +	int op;
> > +	unsigned int max_sectors;
> >  
> >  	if (blk_rq_is_passthrough(rq))
> >  		return q->limits.max_hw_sectors;
> >  
> > -	if (!q->limits.chunk_sectors ||
> > -	    req_op(rq) == REQ_OP_DISCARD ||
> > -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> > -		return blk_queue_get_max_sectors(q, req_op(rq));
> > +	op = req_op(rq);
> > +	max_sectors = blk_queue_get_max_sectors(q, op);
> >  
> > -	return min(blk_max_size_offset(q, offset),
> > -			blk_queue_get_max_sectors(q, req_op(rq)));
> > +	switch (op) {
> > +	case REQ_OP_DISCARD:
> > +	case REQ_OP_SECURE_ERASE:
> > +	case REQ_OP_WRITE_SAME:
> > +	case REQ_OP_WRITE_ZEROES:
> > +		return max_sectors;
> > +	}
> 
> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
> no ?)
> 
> As mentioned in my reply to Ming's email, this will allow these commands to
> potentially cross over zone boundaries on zoned block devices, which would be an
> immediate command failure.

Depending on the implementation it is beneficial to get a large
discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
optimization for handling large discards and issuing N discards, one per
stripe).  Same could apply for other commands.

Like all devices, zoned devices should impose command specific limits in
the queue_limits (and not lean on chunk_sectors to do a
one-size-fits-all).

But that aside, yes I agree I didn't pay close enough attention to the
implications of deferring the splitting of these commands until they
were issued to underlying storage.  This chunk_sectors early splitting
override is a bit of a mess... not quite following the logic given we
were supposed to be waiting to split bios as late as possible.

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-14 14:52               ` Mike Snitzer
@ 2020-09-14 23:28                 ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2020-09-14 23:28 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block

On 2020/09/14 23:52, Mike Snitzer wrote:
> On Sun, Sep 13 2020 at  8:43pm -0400,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> 
>> On 2020/09/12 22:53, Ming Lei wrote:
>>> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
>>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>>> those operations.
>>>
>>> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
>>> chunk_sectors is set:
>>>
>>>         return min(blk_max_size_offset(q, offset),
>>>                         blk_queue_get_max_sectors(q, req_op(rq)));
>>>  
>>>> Also, there is no need to avoid blk_max_size_offset() if
>>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>>
>>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>>> ---
>>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bb5636cc17b9..453a3d735d66 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>>  						  sector_t offset)
>>>>  {
>>>>  	struct request_queue *q = rq->q;
>>>> +	int op;
>>>> +	unsigned int max_sectors;
>>>>  
>>>>  	if (blk_rq_is_passthrough(rq))
>>>>  		return q->limits.max_hw_sectors;
>>>>  
>>>> -	if (!q->limits.chunk_sectors ||
>>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>>> +	op = req_op(rq);
>>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>>  
>>>> -	return min(blk_max_size_offset(q, offset),
>>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>>> +	switch (op) {
>>>> +	case REQ_OP_DISCARD:
>>>> +	case REQ_OP_SECURE_ERASE:
>>>> +	case REQ_OP_WRITE_SAME:
>>>> +	case REQ_OP_WRITE_ZEROES:
>>>> +		return max_sectors;
>>>> +	}>> +
>>>> +	return min(blk_max_size_offset(q, offset), max_sectors);
>>>>  }
>>>
>>> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
>>> needs to be considered.
>>
>> That limit is needed for zoned block devices to ensure that *any* write request,
>> no matter the command, do not cross zone boundaries. Otherwise, the write would
>> be immediately failed by the device.
> 
> Thanks for the additional context, sorry to make you so concerned! ;)

No worries :)



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-14 15:03             ` Mike Snitzer
@ 2020-09-15  1:09               ` Damien Le Moal
  2020-09-15  4:21                 ` Damien Le Moal
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2020-09-15  1:09 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block

On 2020/09/15 0:04, Mike Snitzer wrote:
> On Sun, Sep 13 2020 at  8:46pm -0400,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> 
>> On 2020/09/12 6:53, Mike Snitzer wrote:
>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>> those operations.
>>>
>>> Also, there is no need to avoid blk_max_size_offset() if
>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>> ---
>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index bb5636cc17b9..453a3d735d66 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>  						  sector_t offset)
>>>  {
>>>  	struct request_queue *q = rq->q;
>>> +	int op;
>>> +	unsigned int max_sectors;
>>>  
>>>  	if (blk_rq_is_passthrough(rq))
>>>  		return q->limits.max_hw_sectors;
>>>  
>>> -	if (!q->limits.chunk_sectors ||
>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>> +	op = req_op(rq);
>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>  
>>> -	return min(blk_max_size_offset(q, offset),
>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>> +	switch (op) {
>>> +	case REQ_OP_DISCARD:
>>> +	case REQ_OP_SECURE_ERASE:
>>> +	case REQ_OP_WRITE_SAME:
>>> +	case REQ_OP_WRITE_ZEROES:
>>> +		return max_sectors;
>>> +	}
>>
>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
>> no ?)
>>
>> As mentioned in my reply to Ming's email, this will allow these commands to
>> potentially cross over zone boundaries on zoned block devices, which would be an
>> immediate command failure.
> 
> Depending on the implementation it is beneficial to get a large
> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
> optimization for handling large discards and issuing N discards, one per
> stripe).  Same could apply for other commands.
> 
> Like all devices, zoned devices should impose command specific limits in
> the queue_limits (and not lean on chunk_sectors to do a
> one-size-fits-all).

Yes, understood. But I think that  in the case of md, chunk_sectors is used to
indicate the boundary between drives for a raid volume. So it does indeed make
sense to limit the IO size on submission since otherwise, the md driver itself
would have to split that bio again anyway.

> But that aside, yes I agree I didn't pay close enough attention to the
> implications of deferring the splitting of these commands until they
> were issued to underlying storage.  This chunk_sectors early splitting
> override is a bit of a mess... not quite following the logic given we
> were supposed to be waiting to split bios as late as possible.

My view is that the multipage bvec (BIOs almost as large as we want) and late
splitting is beneficial to get larger effective BIO sent to the device as having
more pages on hand allows bigger segments in the bio instead of always having at
most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A
lot of requests end up being much larger than the device max_segments * page_size.

However, if there is already a known limit on the BIO size when the BIO is being
built, it does not make much sense to try to grow a bio beyond that limit since
it will have to be split by the driver anyway. chunk_sectors is one such limit
used for md (I think) to indicate boundaries between drives of a raid volume.
And we reuse it (abuse it ?) for zoned block devices to ensure that any command
does not cross over zone boundaries since that triggers errors for writes within
sequential zones or read/write crossing over zones of different types
(conventional->sequential zone boundary).

I may not have the entire picture correctly here, but so far, this is my
understanding.

Cheers.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()"
  2020-09-10 19:29   ` Vijayendra Suman
@ 2020-09-15  1:33     ` Mike Snitzer
  2020-09-15 17:03       ` Mike Snitzer
  2020-09-16 14:56       ` Vijayendra Suman
  0 siblings, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2020-09-15  1:33 UTC (permalink / raw)
  To: Vijayendra Suman
  Cc: linux-block, Somu Krishnasamy, dm-devel, RAMANAN_GOVINDARAJAN

On Thu, Sep 10 2020 at  3:29pm -0400,
Vijayendra Suman <vijayendra.suman@oracle.com> wrote:

> Hello Mike,
> 
> I checked with upstream, performance measurement is similar and
> shows performance improvement when
> 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is reverted.
> 
> On 9/10/2020 7:54 PM, Mike Snitzer wrote:
> >[cc'ing dm-devel and linux-block because this is upstream concern too]
> >
> >On Wed, Sep 09 2020 at  1:00pm -0400,
> >Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
> >
> >>    Hello Mike,
> >>
> >>    While Running pgbench tool with  5.4.17 kernel build
> >>
> >>    Following performance degrade is found out
> >>
> >>    buffer read/write metric : -17.2%
> >>    cache read/write metric : -18.7%
> >>    disk read/write metric : -19%
> >>
> >>    buffer
> >>    number of transactions actually processed: 840972
> >>    latency average = 24.013 ms
> >>    tps = 4664.153934 (including connections establishing)
> >>    tps = 4664.421492 (excluding connections establishing)
> >>
> >>    cache
> >>    number of transactions actually processed: 551345
> >>    latency average = 36.949 ms
> >>    tps = 3031.223905 (including connections establishing)
> >>    tps = 3031.402581 (excluding connections establishing)
> >>
> >>    After revert of Commit
> >>    2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call
> >>    blk_queue_split() in dm_process_bio()")
> >
> >I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's
> >backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ?
>
> Yes
>
> >>    Performance is Counter measurement
> >>
> >>    buffer ->
> >>    number of transactions actually processed: 1135735
> >>    latency average = 17.799 ms
> >>    tps = 6292.586749 (including connections establishing)
> >>    tps = 6292.875089 (excluding connections establishing)
> >>
> >>    cache ->
> >>    number of transactions actually processed: 648177
> >>    latency average = 31.217 ms
> >>    tps = 3587.755975 (including connections establishing)
> >>    tps = 3587.966359 (excluding connections establishing)
> >>
> >>    Following is your commit
> >>
> >>    diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>    index cf71a2277d60..1e6e0c970e19 100644
> >>    --- a/drivers/md/dm.c
> >>    +++ b/drivers/md/dm.c
> >>    @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device
> >>    *md,
> >>             * won't be imposed.
> >>             */
> >>            if (current->bio_list) {
> >>    -               blk_queue_split(md->queue, &bio);
> >>    -               if (!is_abnormal_io(bio))
> >>    +               if (is_abnormal_io(bio))
> >>    +                       blk_queue_split(md->queue, &bio);
> >>    +               else
> >>                            dm_queue_split(md, ti, &bio);
> >>            }
> >>
> >>    Could you have a look if it is safe to revert this commit.
> >No, it really isn't a good idea given what was documented in the commit
> >header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the
> >excessive splitting is not conducive to performance either.
> >
> >So I think we need to identify _why_ reverting this commit is causing
> >such a performance improvement.  Why is calling blk_queue_split() before
> >dm_queue_split() benefiting your pgbench workload?
>
> Let me know if you want to check some patch.

Hi,

Could you please test this branch?:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.10
(or apply at least the first 4 patches, commit 63f85d97be69^..b6a80963621fa)

So far I've done various DM regression testing.  But I haven't tested
with pgbench or with the misaaligned IO scenario documented in the
header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74.  But I'll
test that scenario tomorrow.

Any chance you could provide some hints on how you're running pgbench
just so I can try to test/reproduce/verify locally?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-14 14:49             ` Mike Snitzer
@ 2020-09-15  1:50               ` Ming Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2020-09-15  1:50 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block

On Mon, Sep 14, 2020 at 10:49:28AM -0400, Mike Snitzer wrote:
> On Sat, Sep 12 2020 at  9:52am -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> > > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> > > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> > > those operations.
> > 
> > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> > chunk_sectors is set:
> > 
> >         return min(blk_max_size_offset(q, offset),
> >                         blk_queue_get_max_sectors(q, req_op(rq)));
> 
> Yes, but blk_rq_get_max_sectors() is a bit of a mess structurally.  he
> duality of imposing chunk_sectors and/or considering offset when
> calculating the return is very confused.
> 
> > > Also, there is no need to avoid blk_max_size_offset() if
> > > 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > ---
> > >  include/linux/blkdev.h | 19 +++++++++++++------
> > >  1 file changed, 13 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index bb5636cc17b9..453a3d735d66 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> > >  						  sector_t offset)
> > >  {
> > >  	struct request_queue *q = rq->q;
> > > +	int op;
> > > +	unsigned int max_sectors;
> > >  
> > >  	if (blk_rq_is_passthrough(rq))
> > >  		return q->limits.max_hw_sectors;
> > >  
> > > -	if (!q->limits.chunk_sectors ||
> > > -	    req_op(rq) == REQ_OP_DISCARD ||
> > > -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> > > -		return blk_queue_get_max_sectors(q, req_op(rq));
> > > +	op = req_op(rq);
> > > +	max_sectors = blk_queue_get_max_sectors(q, op);
> > >  
> > > -	return min(blk_max_size_offset(q, offset),
> > > -			blk_queue_get_max_sectors(q, req_op(rq)));
> > > +	switch (op) {
> > > +	case REQ_OP_DISCARD:
> > > +	case REQ_OP_SECURE_ERASE:
> > > +	case REQ_OP_WRITE_SAME:
> > > +	case REQ_OP_WRITE_ZEROES:
> > > +		return max_sectors;
> > > +	}
> > > +
> > > +	return min(blk_max_size_offset(q, offset), max_sectors);
> > >  }
> > 
> > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> > needs to be considered.
> 
> Yes, I see that now.  But why don't they need to be considered for
> REQ_OP_DISCARD and REQ_OP_SECURE_ERASE?

This behavior is introduced int the following commit, and I guess it is
because we support multi-range discard request, maybe Jens can explain more.

commit e548ca4ee4595f65b262661d166310ad8a149bec
Author: Jens Axboe <axboe@fb.com>
Date:   Fri May 29 13:11:32 2015 -0600

    block: don't honor chunk sizes for data-less IO

    We don't need to honor chunk sizes for IO that doesn't carry any
    data.

    Signed-off-by: Jens Axboe <axboe@fb.com>

> Is it because the intent of the
> block core is to offer late splitting of bios?

block layer doesn't have late bio splitting, and bio is only splitted
via __blk_queue_split() before allocating request.

blk_rq_get_max_sectors() is only called by rq merge code, actually it
should have been defined in block/blk.h instead of public header.

> If so, then why impose
> chunk_sectors so early?

Not sure I understand your question. 'chunk_sectors' is firstly used
during bio split(get_max_io_size() from blk_bio_segment_split()), 

> 
> Obviously this patch 1/3 should be dropped.  I didn't treat
> chunk_sectors with proper priority.
> 
> But like I said above, blk_rq_get_max_sectors() vs blk_max_size_offset()
> is not at all straight-forward.  And the code looks prone to imposing
> limits that shouldn't be (or vice-versa).
> 
> Also, when falling back to max_sectors, why not consider offset to treat
> max_sectors like a granularity?  Would allow for much more consistent IO
> patterns.

blk_rq_get_max_sectors() is called when one bio or rq can be merged to
current request, and we have already considered all kinds of queue limits
when doing bio splitting, so not necessary to consider it again here during
merging rq.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-14  0:43             ` Damien Le Moal
  2020-09-14 14:52               ` Mike Snitzer
@ 2020-09-15  2:03               ` Ming Lei
  2020-09-15  2:15                 ` Damien Le Moal
  1 sibling, 1 reply; 28+ messages in thread
From: Ming Lei @ 2020-09-15  2:03 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Mike Snitzer, Vijayendra Suman, dm-devel, linux-block

On Mon, Sep 14, 2020 at 12:43:06AM +0000, Damien Le Moal wrote:
> On 2020/09/12 22:53, Ming Lei wrote:
> > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> >> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> >> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> >> those operations.
> > 
> > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> > chunk_sectors is set:
> > 
> >         return min(blk_max_size_offset(q, offset),
> >                         blk_queue_get_max_sectors(q, req_op(rq)));
> >  
> >> Also, there is no need to avoid blk_max_size_offset() if
> >> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> >>
> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> ---
> >>  include/linux/blkdev.h | 19 +++++++++++++------
> >>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> index bb5636cc17b9..453a3d735d66 100644
> >> --- a/include/linux/blkdev.h
> >> +++ b/include/linux/blkdev.h
> >> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >>  						  sector_t offset)
> >>  {
> >>  	struct request_queue *q = rq->q;
> >> +	int op;
> >> +	unsigned int max_sectors;
> >>  
> >>  	if (blk_rq_is_passthrough(rq))
> >>  		return q->limits.max_hw_sectors;
> >>  
> >> -	if (!q->limits.chunk_sectors ||
> >> -	    req_op(rq) == REQ_OP_DISCARD ||
> >> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> >> -		return blk_queue_get_max_sectors(q, req_op(rq));
> >> +	op = req_op(rq);
> >> +	max_sectors = blk_queue_get_max_sectors(q, op);
> >>  
> >> -	return min(blk_max_size_offset(q, offset),
> >> -			blk_queue_get_max_sectors(q, req_op(rq)));
> >> +	switch (op) {
> >> +	case REQ_OP_DISCARD:
> >> +	case REQ_OP_SECURE_ERASE:
> >> +	case REQ_OP_WRITE_SAME:
> >> +	case REQ_OP_WRITE_ZEROES:
> >> +		return max_sectors;
> >> +	}>> +
> >> +	return min(blk_max_size_offset(q, offset), max_sectors);
> >>  }
> > 
> > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> > needs to be considered.
> 
> That limit is needed for zoned block devices to ensure that *any* write request,
> no matter the command, do not cross zone boundaries. Otherwise, the write would
> be immediately failed by the device.

Looks both blk_bio_write_zeroes_split() and blk_bio_write_same_split()
don't consider chunk_sectors limit, is that an issue for zone block?


thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-15  2:03               ` Ming Lei
@ 2020-09-15  2:15                 ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2020-09-15  2:15 UTC (permalink / raw)
  To: Ming Lei; +Cc: Mike Snitzer, Vijayendra Suman, dm-devel, linux-block

On 2020/09/15 11:04, Ming Lei wrote:
> On Mon, Sep 14, 2020 at 12:43:06AM +0000, Damien Le Moal wrote:
>> On 2020/09/12 22:53, Ming Lei wrote:
>>> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
>>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>>> those operations.
>>>
>>> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
>>> chunk_sectors is set:
>>>
>>>         return min(blk_max_size_offset(q, offset),
>>>                         blk_queue_get_max_sectors(q, req_op(rq)));
>>>  
>>>> Also, there is no need to avoid blk_max_size_offset() if
>>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>>
>>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>>> ---
>>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bb5636cc17b9..453a3d735d66 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>>  						  sector_t offset)
>>>>  {
>>>>  	struct request_queue *q = rq->q;
>>>> +	int op;
>>>> +	unsigned int max_sectors;
>>>>  
>>>>  	if (blk_rq_is_passthrough(rq))
>>>>  		return q->limits.max_hw_sectors;
>>>>  
>>>> -	if (!q->limits.chunk_sectors ||
>>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>>> +	op = req_op(rq);
>>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>>  
>>>> -	return min(blk_max_size_offset(q, offset),
>>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>>> +	switch (op) {
>>>> +	case REQ_OP_DISCARD:
>>>> +	case REQ_OP_SECURE_ERASE:
>>>> +	case REQ_OP_WRITE_SAME:
>>>> +	case REQ_OP_WRITE_ZEROES:
>>>> +		return max_sectors;
>>>> +	}>> +
>>>> +	return min(blk_max_size_offset(q, offset), max_sectors);
>>>>  }
>>>
>>> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
>>> needs to be considered.
>>
>> That limit is needed for zoned block devices to ensure that *any* write request,
>> no matter the command, do not cross zone boundaries. Otherwise, the write would
>> be immediately failed by the device.
> 
> Looks both blk_bio_write_zeroes_split() and blk_bio_write_same_split()
> don't consider chunk_sectors limit, is that an issue for zone block?

Hu... Never looked at these. Yes, it will be a problem. write zeroes for NVMe
ZNS drives and write same for SCSI/ZBC drives. So yes, definitely something
that needs to be fixed. User of these will be file systems that in the case of
zoned block devices would be FS with zone support. f2fs does not use these
commands, and btrfs (posted recently) needs to be checked. But the FS itself
being zone aware, the requests will be zone aligned.

But definitely worth fixing.

Thanks !

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-15  1:09               ` Damien Le Moal
@ 2020-09-15  4:21                 ` Damien Le Moal
  2020-09-15  8:01                   ` Ming Lei
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2020-09-15  4:21 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block

On 2020/09/15 10:10, Damien Le Moal wrote:
> On 2020/09/15 0:04, Mike Snitzer wrote:
>> On Sun, Sep 13 2020 at  8:46pm -0400,
>> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>>
>>> On 2020/09/12 6:53, Mike Snitzer wrote:
>>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>>> those operations.
>>>>
>>>> Also, there is no need to avoid blk_max_size_offset() if
>>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>>
>>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>>> ---
>>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bb5636cc17b9..453a3d735d66 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>>  						  sector_t offset)
>>>>  {
>>>>  	struct request_queue *q = rq->q;
>>>> +	int op;
>>>> +	unsigned int max_sectors;
>>>>  
>>>>  	if (blk_rq_is_passthrough(rq))
>>>>  		return q->limits.max_hw_sectors;
>>>>  
>>>> -	if (!q->limits.chunk_sectors ||
>>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>>> +	op = req_op(rq);
>>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>>  
>>>> -	return min(blk_max_size_offset(q, offset),
>>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>>> +	switch (op) {
>>>> +	case REQ_OP_DISCARD:
>>>> +	case REQ_OP_SECURE_ERASE:
>>>> +	case REQ_OP_WRITE_SAME:
>>>> +	case REQ_OP_WRITE_ZEROES:
>>>> +		return max_sectors;
>>>> +	}
>>>
>>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
>>> no ?)
>>>
>>> As mentioned in my reply to Ming's email, this will allow these commands to
>>> potentially cross over zone boundaries on zoned block devices, which would be an
>>> immediate command failure.
>>
>> Depending on the implementation it is beneficial to get a large
>> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
>> optimization for handling large discards and issuing N discards, one per
>> stripe).  Same could apply for other commands.
>>
>> Like all devices, zoned devices should impose command specific limits in
>> the queue_limits (and not lean on chunk_sectors to do a
>> one-size-fits-all).
> 
> Yes, understood. But I think that  in the case of md, chunk_sectors is used to
> indicate the boundary between drives for a raid volume. So it does indeed make
> sense to limit the IO size on submission since otherwise, the md driver itself
> would have to split that bio again anyway.
> 
>> But that aside, yes I agree I didn't pay close enough attention to the
>> implications of deferring the splitting of these commands until they
>> were issued to underlying storage.  This chunk_sectors early splitting
>> override is a bit of a mess... not quite following the logic given we
>> were supposed to be waiting to split bios as late as possible.
> 
> My view is that the multipage bvec (BIOs almost as large as we want) and late
> splitting is beneficial to get larger effective BIO sent to the device as having
> more pages on hand allows bigger segments in the bio instead of always having at
> most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A
> lot of requests end up being much larger than the device max_segments * page_size.
> 
> However, if there is already a known limit on the BIO size when the BIO is being
> built, it does not make much sense to try to grow a bio beyond that limit since
> it will have to be split by the driver anyway. chunk_sectors is one such limit
> used for md (I think) to indicate boundaries between drives of a raid volume.
> And we reuse it (abuse it ?) for zoned block devices to ensure that any command
> does not cross over zone boundaries since that triggers errors for writes within
> sequential zones or read/write crossing over zones of different types
> (conventional->sequential zone boundary).
> 
> I may not have the entire picture correctly here, but so far, this is my
> understanding.

And I was wrong :) In light of Ming's comment + a little code refresher reading,
indeed, chunk_sectors will split BIOs so that *requests* do not exceed that
limit, but the initial BIO submission may be much larger regardless of
chunk_sectors.

Ming, I think the point here is that building a large BIO first and splitting it
later (as opposed to limiting the bio size by stopping bio_add_page()) is more
efficient as there is only one bio submit instead of many, right ?


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully
  2020-09-15  4:21                 ` Damien Le Moal
@ 2020-09-15  8:01                   ` Ming Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2020-09-15  8:01 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, dm-devel, Mike Snitzer, Vijayendra Suman

On Tue, Sep 15, 2020 at 04:21:54AM +0000, Damien Le Moal wrote:
> On 2020/09/15 10:10, Damien Le Moal wrote:
> > On 2020/09/15 0:04, Mike Snitzer wrote:
> >> On Sun, Sep 13 2020 at  8:46pm -0400,
> >> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> >>
> >>> On 2020/09/12 6:53, Mike Snitzer wrote:
> >>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> >>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> >>>> those operations.
> >>>>
> >>>> Also, there is no need to avoid blk_max_size_offset() if
> >>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> >>>>
> >>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>>> ---
> >>>>  include/linux/blkdev.h | 19 +++++++++++++------
> >>>>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >>>> index bb5636cc17b9..453a3d735d66 100644
> >>>> --- a/include/linux/blkdev.h
> >>>> +++ b/include/linux/blkdev.h
> >>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >>>>  						  sector_t offset)
> >>>>  {
> >>>>  	struct request_queue *q = rq->q;
> >>>> +	int op;
> >>>> +	unsigned int max_sectors;
> >>>>  
> >>>>  	if (blk_rq_is_passthrough(rq))
> >>>>  		return q->limits.max_hw_sectors;
> >>>>  
> >>>> -	if (!q->limits.chunk_sectors ||
> >>>> -	    req_op(rq) == REQ_OP_DISCARD ||
> >>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> >>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
> >>>> +	op = req_op(rq);
> >>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
> >>>>  
> >>>> -	return min(blk_max_size_offset(q, offset),
> >>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
> >>>> +	switch (op) {
> >>>> +	case REQ_OP_DISCARD:
> >>>> +	case REQ_OP_SECURE_ERASE:
> >>>> +	case REQ_OP_WRITE_SAME:
> >>>> +	case REQ_OP_WRITE_ZEROES:
> >>>> +		return max_sectors;
> >>>> +	}
> >>>
> >>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
> >>> no ?)
> >>>
> >>> As mentioned in my reply to Ming's email, this will allow these commands to
> >>> potentially cross over zone boundaries on zoned block devices, which would be an
> >>> immediate command failure.
> >>
> >> Depending on the implementation it is beneficial to get a large
> >> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
> >> optimization for handling large discards and issuing N discards, one per
> >> stripe).  Same could apply for other commands.
> >>
> >> Like all devices, zoned devices should impose command specific limits in
> >> the queue_limits (and not lean on chunk_sectors to do a
> >> one-size-fits-all).
> > 
> > Yes, understood. But I think that  in the case of md, chunk_sectors is used to
> > indicate the boundary between drives for a raid volume. So it does indeed make
> > sense to limit the IO size on submission since otherwise, the md driver itself
> > would have to split that bio again anyway.
> > 
> >> But that aside, yes I agree I didn't pay close enough attention to the
> >> implications of deferring the splitting of these commands until they
> >> were issued to underlying storage.  This chunk_sectors early splitting
> >> override is a bit of a mess... not quite following the logic given we
> >> were supposed to be waiting to split bios as late as possible.
> > 
> > My view is that the multipage bvec (BIOs almost as large as we want) and late
> > splitting is beneficial to get larger effective BIO sent to the device as having
> > more pages on hand allows bigger segments in the bio instead of always having at
> > most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A
> > lot of requests end up being much larger than the device max_segments * page_size.
> > 
> > However, if there is already a known limit on the BIO size when the BIO is being
> > built, it does not make much sense to try to grow a bio beyond that limit since
> > it will have to be split by the driver anyway. chunk_sectors is one such limit
> > used for md (I think) to indicate boundaries between drives of a raid volume.
> > And we reuse it (abuse it ?) for zoned block devices to ensure that any command
> > does not cross over zone boundaries since that triggers errors for writes within
> > sequential zones or read/write crossing over zones of different types
> > (conventional->sequential zone boundary).
> > 
> > I may not have the entire picture correctly here, but so far, this is my
> > understanding.
> 
> And I was wrong :) In light of Ming's comment + a little code refresher reading,
> indeed, chunk_sectors will split BIOs so that *requests* do not exceed that
> limit, but the initial BIO submission may be much larger regardless of
> chunk_sectors.
> 
> Ming, I think the point here is that building a large BIO first and splitting it
> later (as opposed to limiting the bio size by stopping bio_add_page()) is more
> efficient as there is only one bio submit instead of many, right ?

Yeah, this way allows generic_make_request(submit_bio_noacct) to handle arbitrarily
sized bios, so bio_add_page() becomes more efficiently and simplified a lot, and
stacking driver is simplified too, such as the original q->merge_bvec_fn() is killed.

On the other hand, the cost of bio splitting is added.

Especially for stacking driver, there may be two times of bio splitting,
one is in stacking driver, another is in underlying device driver.

Fortunately underlying queue's limits are propagated to stacking queue, so in theory
the bio splitting in stacking driver's ->submit_bio is enough most of times.



Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()"
  2020-09-15  1:33     ` Mike Snitzer
@ 2020-09-15 17:03       ` Mike Snitzer
  2020-09-16 14:56       ` Vijayendra Suman
  1 sibling, 0 replies; 28+ messages in thread
From: Mike Snitzer @ 2020-09-15 17:03 UTC (permalink / raw)
  To: Vijayendra Suman
  Cc: linux-block, Somu Krishnasamy, dm-devel, RAMANAN_GOVINDARAJAN

On Mon, Sep 14 2020 at  9:33pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Sep 10 2020 at  3:29pm -0400,
> Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
> 
> > Hello Mike,
> > 
> > I checked with upstream, performance measurement is similar and
> > shows performance improvement when
> > 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is reverted.
> > 
> > On 9/10/2020 7:54 PM, Mike Snitzer wrote:
> > >[cc'ing dm-devel and linux-block because this is upstream concern too]
> > >
> > >On Wed, Sep 09 2020 at  1:00pm -0400,
> > >Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
> > >
> > >>    Hello Mike,
> > >>
> > >>    While Running pgbench tool with  5.4.17 kernel build
> > >>
> > >>    Following performance degrade is found out
> > >>
> > >>    buffer read/write metric : -17.2%
> > >>    cache read/write metric : -18.7%
> > >>    disk read/write metric : -19%
> > >>
> > >>    buffer
> > >>    number of transactions actually processed: 840972
> > >>    latency average = 24.013 ms
> > >>    tps = 4664.153934 (including connections establishing)
> > >>    tps = 4664.421492 (excluding connections establishing)
> > >>
> > >>    cache
> > >>    number of transactions actually processed: 551345
> > >>    latency average = 36.949 ms
> > >>    tps = 3031.223905 (including connections establishing)
> > >>    tps = 3031.402581 (excluding connections establishing)
> > >>
> > >>    After revert of Commit
> > >>    2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call
> > >>    blk_queue_split() in dm_process_bio()")
> > >
> > >I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's
> > >backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ?
> >
> > Yes
> >
> > >>    Performance is Counter measurement
> > >>
> > >>    buffer ->
> > >>    number of transactions actually processed: 1135735
> > >>    latency average = 17.799 ms
> > >>    tps = 6292.586749 (including connections establishing)
> > >>    tps = 6292.875089 (excluding connections establishing)
> > >>
> > >>    cache ->
> > >>    number of transactions actually processed: 648177
> > >>    latency average = 31.217 ms
> > >>    tps = 3587.755975 (including connections establishing)
> > >>    tps = 3587.966359 (excluding connections establishing)
> > >>
> > >>    Following is your commit
> > >>
> > >>    diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > >>    index cf71a2277d60..1e6e0c970e19 100644
> > >>    --- a/drivers/md/dm.c
> > >>    +++ b/drivers/md/dm.c
> > >>    @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device
> > >>    *md,
> > >>             * won't be imposed.
> > >>             */
> > >>            if (current->bio_list) {
> > >>    -               blk_queue_split(md->queue, &bio);
> > >>    -               if (!is_abnormal_io(bio))
> > >>    +               if (is_abnormal_io(bio))
> > >>    +                       blk_queue_split(md->queue, &bio);
> > >>    +               else
> > >>                            dm_queue_split(md, ti, &bio);
> > >>            }
> > >>
> > >>    Could you have a look if it is safe to revert this commit.
> > >No, it really isn't a good idea given what was documented in the commit
> > >header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the
> > >excessive splitting is not conducive to performance either.
> > >
> > >So I think we need to identify _why_ reverting this commit is causing
> > >such a performance improvement.  Why is calling blk_queue_split() before
> > >dm_queue_split() benefiting your pgbench workload?
> >
> > Let me know if you want to check some patch.
> 
> Hi,
> 
> Could you please test this branch?:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.10
> (or apply at least the first 4 patches, commit 63f85d97be69^..b6a80963621fa)
> 
> So far I've done various DM regression testing.  But I haven't tested
> with pgbench or with the misaaligned IO scenario documented in the
> header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74.  But I'll
> test that scenario tomorrow.

Training DM core to set chunk_sectors and always use blk_queue_split
resolves the inefficient splitting documented in the header for
commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74.

xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev

before, so with commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74:

253,2    5        1     0.000000000  4382  Q   R 224 + 2064 [xfs_io]
253,2    5        2     0.000003414  4382  X   R 224 / 256 [xfs_io]
253,2    5        3     0.000017838  4382  X   R 256 / 512 [xfs_io]
253,2    5        4     0.000019852  4382  X   R 512 / 768 [xfs_io]
253,2    5        5     0.000031316  4382  X   R 768 / 1024 [xfs_io]
253,2    5        6     0.000034333  4382  X   R 1024 / 1280 [xfs_io]
253,2    5        7     0.000037684  4382  X   R 1280 / 1536 [xfs_io]
253,2    5        8     0.000041011  4382  X   R 1536 / 1792 [xfs_io]
253,2    5        9     0.000043962  4382  X   R 1792 / 2048 [xfs_io]
253,2    5       10     0.000074765  4382  Q   R 2288 + 2008 [xfs_io]
253,2    5       11     0.000075020  4382  X   R 2288 / 2304 [xfs_io]
253,2    5       12     0.000077009  4382  X   R 2304 / 2560 [xfs_io]
253,2    5       13     0.000080509  4382  X   R 2560 / 2816 [xfs_io]
253,2    5       14     0.000084182  4382  X   R 2816 / 3072 [xfs_io]
253,2    5       15     0.000087274  4382  X   R 3072 / 3328 [xfs_io]
253,2    5       16     0.000090342  4382  X   R 3328 / 3584 [xfs_io]
253,2    5       17     0.000095348  4382  X   R 3584 / 3840 [xfs_io]
253,2    5       18     0.000097776  4382  X   R 3840 / 4096 [xfs_io]

after, so with 'dm-5.10' branch refernced above, meaning dm_process_bio
w/ unconditional blk_queue_split (w/ chunk_sectors):

253,2   17        1     0.000000000  2176  Q   R 224 + 2280 [xfs_io]
253,2   17        2     0.000001978  2176  X   R 224 / 256 [xfs_io]
253,2   17        3     0.000017882  2176  X   R 256 / 512 [xfs_io]
253,2   17        4     0.000020406  2176  X   R 512 / 768 [xfs_io]
253,2   17        5     0.000031298  2176  X   R 768 / 1024 [xfs_io]
253,2   17        6     0.000034654  2176  X   R 1024 / 1280 [xfs_io]
253,2   17        7     0.000038474  2176  X   R 1280 / 1536 [xfs_io]
253,2   17        8     0.000042299  2176  X   R 1536 / 1792 [xfs_io]
253,2   17        9     0.000054088  2176  X   R 1792 / 2048 [xfs_io]
253,2   17       10     0.000057884  2176  X   R 2048 / 2304 [xfs_io]
253,2   17       11     0.000081358  2176  Q   R 2504 + 1792 [xfs_io]
253,2   17       12     0.000081778  2176  X   R 2504 / 2560 [xfs_io]
253,2   17       13     0.000083496  2176  X   R 2560 / 2816 [xfs_io]
253,2   17       14     0.000085301  2176  X   R 2816 / 3072 [xfs_io]
253,2   17       15     0.000092374  2176  X   R 3072 / 3328 [xfs_io]
253,2   17       16     0.000094774  2176  X   R 3328 / 3584 [xfs_io]
253,2   17       17     0.000097977  2176  X   R 3584 / 3840 [xfs_io]
253,2   17       18     0.000100094  2176  X   R 3840 / 4096 [xfs_io]

> Any chance you could provide some hints on how you're running pgbench
> just so I can try to test/reproduce/verify locally?

I'm going to defer to you on pgbench testing.

What is your underlying storage?

Could it be that DM using unconditional blk_queue_split() is helping
your pgbench workload because it splits IO more (so smaller IO, lower
latency per IO)?

Do you have comparison blktrace data?

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()"
  2020-09-15  1:33     ` Mike Snitzer
  2020-09-15 17:03       ` Mike Snitzer
@ 2020-09-16 14:56       ` Vijayendra Suman
  1 sibling, 0 replies; 28+ messages in thread
From: Vijayendra Suman @ 2020-09-16 14:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: linux-block, Somu Krishnasamy, dm-devel, RAMANAN_GOVINDARAJAN

Hello Mike,

On 9/15/2020 7:03 AM, Mike Snitzer wrote:
> On Thu, Sep 10 2020 at  3:29pm -0400,
> Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
>
>> Hello Mike,
>>
>> I checked with upstream, performance measurement is similar and
>> shows performance improvement when
>> 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is reverted.
>>
>> On 9/10/2020 7:54 PM, Mike Snitzer wrote:
>>> [cc'ing dm-devel and linux-block because this is upstream concern too]
>>>
>>> On Wed, Sep 09 2020 at  1:00pm -0400,
>>> Vijayendra Suman <vijayendra.suman@oracle.com> wrote:
>>>
>>>>     Hello Mike,
>>>>
>>>>     While Running pgbench tool with  5.4.17 kernel build
>>>>
>>>>     Following performance degrade is found out
>>>>
>>>>     buffer read/write metric : -17.2%
>>>>     cache read/write metric : -18.7%
>>>>     disk read/write metric : -19%
>>>>
>>>>     buffer
>>>>     number of transactions actually processed: 840972
>>>>     latency average = 24.013 ms
>>>>     tps = 4664.153934 (including connections establishing)
>>>>     tps = 4664.421492 (excluding connections establishing)
>>>>
>>>>     cache
>>>>     number of transactions actually processed: 551345
>>>>     latency average = 36.949 ms
>>>>     tps = 3031.223905 (including connections establishing)
>>>>     tps = 3031.402581 (excluding connections establishing)
>>>>
>>>>     After revert of Commit
>>>>     2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call
>>>>     blk_queue_split() in dm_process_bio()")
>>> I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's
>>> backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ?
>> Yes
>>
>>>>     Performance is Counter measurement
>>>>
>>>>     buffer ->
>>>>     number of transactions actually processed: 1135735
>>>>     latency average = 17.799 ms
>>>>     tps = 6292.586749 (including connections establishing)
>>>>     tps = 6292.875089 (excluding connections establishing)
>>>>
>>>>     cache ->
>>>>     number of transactions actually processed: 648177
>>>>     latency average = 31.217 ms
>>>>     tps = 3587.755975 (including connections establishing)
>>>>     tps = 3587.966359 (excluding connections establishing)
>>>>
>>>>     Following is your commit
>>>>
>>>>     diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>     index cf71a2277d60..1e6e0c970e19 100644
>>>>     --- a/drivers/md/dm.c
>>>>     +++ b/drivers/md/dm.c
>>>>     @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device
>>>>     *md,
>>>>              * won't be imposed.
>>>>              */
>>>>             if (current->bio_list) {
>>>>     -               blk_queue_split(md->queue, &bio);
>>>>     -               if (!is_abnormal_io(bio))
>>>>     +               if (is_abnormal_io(bio))
>>>>     +                       blk_queue_split(md->queue, &bio);
>>>>     +               else
>>>>                             dm_queue_split(md, ti, &bio);
>>>>             }
>>>>
>>>>     Could you have a look if it is safe to revert this commit.
>>> No, it really isn't a good idea given what was documented in the commit
>>> header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the
>>> excessive splitting is not conducive to performance either.
>>>
>>> So I think we need to identify _why_ reverting this commit is causing
>>> such a performance improvement.  Why is calling blk_queue_split() before
>>> dm_queue_split() benefiting your pgbench workload?
>> Let me know if you want to check some patch.
> Hi,
>
> Could you please test this branch?:
> https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.10__;!!GqivPVa7Brio!MspX41fnl1XoqlkHjwMuNFk--2a9yMSV9IQMRazyHTKEPls1nuF37bSIum7ZAOLZGxk6kw$
> (or apply at least the first 4 patches, commit 63f85d97be69^..b6a80963621fa)

With above mentioned patch set, I get following results

buffer ->

number of transactions actually processed: 1001957
latency average = 20.135 ms
tps = 5562.323947 (including connections establishing)
tps = 5562.649168 (excluding connections establishing)

cache ->

number of transactions actually processed: 581273
latency average = 34.745 ms
tps = 3223.520038 (including connections establishing)
tps = 3223.717013 (excluding connections establishing)

With above patch there is performance improvement.

> So far I've done various DM regression testing.  But I haven't tested
> with pgbench or with the misaaligned IO scenario documented in the
> header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74.  But I'll
> test that scenario tomorrow.
>
> Any chance you could provide some hints on how you're running pgbench
> just so I can try to test/reproduce/verify locally?
A PostgreSQL setup script will run as part of the setup within RUN to 
install the PostgreSQL DB, configure the /etc/postgresql.conf file and 
initialize the DB.
The RUN script will start the PostgreSQL service and bind it to running 
on half the cpu's, a DB will be created of a default size (I think 16M) 
and will be scaled up to the required size based on whether it is a 
buffer, cache or disk run.

After this, PostgreSQL pgbench will be run in readonly and readwrite 
modes (and be binded to the other half of the cpu's on the system).

Performance issue was seen on readwrite mode.

>
> Thanks,
> Mike
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-09-16 14:56 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <529c2394-1b58-b9d8-d462-1f3de1b78ac8@oracle.com>
2020-09-10 14:24 ` Revert "dm: always call blk_queue_split() in dm_process_bio()" Mike Snitzer
2020-09-10 19:29   ` Vijayendra Suman
2020-09-15  1:33     ` Mike Snitzer
2020-09-15 17:03       ` Mike Snitzer
2020-09-16 14:56       ` Vijayendra Suman
2020-09-11 12:20   ` Ming Lei
2020-09-11 16:13     ` Mike Snitzer
2020-09-11 21:53       ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer
2020-09-11 21:53         ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer
2020-09-12 13:52           ` Ming Lei
2020-09-14  0:43             ` Damien Le Moal
2020-09-14 14:52               ` Mike Snitzer
2020-09-14 23:28                 ` Damien Le Moal
2020-09-15  2:03               ` Ming Lei
2020-09-15  2:15                 ` Damien Le Moal
2020-09-14 14:49             ` Mike Snitzer
2020-09-15  1:50               ` Ming Lei
2020-09-14  0:46           ` Damien Le Moal
2020-09-14 15:03             ` Mike Snitzer
2020-09-15  1:09               ` Damien Le Moal
2020-09-15  4:21                 ` Damien Le Moal
2020-09-15  8:01                   ` Ming Lei
2020-09-11 21:53         ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer
2020-09-12 13:58           ` Ming Lei
2020-09-11 21:53         ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer
2020-09-12 14:06           ` Ming Lei
2020-09-14  2:43             ` Keith Busch
2020-09-14  0:55           ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).