* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()" [not found] <529c2394-1b58-b9d8-d462-1f3de1b78ac8@oracle.com> @ 2020-09-10 14:24 ` Mike Snitzer 2020-09-10 19:29 ` Vijayendra Suman 2020-09-11 12:20 ` Ming Lei 0 siblings, 2 replies; 28+ messages in thread From: Mike Snitzer @ 2020-09-10 14:24 UTC (permalink / raw) To: Vijayendra Suman; +Cc: linux-block, dm-devel [cc'ing dm-devel and linux-block because this is upstream concern too] On Wed, Sep 09 2020 at 1:00pm -0400, Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > Hello Mike, > > While Running pgbench tool with 5.4.17 kernel build > > Following performance degrade is found out > > buffer read/write metric : -17.2% > cache read/write metric : -18.7% > disk read/write metric : -19% > > buffer > number of transactions actually processed: 840972 > latency average = 24.013 ms > tps = 4664.153934 (including connections establishing) > tps = 4664.421492 (excluding connections establishing) > > cache > number of transactions actually processed: 551345 > latency average = 36.949 ms > tps = 3031.223905 (including connections establishing) > tps = 3031.402581 (excluding connections establishing) > > After revert of Commit > 2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call > blk_queue_split() in dm_process_bio()") I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ? > Performance is Counter measurement > > buffer -> > number of transactions actually processed: 1135735 > latency average = 17.799 ms > tps = 6292.586749 (including connections establishing) > tps = 6292.875089 (excluding connections establishing) > > cache -> > number of transactions actually processed: 648177 > latency average = 31.217 ms > tps = 3587.755975 (including connections establishing) > tps = 3587.966359 (excluding connections establishing) > > Following is your commit > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index cf71a2277d60..1e6e0c970e19 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device > *md, > * won't be imposed. > */ > if (current->bio_list) { > - blk_queue_split(md->queue, &bio); > - if (!is_abnormal_io(bio)) > + if (is_abnormal_io(bio)) > + blk_queue_split(md->queue, &bio); > + else > dm_queue_split(md, ti, &bio); > } > > Could you have a look if it is safe to revert this commit. No, it really isn't a good idea given what was documented in the commit header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the excessive splitting is not conducive to performance either. So I think we need to identify _why_ reverting this commit is causing such a performance improvement. Why is calling blk_queue_split() before dm_queue_split() benefiting your pgbench workload? Thanks, Mike ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()" 2020-09-10 14:24 ` Revert "dm: always call blk_queue_split() in dm_process_bio()" Mike Snitzer @ 2020-09-10 19:29 ` Vijayendra Suman 2020-09-15 1:33 ` Mike Snitzer 2020-09-11 12:20 ` Ming Lei 1 sibling, 1 reply; 28+ messages in thread From: Vijayendra Suman @ 2020-09-10 19:29 UTC (permalink / raw) To: Mike Snitzer Cc: dm-devel, linux-block, RAMANAN_GOVINDARAJAN, Somu Krishnasamy Hello Mike, I checked with upstream, performance measurement is similar and shows performance improvement when 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is reverted. On 9/10/2020 7:54 PM, Mike Snitzer wrote: > [cc'ing dm-devel and linux-block because this is upstream concern too] > > On Wed, Sep 09 2020 at 1:00pm -0400, > Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > >> Hello Mike, >> >> While Running pgbench tool with 5.4.17 kernel build >> >> Following performance degrade is found out >> >> buffer read/write metric : -17.2% >> cache read/write metric : -18.7% >> disk read/write metric : -19% >> >> buffer >> number of transactions actually processed: 840972 >> latency average = 24.013 ms >> tps = 4664.153934 (including connections establishing) >> tps = 4664.421492 (excluding connections establishing) >> >> cache >> number of transactions actually processed: 551345 >> latency average = 36.949 ms >> tps = 3031.223905 (including connections establishing) >> tps = 3031.402581 (excluding connections establishing) >> >> After revert of Commit >> 2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call >> blk_queue_split() in dm_process_bio()") > I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's > backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ? Yes >> Performance is Counter measurement >> >> buffer -> >> number of transactions actually processed: 1135735 >> latency average = 17.799 ms >> tps = 6292.586749 (including connections establishing) >> tps = 6292.875089 (excluding connections establishing) >> >> cache -> >> number of transactions actually processed: 648177 >> latency average = 31.217 ms >> tps = 3587.755975 (including connections establishing) >> tps = 3587.966359 (excluding connections establishing) >> >> Following is your commit >> >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >> index cf71a2277d60..1e6e0c970e19 100644 >> --- a/drivers/md/dm.c >> +++ b/drivers/md/dm.c >> @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device >> *md, >> * won't be imposed. >> */ >> if (current->bio_list) { >> - blk_queue_split(md->queue, &bio); >> - if (!is_abnormal_io(bio)) >> + if (is_abnormal_io(bio)) >> + blk_queue_split(md->queue, &bio); >> + else >> dm_queue_split(md, ti, &bio); >> } >> >> Could you have a look if it is safe to revert this commit. > No, it really isn't a good idea given what was documented in the commit > header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the > excessive splitting is not conducive to performance either. > > So I think we need to identify _why_ reverting this commit is causing > such a performance improvement. Why is calling blk_queue_split() before > dm_queue_split() benefiting your pgbench workload? Let me know if you want to check some patch. > > Thanks, > Mike > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()" 2020-09-10 19:29 ` Vijayendra Suman @ 2020-09-15 1:33 ` Mike Snitzer 2020-09-15 17:03 ` Mike Snitzer 2020-09-16 14:56 ` Vijayendra Suman 0 siblings, 2 replies; 28+ messages in thread From: Mike Snitzer @ 2020-09-15 1:33 UTC (permalink / raw) To: Vijayendra Suman Cc: linux-block, Somu Krishnasamy, dm-devel, RAMANAN_GOVINDARAJAN On Thu, Sep 10 2020 at 3:29pm -0400, Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > Hello Mike, > > I checked with upstream, performance measurement is similar and > shows performance improvement when > 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is reverted. > > On 9/10/2020 7:54 PM, Mike Snitzer wrote: > >[cc'ing dm-devel and linux-block because this is upstream concern too] > > > >On Wed, Sep 09 2020 at 1:00pm -0400, > >Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > > > >> Hello Mike, > >> > >> While Running pgbench tool with 5.4.17 kernel build > >> > >> Following performance degrade is found out > >> > >> buffer read/write metric : -17.2% > >> cache read/write metric : -18.7% > >> disk read/write metric : -19% > >> > >> buffer > >> number of transactions actually processed: 840972 > >> latency average = 24.013 ms > >> tps = 4664.153934 (including connections establishing) > >> tps = 4664.421492 (excluding connections establishing) > >> > >> cache > >> number of transactions actually processed: 551345 > >> latency average = 36.949 ms > >> tps = 3031.223905 (including connections establishing) > >> tps = 3031.402581 (excluding connections establishing) > >> > >> After revert of Commit > >> 2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call > >> blk_queue_split() in dm_process_bio()") > > > >I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's > >backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ? > > Yes > > >> Performance is Counter measurement > >> > >> buffer -> > >> number of transactions actually processed: 1135735 > >> latency average = 17.799 ms > >> tps = 6292.586749 (including connections establishing) > >> tps = 6292.875089 (excluding connections establishing) > >> > >> cache -> > >> number of transactions actually processed: 648177 > >> latency average = 31.217 ms > >> tps = 3587.755975 (including connections establishing) > >> tps = 3587.966359 (excluding connections establishing) > >> > >> Following is your commit > >> > >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c > >> index cf71a2277d60..1e6e0c970e19 100644 > >> --- a/drivers/md/dm.c > >> +++ b/drivers/md/dm.c > >> @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device > >> *md, > >> * won't be imposed. > >> */ > >> if (current->bio_list) { > >> - blk_queue_split(md->queue, &bio); > >> - if (!is_abnormal_io(bio)) > >> + if (is_abnormal_io(bio)) > >> + blk_queue_split(md->queue, &bio); > >> + else > >> dm_queue_split(md, ti, &bio); > >> } > >> > >> Could you have a look if it is safe to revert this commit. > >No, it really isn't a good idea given what was documented in the commit > >header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the > >excessive splitting is not conducive to performance either. > > > >So I think we need to identify _why_ reverting this commit is causing > >such a performance improvement. Why is calling blk_queue_split() before > >dm_queue_split() benefiting your pgbench workload? > > Let me know if you want to check some patch. Hi, Could you please test this branch?: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.10 (or apply at least the first 4 patches, commit 63f85d97be69^..b6a80963621fa) So far I've done various DM regression testing. But I haven't tested with pgbench or with the misaaligned IO scenario documented in the header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74. But I'll test that scenario tomorrow. Any chance you could provide some hints on how you're running pgbench just so I can try to test/reproduce/verify locally? Thanks, Mike ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()" 2020-09-15 1:33 ` Mike Snitzer @ 2020-09-15 17:03 ` Mike Snitzer 2020-09-16 14:56 ` Vijayendra Suman 1 sibling, 0 replies; 28+ messages in thread From: Mike Snitzer @ 2020-09-15 17:03 UTC (permalink / raw) To: Vijayendra Suman Cc: linux-block, Somu Krishnasamy, dm-devel, RAMANAN_GOVINDARAJAN On Mon, Sep 14 2020 at 9:33pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Sep 10 2020 at 3:29pm -0400, > Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > > > Hello Mike, > > > > I checked with upstream, performance measurement is similar and > > shows performance improvement when > > 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is reverted. > > > > On 9/10/2020 7:54 PM, Mike Snitzer wrote: > > >[cc'ing dm-devel and linux-block because this is upstream concern too] > > > > > >On Wed, Sep 09 2020 at 1:00pm -0400, > > >Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > > > > > >> Hello Mike, > > >> > > >> While Running pgbench tool with 5.4.17 kernel build > > >> > > >> Following performance degrade is found out > > >> > > >> buffer read/write metric : -17.2% > > >> cache read/write metric : -18.7% > > >> disk read/write metric : -19% > > >> > > >> buffer > > >> number of transactions actually processed: 840972 > > >> latency average = 24.013 ms > > >> tps = 4664.153934 (including connections establishing) > > >> tps = 4664.421492 (excluding connections establishing) > > >> > > >> cache > > >> number of transactions actually processed: 551345 > > >> latency average = 36.949 ms > > >> tps = 3031.223905 (including connections establishing) > > >> tps = 3031.402581 (excluding connections establishing) > > >> > > >> After revert of Commit > > >> 2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call > > >> blk_queue_split() in dm_process_bio()") > > > > > >I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's > > >backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ? > > > > Yes > > > > >> Performance is Counter measurement > > >> > > >> buffer -> > > >> number of transactions actually processed: 1135735 > > >> latency average = 17.799 ms > > >> tps = 6292.586749 (including connections establishing) > > >> tps = 6292.875089 (excluding connections establishing) > > >> > > >> cache -> > > >> number of transactions actually processed: 648177 > > >> latency average = 31.217 ms > > >> tps = 3587.755975 (including connections establishing) > > >> tps = 3587.966359 (excluding connections establishing) > > >> > > >> Following is your commit > > >> > > >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > >> index cf71a2277d60..1e6e0c970e19 100644 > > >> --- a/drivers/md/dm.c > > >> +++ b/drivers/md/dm.c > > >> @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device > > >> *md, > > >> * won't be imposed. > > >> */ > > >> if (current->bio_list) { > > >> - blk_queue_split(md->queue, &bio); > > >> - if (!is_abnormal_io(bio)) > > >> + if (is_abnormal_io(bio)) > > >> + blk_queue_split(md->queue, &bio); > > >> + else > > >> dm_queue_split(md, ti, &bio); > > >> } > > >> > > >> Could you have a look if it is safe to revert this commit. > > >No, it really isn't a good idea given what was documented in the commit > > >header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the > > >excessive splitting is not conducive to performance either. > > > > > >So I think we need to identify _why_ reverting this commit is causing > > >such a performance improvement. Why is calling blk_queue_split() before > > >dm_queue_split() benefiting your pgbench workload? > > > > Let me know if you want to check some patch. > > Hi, > > Could you please test this branch?: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.10 > (or apply at least the first 4 patches, commit 63f85d97be69^..b6a80963621fa) > > So far I've done various DM regression testing. But I haven't tested > with pgbench or with the misaaligned IO scenario documented in the > header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74. But I'll > test that scenario tomorrow. Training DM core to set chunk_sectors and always use blk_queue_split resolves the inefficient splitting documented in the header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74. xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev before, so with commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74: 253,2 5 1 0.000000000 4382 Q R 224 + 2064 [xfs_io] 253,2 5 2 0.000003414 4382 X R 224 / 256 [xfs_io] 253,2 5 3 0.000017838 4382 X R 256 / 512 [xfs_io] 253,2 5 4 0.000019852 4382 X R 512 / 768 [xfs_io] 253,2 5 5 0.000031316 4382 X R 768 / 1024 [xfs_io] 253,2 5 6 0.000034333 4382 X R 1024 / 1280 [xfs_io] 253,2 5 7 0.000037684 4382 X R 1280 / 1536 [xfs_io] 253,2 5 8 0.000041011 4382 X R 1536 / 1792 [xfs_io] 253,2 5 9 0.000043962 4382 X R 1792 / 2048 [xfs_io] 253,2 5 10 0.000074765 4382 Q R 2288 + 2008 [xfs_io] 253,2 5 11 0.000075020 4382 X R 2288 / 2304 [xfs_io] 253,2 5 12 0.000077009 4382 X R 2304 / 2560 [xfs_io] 253,2 5 13 0.000080509 4382 X R 2560 / 2816 [xfs_io] 253,2 5 14 0.000084182 4382 X R 2816 / 3072 [xfs_io] 253,2 5 15 0.000087274 4382 X R 3072 / 3328 [xfs_io] 253,2 5 16 0.000090342 4382 X R 3328 / 3584 [xfs_io] 253,2 5 17 0.000095348 4382 X R 3584 / 3840 [xfs_io] 253,2 5 18 0.000097776 4382 X R 3840 / 4096 [xfs_io] after, so with 'dm-5.10' branch refernced above, meaning dm_process_bio w/ unconditional blk_queue_split (w/ chunk_sectors): 253,2 17 1 0.000000000 2176 Q R 224 + 2280 [xfs_io] 253,2 17 2 0.000001978 2176 X R 224 / 256 [xfs_io] 253,2 17 3 0.000017882 2176 X R 256 / 512 [xfs_io] 253,2 17 4 0.000020406 2176 X R 512 / 768 [xfs_io] 253,2 17 5 0.000031298 2176 X R 768 / 1024 [xfs_io] 253,2 17 6 0.000034654 2176 X R 1024 / 1280 [xfs_io] 253,2 17 7 0.000038474 2176 X R 1280 / 1536 [xfs_io] 253,2 17 8 0.000042299 2176 X R 1536 / 1792 [xfs_io] 253,2 17 9 0.000054088 2176 X R 1792 / 2048 [xfs_io] 253,2 17 10 0.000057884 2176 X R 2048 / 2304 [xfs_io] 253,2 17 11 0.000081358 2176 Q R 2504 + 1792 [xfs_io] 253,2 17 12 0.000081778 2176 X R 2504 / 2560 [xfs_io] 253,2 17 13 0.000083496 2176 X R 2560 / 2816 [xfs_io] 253,2 17 14 0.000085301 2176 X R 2816 / 3072 [xfs_io] 253,2 17 15 0.000092374 2176 X R 3072 / 3328 [xfs_io] 253,2 17 16 0.000094774 2176 X R 3328 / 3584 [xfs_io] 253,2 17 17 0.000097977 2176 X R 3584 / 3840 [xfs_io] 253,2 17 18 0.000100094 2176 X R 3840 / 4096 [xfs_io] > Any chance you could provide some hints on how you're running pgbench > just so I can try to test/reproduce/verify locally? I'm going to defer to you on pgbench testing. What is your underlying storage? Could it be that DM using unconditional blk_queue_split() is helping your pgbench workload because it splits IO more (so smaller IO, lower latency per IO)? Do you have comparison blktrace data? Mike ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()" 2020-09-15 1:33 ` Mike Snitzer 2020-09-15 17:03 ` Mike Snitzer @ 2020-09-16 14:56 ` Vijayendra Suman 1 sibling, 0 replies; 28+ messages in thread From: Vijayendra Suman @ 2020-09-16 14:56 UTC (permalink / raw) To: Mike Snitzer Cc: linux-block, Somu Krishnasamy, dm-devel, RAMANAN_GOVINDARAJAN Hello Mike, On 9/15/2020 7:03 AM, Mike Snitzer wrote: > On Thu, Sep 10 2020 at 3:29pm -0400, > Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > >> Hello Mike, >> >> I checked with upstream, performance measurement is similar and >> shows performance improvement when >> 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 is reverted. >> >> On 9/10/2020 7:54 PM, Mike Snitzer wrote: >>> [cc'ing dm-devel and linux-block because this is upstream concern too] >>> >>> On Wed, Sep 09 2020 at 1:00pm -0400, >>> Vijayendra Suman <vijayendra.suman@oracle.com> wrote: >>> >>>> Hello Mike, >>>> >>>> While Running pgbench tool with 5.4.17 kernel build >>>> >>>> Following performance degrade is found out >>>> >>>> buffer read/write metric : -17.2% >>>> cache read/write metric : -18.7% >>>> disk read/write metric : -19% >>>> >>>> buffer >>>> number of transactions actually processed: 840972 >>>> latency average = 24.013 ms >>>> tps = 4664.153934 (including connections establishing) >>>> tps = 4664.421492 (excluding connections establishing) >>>> >>>> cache >>>> number of transactions actually processed: 551345 >>>> latency average = 36.949 ms >>>> tps = 3031.223905 (including connections establishing) >>>> tps = 3031.402581 (excluding connections establishing) >>>> >>>> After revert of Commit >>>> 2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call >>>> blk_queue_split() in dm_process_bio()") >>> I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's >>> backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ? >> Yes >> >>>> Performance is Counter measurement >>>> >>>> buffer -> >>>> number of transactions actually processed: 1135735 >>>> latency average = 17.799 ms >>>> tps = 6292.586749 (including connections establishing) >>>> tps = 6292.875089 (excluding connections establishing) >>>> >>>> cache -> >>>> number of transactions actually processed: 648177 >>>> latency average = 31.217 ms >>>> tps = 3587.755975 (including connections establishing) >>>> tps = 3587.966359 (excluding connections establishing) >>>> >>>> Following is your commit >>>> >>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >>>> index cf71a2277d60..1e6e0c970e19 100644 >>>> --- a/drivers/md/dm.c >>>> +++ b/drivers/md/dm.c >>>> @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device >>>> *md, >>>> * won't be imposed. >>>> */ >>>> if (current->bio_list) { >>>> - blk_queue_split(md->queue, &bio); >>>> - if (!is_abnormal_io(bio)) >>>> + if (is_abnormal_io(bio)) >>>> + blk_queue_split(md->queue, &bio); >>>> + else >>>> dm_queue_split(md, ti, &bio); >>>> } >>>> >>>> Could you have a look if it is safe to revert this commit. >>> No, it really isn't a good idea given what was documented in the commit >>> header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the >>> excessive splitting is not conducive to performance either. >>> >>> So I think we need to identify _why_ reverting this commit is causing >>> such a performance improvement. Why is calling blk_queue_split() before >>> dm_queue_split() benefiting your pgbench workload? >> Let me know if you want to check some patch. > Hi, > > Could you please test this branch?: > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.10__;!!GqivPVa7Brio!MspX41fnl1XoqlkHjwMuNFk--2a9yMSV9IQMRazyHTKEPls1nuF37bSIum7ZAOLZGxk6kw$ > (or apply at least the first 4 patches, commit 63f85d97be69^..b6a80963621fa) With above mentioned patch set, I get following results buffer -> number of transactions actually processed: 1001957 latency average = 20.135 ms tps = 5562.323947 (including connections establishing) tps = 5562.649168 (excluding connections establishing) cache -> number of transactions actually processed: 581273 latency average = 34.745 ms tps = 3223.520038 (including connections establishing) tps = 3223.717013 (excluding connections establishing) With above patch there is performance improvement. > So far I've done various DM regression testing. But I haven't tested > with pgbench or with the misaaligned IO scenario documented in the > header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74. But I'll > test that scenario tomorrow. > > Any chance you could provide some hints on how you're running pgbench > just so I can try to test/reproduce/verify locally? A PostgreSQL setup script will run as part of the setup within RUN to install the PostgreSQL DB, configure the /etc/postgresql.conf file and initialize the DB. The RUN script will start the PostgreSQL service and bind it to running on half the cpu's, a DB will be created of a default size (I think 16M) and will be scaled up to the required size based on whether it is a buffer, cache or disk run. After this, PostgreSQL pgbench will be run in readonly and readwrite modes (and be binded to the other half of the cpu's on the system). Performance issue was seen on readwrite mode. > > Thanks, > Mike > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()" 2020-09-10 14:24 ` Revert "dm: always call blk_queue_split() in dm_process_bio()" Mike Snitzer 2020-09-10 19:29 ` Vijayendra Suman @ 2020-09-11 12:20 ` Ming Lei 2020-09-11 16:13 ` Mike Snitzer 1 sibling, 1 reply; 28+ messages in thread From: Ming Lei @ 2020-09-11 12:20 UTC (permalink / raw) To: Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block On Thu, Sep 10, 2020 at 10:24:39AM -0400, Mike Snitzer wrote: > [cc'ing dm-devel and linux-block because this is upstream concern too] > > On Wed, Sep 09 2020 at 1:00pm -0400, > Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > > > Hello Mike, > > > > While Running pgbench tool with 5.4.17 kernel build > > > > Following performance degrade is found out > > > > buffer read/write metric : -17.2% > > cache read/write metric : -18.7% > > disk read/write metric : -19% > > > > buffer > > number of transactions actually processed: 840972 > > latency average = 24.013 ms > > tps = 4664.153934 (including connections establishing) > > tps = 4664.421492 (excluding connections establishing) > > > > cache > > number of transactions actually processed: 551345 > > latency average = 36.949 ms > > tps = 3031.223905 (including connections establishing) > > tps = 3031.402581 (excluding connections establishing) > > > > After revert of Commit > > 2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call > > blk_queue_split() in dm_process_bio()") > > I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's > backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ? > > > Performance is Counter measurement > > > > buffer -> > > number of transactions actually processed: 1135735 > > latency average = 17.799 ms > > tps = 6292.586749 (including connections establishing) > > tps = 6292.875089 (excluding connections establishing) > > > > cache -> > > number of transactions actually processed: 648177 > > latency average = 31.217 ms > > tps = 3587.755975 (including connections establishing) > > tps = 3587.966359 (excluding connections establishing) > > > > Following is your commit > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > index cf71a2277d60..1e6e0c970e19 100644 > > --- a/drivers/md/dm.c > > +++ b/drivers/md/dm.c > > @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device > > *md, > > * won't be imposed. > > */ > > if (current->bio_list) { > > - blk_queue_split(md->queue, &bio); > > - if (!is_abnormal_io(bio)) > > + if (is_abnormal_io(bio)) > > + blk_queue_split(md->queue, &bio); > > + else > > dm_queue_split(md, ti, &bio); > > } > > > > Could you have a look if it is safe to revert this commit. > > No, it really isn't a good idea given what was documented in the commit > header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the > excessive splitting is not conducive to performance either. > > So I think we need to identify _why_ reverting this commit is causing > such a performance improvement. Why is calling blk_queue_split() before > dm_queue_split() benefiting your pgbench workload? blk_queue_split() takes every queue's limit into account, and dm_queue_split() only splits bio according to max len(offset, chunk size), so the splitted bio may not be optimal one from device viewpoint. Maybe DM can switch to blk_queue_split() if 'chunk_sectors' limit is power-2 aligned. Thanks, Ming ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Revert "dm: always call blk_queue_split() in dm_process_bio()" 2020-09-11 12:20 ` Ming Lei @ 2020-09-11 16:13 ` Mike Snitzer 2020-09-11 21:53 ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer 0 siblings, 1 reply; 28+ messages in thread From: Mike Snitzer @ 2020-09-11 16:13 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, dm-devel, Vijayendra Suman On Fri, Sep 11 2020 at 8:20am -0400, Ming Lei <ming.lei@redhat.com> wrote: > On Thu, Sep 10, 2020 at 10:24:39AM -0400, Mike Snitzer wrote: > > [cc'ing dm-devel and linux-block because this is upstream concern too] > > > > On Wed, Sep 09 2020 at 1:00pm -0400, > > Vijayendra Suman <vijayendra.suman@oracle.com> wrote: > > > > > Hello Mike, > > > > > > While Running pgbench tool with 5.4.17 kernel build > > > > > > Following performance degrade is found out > > > > > > buffer read/write metric : -17.2% > > > cache read/write metric : -18.7% > > > disk read/write metric : -19% > > > > > > buffer > > > number of transactions actually processed: 840972 > > > latency average = 24.013 ms > > > tps = 4664.153934 (including connections establishing) > > > tps = 4664.421492 (excluding connections establishing) > > > > > > cache > > > number of transactions actually processed: 551345 > > > latency average = 36.949 ms > > > tps = 3031.223905 (including connections establishing) > > > tps = 3031.402581 (excluding connections establishing) > > > > > > After revert of Commit > > > 2892100bc85ae446088cebe0c00ba9b194c0ac9d ( Revert "dm: always call > > > blk_queue_split() in dm_process_bio()") > > > > I assume 2892100bc85ae446088cebe0c00ba9b194c0ac9d is 5.4-stable's > > backport of upstream commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 ? > > > > > Performance is Counter measurement > > > > > > buffer -> > > > number of transactions actually processed: 1135735 > > > latency average = 17.799 ms > > > tps = 6292.586749 (including connections establishing) > > > tps = 6292.875089 (excluding connections establishing) > > > > > > cache -> > > > number of transactions actually processed: 648177 > > > latency average = 31.217 ms > > > tps = 3587.755975 (including connections establishing) > > > tps = 3587.966359 (excluding connections establishing) > > > > > > Following is your commit > > > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > > index cf71a2277d60..1e6e0c970e19 100644 > > > --- a/drivers/md/dm.c > > > +++ b/drivers/md/dm.c > > > @@ -1760,8 +1760,9 @@ static blk_qc_t dm_process_bio(struct mapped_device > > > *md, > > > * won't be imposed. > > > */ > > > if (current->bio_list) { > > > - blk_queue_split(md->queue, &bio); > > > - if (!is_abnormal_io(bio)) > > > + if (is_abnormal_io(bio)) > > > + blk_queue_split(md->queue, &bio); > > > + else > > > dm_queue_split(md, ti, &bio); > > > } > > > > > > Could you have a look if it is safe to revert this commit. > > > > No, it really isn't a good idea given what was documented in the commit > > header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 -- the > > excessive splitting is not conducive to performance either. > > > > So I think we need to identify _why_ reverting this commit is causing > > such a performance improvement. Why is calling blk_queue_split() before > > dm_queue_split() benefiting your pgbench workload? > > blk_queue_split() takes every queue's limit into account, and dm_queue_split() > only splits bio according to max len(offset, chunk size), so the > splitted bio may not be optimal one from device viewpoint. Yes, but the issue is blk_queue_split() is doing the wrong thing for the case described in the header for commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 > Maybe DM can switch to blk_queue_split() if 'chunk_sectors' limit is power-2 > aligned. Not seeing relation to chunk_sectors being power of 2 -- other than that is all block core supports. But chunk_sectors isn't set for DM, you added chunk_sectors for MD or something and DM was caught out, so blk_queue_split() falls back to splitting on max_sectors. You're saying DM should set 'chunk_sectors' IFF it'd be a power of 2? While I could do that, it seems like just continuing a sequence of hacks around earlier imposed chunk_sectors infrastructure that was a half-measure to begin with. Think chunk_sectors logic in block core needs to be enhanced -- but I'll take a closer look. Mike ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 0/3] block: a few chunk_sectors fixes/improvements 2020-09-11 16:13 ` Mike Snitzer @ 2020-09-11 21:53 ` Mike Snitzer 2020-09-11 21:53 ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw) To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block Hi, Here are some changes that seem needed but that stop short of fixing the initial report (of DM pgbench regression). Would be nice if you could review these block patches since they stand on their own. I still have to look closer at how to get blk_queue_split() the info DM has (in ti->max_io_len). Ideally a variant of blk_queue_split() could be created that allows a 'chunk_sectors' override to be passed in, e.g.: dm_process_bio() would call: blk_queue_split(&bio, ti->max_io_len); And the provided ti->max_io_len would be used instead of a global (one size fits all) q->limits.chunk_sectors. The reason why this is needed/ideal is due to DM's stacked nature. Different offsets into a DM device could yield entirely different max_io_len (or chunk_sectors) settings. BUT short of standing up a new variant of blk_queue_split() with per bio override for 'chunk_sectors' (which is likely a non-starter for a few reasons, recurssive nature of bio_split being the biggest): I'll need to update all DM targets that call dm_set_target_max_io_len() to also do this in each target's .io_hints hook: limits->chunk_sectors = lcm_not_zero(ti->max_io_len, limits->chunk_sectors); Won't be perfect for some stacked devices (given it'll constrain chunk_sectors to be less optimal as the IO limits are stacked) but it should be an improvment -- and hopefully fix this pgbench regression. Thanks, Mike Mike Snitzer (3): block: fix blk_rq_get_max_sectors() to flow more carefully block: use lcm_not_zero() when stacking chunk_sectors block: allow 'chunk_sectors' to be non-power-of-2 block/blk-settings.c | 22 ++++++++++++---------- include/linux/blkdev.h | 31 ++++++++++++++++++++++--------- 2 files changed, 34 insertions(+), 19 deletions(-) -- 2.15.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-11 21:53 ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer @ 2020-09-11 21:53 ` Mike Snitzer 2020-09-12 13:52 ` Ming Lei 2020-09-14 0:46 ` Damien Le Moal 2020-09-11 21:53 ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer 2020-09-11 21:53 ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer 2 siblings, 2 replies; 28+ messages in thread From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw) To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for those operations. Also, there is no need to avoid blk_max_size_offset() if 'chunk_sectors' isn't set because it falls back to 'max_sectors'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- include/linux/blkdev.h | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index bb5636cc17b9..453a3d735d66 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, sector_t offset) { struct request_queue *q = rq->q; + int op; + unsigned int max_sectors; if (blk_rq_is_passthrough(rq)) return q->limits.max_hw_sectors; - if (!q->limits.chunk_sectors || - req_op(rq) == REQ_OP_DISCARD || - req_op(rq) == REQ_OP_SECURE_ERASE) - return blk_queue_get_max_sectors(q, req_op(rq)); + op = req_op(rq); + max_sectors = blk_queue_get_max_sectors(q, op); - return min(blk_max_size_offset(q, offset), - blk_queue_get_max_sectors(q, req_op(rq))); + switch (op) { + case REQ_OP_DISCARD: + case REQ_OP_SECURE_ERASE: + case REQ_OP_WRITE_SAME: + case REQ_OP_WRITE_ZEROES: + return max_sectors; + } + + return min(blk_max_size_offset(q, offset), max_sectors); } static inline unsigned int blk_rq_count_bios(struct request *rq) -- 2.15.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-11 21:53 ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer @ 2020-09-12 13:52 ` Ming Lei 2020-09-14 0:43 ` Damien Le Moal 2020-09-14 14:49 ` Mike Snitzer 2020-09-14 0:46 ` Damien Le Moal 1 sibling, 2 replies; 28+ messages in thread From: Ming Lei @ 2020-09-12 13:52 UTC (permalink / raw) To: Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > those operations. Actually WRITE_SAME & WRITE_ZEROS are handled by the following if chunk_sectors is set: return min(blk_max_size_offset(q, offset), blk_queue_get_max_sectors(q, req_op(rq))); > Also, there is no need to avoid blk_max_size_offset() if > 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > include/linux/blkdev.h | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index bb5636cc17b9..453a3d735d66 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > sector_t offset) > { > struct request_queue *q = rq->q; > + int op; > + unsigned int max_sectors; > > if (blk_rq_is_passthrough(rq)) > return q->limits.max_hw_sectors; > > - if (!q->limits.chunk_sectors || > - req_op(rq) == REQ_OP_DISCARD || > - req_op(rq) == REQ_OP_SECURE_ERASE) > - return blk_queue_get_max_sectors(q, req_op(rq)); > + op = req_op(rq); > + max_sectors = blk_queue_get_max_sectors(q, op); > > - return min(blk_max_size_offset(q, offset), > - blk_queue_get_max_sectors(q, req_op(rq))); > + switch (op) { > + case REQ_OP_DISCARD: > + case REQ_OP_SECURE_ERASE: > + case REQ_OP_WRITE_SAME: > + case REQ_OP_WRITE_ZEROES: > + return max_sectors; > + } > + > + return min(blk_max_size_offset(q, offset), max_sectors); > } It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS needs to be considered. Thanks, Ming ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-12 13:52 ` Ming Lei @ 2020-09-14 0:43 ` Damien Le Moal 2020-09-14 14:52 ` Mike Snitzer 2020-09-15 2:03 ` Ming Lei 2020-09-14 14:49 ` Mike Snitzer 1 sibling, 2 replies; 28+ messages in thread From: Damien Le Moal @ 2020-09-14 0:43 UTC (permalink / raw) To: Ming Lei, Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block On 2020/09/12 22:53, Ming Lei wrote: > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: >> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and >> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for >> those operations. > > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if > chunk_sectors is set: > > return min(blk_max_size_offset(q, offset), > blk_queue_get_max_sectors(q, req_op(rq))); > >> Also, there is no need to avoid blk_max_size_offset() if >> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. >> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >> --- >> include/linux/blkdev.h | 19 +++++++++++++------ >> 1 file changed, 13 insertions(+), 6 deletions(-) >> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >> index bb5636cc17b9..453a3d735d66 100644 >> --- a/include/linux/blkdev.h >> +++ b/include/linux/blkdev.h >> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, >> sector_t offset) >> { >> struct request_queue *q = rq->q; >> + int op; >> + unsigned int max_sectors; >> >> if (blk_rq_is_passthrough(rq)) >> return q->limits.max_hw_sectors; >> >> - if (!q->limits.chunk_sectors || >> - req_op(rq) == REQ_OP_DISCARD || >> - req_op(rq) == REQ_OP_SECURE_ERASE) >> - return blk_queue_get_max_sectors(q, req_op(rq)); >> + op = req_op(rq); >> + max_sectors = blk_queue_get_max_sectors(q, op); >> >> - return min(blk_max_size_offset(q, offset), >> - blk_queue_get_max_sectors(q, req_op(rq))); >> + switch (op) { >> + case REQ_OP_DISCARD: >> + case REQ_OP_SECURE_ERASE: >> + case REQ_OP_WRITE_SAME: >> + case REQ_OP_WRITE_ZEROES: >> + return max_sectors; >> + }>> + >> + return min(blk_max_size_offset(q, offset), max_sectors); >> } > > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS > needs to be considered. That limit is needed for zoned block devices to ensure that *any* write request, no matter the command, do not cross zone boundaries. Otherwise, the write would be immediately failed by the device. > > > Thanks, > Ming > > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-14 0:43 ` Damien Le Moal @ 2020-09-14 14:52 ` Mike Snitzer 2020-09-14 23:28 ` Damien Le Moal 2020-09-15 2:03 ` Ming Lei 1 sibling, 1 reply; 28+ messages in thread From: Mike Snitzer @ 2020-09-14 14:52 UTC (permalink / raw) To: Damien Le Moal; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block On Sun, Sep 13 2020 at 8:43pm -0400, Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > On 2020/09/12 22:53, Ming Lei wrote: > > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: > >> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > >> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > >> those operations. > > > > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if > > chunk_sectors is set: > > > > return min(blk_max_size_offset(q, offset), > > blk_queue_get_max_sectors(q, req_op(rq))); > > > >> Also, there is no need to avoid blk_max_size_offset() if > >> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > >> > >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >> --- > >> include/linux/blkdev.h | 19 +++++++++++++------ > >> 1 file changed, 13 insertions(+), 6 deletions(-) > >> > >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > >> index bb5636cc17b9..453a3d735d66 100644 > >> --- a/include/linux/blkdev.h > >> +++ b/include/linux/blkdev.h > >> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > >> sector_t offset) > >> { > >> struct request_queue *q = rq->q; > >> + int op; > >> + unsigned int max_sectors; > >> > >> if (blk_rq_is_passthrough(rq)) > >> return q->limits.max_hw_sectors; > >> > >> - if (!q->limits.chunk_sectors || > >> - req_op(rq) == REQ_OP_DISCARD || > >> - req_op(rq) == REQ_OP_SECURE_ERASE) > >> - return blk_queue_get_max_sectors(q, req_op(rq)); > >> + op = req_op(rq); > >> + max_sectors = blk_queue_get_max_sectors(q, op); > >> > >> - return min(blk_max_size_offset(q, offset), > >> - blk_queue_get_max_sectors(q, req_op(rq))); > >> + switch (op) { > >> + case REQ_OP_DISCARD: > >> + case REQ_OP_SECURE_ERASE: > >> + case REQ_OP_WRITE_SAME: > >> + case REQ_OP_WRITE_ZEROES: > >> + return max_sectors; > >> + }>> + > >> + return min(blk_max_size_offset(q, offset), max_sectors); > >> } > > > > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS > > needs to be considered. > > That limit is needed for zoned block devices to ensure that *any* write request, > no matter the command, do not cross zone boundaries. Otherwise, the write would > be immediately failed by the device. Thanks for the additional context, sorry to make you so concerned! ;) Mike ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-14 14:52 ` Mike Snitzer @ 2020-09-14 23:28 ` Damien Le Moal 0 siblings, 0 replies; 28+ messages in thread From: Damien Le Moal @ 2020-09-14 23:28 UTC (permalink / raw) To: Mike Snitzer; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block On 2020/09/14 23:52, Mike Snitzer wrote: > On Sun, Sep 13 2020 at 8:43pm -0400, > Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > >> On 2020/09/12 22:53, Ming Lei wrote: >>> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: >>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and >>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for >>>> those operations. >>> >>> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if >>> chunk_sectors is set: >>> >>> return min(blk_max_size_offset(q, offset), >>> blk_queue_get_max_sectors(q, req_op(rq))); >>> >>>> Also, there is no need to avoid blk_max_size_offset() if >>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. >>>> >>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >>>> --- >>>> include/linux/blkdev.h | 19 +++++++++++++------ >>>> 1 file changed, 13 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>> index bb5636cc17b9..453a3d735d66 100644 >>>> --- a/include/linux/blkdev.h >>>> +++ b/include/linux/blkdev.h >>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, >>>> sector_t offset) >>>> { >>>> struct request_queue *q = rq->q; >>>> + int op; >>>> + unsigned int max_sectors; >>>> >>>> if (blk_rq_is_passthrough(rq)) >>>> return q->limits.max_hw_sectors; >>>> >>>> - if (!q->limits.chunk_sectors || >>>> - req_op(rq) == REQ_OP_DISCARD || >>>> - req_op(rq) == REQ_OP_SECURE_ERASE) >>>> - return blk_queue_get_max_sectors(q, req_op(rq)); >>>> + op = req_op(rq); >>>> + max_sectors = blk_queue_get_max_sectors(q, op); >>>> >>>> - return min(blk_max_size_offset(q, offset), >>>> - blk_queue_get_max_sectors(q, req_op(rq))); >>>> + switch (op) { >>>> + case REQ_OP_DISCARD: >>>> + case REQ_OP_SECURE_ERASE: >>>> + case REQ_OP_WRITE_SAME: >>>> + case REQ_OP_WRITE_ZEROES: >>>> + return max_sectors; >>>> + }>> + >>>> + return min(blk_max_size_offset(q, offset), max_sectors); >>>> } >>> >>> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS >>> needs to be considered. >> >> That limit is needed for zoned block devices to ensure that *any* write request, >> no matter the command, do not cross zone boundaries. Otherwise, the write would >> be immediately failed by the device. > > Thanks for the additional context, sorry to make you so concerned! ;) No worries :) -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-14 0:43 ` Damien Le Moal 2020-09-14 14:52 ` Mike Snitzer @ 2020-09-15 2:03 ` Ming Lei 2020-09-15 2:15 ` Damien Le Moal 1 sibling, 1 reply; 28+ messages in thread From: Ming Lei @ 2020-09-15 2:03 UTC (permalink / raw) To: Damien Le Moal; +Cc: Mike Snitzer, Vijayendra Suman, dm-devel, linux-block On Mon, Sep 14, 2020 at 12:43:06AM +0000, Damien Le Moal wrote: > On 2020/09/12 22:53, Ming Lei wrote: > > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: > >> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > >> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > >> those operations. > > > > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if > > chunk_sectors is set: > > > > return min(blk_max_size_offset(q, offset), > > blk_queue_get_max_sectors(q, req_op(rq))); > > > >> Also, there is no need to avoid blk_max_size_offset() if > >> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > >> > >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >> --- > >> include/linux/blkdev.h | 19 +++++++++++++------ > >> 1 file changed, 13 insertions(+), 6 deletions(-) > >> > >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > >> index bb5636cc17b9..453a3d735d66 100644 > >> --- a/include/linux/blkdev.h > >> +++ b/include/linux/blkdev.h > >> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > >> sector_t offset) > >> { > >> struct request_queue *q = rq->q; > >> + int op; > >> + unsigned int max_sectors; > >> > >> if (blk_rq_is_passthrough(rq)) > >> return q->limits.max_hw_sectors; > >> > >> - if (!q->limits.chunk_sectors || > >> - req_op(rq) == REQ_OP_DISCARD || > >> - req_op(rq) == REQ_OP_SECURE_ERASE) > >> - return blk_queue_get_max_sectors(q, req_op(rq)); > >> + op = req_op(rq); > >> + max_sectors = blk_queue_get_max_sectors(q, op); > >> > >> - return min(blk_max_size_offset(q, offset), > >> - blk_queue_get_max_sectors(q, req_op(rq))); > >> + switch (op) { > >> + case REQ_OP_DISCARD: > >> + case REQ_OP_SECURE_ERASE: > >> + case REQ_OP_WRITE_SAME: > >> + case REQ_OP_WRITE_ZEROES: > >> + return max_sectors; > >> + }>> + > >> + return min(blk_max_size_offset(q, offset), max_sectors); > >> } > > > > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS > > needs to be considered. > > That limit is needed for zoned block devices to ensure that *any* write request, > no matter the command, do not cross zone boundaries. Otherwise, the write would > be immediately failed by the device. Looks both blk_bio_write_zeroes_split() and blk_bio_write_same_split() don't consider chunk_sectors limit, is that an issue for zone block? thanks, Ming ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-15 2:03 ` Ming Lei @ 2020-09-15 2:15 ` Damien Le Moal 0 siblings, 0 replies; 28+ messages in thread From: Damien Le Moal @ 2020-09-15 2:15 UTC (permalink / raw) To: Ming Lei; +Cc: Mike Snitzer, Vijayendra Suman, dm-devel, linux-block On 2020/09/15 11:04, Ming Lei wrote: > On Mon, Sep 14, 2020 at 12:43:06AM +0000, Damien Le Moal wrote: >> On 2020/09/12 22:53, Ming Lei wrote: >>> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: >>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and >>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for >>>> those operations. >>> >>> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if >>> chunk_sectors is set: >>> >>> return min(blk_max_size_offset(q, offset), >>> blk_queue_get_max_sectors(q, req_op(rq))); >>> >>>> Also, there is no need to avoid blk_max_size_offset() if >>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. >>>> >>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >>>> --- >>>> include/linux/blkdev.h | 19 +++++++++++++------ >>>> 1 file changed, 13 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>> index bb5636cc17b9..453a3d735d66 100644 >>>> --- a/include/linux/blkdev.h >>>> +++ b/include/linux/blkdev.h >>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, >>>> sector_t offset) >>>> { >>>> struct request_queue *q = rq->q; >>>> + int op; >>>> + unsigned int max_sectors; >>>> >>>> if (blk_rq_is_passthrough(rq)) >>>> return q->limits.max_hw_sectors; >>>> >>>> - if (!q->limits.chunk_sectors || >>>> - req_op(rq) == REQ_OP_DISCARD || >>>> - req_op(rq) == REQ_OP_SECURE_ERASE) >>>> - return blk_queue_get_max_sectors(q, req_op(rq)); >>>> + op = req_op(rq); >>>> + max_sectors = blk_queue_get_max_sectors(q, op); >>>> >>>> - return min(blk_max_size_offset(q, offset), >>>> - blk_queue_get_max_sectors(q, req_op(rq))); >>>> + switch (op) { >>>> + case REQ_OP_DISCARD: >>>> + case REQ_OP_SECURE_ERASE: >>>> + case REQ_OP_WRITE_SAME: >>>> + case REQ_OP_WRITE_ZEROES: >>>> + return max_sectors; >>>> + }>> + >>>> + return min(blk_max_size_offset(q, offset), max_sectors); >>>> } >>> >>> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS >>> needs to be considered. >> >> That limit is needed for zoned block devices to ensure that *any* write request, >> no matter the command, do not cross zone boundaries. Otherwise, the write would >> be immediately failed by the device. > > Looks both blk_bio_write_zeroes_split() and blk_bio_write_same_split() > don't consider chunk_sectors limit, is that an issue for zone block? Hu... Never looked at these. Yes, it will be a problem. write zeroes for NVMe ZNS drives and write same for SCSI/ZBC drives. So yes, definitely something that needs to be fixed. User of these will be file systems that in the case of zoned block devices would be FS with zone support. f2fs does not use these commands, and btrfs (posted recently) needs to be checked. But the FS itself being zone aware, the requests will be zone aligned. But definitely worth fixing. Thanks ! -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-12 13:52 ` Ming Lei 2020-09-14 0:43 ` Damien Le Moal @ 2020-09-14 14:49 ` Mike Snitzer 2020-09-15 1:50 ` Ming Lei 1 sibling, 1 reply; 28+ messages in thread From: Mike Snitzer @ 2020-09-14 14:49 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, dm-devel, Vijayendra Suman On Sat, Sep 12 2020 at 9:52am -0400, Ming Lei <ming.lei@redhat.com> wrote: > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: > > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > > those operations. > > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if > chunk_sectors is set: > > return min(blk_max_size_offset(q, offset), > blk_queue_get_max_sectors(q, req_op(rq))); Yes, but blk_rq_get_max_sectors() is a bit of a mess structurally. he duality of imposing chunk_sectors and/or considering offset when calculating the return is very confused. > > Also, there is no need to avoid blk_max_size_offset() if > > 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > > > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > --- > > include/linux/blkdev.h | 19 +++++++++++++------ > > 1 file changed, 13 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index bb5636cc17b9..453a3d735d66 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > > sector_t offset) > > { > > struct request_queue *q = rq->q; > > + int op; > > + unsigned int max_sectors; > > > > if (blk_rq_is_passthrough(rq)) > > return q->limits.max_hw_sectors; > > > > - if (!q->limits.chunk_sectors || > > - req_op(rq) == REQ_OP_DISCARD || > > - req_op(rq) == REQ_OP_SECURE_ERASE) > > - return blk_queue_get_max_sectors(q, req_op(rq)); > > + op = req_op(rq); > > + max_sectors = blk_queue_get_max_sectors(q, op); > > > > - return min(blk_max_size_offset(q, offset), > > - blk_queue_get_max_sectors(q, req_op(rq))); > > + switch (op) { > > + case REQ_OP_DISCARD: > > + case REQ_OP_SECURE_ERASE: > > + case REQ_OP_WRITE_SAME: > > + case REQ_OP_WRITE_ZEROES: > > + return max_sectors; > > + } > > + > > + return min(blk_max_size_offset(q, offset), max_sectors); > > } > > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS > needs to be considered. Yes, I see that now. But why don't they need to be considered for REQ_OP_DISCARD and REQ_OP_SECURE_ERASE? Is it because the intent of the block core is to offer late splitting of bios? If so, then why impose chunk_sectors so early? Obviously this patch 1/3 should be dropped. I didn't treat chunk_sectors with proper priority. But like I said above, blk_rq_get_max_sectors() vs blk_max_size_offset() is not at all straight-forward. And the code looks prone to imposing limits that shouldn't be (or vice-versa). Also, when falling back to max_sectors, why not consider offset to treat max_sectors like a granularity? Would allow for much more consistent IO patterns. Mike ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-14 14:49 ` Mike Snitzer @ 2020-09-15 1:50 ` Ming Lei 0 siblings, 0 replies; 28+ messages in thread From: Ming Lei @ 2020-09-15 1:50 UTC (permalink / raw) To: Mike Snitzer; +Cc: Vijayendra Suman, dm-devel, linux-block On Mon, Sep 14, 2020 at 10:49:28AM -0400, Mike Snitzer wrote: > On Sat, Sep 12 2020 at 9:52am -0400, > Ming Lei <ming.lei@redhat.com> wrote: > > > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote: > > > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > > > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > > > those operations. > > > > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if > > chunk_sectors is set: > > > > return min(blk_max_size_offset(q, offset), > > blk_queue_get_max_sectors(q, req_op(rq))); > > Yes, but blk_rq_get_max_sectors() is a bit of a mess structurally. he > duality of imposing chunk_sectors and/or considering offset when > calculating the return is very confused. > > > > Also, there is no need to avoid blk_max_size_offset() if > > > 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > > > > > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > > --- > > > include/linux/blkdev.h | 19 +++++++++++++------ > > > 1 file changed, 13 insertions(+), 6 deletions(-) > > > > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > > index bb5636cc17b9..453a3d735d66 100644 > > > --- a/include/linux/blkdev.h > > > +++ b/include/linux/blkdev.h > > > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > > > sector_t offset) > > > { > > > struct request_queue *q = rq->q; > > > + int op; > > > + unsigned int max_sectors; > > > > > > if (blk_rq_is_passthrough(rq)) > > > return q->limits.max_hw_sectors; > > > > > > - if (!q->limits.chunk_sectors || > > > - req_op(rq) == REQ_OP_DISCARD || > > > - req_op(rq) == REQ_OP_SECURE_ERASE) > > > - return blk_queue_get_max_sectors(q, req_op(rq)); > > > + op = req_op(rq); > > > + max_sectors = blk_queue_get_max_sectors(q, op); > > > > > > - return min(blk_max_size_offset(q, offset), > > > - blk_queue_get_max_sectors(q, req_op(rq))); > > > + switch (op) { > > > + case REQ_OP_DISCARD: > > > + case REQ_OP_SECURE_ERASE: > > > + case REQ_OP_WRITE_SAME: > > > + case REQ_OP_WRITE_ZEROES: > > > + return max_sectors; > > > + } > > > + > > > + return min(blk_max_size_offset(q, offset), max_sectors); > > > } > > > > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS > > needs to be considered. > > Yes, I see that now. But why don't they need to be considered for > REQ_OP_DISCARD and REQ_OP_SECURE_ERASE? This behavior is introduced int the following commit, and I guess it is because we support multi-range discard request, maybe Jens can explain more. commit e548ca4ee4595f65b262661d166310ad8a149bec Author: Jens Axboe <axboe@fb.com> Date: Fri May 29 13:11:32 2015 -0600 block: don't honor chunk sizes for data-less IO We don't need to honor chunk sizes for IO that doesn't carry any data. Signed-off-by: Jens Axboe <axboe@fb.com> > Is it because the intent of the > block core is to offer late splitting of bios? block layer doesn't have late bio splitting, and bio is only splitted via __blk_queue_split() before allocating request. blk_rq_get_max_sectors() is only called by rq merge code, actually it should have been defined in block/blk.h instead of public header. > If so, then why impose > chunk_sectors so early? Not sure I understand your question. 'chunk_sectors' is firstly used during bio split(get_max_io_size() from blk_bio_segment_split()), > > Obviously this patch 1/3 should be dropped. I didn't treat > chunk_sectors with proper priority. > > But like I said above, blk_rq_get_max_sectors() vs blk_max_size_offset() > is not at all straight-forward. And the code looks prone to imposing > limits that shouldn't be (or vice-versa). > > Also, when falling back to max_sectors, why not consider offset to treat > max_sectors like a granularity? Would allow for much more consistent IO > patterns. blk_rq_get_max_sectors() is called when one bio or rq can be merged to current request, and we have already considered all kinds of queue limits when doing bio splitting, so not necessary to consider it again here during merging rq. Thanks, Ming ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-11 21:53 ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer 2020-09-12 13:52 ` Ming Lei @ 2020-09-14 0:46 ` Damien Le Moal 2020-09-14 15:03 ` Mike Snitzer 1 sibling, 1 reply; 28+ messages in thread From: Damien Le Moal @ 2020-09-14 0:46 UTC (permalink / raw) To: Mike Snitzer, Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block On 2020/09/12 6:53, Mike Snitzer wrote: > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > those operations. > > Also, there is no need to avoid blk_max_size_offset() if > 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > include/linux/blkdev.h | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index bb5636cc17b9..453a3d735d66 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > sector_t offset) > { > struct request_queue *q = rq->q; > + int op; > + unsigned int max_sectors; > > if (blk_rq_is_passthrough(rq)) > return q->limits.max_hw_sectors; > > - if (!q->limits.chunk_sectors || > - req_op(rq) == REQ_OP_DISCARD || > - req_op(rq) == REQ_OP_SECURE_ERASE) > - return blk_queue_get_max_sectors(q, req_op(rq)); > + op = req_op(rq); > + max_sectors = blk_queue_get_max_sectors(q, op); > > - return min(blk_max_size_offset(q, offset), > - blk_queue_get_max_sectors(q, req_op(rq))); > + switch (op) { > + case REQ_OP_DISCARD: > + case REQ_OP_SECURE_ERASE: > + case REQ_OP_WRITE_SAME: > + case REQ_OP_WRITE_ZEROES: > + return max_sectors; > + } Doesn't this break md devices ? (I think does use chunk_sectors for stride size, no ?) As mentioned in my reply to Ming's email, this will allow these commands to potentially cross over zone boundaries on zoned block devices, which would be an immediate command failure. > + > + return min(blk_max_size_offset(q, offset), max_sectors); > } > > static inline unsigned int blk_rq_count_bios(struct request *rq) > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-14 0:46 ` Damien Le Moal @ 2020-09-14 15:03 ` Mike Snitzer 2020-09-15 1:09 ` Damien Le Moal 0 siblings, 1 reply; 28+ messages in thread From: Mike Snitzer @ 2020-09-14 15:03 UTC (permalink / raw) To: Damien Le Moal; +Cc: linux-block, dm-devel, Vijayendra Suman, Ming Lei On Sun, Sep 13 2020 at 8:46pm -0400, Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > On 2020/09/12 6:53, Mike Snitzer wrote: > > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > > those operations. > > > > Also, there is no need to avoid blk_max_size_offset() if > > 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > > > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > --- > > include/linux/blkdev.h | 19 +++++++++++++------ > > 1 file changed, 13 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index bb5636cc17b9..453a3d735d66 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > > sector_t offset) > > { > > struct request_queue *q = rq->q; > > + int op; > > + unsigned int max_sectors; > > > > if (blk_rq_is_passthrough(rq)) > > return q->limits.max_hw_sectors; > > > > - if (!q->limits.chunk_sectors || > > - req_op(rq) == REQ_OP_DISCARD || > > - req_op(rq) == REQ_OP_SECURE_ERASE) > > - return blk_queue_get_max_sectors(q, req_op(rq)); > > + op = req_op(rq); > > + max_sectors = blk_queue_get_max_sectors(q, op); > > > > - return min(blk_max_size_offset(q, offset), > > - blk_queue_get_max_sectors(q, req_op(rq))); > > + switch (op) { > > + case REQ_OP_DISCARD: > > + case REQ_OP_SECURE_ERASE: > > + case REQ_OP_WRITE_SAME: > > + case REQ_OP_WRITE_ZEROES: > > + return max_sectors; > > + } > > Doesn't this break md devices ? (I think does use chunk_sectors for stride size, > no ?) > > As mentioned in my reply to Ming's email, this will allow these commands to > potentially cross over zone boundaries on zoned block devices, which would be an > immediate command failure. Depending on the implementation it is beneficial to get a large discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's optimization for handling large discards and issuing N discards, one per stripe). Same could apply for other commands. Like all devices, zoned devices should impose command specific limits in the queue_limits (and not lean on chunk_sectors to do a one-size-fits-all). But that aside, yes I agree I didn't pay close enough attention to the implications of deferring the splitting of these commands until they were issued to underlying storage. This chunk_sectors early splitting override is a bit of a mess... not quite following the logic given we were supposed to be waiting to split bios as late as possible. Mike ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-14 15:03 ` Mike Snitzer @ 2020-09-15 1:09 ` Damien Le Moal 2020-09-15 4:21 ` Damien Le Moal 0 siblings, 1 reply; 28+ messages in thread From: Damien Le Moal @ 2020-09-15 1:09 UTC (permalink / raw) To: Mike Snitzer; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block On 2020/09/15 0:04, Mike Snitzer wrote: > On Sun, Sep 13 2020 at 8:46pm -0400, > Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > >> On 2020/09/12 6:53, Mike Snitzer wrote: >>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and >>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for >>> those operations. >>> >>> Also, there is no need to avoid blk_max_size_offset() if >>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. >>> >>> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >>> --- >>> include/linux/blkdev.h | 19 +++++++++++++------ >>> 1 file changed, 13 insertions(+), 6 deletions(-) >>> >>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>> index bb5636cc17b9..453a3d735d66 100644 >>> --- a/include/linux/blkdev.h >>> +++ b/include/linux/blkdev.h >>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, >>> sector_t offset) >>> { >>> struct request_queue *q = rq->q; >>> + int op; >>> + unsigned int max_sectors; >>> >>> if (blk_rq_is_passthrough(rq)) >>> return q->limits.max_hw_sectors; >>> >>> - if (!q->limits.chunk_sectors || >>> - req_op(rq) == REQ_OP_DISCARD || >>> - req_op(rq) == REQ_OP_SECURE_ERASE) >>> - return blk_queue_get_max_sectors(q, req_op(rq)); >>> + op = req_op(rq); >>> + max_sectors = blk_queue_get_max_sectors(q, op); >>> >>> - return min(blk_max_size_offset(q, offset), >>> - blk_queue_get_max_sectors(q, req_op(rq))); >>> + switch (op) { >>> + case REQ_OP_DISCARD: >>> + case REQ_OP_SECURE_ERASE: >>> + case REQ_OP_WRITE_SAME: >>> + case REQ_OP_WRITE_ZEROES: >>> + return max_sectors; >>> + } >> >> Doesn't this break md devices ? (I think does use chunk_sectors for stride size, >> no ?) >> >> As mentioned in my reply to Ming's email, this will allow these commands to >> potentially cross over zone boundaries on zoned block devices, which would be an >> immediate command failure. > > Depending on the implementation it is beneficial to get a large > discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's > optimization for handling large discards and issuing N discards, one per > stripe). Same could apply for other commands. > > Like all devices, zoned devices should impose command specific limits in > the queue_limits (and not lean on chunk_sectors to do a > one-size-fits-all). Yes, understood. But I think that in the case of md, chunk_sectors is used to indicate the boundary between drives for a raid volume. So it does indeed make sense to limit the IO size on submission since otherwise, the md driver itself would have to split that bio again anyway. > But that aside, yes I agree I didn't pay close enough attention to the > implications of deferring the splitting of these commands until they > were issued to underlying storage. This chunk_sectors early splitting > override is a bit of a mess... not quite following the logic given we > were supposed to be waiting to split bios as late as possible. My view is that the multipage bvec (BIOs almost as large as we want) and late splitting is beneficial to get larger effective BIO sent to the device as having more pages on hand allows bigger segments in the bio instead of always having at most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A lot of requests end up being much larger than the device max_segments * page_size. However, if there is already a known limit on the BIO size when the BIO is being built, it does not make much sense to try to grow a bio beyond that limit since it will have to be split by the driver anyway. chunk_sectors is one such limit used for md (I think) to indicate boundaries between drives of a raid volume. And we reuse it (abuse it ?) for zoned block devices to ensure that any command does not cross over zone boundaries since that triggers errors for writes within sequential zones or read/write crossing over zones of different types (conventional->sequential zone boundary). I may not have the entire picture correctly here, but so far, this is my understanding. Cheers. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-15 1:09 ` Damien Le Moal @ 2020-09-15 4:21 ` Damien Le Moal 2020-09-15 8:01 ` Ming Lei 0 siblings, 1 reply; 28+ messages in thread From: Damien Le Moal @ 2020-09-15 4:21 UTC (permalink / raw) To: Mike Snitzer; +Cc: Ming Lei, Vijayendra Suman, dm-devel, linux-block On 2020/09/15 10:10, Damien Le Moal wrote: > On 2020/09/15 0:04, Mike Snitzer wrote: >> On Sun, Sep 13 2020 at 8:46pm -0400, >> Damien Le Moal <Damien.LeMoal@wdc.com> wrote: >> >>> On 2020/09/12 6:53, Mike Snitzer wrote: >>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and >>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for >>>> those operations. >>>> >>>> Also, there is no need to avoid blk_max_size_offset() if >>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. >>>> >>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >>>> --- >>>> include/linux/blkdev.h | 19 +++++++++++++------ >>>> 1 file changed, 13 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>>> index bb5636cc17b9..453a3d735d66 100644 >>>> --- a/include/linux/blkdev.h >>>> +++ b/include/linux/blkdev.h >>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, >>>> sector_t offset) >>>> { >>>> struct request_queue *q = rq->q; >>>> + int op; >>>> + unsigned int max_sectors; >>>> >>>> if (blk_rq_is_passthrough(rq)) >>>> return q->limits.max_hw_sectors; >>>> >>>> - if (!q->limits.chunk_sectors || >>>> - req_op(rq) == REQ_OP_DISCARD || >>>> - req_op(rq) == REQ_OP_SECURE_ERASE) >>>> - return blk_queue_get_max_sectors(q, req_op(rq)); >>>> + op = req_op(rq); >>>> + max_sectors = blk_queue_get_max_sectors(q, op); >>>> >>>> - return min(blk_max_size_offset(q, offset), >>>> - blk_queue_get_max_sectors(q, req_op(rq))); >>>> + switch (op) { >>>> + case REQ_OP_DISCARD: >>>> + case REQ_OP_SECURE_ERASE: >>>> + case REQ_OP_WRITE_SAME: >>>> + case REQ_OP_WRITE_ZEROES: >>>> + return max_sectors; >>>> + } >>> >>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size, >>> no ?) >>> >>> As mentioned in my reply to Ming's email, this will allow these commands to >>> potentially cross over zone boundaries on zoned block devices, which would be an >>> immediate command failure. >> >> Depending on the implementation it is beneficial to get a large >> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's >> optimization for handling large discards and issuing N discards, one per >> stripe). Same could apply for other commands. >> >> Like all devices, zoned devices should impose command specific limits in >> the queue_limits (and not lean on chunk_sectors to do a >> one-size-fits-all). > > Yes, understood. But I think that in the case of md, chunk_sectors is used to > indicate the boundary between drives for a raid volume. So it does indeed make > sense to limit the IO size on submission since otherwise, the md driver itself > would have to split that bio again anyway. > >> But that aside, yes I agree I didn't pay close enough attention to the >> implications of deferring the splitting of these commands until they >> were issued to underlying storage. This chunk_sectors early splitting >> override is a bit of a mess... not quite following the logic given we >> were supposed to be waiting to split bios as late as possible. > > My view is that the multipage bvec (BIOs almost as large as we want) and late > splitting is beneficial to get larger effective BIO sent to the device as having > more pages on hand allows bigger segments in the bio instead of always having at > most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A > lot of requests end up being much larger than the device max_segments * page_size. > > However, if there is already a known limit on the BIO size when the BIO is being > built, it does not make much sense to try to grow a bio beyond that limit since > it will have to be split by the driver anyway. chunk_sectors is one such limit > used for md (I think) to indicate boundaries between drives of a raid volume. > And we reuse it (abuse it ?) for zoned block devices to ensure that any command > does not cross over zone boundaries since that triggers errors for writes within > sequential zones or read/write crossing over zones of different types > (conventional->sequential zone boundary). > > I may not have the entire picture correctly here, but so far, this is my > understanding. And I was wrong :) In light of Ming's comment + a little code refresher reading, indeed, chunk_sectors will split BIOs so that *requests* do not exceed that limit, but the initial BIO submission may be much larger regardless of chunk_sectors. Ming, I think the point here is that building a large BIO first and splitting it later (as opposed to limiting the bio size by stopping bio_add_page()) is more efficient as there is only one bio submit instead of many, right ? -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully 2020-09-15 4:21 ` Damien Le Moal @ 2020-09-15 8:01 ` Ming Lei 0 siblings, 0 replies; 28+ messages in thread From: Ming Lei @ 2020-09-15 8:01 UTC (permalink / raw) To: Damien Le Moal; +Cc: linux-block, dm-devel, Mike Snitzer, Vijayendra Suman On Tue, Sep 15, 2020 at 04:21:54AM +0000, Damien Le Moal wrote: > On 2020/09/15 10:10, Damien Le Moal wrote: > > On 2020/09/15 0:04, Mike Snitzer wrote: > >> On Sun, Sep 13 2020 at 8:46pm -0400, > >> Damien Le Moal <Damien.LeMoal@wdc.com> wrote: > >> > >>> On 2020/09/12 6:53, Mike Snitzer wrote: > >>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and > >>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for > >>>> those operations. > >>>> > >>>> Also, there is no need to avoid blk_max_size_offset() if > >>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'. > >>>> > >>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >>>> --- > >>>> include/linux/blkdev.h | 19 +++++++++++++------ > >>>> 1 file changed, 13 insertions(+), 6 deletions(-) > >>>> > >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > >>>> index bb5636cc17b9..453a3d735d66 100644 > >>>> --- a/include/linux/blkdev.h > >>>> +++ b/include/linux/blkdev.h > >>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > >>>> sector_t offset) > >>>> { > >>>> struct request_queue *q = rq->q; > >>>> + int op; > >>>> + unsigned int max_sectors; > >>>> > >>>> if (blk_rq_is_passthrough(rq)) > >>>> return q->limits.max_hw_sectors; > >>>> > >>>> - if (!q->limits.chunk_sectors || > >>>> - req_op(rq) == REQ_OP_DISCARD || > >>>> - req_op(rq) == REQ_OP_SECURE_ERASE) > >>>> - return blk_queue_get_max_sectors(q, req_op(rq)); > >>>> + op = req_op(rq); > >>>> + max_sectors = blk_queue_get_max_sectors(q, op); > >>>> > >>>> - return min(blk_max_size_offset(q, offset), > >>>> - blk_queue_get_max_sectors(q, req_op(rq))); > >>>> + switch (op) { > >>>> + case REQ_OP_DISCARD: > >>>> + case REQ_OP_SECURE_ERASE: > >>>> + case REQ_OP_WRITE_SAME: > >>>> + case REQ_OP_WRITE_ZEROES: > >>>> + return max_sectors; > >>>> + } > >>> > >>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size, > >>> no ?) > >>> > >>> As mentioned in my reply to Ming's email, this will allow these commands to > >>> potentially cross over zone boundaries on zoned block devices, which would be an > >>> immediate command failure. > >> > >> Depending on the implementation it is beneficial to get a large > >> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's > >> optimization for handling large discards and issuing N discards, one per > >> stripe). Same could apply for other commands. > >> > >> Like all devices, zoned devices should impose command specific limits in > >> the queue_limits (and not lean on chunk_sectors to do a > >> one-size-fits-all). > > > > Yes, understood. But I think that in the case of md, chunk_sectors is used to > > indicate the boundary between drives for a raid volume. So it does indeed make > > sense to limit the IO size on submission since otherwise, the md driver itself > > would have to split that bio again anyway. > > > >> But that aside, yes I agree I didn't pay close enough attention to the > >> implications of deferring the splitting of these commands until they > >> were issued to underlying storage. This chunk_sectors early splitting > >> override is a bit of a mess... not quite following the logic given we > >> were supposed to be waiting to split bios as late as possible. > > > > My view is that the multipage bvec (BIOs almost as large as we want) and late > > splitting is beneficial to get larger effective BIO sent to the device as having > > more pages on hand allows bigger segments in the bio instead of always having at > > most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A > > lot of requests end up being much larger than the device max_segments * page_size. > > > > However, if there is already a known limit on the BIO size when the BIO is being > > built, it does not make much sense to try to grow a bio beyond that limit since > > it will have to be split by the driver anyway. chunk_sectors is one such limit > > used for md (I think) to indicate boundaries between drives of a raid volume. > > And we reuse it (abuse it ?) for zoned block devices to ensure that any command > > does not cross over zone boundaries since that triggers errors for writes within > > sequential zones or read/write crossing over zones of different types > > (conventional->sequential zone boundary). > > > > I may not have the entire picture correctly here, but so far, this is my > > understanding. > > And I was wrong :) In light of Ming's comment + a little code refresher reading, > indeed, chunk_sectors will split BIOs so that *requests* do not exceed that > limit, but the initial BIO submission may be much larger regardless of > chunk_sectors. > > Ming, I think the point here is that building a large BIO first and splitting it > later (as opposed to limiting the bio size by stopping bio_add_page()) is more > efficient as there is only one bio submit instead of many, right ? Yeah, this way allows generic_make_request(submit_bio_noacct) to handle arbitrarily sized bios, so bio_add_page() becomes more efficiently and simplified a lot, and stacking driver is simplified too, such as the original q->merge_bvec_fn() is killed. On the other hand, the cost of bio splitting is added. Especially for stacking driver, there may be two times of bio splitting, one is in stacking driver, another is in underlying device driver. Fortunately underlying queue's limits are propagated to stacking queue, so in theory the bio splitting in stacking driver's ->submit_bio is enough most of times. Thanks, Ming ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors 2020-09-11 21:53 ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer 2020-09-11 21:53 ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer @ 2020-09-11 21:53 ` Mike Snitzer 2020-09-12 13:58 ` Ming Lei 2020-09-11 21:53 ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer 2 siblings, 1 reply; 28+ messages in thread From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw) To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using lcm_not_zero() rather than min_not_zero() -- otherwise the final 'chunk_sectors' could result in sub-optimal alignment of IO to component devices in the IO stack. Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size' then it is a bug in the driver and the device should be flagged as 'misaligned'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- block/blk-settings.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index 76a7e03bcd6c..b09642d5f15e 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -534,6 +534,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->io_min = max(t->io_min, b->io_min); t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); + t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors); /* Physical block size a multiple of the logical block size? */ if (t->physical_block_size & (t->logical_block_size - 1)) { @@ -556,6 +557,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, ret = -1; } + /* chunk_sectors a multiple of the physical block size? */ + if (t->chunk_sectors & (t->physical_block_size - 1)) { + t->chunk_sectors = 0; + t->misaligned = 1; + ret = -1; + } + t->raid_partial_stripes_expensive = max(t->raid_partial_stripes_expensive, b->raid_partial_stripes_expensive); @@ -594,10 +602,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->discard_granularity; } - if (b->chunk_sectors) - t->chunk_sectors = min_not_zero(t->chunk_sectors, - b->chunk_sectors); - t->zoned = max(t->zoned, b->zoned); return ret; } -- 2.15.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors 2020-09-11 21:53 ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer @ 2020-09-12 13:58 ` Ming Lei 0 siblings, 0 replies; 28+ messages in thread From: Ming Lei @ 2020-09-12 13:58 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-block, dm-devel, Vijayendra Suman On Fri, Sep 11, 2020 at 05:53:37PM -0400, Mike Snitzer wrote: > Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using > lcm_not_zero() rather than min_not_zero() -- otherwise the final > 'chunk_sectors' could result in sub-optimal alignment of IO to > component devices in the IO stack. > > Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size' > then it is a bug in the driver and the device should be flagged as > 'misaligned'. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > block/blk-settings.c | 12 ++++++++---- > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index 76a7e03bcd6c..b09642d5f15e 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -534,6 +534,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > > t->io_min = max(t->io_min, b->io_min); > t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); > + t->chunk_sectors = lcm_not_zero(t->chunk_sectors, b->chunk_sectors); > > /* Physical block size a multiple of the logical block size? */ > if (t->physical_block_size & (t->logical_block_size - 1)) { > @@ -556,6 +557,13 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > ret = -1; > } > > + /* chunk_sectors a multiple of the physical block size? */ > + if (t->chunk_sectors & (t->physical_block_size - 1)) { > + t->chunk_sectors = 0; > + t->misaligned = 1; > + ret = -1; > + } > + > t->raid_partial_stripes_expensive = > max(t->raid_partial_stripes_expensive, > b->raid_partial_stripes_expensive); > @@ -594,10 +602,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > t->discard_granularity; > } > > - if (b->chunk_sectors) > - t->chunk_sectors = min_not_zero(t->chunk_sectors, > - b->chunk_sectors); > - > t->zoned = max(t->zoned, b->zoned); > return ret; > } Looks fine: Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 2020-09-11 21:53 ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer 2020-09-11 21:53 ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer 2020-09-11 21:53 ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer @ 2020-09-11 21:53 ` Mike Snitzer 2020-09-12 14:06 ` Ming Lei 2020-09-14 0:55 ` Damien Le Moal 2 siblings, 2 replies; 28+ messages in thread From: Mike Snitzer @ 2020-09-11 21:53 UTC (permalink / raw) To: Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block It is possible for a block device to use a non power-of-2 for chunk size which results in a full-stripe size that is also a non power-of-2. Update blk_queue_chunk_sectors() and blk_max_size_offset() to accommodate drivers that need a non power-of-2 chunk_sectors. Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- block/blk-settings.c | 10 ++++------ include/linux/blkdev.h | 12 +++++++++--- 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index b09642d5f15e..e40a162cc946 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors); * * Description: * If a driver doesn't want IOs to cross a given chunk size, it can set - * this limit and prevent merging across chunks. Note that the chunk size - * must currently be a power-of-2 in sectors. Also note that the block - * layer must accept a page worth of data at any offset. So if the - * crossing of chunks is a hard limitation in the driver, it must still be - * prepared to split single page bios. + * this limit and prevent merging across chunks. Note that the block layer + * must accept a page worth of data at any offset. So if the crossing of + * chunks is a hard limitation in the driver, it must still be prepared + * to split single page bios. **/ void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors) { - BUG_ON(!is_power_of_2(chunk_sectors)); q->limits.chunk_sectors = chunk_sectors; } EXPORT_SYMBOL(blk_queue_chunk_sectors); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 453a3d735d66..e72bcce22143 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, static inline unsigned int blk_max_size_offset(struct request_queue *q, sector_t offset) { - if (!q->limits.chunk_sectors) + unsigned int chunk_sectors = q->limits.chunk_sectors; + + if (!chunk_sectors) return q->limits.max_sectors; - return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors - - (offset & (q->limits.chunk_sectors - 1)))); + if (is_power_of_2(chunk_sectors)) + chunk_sectors -= (offset & (chunk_sectors - 1)); + else + chunk_sectors -= sector_div(offset, chunk_sectors); + + return min(q->limits.max_sectors, chunk_sectors); } static inline unsigned int blk_rq_get_max_sectors(struct request *rq, -- 2.15.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 2020-09-11 21:53 ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer @ 2020-09-12 14:06 ` Ming Lei 2020-09-14 2:43 ` Keith Busch 2020-09-14 0:55 ` Damien Le Moal 1 sibling, 1 reply; 28+ messages in thread From: Ming Lei @ 2020-09-12 14:06 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-block, dm-devel, Vijayendra Suman On Fri, Sep 11, 2020 at 05:53:38PM -0400, Mike Snitzer wrote: > It is possible for a block device to use a non power-of-2 for chunk > size which results in a full-stripe size that is also a non > power-of-2. > > Update blk_queue_chunk_sectors() and blk_max_size_offset() to > accommodate drivers that need a non power-of-2 chunk_sectors. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > block/blk-settings.c | 10 ++++------ > include/linux/blkdev.h | 12 +++++++++--- > 2 files changed, 13 insertions(+), 9 deletions(-) > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index b09642d5f15e..e40a162cc946 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors); > * > * Description: > * If a driver doesn't want IOs to cross a given chunk size, it can set > - * this limit and prevent merging across chunks. Note that the chunk size > - * must currently be a power-of-2 in sectors. Also note that the block > - * layer must accept a page worth of data at any offset. So if the > - * crossing of chunks is a hard limitation in the driver, it must still be > - * prepared to split single page bios. > + * this limit and prevent merging across chunks. Note that the block layer > + * must accept a page worth of data at any offset. So if the crossing of > + * chunks is a hard limitation in the driver, it must still be prepared > + * to split single page bios. > **/ > void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors) > { > - BUG_ON(!is_power_of_2(chunk_sectors)); > q->limits.chunk_sectors = chunk_sectors; > } > EXPORT_SYMBOL(blk_queue_chunk_sectors); > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 453a3d735d66..e72bcce22143 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, > static inline unsigned int blk_max_size_offset(struct request_queue *q, > sector_t offset) > { > - if (!q->limits.chunk_sectors) > + unsigned int chunk_sectors = q->limits.chunk_sectors; > + > + if (!chunk_sectors) > return q->limits.max_sectors; > > - return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors - > - (offset & (q->limits.chunk_sectors - 1)))); > + if (is_power_of_2(chunk_sectors)) > + chunk_sectors -= (offset & (chunk_sectors - 1)); > + else > + chunk_sectors -= sector_div(offset, chunk_sectors); > + > + return min(q->limits.max_sectors, chunk_sectors); > } > > static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > -- > 2.15.0 > is_power_of_2() is cheap enough for fast path, so looks fine to support non-power-of-2 chunk sectors. Maybe NVMe PCI can remove the power_of_2() limit too. Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 2020-09-12 14:06 ` Ming Lei @ 2020-09-14 2:43 ` Keith Busch 0 siblings, 0 replies; 28+ messages in thread From: Keith Busch @ 2020-09-14 2:43 UTC (permalink / raw) To: Ming Lei; +Cc: Mike Snitzer, Vijayendra Suman, dm-devel, linux-block On Sat, Sep 12, 2020 at 10:06:30PM +0800, Ming Lei wrote: > On Fri, Sep 11, 2020 at 05:53:38PM -0400, Mike Snitzer wrote: > > It is possible for a block device to use a non power-of-2 for chunk > > size which results in a full-stripe size that is also a non > > power-of-2. > > > > Update blk_queue_chunk_sectors() and blk_max_size_offset() to > > accommodate drivers that need a non power-of-2 chunk_sectors. > > > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > --- > > block/blk-settings.c | 10 ++++------ > > include/linux/blkdev.h | 12 +++++++++--- > > 2 files changed, 13 insertions(+), 9 deletions(-) > > > > diff --git a/block/blk-settings.c b/block/blk-settings.c > > index b09642d5f15e..e40a162cc946 100644 > > --- a/block/blk-settings.c > > +++ b/block/blk-settings.c > > @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors); > > * > > * Description: > > * If a driver doesn't want IOs to cross a given chunk size, it can set > > - * this limit and prevent merging across chunks. Note that the chunk size > > - * must currently be a power-of-2 in sectors. Also note that the block > > - * layer must accept a page worth of data at any offset. So if the > > - * crossing of chunks is a hard limitation in the driver, it must still be > > - * prepared to split single page bios. > > + * this limit and prevent merging across chunks. Note that the block layer > > + * must accept a page worth of data at any offset. So if the crossing of > > + * chunks is a hard limitation in the driver, it must still be prepared > > + * to split single page bios. > > **/ > > void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors) > > { > > - BUG_ON(!is_power_of_2(chunk_sectors)); > > q->limits.chunk_sectors = chunk_sectors; > > } > > EXPORT_SYMBOL(blk_queue_chunk_sectors); > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index 453a3d735d66..e72bcce22143 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, > > static inline unsigned int blk_max_size_offset(struct request_queue *q, > > sector_t offset) > > { > > - if (!q->limits.chunk_sectors) > > + unsigned int chunk_sectors = q->limits.chunk_sectors; > > + > > + if (!chunk_sectors) > > return q->limits.max_sectors; > > > > - return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors - > > - (offset & (q->limits.chunk_sectors - 1)))); > > + if (is_power_of_2(chunk_sectors)) > > + chunk_sectors -= (offset & (chunk_sectors - 1)); > > + else > > + chunk_sectors -= sector_div(offset, chunk_sectors); > > + > > + return min(q->limits.max_sectors, chunk_sectors); > > } > > > > static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > > -- > > 2.15.0 > > > > is_power_of_2() is cheap enough for fast path, so looks fine to support > non-power-of-2 chunk sectors. > > Maybe NVMe PCI can remove the power_of_2() limit too. I'd need to see the justification for that. The boundary is just a suggestion in NVMe. The majority of IO never crosses it so the calculation is usually wasted CPU cycles. Crossing the boundary is going to have to be very costly on the device side in order to justify the host side per-IO overhead for a non-power-of-2 split. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 2020-09-11 21:53 ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer 2020-09-12 14:06 ` Ming Lei @ 2020-09-14 0:55 ` Damien Le Moal 1 sibling, 0 replies; 28+ messages in thread From: Damien Le Moal @ 2020-09-14 0:55 UTC (permalink / raw) To: Mike Snitzer, Ming Lei; +Cc: Vijayendra Suman, dm-devel, linux-block On 2020/09/12 6:53, Mike Snitzer wrote: > It is possible for a block device to use a non power-of-2 for chunk > size which results in a full-stripe size that is also a non > power-of-2. > > Update blk_queue_chunk_sectors() and blk_max_size_offset() to > accommodate drivers that need a non power-of-2 chunk_sectors. > > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > block/blk-settings.c | 10 ++++------ > include/linux/blkdev.h | 12 +++++++++--- > 2 files changed, 13 insertions(+), 9 deletions(-) > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index b09642d5f15e..e40a162cc946 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors); > * > * Description: > * If a driver doesn't want IOs to cross a given chunk size, it can set > - * this limit and prevent merging across chunks. Note that the chunk size > - * must currently be a power-of-2 in sectors. Also note that the block > - * layer must accept a page worth of data at any offset. So if the > - * crossing of chunks is a hard limitation in the driver, it must still be > - * prepared to split single page bios. > + * this limit and prevent merging across chunks. Note that the block layer > + * must accept a page worth of data at any offset. So if the crossing of > + * chunks is a hard limitation in the driver, it must still be prepared > + * to split single page bios. > **/ > void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors) > { > - BUG_ON(!is_power_of_2(chunk_sectors)); > q->limits.chunk_sectors = chunk_sectors; > } > EXPORT_SYMBOL(blk_queue_chunk_sectors); > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 453a3d735d66..e72bcce22143 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1059,11 +1059,17 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, > static inline unsigned int blk_max_size_offset(struct request_queue *q, > sector_t offset) > { > - if (!q->limits.chunk_sectors) > + unsigned int chunk_sectors = q->limits.chunk_sectors; > + > + if (!chunk_sectors) > return q->limits.max_sectors; > > - return min(q->limits.max_sectors, (unsigned int)(q->limits.chunk_sectors - > - (offset & (q->limits.chunk_sectors - 1)))); > + if (is_power_of_2(chunk_sectors)) > + chunk_sectors -= (offset & (chunk_sectors - 1)); I do not think you need the outer parenthesis here. > + else > + chunk_sectors -= sector_div(offset, chunk_sectors); > + > + return min(q->limits.max_sectors, chunk_sectors); > } > > static inline unsigned int blk_rq_get_max_sectors(struct request *rq, > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2020-09-16 14:56 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <529c2394-1b58-b9d8-d462-1f3de1b78ac8@oracle.com> 2020-09-10 14:24 ` Revert "dm: always call blk_queue_split() in dm_process_bio()" Mike Snitzer 2020-09-10 19:29 ` Vijayendra Suman 2020-09-15 1:33 ` Mike Snitzer 2020-09-15 17:03 ` Mike Snitzer 2020-09-16 14:56 ` Vijayendra Suman 2020-09-11 12:20 ` Ming Lei 2020-09-11 16:13 ` Mike Snitzer 2020-09-11 21:53 ` [PATCH 0/3] block: a few chunk_sectors fixes/improvements Mike Snitzer 2020-09-11 21:53 ` [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully Mike Snitzer 2020-09-12 13:52 ` Ming Lei 2020-09-14 0:43 ` Damien Le Moal 2020-09-14 14:52 ` Mike Snitzer 2020-09-14 23:28 ` Damien Le Moal 2020-09-15 2:03 ` Ming Lei 2020-09-15 2:15 ` Damien Le Moal 2020-09-14 14:49 ` Mike Snitzer 2020-09-15 1:50 ` Ming Lei 2020-09-14 0:46 ` Damien Le Moal 2020-09-14 15:03 ` Mike Snitzer 2020-09-15 1:09 ` Damien Le Moal 2020-09-15 4:21 ` Damien Le Moal 2020-09-15 8:01 ` Ming Lei 2020-09-11 21:53 ` [PATCH 2/3] block: use lcm_not_zero() when stacking chunk_sectors Mike Snitzer 2020-09-12 13:58 ` Ming Lei 2020-09-11 21:53 ` [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2 Mike Snitzer 2020-09-12 14:06 ` Ming Lei 2020-09-14 2:43 ` Keith Busch 2020-09-14 0:55 ` Damien Le Moal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).