* Switching to MQ by default may generate some bug reports
@ 2017-08-03 8:51 Mel Gorman
2017-08-03 9:17 ` Ming Lei
2017-08-03 9:21 ` Paolo Valente
0 siblings, 2 replies; 29+ messages in thread
From: Mel Gorman @ 2017-08-03 8:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-block, Paolo Valente
Hi Christoph,
I know the reasons for switching to MQ by default but just be aware that it's
not without hazards albeit it the biggest issues I've seen are switching
CFQ to BFQ. On my home grid, there is some experimental automatic testing
running every few weeks searching for regressions. Yesterday, it noticed
that creating some work files for a postgres simulator called pgioperf
was 38.33% slower and it auto-bisected to the switch to MQ. This is just
linearly writing two files for testing on another benchmark and is not
remarkable. The relevant part of the report is
Last good/First bad commit
==========================
Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
>From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 16 Jun 2017 10:27:55 +0200
Subject: [PATCH] scsi: default to scsi-mq
Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
path now that we had plenty of testing, and have I/O schedulers for
blk-mq. The module option to disable the blk-mq path is kept around for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
drivers/scsi/Kconfig | 11 -----------
drivers/scsi/scsi.c | 4 ----
2 files changed, 15 deletions(-)
Comparison
==========
initial initial last penup first
good-v4.12 bad-16f73eb02d7e good-6d311fa7 good-d06c587d bad-5c279bd9
User min 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
User mean 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
User stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
User coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
User max 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
System min 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
System mean 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
System stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
System coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
System max 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
Elapsed min 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
Elapsed mean 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
Elapsed stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Elapsed coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Elapsed max 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
CPU min 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
CPU mean 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
CPU stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
CPU coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
CPU max 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
The "Elapsed mean" line is what the testing and auto-bisection was paying
attention to. Commit 16f73eb02d7e is simply the head commit at the time
the continuous testing started. The first "bad commit" is the last column.
It's not the only slowdown that has been observed from other testing when
examining whether it's ok to switch to MQ by default. The biggest slowdown
observed was with a modified version of dbench4 -- the modifications use
shorter, but representative, load files to avoid timing artifacts and
reports time to complete a load file instead of throughput as throughput
is kind of meaningless for dbench4
dbench4 Loadfile Execution Time
4.12.0 4.12.0
legacy-cfq mq-bfq
Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%)
Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%)
Amean 4 102.72 ( 0.00%) 474.33 (-361.77%)
Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%)
The units are "milliseconds to complete a load file" so as thread count
increased, there were some fairly bad slowdowns. The most dramatic
slowdown was observed on a machine with a controller with on-board cache
4.12.0 4.12.0
legacy-cfq mq-bfq
Amean 1 289.09 ( 0.00%) 128.43 ( 55.57%)
Amean 2 491.32 ( 0.00%) 794.04 ( -61.61%)
Amean 4 875.26 ( 0.00%) 9331.79 (-966.17%)
Amean 8 2074.30 ( 0.00%) 317.79 ( 84.68%)
Amean 16 3380.47 ( 0.00%) 669.51 ( 80.19%)
Amean 32 7427.25 ( 0.00%) 8821.75 ( -18.78%)
Amean 256 53376.81 ( 0.00%) 69006.94 ( -29.28%)
The slowdown wasn't universal but at 4 threads, it was severe. There
are other examples but it'd just be a lot of noise and not change the
central point.
The major problems were all observed switching from CFQ to BFQ on single disk
rotary storage. It's not machine specific as 5 separate machines noticed
problems with dbench and fio when switching to MQ on kernel 4.12. Weirdly,
I've seen cases of read starvation in the presence of heavy writers
using fio to generate the workload which was surprising to me. Jan Kara
suggested that it may be because the read workload is not being identified
as "interactive" but I didn't dig into the details myself and have zero
understanding of BFQ. I was only interested in answering the question "is
it safe to switch the default and will the performance be similar enough
to avoid bug reports?" and concluded that the answer is "no".
For what it's worth, I've noticed on SSDs that switching from legacy-mq
to deadline-mq also slowed down but in many cases the slowdown was small
enough that it may be tolerable and not generate many bug reports. Also,
mq-deadline appears to receive more attention so issues there are probably
going to be noticed faster.
I'm not suggesting for a second that you fix this or switch back to legacy
by default because it's BFQ, Paulo is cc'd and it'll have to be fixed
eventually but you might see "workload foo is slower on 4.13" reports that
bisect to this commit. What filesystem is used changes the results but at
least btrfs, ext3, ext4 and xfs experience slowdowns.
For Paulo, if you want to try preemptively dealing with regression reports
before 4.13 releases then all the tests in question can be reproduced with
https://github.com/gormanm/mmtests . The most relevant test configurations
I've seen so far are
configs/config-global-dhp__io-dbench4-async
configs/config-global-dhp__io-fio-randread-async-randwrite
configs/config-global-dhp__io-fio-randread-async-seqwrite
configs/config-global-dhp__io-fio-randread-sync-heavywrite
configs/config-global-dhp__io-fio-randread-sync-randwrite
configs/config-global-dhp__pgioperf
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 8:51 Switching to MQ by default may generate some bug reports Mel Gorman @ 2017-08-03 9:17 ` Ming Lei 2017-08-03 9:32 ` Ming Lei 2017-08-03 9:42 ` Mel Gorman 2017-08-03 9:21 ` Paolo Valente 1 sibling, 2 replies; 29+ messages in thread From: Ming Lei @ 2017-08-03 9:17 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block, Paolo Valente Hi Mel Gorman, On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > Hi Christoph, > > I know the reasons for switching to MQ by default but just be aware that it's > not without hazards albeit it the biggest issues I've seen are switching > CFQ to BFQ. On my home grid, there is some experimental automatic testing > running every few weeks searching for regressions. Yesterday, it noticed > that creating some work files for a postgres simulator called pgioperf > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > linearly writing two files for testing on another benchmark and is not > remarkable. The relevant part of the report is We saw some SCSI-MQ performance issue too, please see if the following patchset fixes your issue: http://marc.info/?l=linux-block&m=150151989915776&w=2 Thanks, Ming ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 9:17 ` Ming Lei @ 2017-08-03 9:32 ` Ming Lei 2017-08-03 9:42 ` Mel Gorman 1 sibling, 0 replies; 29+ messages in thread From: Ming Lei @ 2017-08-03 9:32 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block, Paolo Valente, Ming Lei On Thu, Aug 3, 2017 at 5:17 PM, Ming Lei <tom.leiming@gmail.com> wrote: > Hi Mel Gorman, > > On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote: >> Hi Christoph, >> >> I know the reasons for switching to MQ by default but just be aware that it's >> not without hazards albeit it the biggest issues I've seen are switching >> CFQ to BFQ. On my home grid, there is some experimental automatic testing >> running every few weeks searching for regressions. Yesterday, it noticed >> that creating some work files for a postgres simulator called pgioperf >> was 38.33% slower and it auto-bisected to the switch to MQ. This is just >> linearly writing two files for testing on another benchmark and is not >> remarkable. The relevant part of the report is > > We saw some SCSI-MQ performance issue too, please see if the following > patchset fixes your issue: > > http://marc.info/?l=linux-block&m=150151989915776&w=2 BTW, the above patches(V1) can be found in the following tree: https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V1 V2 has already been done but not posted out yet, because the performance test on SRP isn't completed: https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V2 Thanks, Ming Lei ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 9:17 ` Ming Lei 2017-08-03 9:32 ` Ming Lei @ 2017-08-03 9:42 ` Mel Gorman 2017-08-03 9:44 ` Paolo Valente 2017-08-03 9:57 ` Ming Lei 1 sibling, 2 replies; 29+ messages in thread From: Mel Gorman @ 2017-08-03 9:42 UTC (permalink / raw) To: Ming Lei Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block, Paolo Valente On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: > Hi Mel Gorman, > > On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > > Hi Christoph, > > > > I know the reasons for switching to MQ by default but just be aware that it's > > not without hazards albeit it the biggest issues I've seen are switching > > CFQ to BFQ. On my home grid, there is some experimental automatic testing > > running every few weeks searching for regressions. Yesterday, it noticed > > that creating some work files for a postgres simulator called pgioperf > > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > > linearly writing two files for testing on another benchmark and is not > > remarkable. The relevant part of the report is > > We saw some SCSI-MQ performance issue too, please see if the following > patchset fixes your issue: > > http://marc.info/?l=linux-block&m=150151989915776&w=2 > That series is dealing with problems with legacy-deadline vs mq-none where as the bulk of the problems reported in this mail are related to legacy-CFQ vs mq-BFQ. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 9:42 ` Mel Gorman @ 2017-08-03 9:44 ` Paolo Valente 2017-08-03 10:46 ` Mel Gorman 2017-08-03 9:57 ` Ming Lei 1 sibling, 1 reply; 29+ messages in thread From: Paolo Valente @ 2017-08-03 9:44 UTC (permalink / raw) To: Mel Gorman Cc: Ming Lei, Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block > Il giorno 03 ago 2017, alle ore 11:42, Mel Gorman <mgorman@techsingularity.net> ha scritto: > > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: >> Hi Mel Gorman, >> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote: >>> Hi Christoph, >>> >>> I know the reasons for switching to MQ by default but just be aware that it's >>> not without hazards albeit it the biggest issues I've seen are switching >>> CFQ to BFQ. On my home grid, there is some experimental automatic testing >>> running every few weeks searching for regressions. Yesterday, it noticed >>> that creating some work files for a postgres simulator called pgioperf >>> was 38.33% slower and it auto-bisected to the switch to MQ. This is just >>> linearly writing two files for testing on another benchmark and is not >>> remarkable. The relevant part of the report is >> >> We saw some SCSI-MQ performance issue too, please see if the following >> patchset fixes your issue: >> >> http://marc.info/?l=linux-block&m=150151989915776&w=2 >> > > That series is dealing with problems with legacy-deadline vs mq-none where > as the bulk of the problems reported in this mail are related to > legacy-CFQ vs mq-BFQ. > Out-of-curiosity: you get no regression with mq-none or mq-deadline? Thanks, Paolo > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 9:44 ` Paolo Valente @ 2017-08-03 10:46 ` Mel Gorman 0 siblings, 0 replies; 29+ messages in thread From: Mel Gorman @ 2017-08-03 10:46 UTC (permalink / raw) To: Paolo Valente Cc: Ming Lei, Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block On Thu, Aug 03, 2017 at 11:44:06AM +0200, Paolo Valente wrote: > > That series is dealing with problems with legacy-deadline vs mq-none where > > as the bulk of the problems reported in this mail are related to > > legacy-CFQ vs mq-BFQ. > > > > Out-of-curiosity: you get no regression with mq-none or mq-deadline? > I didn't test mq-none as the underlying storage was not fast enough to make a legacy-noop vs mq-none meaningful. legacy-deadline vs mq-deadline did show small regressions on some workloads but not as dramatic and small enough that it would go unmissed in some cases. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 9:42 ` Mel Gorman 2017-08-03 9:44 ` Paolo Valente @ 2017-08-03 9:57 ` Ming Lei 2017-08-03 10:47 ` Mel Gorman 1 sibling, 1 reply; 29+ messages in thread From: Ming Lei @ 2017-08-03 9:57 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block, Paolo Valente On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: >> Hi Mel Gorman, >> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote: >> > Hi Christoph, >> > >> > I know the reasons for switching to MQ by default but just be aware that it's >> > not without hazards albeit it the biggest issues I've seen are switching >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing >> > running every few weeks searching for regressions. Yesterday, it noticed >> > that creating some work files for a postgres simulator called pgioperf >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just >> > linearly writing two files for testing on another benchmark and is not >> > remarkable. The relevant part of the report is >> >> We saw some SCSI-MQ performance issue too, please see if the following >> patchset fixes your issue: >> >> http://marc.info/?l=linux-block&m=150151989915776&w=2 >> > > That series is dealing with problems with legacy-deadline vs mq-none where > as the bulk of the problems reported in this mail are related to > legacy-CFQ vs mq-BFQ. The serials deals with none and all mq schedulers, and you can see the improvement on mq-deadline in cover letter, :-) Thanks, Ming Lei ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 9:57 ` Ming Lei @ 2017-08-03 10:47 ` Mel Gorman 2017-08-03 11:48 ` Ming Lei 0 siblings, 1 reply; 29+ messages in thread From: Mel Gorman @ 2017-08-03 10:47 UTC (permalink / raw) To: Ming Lei Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block, Paolo Valente On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote: > On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: > >> Hi Mel Gorman, > >> > >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > >> > Hi Christoph, > >> > > >> > I know the reasons for switching to MQ by default but just be aware that it's > >> > not without hazards albeit it the biggest issues I've seen are switching > >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing > >> > running every few weeks searching for regressions. Yesterday, it noticed > >> > that creating some work files for a postgres simulator called pgioperf > >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > >> > linearly writing two files for testing on another benchmark and is not > >> > remarkable. The relevant part of the report is > >> > >> We saw some SCSI-MQ performance issue too, please see if the following > >> patchset fixes your issue: > >> > >> http://marc.info/?l=linux-block&m=150151989915776&w=2 > >> > > > > That series is dealing with problems with legacy-deadline vs mq-none where > > as the bulk of the problems reported in this mail are related to > > legacy-CFQ vs mq-BFQ. > > The serials deals with none and all mq schedulers, and you can see > the improvement on mq-deadline in cover letter, :-) > Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ that was not observed on other schedulers? -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 10:47 ` Mel Gorman @ 2017-08-03 11:48 ` Ming Lei 0 siblings, 0 replies; 29+ messages in thread From: Ming Lei @ 2017-08-03 11:48 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block, Paolo Valente On Thu, Aug 3, 2017 at 6:47 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote: >> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <mgorman@techsingularity.net> wrote: >> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote: >> >> Hi Mel Gorman, >> >> >> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote: >> >> > Hi Christoph, >> >> > >> >> > I know the reasons for switching to MQ by default but just be aware that it's >> >> > not without hazards albeit it the biggest issues I've seen are switching >> >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing >> >> > running every few weeks searching for regressions. Yesterday, it noticed >> >> > that creating some work files for a postgres simulator called pgioperf >> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just >> >> > linearly writing two files for testing on another benchmark and is not >> >> > remarkable. The relevant part of the report is >> >> >> >> We saw some SCSI-MQ performance issue too, please see if the following >> >> patchset fixes your issue: >> >> >> >> http://marc.info/?l=linux-block&m=150151989915776&w=2 >> >> >> > >> > That series is dealing with problems with legacy-deadline vs mq-none where >> > as the bulk of the problems reported in this mail are related to >> > legacy-CFQ vs mq-BFQ. >> >> The serials deals with none and all mq schedulers, and you can see >> the improvement on mq-deadline in cover letter, :-) >> > > Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ > that was not observed on other schedulers? Actually if you look at the cover letter, you will see this patchset increases by > 10X sequential I/O IOPS on mq-deadline, so it would be reasonable to see 2x to 4x BFQ slowdown, but I didn't test BFQ. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 8:51 Switching to MQ by default may generate some bug reports Mel Gorman 2017-08-03 9:17 ` Ming Lei @ 2017-08-03 9:21 ` Paolo Valente 2017-08-03 11:01 ` Mel Gorman 1 sibling, 1 reply; 29+ messages in thread From: Paolo Valente @ 2017-08-03 9:21 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 03 ago 2017, alle ore 10:51, Mel Gorman <mgorman@techsingularity.net> ha scritto: > > Hi Christoph, > > I know the reasons for switching to MQ by default but just be aware that it's > not without hazards albeit it the biggest issues I've seen are switching > CFQ to BFQ. On my home grid, there is some experimental automatic testing > running every few weeks searching for regressions. Yesterday, it noticed > that creating some work files for a postgres simulator called pgioperf > was 38.33% slower and it auto-bisected to the switch to MQ. This is just > linearly writing two files for testing on another benchmark and is not > remarkable. The relevant part of the report is > > Last good/First bad commit > ========================== > Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5 > First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa > From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch@lst.de> > Date: Fri, 16 Jun 2017 10:27:55 +0200 > Subject: [PATCH] scsi: default to scsi-mq > Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O > path now that we had plenty of testing, and have I/O schedulers for > blk-mq. The module option to disable the blk-mq path is kept around for > now. > Signed-off-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > drivers/scsi/Kconfig | 11 ----------- > drivers/scsi/scsi.c | 4 ---- > 2 files changed, 15 deletions(-) > > Comparison > ========== > initial initial last penup first > good-v4.12 bad-16f73eb02d7e good-6d311fa7 good-d06c587d bad-5c279bd9 > User min 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%) > User mean 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%) > User stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > User coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > User max 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%) > System min 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) > System mean 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) > System stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > System coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > System max 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%) > Elapsed min 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) > Elapsed mean 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) > Elapsed stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Elapsed coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Elapsed max 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%) > CPU min 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%) > CPU mean 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%) > CPU stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > CPU coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > CPU max 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%) > > The "Elapsed mean" line is what the testing and auto-bisection was paying > attention to. Commit 16f73eb02d7e is simply the head commit at the time > the continuous testing started. The first "bad commit" is the last column. > > It's not the only slowdown that has been observed from other testing when > examining whether it's ok to switch to MQ by default. The biggest slowdown > observed was with a modified version of dbench4 -- the modifications use > shorter, but representative, load files to avoid timing artifacts and > reports time to complete a load file instead of throughput as throughput > is kind of meaningless for dbench4 > > dbench4 Loadfile Execution Time > 4.12.0 4.12.0 > legacy-cfq mq-bfq > Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) > Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) > Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) > Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) > > The units are "milliseconds to complete a load file" so as thread count > increased, there were some fairly bad slowdowns. The most dramatic > slowdown was observed on a machine with a controller with on-board cache > > 4.12.0 4.12.0 > legacy-cfq mq-bfq > Amean 1 289.09 ( 0.00%) 128.43 ( 55.57%) > Amean 2 491.32 ( 0.00%) 794.04 ( -61.61%) > Amean 4 875.26 ( 0.00%) 9331.79 (-966.17%) > Amean 8 2074.30 ( 0.00%) 317.79 ( 84.68%) > Amean 16 3380.47 ( 0.00%) 669.51 ( 80.19%) > Amean 32 7427.25 ( 0.00%) 8821.75 ( -18.78%) > Amean 256 53376.81 ( 0.00%) 69006.94 ( -29.28%) > > The slowdown wasn't universal but at 4 threads, it was severe. There > are other examples but it'd just be a lot of noise and not change the > central point. > > The major problems were all observed switching from CFQ to BFQ on single disk > rotary storage. It's not machine specific as 5 separate machines noticed > problems with dbench and fio when switching to MQ on kernel 4.12. Weirdly, > I've seen cases of read starvation in the presence of heavy writers > using fio to generate the workload which was surprising to me. Jan Kara > suggested that it may be because the read workload is not being identified > as "interactive" but I didn't dig into the details myself and have zero > understanding of BFQ. I was only interested in answering the question "is > it safe to switch the default and will the performance be similar enough > to avoid bug reports?" and concluded that the answer is "no". > > For what it's worth, I've noticed on SSDs that switching from legacy-mq > to deadline-mq also slowed down but in many cases the slowdown was small > enough that it may be tolerable and not generate many bug reports. Also, > mq-deadline appears to receive more attention so issues there are probably > going to be noticed faster. > > I'm not suggesting for a second that you fix this or switch back to legacy > by default because it's BFQ, Paulo is cc'd and it'll have to be fixed > eventually but you might see "workload foo is slower on 4.13" reports that > bisect to this commit. What filesystem is used changes the results but at > least btrfs, ext3, ext4 and xfs experience slowdowns. > > For Paulo, if you want to try preemptively dealing with regression reports > before 4.13 releases then all the tests in question can be reproduced with > https://github.com/gormanm/mmtests . The most relevant test configurations > I've seen so far are > > configs/config-global-dhp__io-dbench4-async > configs/config-global-dhp__io-fio-randread-async-randwrite > configs/config-global-dhp__io-fio-randread-async-seqwrite > configs/config-global-dhp__io-fio-randread-sync-heavywrite > configs/config-global-dhp__io-fio-randread-sync-randwrite > configs/config-global-dhp__pgioperf > Hi Mel, as it already happened with the latest Phoronix benchmark article (and with other test results reported several months ago on this list), bad results may be caused (also) by the fact that the low-latency, default configuration of BFQ is being used. This configuration is the default one because the motivation for yet-another-scheduler as BFQ is that it drastically reduces latency for interactive and soft real-time tasks (e.g., opening an app or playing/streaming a video), when there is some background I/O. Low-latency heuristics are willing to sacrifice throughput when this provides a large benefit in terms of the above latency. Things do change if, instead, one wants to use BFQ for tasks that don't need this kind of low-latency guarantees, but need only the highest possible sustained throughput. This seems to be the case for all the tests you have listed above. In this case, it doesn't make much sense to leave low-latency heuristics on. Throughput may only get worse for these tests, and the elapsed time can only increase. How to switch low-latency heuristics off? echo 0 > /sys/block/<dev>/queue/iosched/low_latency Of course, BFQ may not be optimal for every workload, even if low-latency mode is switched off. In addition, there may still be some bug. I'll repeat your tests on a machine of mine ASAP. Thanks, Paolo > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 9:21 ` Paolo Valente @ 2017-08-03 11:01 ` Mel Gorman 2017-08-04 7:26 ` Paolo Valente 0 siblings, 1 reply; 29+ messages in thread From: Mel Gorman @ 2017-08-03 11:01 UTC (permalink / raw) To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote: > > For Paulo, if you want to try preemptively dealing with regression reports > > before 4.13 releases then all the tests in question can be reproduced with > > https://github.com/gormanm/mmtests . The most relevant test configurations > > I've seen so far are > > > > configs/config-global-dhp__io-dbench4-async > > configs/config-global-dhp__io-fio-randread-async-randwrite > > configs/config-global-dhp__io-fio-randread-async-seqwrite > > configs/config-global-dhp__io-fio-randread-sync-heavywrite > > configs/config-global-dhp__io-fio-randread-sync-randwrite > > configs/config-global-dhp__pgioperf > > > > Hi Mel, > as it already happened with the latest Phoronix benchmark article (and > with other test results reported several months ago on this list), bad > results may be caused (also) by the fact that the low-latency, default > configuration of BFQ is being used. I took that into account BFQ with low-latency was also tested and the impact was not a universal improvement although it can be a noticable improvement. From the same machine; dbench4 Loadfile Execution Time 4.12.0 4.12.0 4.12.0 legacy-cfq mq-bfq mq-bfq-tput Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) However, it's not a universal gain and there are also fairness issues. For example, this is a fio configuration with a single random reader and a single random writer on the same machine fio Throughput 4.12.0 4.12.0 4.12.0 legacy-cfq mq-bfq mq-bfq-tput Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%) Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%) With CFQ, there is some fairness between the readers and writers and with BFQ, there is a strong preference to writers. Again, this is not universal. It'll be a mix and sometimes it'll be classed as a gain and sometimes a regression. While I accept that BFQ can be tuned, tuning IO schedulers is not something that normal users get right and they'll only look at "out of box" performance which, right now, will trigger bug reports. This is neither good nor bad, it simply is. > This configuration is the default > one because the motivation for yet-another-scheduler as BFQ is that it > drastically reduces latency for interactive and soft real-time tasks > (e.g., opening an app or playing/streaming a video), when there is > some background I/O. Low-latency heuristics are willing to sacrifice > throughput when this provides a large benefit in terms of the above > latency. > I had seen this assertion so one of the fio configurations had multiple heavy writers in the background and a random reader of small files to simulate that scenario. The intent was to simulate heavy IO in the presence of application startup 4.12.0 4.12.0 4.12.0 legacy-cfq mq-bfq mq-bfq-tput Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%) Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%) Write throughput is steady-ish across each IO scheduler but readers get starved badly which I expect would slow application startup and disabling low_latency makes it much worse. The mmtests configuration in question is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create a fresh XFS filesystem on a test partition. This is not exactly equivalent to real application startup but that can be difficult to quantify properly. > Of course, BFQ may not be optimal for every workload, even if > low-latency mode is switched off. In addition, there may still be > some bug. I'll repeat your tests on a machine of mine ASAP. > The intent here is not to rag on BFQ because I know it's going to have some wins and some losses and will take time to fix up. The primary intent was to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports due to the switching of defaults that will bisect to a misleading commit. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-03 11:01 ` Mel Gorman @ 2017-08-04 7:26 ` Paolo Valente 2017-08-04 11:01 ` Mel Gorman 0 siblings, 1 reply; 29+ messages in thread From: Paolo Valente @ 2017-08-04 7:26 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 03 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto: > > On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote: >>> For Paulo, if you want to try preemptively dealing with regression reports >>> before 4.13 releases then all the tests in question can be reproduced with >>> https://github.com/gormanm/mmtests . The most relevant test configurations >>> I've seen so far are >>> >>> configs/config-global-dhp__io-dbench4-async >>> configs/config-global-dhp__io-fio-randread-async-randwrite >>> configs/config-global-dhp__io-fio-randread-async-seqwrite >>> configs/config-global-dhp__io-fio-randread-sync-heavywrite >>> configs/config-global-dhp__io-fio-randread-sync-randwrite >>> configs/config-global-dhp__pgioperf >>> >> >> Hi Mel, >> as it already happened with the latest Phoronix benchmark article (and >> with other test results reported several months ago on this list), bad >> results may be caused (also) by the fact that the low-latency, default >> configuration of BFQ is being used. > > I took that into account BFQ with low-latency was also tested and the > impact was not a universal improvement although it can be a noticable > improvement. From the same machine; > > dbench4 Loadfile Execution Time > 4.12.0 4.12.0 4.12.0 > legacy-cfq mq-bfq mq-bfq-tput > Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) > Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) > Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) > Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) > Thanks for trying with low_latency disabled. If I read numbers correctly, we move from a worst case of 361% higher execution time to a worst case of 11%. With a best case of 20% of lower execution time. I asked you about none and mq-deadline in a previous email, because actually we have a double change here: change of the I/O stack, and change of the scheduler, with the first change probably not irrelevant with respect to the second one. Are we sure that part of the small losses and gains with bfq-mq-tput aren't due to the change of I/O stack? My problem is that it may be hard to find issues or anomalies in BFQ that justify a 5% or 11% loss in two cases, while the same scheduler has a 4% and a 20% gain in the other two cases. By chance, according to what you have measured so far, is there any test where, instead, you expect or have seen bfq-mq-tput to always lose? I could start from there. > However, it's not a universal gain and there are also fairness issues. > For example, this is a fio configuration with a single random reader and > a single random writer on the same machine > > fio Throughput > 4.12.0 4.12.0 4.12.0 > legacy-cfq mq-bfq mq-bfq-tput > Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%) > Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%) > > With CFQ, there is some fairness between the readers and writers and > with BFQ, there is a strong preference to writers. Again, this is not > universal. It'll be a mix and sometimes it'll be classed as a gain and > sometimes a regression. > Yes, that's why I didn't pay too much attention so far to such an issue. I preferred to tune for maximum responsiveness and minimal latency for soft real-time applications, w.r.t. to reducing a kind of unfairness for which no user happened to complain (so far). Do you have some real application (or benchmark simulating a real application) in which we can see actual problems because of this form of unfairness? I was thinking of, e.g., two virtual machines, one doing heavy writes and the other heavy reads. But in that case, cgroups have to be used, and I'm not sure we would still see this problem. Any suggestion is welcome. In any case, if needed, changing read/write throughput ratio should not be a problem. > While I accept that BFQ can be tuned, tuning IO schedulers is not something > that normal users get right and they'll only look at "out of box" performance > which, right now, will trigger bug reports. This is neither good nor bad, > it simply is. > >> This configuration is the default >> one because the motivation for yet-another-scheduler as BFQ is that it >> drastically reduces latency for interactive and soft real-time tasks >> (e.g., opening an app or playing/streaming a video), when there is >> some background I/O. Low-latency heuristics are willing to sacrifice >> throughput when this provides a large benefit in terms of the above >> latency. >> > > I had seen this assertion so one of the fio configurations had multiple > heavy writers in the background and a random reader of small files to > simulate that scenario. The intent was to simulate heavy IO in the presence > of application startup > > 4.12.0 4.12.0 4.12.0 > legacy-cfq mq-bfq mq-bfq-tput > Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%) > Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%) > > Write throughput is steady-ish across each IO scheduler but readers get > starved badly which I expect would slow application startup and disabling > low_latency makes it much worse. A greedy random reader that goes on steadily mimics an application startup only for the first handful of seconds. Where can I find the exact script/configuration you used, to check more precisely what is going on and whether BFQ is actually behaving very badly for some reason? > The mmtests configuration in question > is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create > a fresh XFS filesystem on a test partition. > > This is not exactly equivalent to real application startup but that can > be difficult to quantify properly. > If you do want to check application startup, then just 1) start some background workload, 2) drop caches, 3) start the app, 4) measure how long it takes to start. Otherwise, the comm_startup_lat test in the S suite [1] does all of this for you. [1] https://github.com/Algodev-github/S >> Of course, BFQ may not be optimal for every workload, even if >> low-latency mode is switched off. In addition, there may still be >> some bug. I'll repeat your tests on a machine of mine ASAP. >> > > The intent here is not to rag on BFQ because I know it's going to have some > wins and some losses and will take time to fix up. The primary intent was > to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports > due to the switching of defaults that will bisect to a misleading commit. > I see, and being ready in advance is extremely helpful for me. Thanks, Paolo > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-04 7:26 ` Paolo Valente @ 2017-08-04 11:01 ` Mel Gorman 2017-08-04 22:05 ` Paolo Valente 0 siblings, 1 reply; 29+ messages in thread From: Mel Gorman @ 2017-08-04 11:01 UTC (permalink / raw) To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote: > > I took that into account BFQ with low-latency was also tested and the > > impact was not a universal improvement although it can be a noticable > > improvement. From the same machine; > > > > dbench4 Loadfile Execution Time > > 4.12.0 4.12.0 4.12.0 > > legacy-cfq mq-bfq mq-bfq-tput > > Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) > > Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) > > Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) > > Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) > > > > Thanks for trying with low_latency disabled. If I read numbers > correctly, we move from a worst case of 361% higher execution time to > a worst case of 11%. With a best case of 20% of lower execution time. > Yes. > I asked you about none and mq-deadline in a previous email, because > actually we have a double change here: change of the I/O stack, and > change of the scheduler, with the first change probably not irrelevant > with respect to the second one. > True. However, the difference between legacy-deadline mq-deadline is roughly around the 5-10% mark across workloads for SSD. It's not universally true but the impact is not as severe. While this is not proof that the stack change is the sole root cause, it makes it less likely. > By chance, according to what you have measured so far, is there any > test where, instead, you expect or have seen bfq-mq-tput to always > lose? I could start from there. > global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that it could be the stack change. global-dhp__io-dbench4-fsync-ext4 was a universal loss across any machine tested. This is global-dhp__io-dbench4-fsync from mmtests using ext4 as a filesystem. The same is not true for XFS so the filesystem matters. > > However, it's not a universal gain and there are also fairness issues. > > For example, this is a fio configuration with a single random reader and > > a single random writer on the same machine > > > > fio Throughput > > 4.12.0 4.12.0 4.12.0 > > legacy-cfq mq-bfq mq-bfq-tput > > Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%) > > Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%) > > > > With CFQ, there is some fairness between the readers and writers and > > with BFQ, there is a strong preference to writers. Again, this is not > > universal. It'll be a mix and sometimes it'll be classed as a gain and > > sometimes a regression. > > > > Yes, that's why I didn't pay too much attention so far to such an > issue. I preferred to tune for maximum responsiveness and minimal > latency for soft real-time applications, w.r.t. to reducing a kind of > unfairness for which no user happened to complain (so far). Do you > have some real application (or benchmark simulating a real > application) in which we can see actual problems because of this form > of unfairness? I don't have data on that. This was a preliminary study only to see if a switch was safe running workloads that would appear in internal bug reports related to benchmarking. > I was thinking of, e.g., two virtual machines, one > doing heavy writes and the other heavy reads. But in that case, > cgroups have to be used, and I'm not sure we would still see this > problem. Any suggestion is welcome. > I haven't spent time designing such a thing. Even if I did, I know I would get hit within weeks of a switch during distro development with reports related to fio, dbench and other basic IO benchmarks. > > I had seen this assertion so one of the fio configurations had multiple > > heavy writers in the background and a random reader of small files to > > simulate that scenario. The intent was to simulate heavy IO in the presence > > of application startup > > > > 4.12.0 4.12.0 4.12.0 > > legacy-cfq mq-bfq mq-bfq-tput > > Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%) > > Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%) > > > > Write throughput is steady-ish across each IO scheduler but readers get > > starved badly which I expect would slow application startup and disabling > > low_latency makes it much worse. > > A greedy random reader that goes on steadily mimics an application startup > only for the first handful of seconds. > Sure, but if during those handful of seconds the throughput is 10% of what is used to be, it'll still be noticable. > Where can I find the exact script/configuration you used, to check > more precisely what is going on and whether BFQ is actually behaving very > badly for some reason? > https://github.com/gormanm/mmtests All the configuration files are in configs/ so global-dhp__io-dbench4-fsync-ext4 maps to global-dhp__io-dbench4-fsync but it has to be editted if you want to format a test partition. Otherwise, you'd just need to make sure the current directory was ext4 and ignore any filesystem aging artifacts. > > The mmtests configuration in question > > is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create > > a fresh XFS filesystem on a test partition. > > > > This is not exactly equivalent to real application startup but that can > > be difficult to quantify properly. > > > > If you do want to check application startup, then just 1) start some > background workload, 2) drop caches, 3) start the app, 4) measure how > long it takes to start. Otherwise, the comm_startup_lat test in the > S suite [1] does all of this for you. > I did have something like this before but found it unreliable because it couldn't tell the difference between when an application has a window and when it's ready for use. Evolution for example may start up and start displaing but then clicking on a mail may stall for a few seconds. It's difficult to quantify meaningfully which is why I eventually gave up and relied instead on proxy measures. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-04 11:01 ` Mel Gorman @ 2017-08-04 22:05 ` Paolo Valente 2017-08-05 11:54 ` Mel Gorman 2017-08-07 17:32 ` Paolo Valente 0 siblings, 2 replies; 29+ messages in thread From: Paolo Valente @ 2017-08-04 22:05 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto: > > On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote: >>> I took that into account BFQ with low-latency was also tested and the >>> impact was not a universal improvement although it can be a noticable >>> improvement. From the same machine; >>> >>> dbench4 Loadfile Execution Time >>> 4.12.0 4.12.0 4.12.0 >>> legacy-cfq mq-bfq mq-bfq-tput >>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) >>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) >>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) >>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) >>> >> >> Thanks for trying with low_latency disabled. If I read numbers >> correctly, we move from a worst case of 361% higher execution time to >> a worst case of 11%. With a best case of 20% of lower execution time. >> > > Yes. > >> I asked you about none and mq-deadline in a previous email, because >> actually we have a double change here: change of the I/O stack, and >> change of the scheduler, with the first change probably not irrelevant >> with respect to the second one. >> > > True. However, the difference between legacy-deadline mq-deadline is > roughly around the 5-10% mark across workloads for SSD. It's not > universally true but the impact is not as severe. While this is not > proof that the stack change is the sole root cause, it makes it less > likely. > I'm getting a little lost here. If I'm not mistaken, you are saying, since the difference between two virtually identical schedulers (legacy-deadline and mq-deadline) is only around 5-10%, while the difference between cfq and mq-bfq-tput is higher, then in the latter case it is not the stack's fault. Yet the loss of mq-bfq-tput in the above test is exactly in the 5-10% range? What am I missing? Other tests with mq-bfq-tput not yet reported? >> By chance, according to what you have measured so far, is there any >> test where, instead, you expect or have seen bfq-mq-tput to always >> lose? I could start from there. >> > > global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that > it could be the stack change. > > global-dhp__io-dbench4-fsync-ext4 was a universal loss across any > machine tested. This is global-dhp__io-dbench4-fsync from mmtests using > ext4 as a filesystem. The same is not true for XFS so the filesystem > matters. > Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as soon as I can, thanks. >>> However, it's not a universal gain and there are also fairness issues. >>> For example, this is a fio configuration with a single random reader and >>> a single random writer on the same machine >>> >>> fio Throughput >>> 4.12.0 4.12.0 4.12.0 >>> legacy-cfq mq-bfq mq-bfq-tput >>> Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%) >>> Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%) >>> >>> With CFQ, there is some fairness between the readers and writers and >>> with BFQ, there is a strong preference to writers. Again, this is not >>> universal. It'll be a mix and sometimes it'll be classed as a gain and >>> sometimes a regression. >>> >> >> Yes, that's why I didn't pay too much attention so far to such an >> issue. I preferred to tune for maximum responsiveness and minimal >> latency for soft real-time applications, w.r.t. to reducing a kind of >> unfairness for which no user happened to complain (so far). Do you >> have some real application (or benchmark simulating a real >> application) in which we can see actual problems because of this form >> of unfairness? > > I don't have data on that. This was a preliminary study only to see if > a switch was safe running workloads that would appear in internal bug > reports related to benchmarking. > >> I was thinking of, e.g., two virtual machines, one >> doing heavy writes and the other heavy reads. But in that case, >> cgroups have to be used, and I'm not sure we would still see this >> problem. Any suggestion is welcome. >> > > I haven't spent time designing such a thing. Even if I did, I know I would > get hit within weeks of a switch during distro development with reports > related to fio, dbench and other basic IO benchmarks. > I see. >>> I had seen this assertion so one of the fio configurations had multiple >>> heavy writers in the background and a random reader of small files to >>> simulate that scenario. The intent was to simulate heavy IO in the presence >>> of application startup >>> >>> 4.12.0 4.12.0 4.12.0 >>> legacy-cfq mq-bfq mq-bfq-tput >>> Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%) >>> Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%) >>> >>> Write throughput is steady-ish across each IO scheduler but readers get >>> starved badly which I expect would slow application startup and disabling >>> low_latency makes it much worse. >> >> A greedy random reader that goes on steadily mimics an application startup >> only for the first handful of seconds. >> > > Sure, but if during those handful of seconds the throughput is 10% of > what is used to be, it'll still be noticeable. > I did not have the time yet to repeat this test (I will try soon), but I had the time think about it a little bit. And I soon realized that actually this is not a responsiveness test against background workload, or, it is at most an extreme corner case for it. Both the write and the read thread start at the same time. So, we are mimicking a user starting, e.g., a file copy, and, exactly at the same time, an app(in addition, the file copy starts to cause heavy writes immediately). BFQ uses time patterns to guess which processes to privilege, and the time patterns of the writer and reader are indistinguishable here. Only tagging processes with extra information would help, but that is a different story. And in this case tagging would help for a not-so-frequent use case. In addition, a greedy random reader may mimick the start-up of only very simple applications. Even a simple terminal such as xterm does some I/O (not completely random, but I guess we don't need to be overpicky), then it stops doing I/O and passes the ball to the X server, which does some I/O, stops and passes the ball back to xterm for its final start-up phase. More and more processes are involved, and more and more complex I/O patterns are issued as applications become more complex. This is the reason why we strived to benchmark application start-up by truly starting real applications and measuring their start-up time (see below). >> Where can I find the exact script/configuration you used, to check >> more precisely what is going on and whether BFQ is actually behaving very >> badly for some reason? >> > > https://github.com/gormanm/mmtests > > All the configuration files are in configs/ so > global-dhp__io-dbench4-fsync-ext4 maps to global-dhp__io-dbench4-fsync but > it has to be editted if you want to format a test partition. Otherwise, > you'd just need to make sure the current directory was ext4 and ignore > any filesystem aging artifacts. > Thank you, I'll do it ASAP. >>> The mmtests configuration in question >>> is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create >>> a fresh XFS filesystem on a test partition. >>> >>> This is not exactly equivalent to real application startup but that can >>> be difficult to quantify properly. >>> >> >> If you do want to check application startup, then just 1) start some >> background workload, 2) drop caches, 3) start the app, 4) measure how >> long it takes to start. Otherwise, the comm_startup_lat test in the >> S suite [1] does all of this for you. >> > > I did have something like this before but found it unreliable because it > couldn't tell the difference between when an application has a window > and when it's ready for use. Evolution for example may start up and > start displaing but then clicking on a mail may stall for a few seconds. > It's difficult to quantify meaningfully which is why I eventually gave > up and relied instead on proxy measures. > Right, that's why we looked for other applications that were as popular, but for which we could get reliable and precise measures. One such application is a terminal, another one a shell. On the opposite end of the size spectrum, another other such applications are libreoffice/openoffice. For, e.g, gnome-terminal, it is enough to invoke "time gnome-terminal -e /bin/true". By the stopwatch, such a command measures very precisely the time that elapses from when you start the terminal, to when you can start typing a command in its window. Similarly, "xterm /bin/true", "ssh localhost exit", "bash -c exit", "lowriter --terminate-after-init". Of course, these tricks certainly cause a few more block reads than the real, bare application start-up, but, even if the difference were noticeable in terms of time, what matters is to measure the execution time of these commands without background workload, and then compare it against their execution time with some background workload. If it takes, say, 5 seconds without background workload, and still about 5 seconds with background workload and a given scheduler, but, with another scheduler, it takes 40 seconds with background workload (all real numbers, actually), then you can draw some sound conclusion on responsiveness for the each of the two schedulers. In addition, as for coverage, we made the empiric assumption that start-up time measured with each of the above easy-to-benchmark applications gives an idea of the time that it would take with any application of the same size and complexity. User feedback confirmed this assumptions so far. Of course there may well be exceptions. Thanks, Paolo > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-04 22:05 ` Paolo Valente @ 2017-08-05 11:54 ` Mel Gorman 2017-08-07 17:35 ` Paolo Valente 2017-08-07 17:32 ` Paolo Valente 1 sibling, 1 reply; 29+ messages in thread From: Mel Gorman @ 2017-08-05 11:54 UTC (permalink / raw) To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block On Sat, Aug 05, 2017 at 12:05:00AM +0200, Paolo Valente wrote: > > > > True. However, the difference between legacy-deadline mq-deadline is > > roughly around the 5-10% mark across workloads for SSD. It's not > > universally true but the impact is not as severe. While this is not > > proof that the stack change is the sole root cause, it makes it less > > likely. > > > > I'm getting a little lost here. If I'm not mistaken, you are saying, > since the difference between two virtually identical schedulers > (legacy-deadline and mq-deadline) is only around 5-10%, while the > difference between cfq and mq-bfq-tput is higher, then in the latter > case it is not the stack's fault. Yet the loss of mq-bfq-tput in the > above test is exactly in the 5-10% range? What am I missing? Other > tests with mq-bfq-tput not yet reported? > Unfortunately it's due to very broad generalisations. 10 configurations from mmtests were used in total when I was checking this. Multiply those by 4 for each tested filesystem and then multiply again for each io scheduler on a total of 7 machines taking 3-4 weeks to execute all tests. The deltas between each configuration on different machines varies a lot. It also is an impractical amount of information to present and discuss and the point of the original mail was to highlight that switching the default may create some bug reports so as not be too surprised or panic. The general trend observed was that legacy-deadline vs mq-deadline generally showed a small regression switching to mq-deadline but it was not universal and it wasn't consistent. If nothing else, IO tests that are borderline are difficult to test for significance as distributions are multimodal. However, it was generally close enough to conclude "this could be tolerated and more mq work is on the way". However, it's impossible to give a precise range of how much of a hit it would take but it generally seemed to be around the 5% mark. CFQ switching to BFQ was often more dramatic. Sometimes it doesn't really matter and sometimes turning off low_latency helped enough. bonnie, which is a single IO issuer didn't show much differences in throughput. It had a few problems with file create/delete but the absolute times there are so small that tiny differences look relatively large and were ignored. For the moment, I'll be temporarily ignoring bonnie because it was a sniff-test only and I didn't expect many surprises from a single IO issuer. The workload that cropped up as being most alarming was dbench was is ironic given that it's not actually that IO intensive and tends to be limited by fsync times. The benchmark has a number of other weaknesses. It's more often dominated by scheduler performance, can be gamed by starving all but one threads from IO to give "better" results and is sensitive to the exact timing of when writeback occurs which mmtests tries to mitigate by reducing the loadfile size. If it turns out that it's the only benchmark that really suffers then I think we would live with or find ways of tuning around it but fio concerned me. The fio ones were a concern because of different read/write throughputs and the fact it was not consistent read or write that was favoured. These changes are not necessary good or bad but I've seen in the past that writes that get starved tend to impact workloads that periodically fsync dirty data (think databases) and had to be tuned by reducing dirty_ratio. I've also seen cases where syncing of metadata on some filesystems would cause large stalls if there was a lot of write starvation. I regretted not adding pgioperf (basic simulator of postgres IO behaviour) to the original set of tests because it tends to be very good at detecting fsync stalls due to write starvation. > > <SNIP> > > Sure, but if during those handful of seconds the throughput is 10% of > > what is used to be, it'll still be noticeable. > > > > I did not have the time yet to repeat this test (I will try soon), but > I had the time think about it a little bit. And I soon realized that > actually this is not a responsiveness test against background > workload, or, it is at most an extreme corner case for it. Both the > write and the read thread start at the same time. So, we are > mimicking a user starting, e.g., a file copy, and, exactly at the same > time, an app(in addition, the file copy starts to cause heavy writes > immediately). > Yes, although it's not entirely unrealistic to have light random readers and heavy writers starting at the same time. A write-intensive database can behave like this. Also, I wouldn't panic about needing time to repeat this test. This is not blocking me as such as all I was interested in was checking if the switch could be safely made now or should it be deferred while keeping an eye on how it's doing. It's perfectly possible others will make the switch and find the majority of their workloads are fine. If others report bugs and they're using rotary storage then it should be obvious to ask them to test with the legacy block layer and work from there. At least then, there should be better reference workloads to look from. Unfortunately, given the scope and the time it takes to test, I had little choice except to shotgun a few workloads and see what happened. > BFQ uses time patterns to guess which processes to privilege, and the > time patterns of the writer and reader are indistinguishable here. > Only tagging processes with extra information would help, but that is > a different story. And in this case tagging would help for a > not-so-frequent use case. > Hopefully there will not be a reliance on tagging processes. If we're lucky, I just happened to pick a few IO workloads that seemed to suffer particularly badly. > In addition, a greedy random reader may mimick the start-up of only > very simple applications. Even a simple terminal such as xterm does > some I/O (not completely random, but I guess we don't need to be > overpicky), then it stops doing I/O and passes the ball to the X > server, which does some I/O, stops and passes the ball back to xterm > for its final start-up phase. More and more processes are involved, > and more and more complex I/O patterns are issued as applications > become more complex. This is the reason why we strived to benchmark > application start-up by truly starting real applications and measuring > their start-up time (see below). > Which is fair enough, can't argue with that. Again, the intent here is not to rag on BFQ. I had a few configurations that looked alarming which I sometimes use as an early warning that complex workloads may have problems that are harder to debug. It's not always true. Sometimes the early warnings are red herrings. I've had a long dislike for dbench4 too but each time I got rid of it, it showed up again on some random bug report which is the only reason I included it in this evaluation. > > I did have something like this before but found it unreliable because it > > couldn't tell the difference between when an application has a window > > and when it's ready for use. Evolution for example may start up and > > start displaing but then clicking on a mail may stall for a few seconds. > > It's difficult to quantify meaningfully which is why I eventually gave > > up and relied instead on proxy measures. > > > > Right, that's why we looked for other applications that were as > popular, but for which we could get reliable and precise measures. > One such application is a terminal, another one a shell. On the > opposite end of the size spectrum, another other such applications are > libreoffice/openoffice. > Seems reasonable. > For, e.g, gnome-terminal, it is enough to invoke "time gnome-terminal > -e /bin/true". By the stopwatch, such a command measures very > precisely the time that elapses from when you start the terminal, to > when you can start typing a command in its window. Similarly, "xterm > /bin/true", "ssh localhost exit", "bash -c exit", "lowriter > --terminate-after-init". Of course, these tricks certainly cause a > few more block reads than the real, bare application start-up, but, > even if the difference were noticeable in terms of time, what matters > is to measure the execution time of these commands without background > workload, and then compare it against their execution time with some > background workload. If it takes, say, 5 seconds without background > workload, and still about 5 seconds with background workload and a > given scheduler, but, with another scheduler, it takes 40 seconds with > background workload (all real numbers, actually), then you can draw > some sound conclusion on responsiveness for the each of the two > schedulers. > Again, that is a fair enough methodology and will work in many cases. It's somewhat impractical for myself. When I'm checking patches (be they new patches I developed, am backporting or looking at new kernels), I usually am checking a range of workloads across multiple machines and it's only when I'm doing live analysis of a problem that I'm directly using a machine. > In addition, as for coverage, we made the empiric assumption that > start-up time measured with each of the above easy-to-benchmark > applications gives an idea of the time that it would take with any > application of the same size and complexity. User feedback confirmed > this assumptions so far. Of course there may well be exceptions. > FWIW, I also have anecdotal evidence from at least one user that using BFQ is way better on their desktop than CFQ ever was even under the best of circumstances. I've had problems directly measuring it empirically but this was also the first time I switched on BFQ to see what fell out so it's early days yet. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-05 11:54 ` Mel Gorman @ 2017-08-07 17:35 ` Paolo Valente 0 siblings, 0 replies; 29+ messages in thread From: Paolo Valente @ 2017-08-07 17:35 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 05 ago 2017, alle ore 13:54, Mel Gorman <mgorman@techsingularity.net> ha scritto: > ... > >> In addition, as for coverage, we made the empiric assumption that >> start-up time measured with each of the above easy-to-benchmark >> applications gives an idea of the time that it would take with any >> application of the same size and complexity. User feedback confirmed >> this assumptions so far. Of course there may well be exceptions. >> > > FWIW, I also have anecdotal evidence from at least one user that using > BFQ is way better on their desktop than CFQ ever was even under the best > of circumstances. I've had problems directly measuring it empirically but > this was also the first time I switched on BFQ to see what fell out so > it's early days yet. > Yeah, I'm constantly trying (without great success so far :) ) to turn this folklore into shared, repeatable tests and numbers. The latter could then be reliably evaluated, questioned or defended. Thanks, Paolo > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-04 22:05 ` Paolo Valente 2017-08-05 11:54 ` Mel Gorman @ 2017-08-07 17:32 ` Paolo Valente 2017-08-07 18:42 ` Paolo Valente 2017-08-08 10:30 ` Mel Gorman 1 sibling, 2 replies; 29+ messages in thread From: Paolo Valente @ 2017-08-07 17:32 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto: > >> >> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto: >> >> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote: >>>> I took that into account BFQ with low-latency was also tested and the >>>> impact was not a universal improvement although it can be a noticable >>>> improvement. From the same machine; >>>> >>>> dbench4 Loadfile Execution Time >>>> 4.12.0 4.12.0 4.12.0 >>>> legacy-cfq mq-bfq mq-bfq-tput >>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) >>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) >>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) >>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) >>>> >>> >>> Thanks for trying with low_latency disabled. If I read numbers >>> correctly, we move from a worst case of 361% higher execution time to >>> a worst case of 11%. With a best case of 20% of lower execution time. >>> >> >> Yes. >> >>> I asked you about none and mq-deadline in a previous email, because >>> actually we have a double change here: change of the I/O stack, and >>> change of the scheduler, with the first change probably not irrelevant >>> with respect to the second one. >>> >> >> True. However, the difference between legacy-deadline mq-deadline is >> roughly around the 5-10% mark across workloads for SSD. It's not >> universally true but the impact is not as severe. While this is not >> proof that the stack change is the sole root cause, it makes it less >> likely. >> > > I'm getting a little lost here. If I'm not mistaken, you are saying, > since the difference between two virtually identical schedulers > (legacy-deadline and mq-deadline) is only around 5-10%, while the > difference between cfq and mq-bfq-tput is higher, then in the latter > case it is not the stack's fault. Yet the loss of mq-bfq-tput in the > above test is exactly in the 5-10% range? What am I missing? Other > tests with mq-bfq-tput not yet reported? > >>> By chance, according to what you have measured so far, is there any >>> test where, instead, you expect or have seen bfq-mq-tput to always >>> lose? I could start from there. >>> >> >> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that >> it could be the stack change. >> >> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any >> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using >> ext4 as a filesystem. The same is not true for XFS so the filesystem >> matters. >> > > Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as > soon as I can, thanks. > > I've run this test and tried to further investigate this regression. For the moment, the gist seems to be that blk-mq plays an important role, not only with bfq (unless I'm considering the wrong numbers). Even if your main purpose in this thread was just to give a heads-up, I guess it may be useful to share what I have found out. In addition, I want to ask for some help, to try to get closer to the possible causes of at least this regression. If you think it would be better to open a new thread on this stuff, I'll do it. First, I got mixed results on my system. I'll focus only on the the case where mq-bfq-tput achieves its worst relative performance w.r.t. to cfq, which happens with 64 clients. Still, also in this case mq-bfq is better than cfq in all average values, but Flush. I don't know which are the best/right values to look at, so, here's the final report for both schedulers: CFQ Operation Count AvgLat MaxLat -------------------------------------------------- Flush 13120 20.069 348.594 Close 133696 0.008 14.642 LockX 512 0.009 0.059 Rename 7552 1.857 415.418 ReadX 270720 0.141 535.632 WriteX 89591 421.961 6363.271 Unlink 34048 1.281 662.467 UnlockX 512 0.007 0.057 FIND_FIRST 62016 0.086 25.060 SET_FILE_INFORMATION 15616 0.995 176.621 QUERY_FILE_INFORMATION 28734 0.004 1.372 QUERY_PATH_INFORMATION 170240 0.163 820.292 QUERY_FS_INFORMATION 28736 0.017 4.110 NTCreateX 178688 0.437 905.567 MQ-BFQ-TPUT Operation Count AvgLat MaxLat -------------------------------------------------- Flush 13504 75.828 11196.035 Close 136896 0.004 3.855 LockX 640 0.005 0.031 Rename 8064 1.020 288.989 ReadX 297600 0.081 685.850 WriteX 93515 391.637 12681.517 Unlink 34880 0.500 146.928 UnlockX 640 0.004 0.032 FIND_FIRST 63680 0.045 222.491 SET_FILE_INFORMATION 16000 0.436 686.115 QUERY_FILE_INFORMATION 30464 0.003 0.773 QUERY_PATH_INFORMATION 175552 0.044 148.449 QUERY_FS_INFORMATION 29888 0.009 1.984 NTCreateX 183152 0.289 300.867 Are these results in line with yours for this test? Anyway, to investigate this regression more in depth, I took two further steps. First, I repeated the same test with bfq-sq, my out-of-tree version of bfq for legacy block (identical to mq-bfq apart from the changes needed for bfq to live in blk-mq). I got: BFQ-SQ-TPUT Operation Count AvgLat MaxLat -------------------------------------------------- Flush 12618 30.212 484.099 Close 123884 0.008 10.477 LockX 512 0.010 0.170 Rename 7296 2.032 426.409 ReadX 262179 0.251 985.478 WriteX 84072 461.398 7283.003 Unlink 33076 1.685 848.734 UnlockX 512 0.007 0.036 FIND_FIRST 58690 0.096 220.720 SET_FILE_INFORMATION 14976 1.792 466.435 QUERY_FILE_INFORMATION 26575 0.004 2.194 QUERY_PATH_INFORMATION 158125 0.112 614.063 QUERY_FS_INFORMATION 28224 0.017 1.385 NTCreateX 167877 0.827 945.644 So, the worst-case regression is now around 15%. This made me suspect that blk-mq influences results a lot for this test. To crosscheck, I compared legacy-deadline and mq-deadline too. LEGACY-DEADLINE Operation Count AvgLat MaxLat -------------------------------------------------- Flush 13267 9.622 298.206 Close 135692 0.007 10.627 LockX 640 0.008 0.066 Rename 7827 0.544 481.123 ReadX 285929 0.220 2698.442 WriteX 92309 430.867 5191.608 Unlink 34534 1.133 619.235 UnlockX 640 0.008 0.724 FIND_FIRST 63289 0.086 56.851 SET_FILE_INFORMATION 16000 1.254 844.065 QUERY_FILE_INFORMATION 29883 0.004 0.618 QUERY_PATH_INFORMATION 173232 0.089 1295.651 QUERY_FS_INFORMATION 29632 0.017 4.813 NTCreateX 181464 0.479 2214.343 MQ-DEADLINE Operation Count AvgLat MaxLat -------------------------------------------------- Flush 13760 90.542 13221.495 Close 137654 0.008 27.133 LockX 640 0.009 0.115 Rename 8064 1.062 246.759 ReadX 297956 0.051 347.018 WriteX 94698 425.636 15090.020 Unlink 35077 0.580 208.462 UnlockX 640 0.007 0.291 FIND_FIRST 66630 0.566 530.339 SET_FILE_INFORMATION 16000 1.419 811.494 QUERY_FILE_INFORMATION 30717 0.004 1.108 QUERY_PATH_INFORMATION 176153 0.182 517.419 QUERY_FS_INFORMATION 30857 0.018 18.562 NTCreateX 184145 0.281 582.076 So, with both bfq and deadline there seems to be a serious regression, especially on MaxLat, when moving from legacy block to blk-mq. The regression is much worse with deadline, as legacy-deadline has the lowest max latency among all the schedulers, whereas mq-deadline has the highest one. Regardless of the actual culprit of this regression, I would like to investigate further this issue. In this respect, I would like to ask for a little help. I would like to isolate the workloads generating the highest latencies. To this purpose, I had a look at the loadfile client-tiny.txt, and I still have a doubt: is every item in the loadfile executed somehow several times (for each value of the number of clients), or is it executed only once? More precisely, IIUC, for each operation reported in the above results, there are several items (lines) in the loadfile. So, is each of these items executed only once? I'm asking because, if it is executed only once, then I guess I can find the critical tasks ore easily. Finally, if it is actually executed only once, is it expected that the latency for such a task is one order of magnitude higher than that of the average latency for that group of tasks? I mean, is such a task intrinsically much heavier, and then expectedly much longer, or is the fact that latency is much higher for this task a sign that something in the kernel misbehaves for that task? While waiting for some feedback, I'm going to execute your test showing great unfairness between writes and reads, and to also check whether responsiveness does worsen if the write workload for that test is being executed in the background. Thanks, Paolo > ... >> -- >> Mel Gorman >> SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-07 17:32 ` Paolo Valente @ 2017-08-07 18:42 ` Paolo Valente 2017-08-08 8:06 ` Paolo Valente 2017-08-08 10:30 ` Mel Gorman 1 sibling, 1 reply; 29+ messages in thread From: Paolo Valente @ 2017-08-07 18:42 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto: > >> >> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto: >> >>> >>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto: >>> >>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote: >>>>> I took that into account BFQ with low-latency was also tested and the >>>>> impact was not a universal improvement although it can be a noticable >>>>> improvement. From the same machine; >>>>> >>>>> dbench4 Loadfile Execution Time >>>>> 4.12.0 4.12.0 4.12.0 >>>>> legacy-cfq mq-bfq mq-bfq-tput >>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) >>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) >>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) >>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) >>>>> >>>> >>>> Thanks for trying with low_latency disabled. If I read numbers >>>> correctly, we move from a worst case of 361% higher execution time to >>>> a worst case of 11%. With a best case of 20% of lower execution time. >>>> >>> >>> Yes. >>> >>>> I asked you about none and mq-deadline in a previous email, because >>>> actually we have a double change here: change of the I/O stack, and >>>> change of the scheduler, with the first change probably not irrelevant >>>> with respect to the second one. >>>> >>> >>> True. However, the difference between legacy-deadline mq-deadline is >>> roughly around the 5-10% mark across workloads for SSD. It's not >>> universally true but the impact is not as severe. While this is not >>> proof that the stack change is the sole root cause, it makes it less >>> likely. >>> >> >> I'm getting a little lost here. If I'm not mistaken, you are saying, >> since the difference between two virtually identical schedulers >> (legacy-deadline and mq-deadline) is only around 5-10%, while the >> difference between cfq and mq-bfq-tput is higher, then in the latter >> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the >> above test is exactly in the 5-10% range? What am I missing? Other >> tests with mq-bfq-tput not yet reported? >> >>>> By chance, according to what you have measured so far, is there any >>>> test where, instead, you expect or have seen bfq-mq-tput to always >>>> lose? I could start from there. >>>> >>> >>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that >>> it could be the stack change. >>> >>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any >>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using >>> ext4 as a filesystem. The same is not true for XFS so the filesystem >>> matters. >>> >> >> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as >> soon as I can, thanks. >> >> > > I've run this test and tried to further investigate this regression. > For the moment, the gist seems to be that blk-mq plays an important > role, not only with bfq (unless I'm considering the wrong numbers). > Even if your main purpose in this thread was just to give a heads-up, > I guess it may be useful to share what I have found out. In addition, > I want to ask for some help, to try to get closer to the possible > causes of at least this regression. If you think it would be better > to open a new thread on this stuff, I'll do it. > > First, I got mixed results on my system. I'll focus only on the the > case where mq-bfq-tput achieves its worst relative performance w.r.t. > to cfq, which happens with 64 clients. Still, also in this case > mq-bfq is better than cfq in all average values, but Flush. I don't > know which are the best/right values to look at, so, here's the final > report for both schedulers: > > CFQ > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13120 20.069 348.594 > Close 133696 0.008 14.642 > LockX 512 0.009 0.059 > Rename 7552 1.857 415.418 > ReadX 270720 0.141 535.632 > WriteX 89591 421.961 6363.271 > Unlink 34048 1.281 662.467 > UnlockX 512 0.007 0.057 > FIND_FIRST 62016 0.086 25.060 > SET_FILE_INFORMATION 15616 0.995 176.621 > QUERY_FILE_INFORMATION 28734 0.004 1.372 > QUERY_PATH_INFORMATION 170240 0.163 820.292 > QUERY_FS_INFORMATION 28736 0.017 4.110 > NTCreateX 178688 0.437 905.567 > > MQ-BFQ-TPUT > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13504 75.828 11196.035 > Close 136896 0.004 3.855 > LockX 640 0.005 0.031 > Rename 8064 1.020 288.989 > ReadX 297600 0.081 685.850 > WriteX 93515 391.637 12681.517 > Unlink 34880 0.500 146.928 > UnlockX 640 0.004 0.032 > FIND_FIRST 63680 0.045 222.491 > SET_FILE_INFORMATION 16000 0.436 686.115 > QUERY_FILE_INFORMATION 30464 0.003 0.773 > QUERY_PATH_INFORMATION 175552 0.044 148.449 > QUERY_FS_INFORMATION 29888 0.009 1.984 > NTCreateX 183152 0.289 300.867 > > Are these results in line with yours for this test? > > Anyway, to investigate this regression more in depth, I took two > further steps. First, I repeated the same test with bfq-sq, my > out-of-tree version of bfq for legacy block (identical to mq-bfq apart > from the changes needed for bfq to live in blk-mq). I got: > > BFQ-SQ-TPUT > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 12618 30.212 484.099 > Close 123884 0.008 10.477 > LockX 512 0.010 0.170 > Rename 7296 2.032 426.409 > ReadX 262179 0.251 985.478 > WriteX 84072 461.398 7283.003 > Unlink 33076 1.685 848.734 > UnlockX 512 0.007 0.036 > FIND_FIRST 58690 0.096 220.720 > SET_FILE_INFORMATION 14976 1.792 466.435 > QUERY_FILE_INFORMATION 26575 0.004 2.194 > QUERY_PATH_INFORMATION 158125 0.112 614.063 > QUERY_FS_INFORMATION 28224 0.017 1.385 > NTCreateX 167877 0.827 945.644 > > So, the worst-case regression is now around 15%. This made me suspect > that blk-mq influences results a lot for this test. To crosscheck, I > compared legacy-deadline and mq-deadline too. > Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets occasionally confused by the workload, and grants device idling to processes that, for this specific workload, would be better to de-schedule immediately. If we set slice_idle to 0, then bfq-sq becomes more or less equivalent to cfq (for some operations apparently even much better): bfq-sq-tput-0idle Operation Count AvgLat MaxLat -------------------------------------------------- Flush 13013 17.888 280.517 Close 133004 0.008 20.698 LockX 512 0.008 0.088 Rename 7427 2.041 193.232 ReadX 270534 0.138 408.534 WriteX 88598 429.615 6272.212 Unlink 33734 1.205 559.152 UnlockX 512 0.011 1.808 FIND_FIRST 61762 0.087 23.012 SET_FILE_INFORMATION 15337 1.322 220.155 QUERY_FILE_INFORMATION 28415 0.004 0.559 QUERY_PATH_INFORMATION 169423 0.150 580.570 QUERY_FS_INFORMATION 28547 0.019 24.466 NTCreateX 177618 0.544 681.795 I'll try soon with mq-bfq too, for which I expect however a deeper investigation to be needed. Thanks, Paolo > LEGACY-DEADLINE > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13267 9.622 298.206 > Close 135692 0.007 10.627 > LockX 640 0.008 0.066 > Rename 7827 0.544 481.123 > ReadX 285929 0.220 2698.442 > WriteX 92309 430.867 5191.608 > Unlink 34534 1.133 619.235 > UnlockX 640 0.008 0.724 > FIND_FIRST 63289 0.086 56.851 > SET_FILE_INFORMATION 16000 1.254 844.065 > QUERY_FILE_INFORMATION 29883 0.004 0.618 > QUERY_PATH_INFORMATION 173232 0.089 1295.651 > QUERY_FS_INFORMATION 29632 0.017 4.813 > NTCreateX 181464 0.479 2214.343 > > > MQ-DEADLINE > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13760 90.542 13221.495 > Close 137654 0.008 27.133 > LockX 640 0.009 0.115 > Rename 8064 1.062 246.759 > ReadX 297956 0.051 347.018 > WriteX 94698 425.636 15090.020 > Unlink 35077 0.580 208.462 > UnlockX 640 0.007 0.291 > FIND_FIRST 66630 0.566 530.339 > SET_FILE_INFORMATION 16000 1.419 811.494 > QUERY_FILE_INFORMATION 30717 0.004 1.108 > QUERY_PATH_INFORMATION 176153 0.182 517.419 > QUERY_FS_INFORMATION 30857 0.018 18.562 > NTCreateX 184145 0.281 582.076 > > So, with both bfq and deadline there seems to be a serious regression, > especially on MaxLat, when moving from legacy block to blk-mq. The > regression is much worse with deadline, as legacy-deadline has the > lowest max latency among all the schedulers, whereas mq-deadline has > the highest one. > > Regardless of the actual culprit of this regression, I would like to > investigate further this issue. In this respect, I would like to ask > for a little help. I would like to isolate the workloads generating > the highest latencies. To this purpose, I had a look at the loadfile > client-tiny.txt, and I still have a doubt: is every item in the > loadfile executed somehow several times (for each value of the number > of clients), or is it executed only once? More precisely, IIUC, for > each operation reported in the above results, there are several items > (lines) in the loadfile. So, is each of these items executed only > once? > > I'm asking because, if it is executed only once, then I guess I can > find the critical tasks ore easily. Finally, if it is actually > executed only once, is it expected that the latency for such a task is > one order of magnitude higher than that of the average latency for > that group of tasks? I mean, is such a task intrinsically much > heavier, and then expectedly much longer, or is the fact that latency > is much higher for this task a sign that something in the kernel > misbehaves for that task? > > While waiting for some feedback, I'm going to execute your test > showing great unfairness between writes and reads, and to also check > whether responsiveness does worsen if the write workload for that test > is being executed in the background. > > Thanks, > Paolo > >> ... >>> -- >>> Mel Gorman >>> SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-07 18:42 ` Paolo Valente @ 2017-08-08 8:06 ` Paolo Valente 2017-08-08 17:33 ` Paolo Valente 0 siblings, 1 reply; 29+ messages in thread From: Paolo Valente @ 2017-08-08 8:06 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <paolo.valente@linaro.org> ha scritto: > >> >> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto: >> >>> >>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto: >>> >>>> >>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto: >>>> >>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote: >>>>>> I took that into account BFQ with low-latency was also tested and the >>>>>> impact was not a universal improvement although it can be a noticable >>>>>> improvement. From the same machine; >>>>>> >>>>>> dbench4 Loadfile Execution Time >>>>>> 4.12.0 4.12.0 4.12.0 >>>>>> legacy-cfq mq-bfq mq-bfq-tput >>>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) >>>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) >>>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) >>>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) >>>>>> >>>>> >>>>> Thanks for trying with low_latency disabled. If I read numbers >>>>> correctly, we move from a worst case of 361% higher execution time to >>>>> a worst case of 11%. With a best case of 20% of lower execution time. >>>>> >>>> >>>> Yes. >>>> >>>>> I asked you about none and mq-deadline in a previous email, because >>>>> actually we have a double change here: change of the I/O stack, and >>>>> change of the scheduler, with the first change probably not irrelevant >>>>> with respect to the second one. >>>>> >>>> >>>> True. However, the difference between legacy-deadline mq-deadline is >>>> roughly around the 5-10% mark across workloads for SSD. It's not >>>> universally true but the impact is not as severe. While this is not >>>> proof that the stack change is the sole root cause, it makes it less >>>> likely. >>>> >>> >>> I'm getting a little lost here. If I'm not mistaken, you are saying, >>> since the difference between two virtually identical schedulers >>> (legacy-deadline and mq-deadline) is only around 5-10%, while the >>> difference between cfq and mq-bfq-tput is higher, then in the latter >>> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the >>> above test is exactly in the 5-10% range? What am I missing? Other >>> tests with mq-bfq-tput not yet reported? >>> >>>>> By chance, according to what you have measured so far, is there any >>>>> test where, instead, you expect or have seen bfq-mq-tput to always >>>>> lose? I could start from there. >>>>> >>>> >>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that >>>> it could be the stack change. >>>> >>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any >>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using >>>> ext4 as a filesystem. The same is not true for XFS so the filesystem >>>> matters. >>>> >>> >>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as >>> soon as I can, thanks. >>> >>> >> >> I've run this test and tried to further investigate this regression. >> For the moment, the gist seems to be that blk-mq plays an important >> role, not only with bfq (unless I'm considering the wrong numbers). >> Even if your main purpose in this thread was just to give a heads-up, >> I guess it may be useful to share what I have found out. In addition, >> I want to ask for some help, to try to get closer to the possible >> causes of at least this regression. If you think it would be better >> to open a new thread on this stuff, I'll do it. >> >> First, I got mixed results on my system. I'll focus only on the the >> case where mq-bfq-tput achieves its worst relative performance w.r.t. >> to cfq, which happens with 64 clients. Still, also in this case >> mq-bfq is better than cfq in all average values, but Flush. I don't >> know which are the best/right values to look at, so, here's the final >> report for both schedulers: >> >> CFQ >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13120 20.069 348.594 >> Close 133696 0.008 14.642 >> LockX 512 0.009 0.059 >> Rename 7552 1.857 415.418 >> ReadX 270720 0.141 535.632 >> WriteX 89591 421.961 6363.271 >> Unlink 34048 1.281 662.467 >> UnlockX 512 0.007 0.057 >> FIND_FIRST 62016 0.086 25.060 >> SET_FILE_INFORMATION 15616 0.995 176.621 >> QUERY_FILE_INFORMATION 28734 0.004 1.372 >> QUERY_PATH_INFORMATION 170240 0.163 820.292 >> QUERY_FS_INFORMATION 28736 0.017 4.110 >> NTCreateX 178688 0.437 905.567 >> >> MQ-BFQ-TPUT >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13504 75.828 11196.035 >> Close 136896 0.004 3.855 >> LockX 640 0.005 0.031 >> Rename 8064 1.020 288.989 >> ReadX 297600 0.081 685.850 >> WriteX 93515 391.637 12681.517 >> Unlink 34880 0.500 146.928 >> UnlockX 640 0.004 0.032 >> FIND_FIRST 63680 0.045 222.491 >> SET_FILE_INFORMATION 16000 0.436 686.115 >> QUERY_FILE_INFORMATION 30464 0.003 0.773 >> QUERY_PATH_INFORMATION 175552 0.044 148.449 >> QUERY_FS_INFORMATION 29888 0.009 1.984 >> NTCreateX 183152 0.289 300.867 >> >> Are these results in line with yours for this test? >> >> Anyway, to investigate this regression more in depth, I took two >> further steps. First, I repeated the same test with bfq-sq, my >> out-of-tree version of bfq for legacy block (identical to mq-bfq apart >> from the changes needed for bfq to live in blk-mq). I got: >> >> BFQ-SQ-TPUT >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 12618 30.212 484.099 >> Close 123884 0.008 10.477 >> LockX 512 0.010 0.170 >> Rename 7296 2.032 426.409 >> ReadX 262179 0.251 985.478 >> WriteX 84072 461.398 7283.003 >> Unlink 33076 1.685 848.734 >> UnlockX 512 0.007 0.036 >> FIND_FIRST 58690 0.096 220.720 >> SET_FILE_INFORMATION 14976 1.792 466.435 >> QUERY_FILE_INFORMATION 26575 0.004 2.194 >> QUERY_PATH_INFORMATION 158125 0.112 614.063 >> QUERY_FS_INFORMATION 28224 0.017 1.385 >> NTCreateX 167877 0.827 945.644 >> >> So, the worst-case regression is now around 15%. This made me suspect >> that blk-mq influences results a lot for this test. To crosscheck, I >> compared legacy-deadline and mq-deadline too. >> > > Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets > occasionally confused by the workload, and grants device idling to > processes that, for this specific workload, would be better to > de-schedule immediately. If we set slice_idle to 0, then bfq-sq > becomes more or less equivalent to cfq (for some operations apparently > even much better): > > bfq-sq-tput-0idle > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13013 17.888 280.517 > Close 133004 0.008 20.698 > LockX 512 0.008 0.088 > Rename 7427 2.041 193.232 > ReadX 270534 0.138 408.534 > WriteX 88598 429.615 6272.212 > Unlink 33734 1.205 559.152 > UnlockX 512 0.011 1.808 > FIND_FIRST 61762 0.087 23.012 > SET_FILE_INFORMATION 15337 1.322 220.155 > QUERY_FILE_INFORMATION 28415 0.004 0.559 > QUERY_PATH_INFORMATION 169423 0.150 580.570 > QUERY_FS_INFORMATION 28547 0.019 24.466 > NTCreateX 177618 0.544 681.795 > > I'll try soon with mq-bfq too, for which I expect however a deeper > investigation to be needed. > Hi, to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also applied Ming patches, and Ah, victory! Regardless of the value of slice idle: mq-bfq-tput Operation Count AvgLat MaxLat -------------------------------------------------- Flush 13183 70.381 1025.407 Close 134539 0.004 1.011 LockX 512 0.005 0.025 Rename 7721 0.740 404.979 ReadX 274422 0.126 873.364 WriteX 90535 408.371 7400.585 Unlink 34276 0.634 581.067 UnlockX 512 0.003 0.029 FIND_FIRST 62664 0.052 321.027 SET_FILE_INFORMATION 15981 0.234 124.739 QUERY_FILE_INFORMATION 29042 0.003 1.731 QUERY_PATH_INFORMATION 171769 0.032 522.415 QUERY_FS_INFORMATION 28958 0.009 3.043 NTCreateX 179643 0.298 687.466 Throughput 9.11183 MB/sec 64 clients 64 procs max_latency=7400.588 ms Differently from bfq-sq, setting slice_idle to 0 doesn't provide any benefit, which lets me suspect that there is some other issue in blk-mq (only a suspect). I think I may have already understood how to guarantee that bfq almost never idles the device uselessly also for this workload. Yet, since in blk-mq there is no gain even after excluding useless idling, I'll wait for at least Ming's patches to be merged before possibly proposing this contribution. Maybe some other little issue related to this lack of gain in blk-mq will be found and solved in the meantime. Moving to the read-write unfairness problem. Thanks, Paolo > Thanks, > Paolo > >> LEGACY-DEADLINE >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13267 9.622 298.206 >> Close 135692 0.007 10.627 >> LockX 640 0.008 0.066 >> Rename 7827 0.544 481.123 >> ReadX 285929 0.220 2698.442 >> WriteX 92309 430.867 5191.608 >> Unlink 34534 1.133 619.235 >> UnlockX 640 0.008 0.724 >> FIND_FIRST 63289 0.086 56.851 >> SET_FILE_INFORMATION 16000 1.254 844.065 >> QUERY_FILE_INFORMATION 29883 0.004 0.618 >> QUERY_PATH_INFORMATION 173232 0.089 1295.651 >> QUERY_FS_INFORMATION 29632 0.017 4.813 >> NTCreateX 181464 0.479 2214.343 >> >> >> MQ-DEADLINE >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13760 90.542 13221.495 >> Close 137654 0.008 27.133 >> LockX 640 0.009 0.115 >> Rename 8064 1.062 246.759 >> ReadX 297956 0.051 347.018 >> WriteX 94698 425.636 15090.020 >> Unlink 35077 0.580 208.462 >> UnlockX 640 0.007 0.291 >> FIND_FIRST 66630 0.566 530.339 >> SET_FILE_INFORMATION 16000 1.419 811.494 >> QUERY_FILE_INFORMATION 30717 0.004 1.108 >> QUERY_PATH_INFORMATION 176153 0.182 517.419 >> QUERY_FS_INFORMATION 30857 0.018 18.562 >> NTCreateX 184145 0.281 582.076 >> >> So, with both bfq and deadline there seems to be a serious regression, >> especially on MaxLat, when moving from legacy block to blk-mq. The >> regression is much worse with deadline, as legacy-deadline has the >> lowest max latency among all the schedulers, whereas mq-deadline has >> the highest one. >> >> Regardless of the actual culprit of this regression, I would like to >> investigate further this issue. In this respect, I would like to ask >> for a little help. I would like to isolate the workloads generating >> the highest latencies. To this purpose, I had a look at the loadfile >> client-tiny.txt, and I still have a doubt: is every item in the >> loadfile executed somehow several times (for each value of the number >> of clients), or is it executed only once? More precisely, IIUC, for >> each operation reported in the above results, there are several items >> (lines) in the loadfile. So, is each of these items executed only >> once? >> >> I'm asking because, if it is executed only once, then I guess I can >> find the critical tasks ore easily. Finally, if it is actually >> executed only once, is it expected that the latency for such a task is >> one order of magnitude higher than that of the average latency for >> that group of tasks? I mean, is such a task intrinsically much >> heavier, and then expectedly much longer, or is the fact that latency >> is much higher for this task a sign that something in the kernel >> misbehaves for that task? >> >> While waiting for some feedback, I'm going to execute your test >> showing great unfairness between writes and reads, and to also check >> whether responsiveness does worsen if the write workload for that test >> is being executed in the background. >> >> Thanks, >> Paolo >> >>> ... >>>> -- >>>> Mel Gorman >>>> SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 8:06 ` Paolo Valente @ 2017-08-08 17:33 ` Paolo Valente 2017-08-08 18:27 ` Mel Gorman 2017-08-09 21:49 ` Paolo Valente 0 siblings, 2 replies; 29+ messages in thread From: Paolo Valente @ 2017-08-08 17:33 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente <paolo.valente@linaro.org> ha scritto: > >> >> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <paolo.valente@linaro.org> ha scritto: >> >>> >>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto: >>> >>>> >>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto: >>>> >>>>> >>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto: >>>>> >>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote: >>>>>>> I took that into account BFQ with low-latency was also tested and the >>>>>>> impact was not a universal improvement although it can be a noticable >>>>>>> improvement. From the same machine; >>>>>>> >>>>>>> dbench4 Loadfile Execution Time >>>>>>> 4.12.0 4.12.0 4.12.0 >>>>>>> legacy-cfq mq-bfq mq-bfq-tput >>>>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) >>>>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) >>>>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) >>>>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) >>>>>>> >>>>>> >>>>>> Thanks for trying with low_latency disabled. If I read numbers >>>>>> correctly, we move from a worst case of 361% higher execution time to >>>>>> a worst case of 11%. With a best case of 20% of lower execution time. >>>>>> >>>>> >>>>> Yes. >>>>> >>>>>> I asked you about none and mq-deadline in a previous email, because >>>>>> actually we have a double change here: change of the I/O stack, and >>>>>> change of the scheduler, with the first change probably not irrelevant >>>>>> with respect to the second one. >>>>>> >>>>> >>>>> True. However, the difference between legacy-deadline mq-deadline is >>>>> roughly around the 5-10% mark across workloads for SSD. It's not >>>>> universally true but the impact is not as severe. While this is not >>>>> proof that the stack change is the sole root cause, it makes it less >>>>> likely. >>>>> >>>> >>>> I'm getting a little lost here. If I'm not mistaken, you are saying, >>>> since the difference between two virtually identical schedulers >>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the >>>> difference between cfq and mq-bfq-tput is higher, then in the latter >>>> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the >>>> above test is exactly in the 5-10% range? What am I missing? Other >>>> tests with mq-bfq-tput not yet reported? >>>> >>>>>> By chance, according to what you have measured so far, is there any >>>>>> test where, instead, you expect or have seen bfq-mq-tput to always >>>>>> lose? I could start from there. >>>>>> >>>>> >>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that >>>>> it could be the stack change. >>>>> >>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any >>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using >>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem >>>>> matters. >>>>> >>>> >>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as >>>> soon as I can, thanks. >>>> >>>> >>> >>> I've run this test and tried to further investigate this regression. >>> For the moment, the gist seems to be that blk-mq plays an important >>> role, not only with bfq (unless I'm considering the wrong numbers). >>> Even if your main purpose in this thread was just to give a heads-up, >>> I guess it may be useful to share what I have found out. In addition, >>> I want to ask for some help, to try to get closer to the possible >>> causes of at least this regression. If you think it would be better >>> to open a new thread on this stuff, I'll do it. >>> >>> First, I got mixed results on my system. I'll focus only on the the >>> case where mq-bfq-tput achieves its worst relative performance w.r.t. >>> to cfq, which happens with 64 clients. Still, also in this case >>> mq-bfq is better than cfq in all average values, but Flush. I don't >>> know which are the best/right values to look at, so, here's the final >>> report for both schedulers: >>> >>> CFQ >>> >>> Operation Count AvgLat MaxLat >>> -------------------------------------------------- >>> Flush 13120 20.069 348.594 >>> Close 133696 0.008 14.642 >>> LockX 512 0.009 0.059 >>> Rename 7552 1.857 415.418 >>> ReadX 270720 0.141 535.632 >>> WriteX 89591 421.961 6363.271 >>> Unlink 34048 1.281 662.467 >>> UnlockX 512 0.007 0.057 >>> FIND_FIRST 62016 0.086 25.060 >>> SET_FILE_INFORMATION 15616 0.995 176.621 >>> QUERY_FILE_INFORMATION 28734 0.004 1.372 >>> QUERY_PATH_INFORMATION 170240 0.163 820.292 >>> QUERY_FS_INFORMATION 28736 0.017 4.110 >>> NTCreateX 178688 0.437 905.567 >>> >>> MQ-BFQ-TPUT >>> >>> Operation Count AvgLat MaxLat >>> -------------------------------------------------- >>> Flush 13504 75.828 11196.035 >>> Close 136896 0.004 3.855 >>> LockX 640 0.005 0.031 >>> Rename 8064 1.020 288.989 >>> ReadX 297600 0.081 685.850 >>> WriteX 93515 391.637 12681.517 >>> Unlink 34880 0.500 146.928 >>> UnlockX 640 0.004 0.032 >>> FIND_FIRST 63680 0.045 222.491 >>> SET_FILE_INFORMATION 16000 0.436 686.115 >>> QUERY_FILE_INFORMATION 30464 0.003 0.773 >>> QUERY_PATH_INFORMATION 175552 0.044 148.449 >>> QUERY_FS_INFORMATION 29888 0.009 1.984 >>> NTCreateX 183152 0.289 300.867 >>> >>> Are these results in line with yours for this test? >>> >>> Anyway, to investigate this regression more in depth, I took two >>> further steps. First, I repeated the same test with bfq-sq, my >>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart >>> from the changes needed for bfq to live in blk-mq). I got: >>> >>> BFQ-SQ-TPUT >>> >>> Operation Count AvgLat MaxLat >>> -------------------------------------------------- >>> Flush 12618 30.212 484.099 >>> Close 123884 0.008 10.477 >>> LockX 512 0.010 0.170 >>> Rename 7296 2.032 426.409 >>> ReadX 262179 0.251 985.478 >>> WriteX 84072 461.398 7283.003 >>> Unlink 33076 1.685 848.734 >>> UnlockX 512 0.007 0.036 >>> FIND_FIRST 58690 0.096 220.720 >>> SET_FILE_INFORMATION 14976 1.792 466.435 >>> QUERY_FILE_INFORMATION 26575 0.004 2.194 >>> QUERY_PATH_INFORMATION 158125 0.112 614.063 >>> QUERY_FS_INFORMATION 28224 0.017 1.385 >>> NTCreateX 167877 0.827 945.644 >>> >>> So, the worst-case regression is now around 15%. This made me suspect >>> that blk-mq influences results a lot for this test. To crosscheck, I >>> compared legacy-deadline and mq-deadline too. >>> >> >> Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets >> occasionally confused by the workload, and grants device idling to >> processes that, for this specific workload, would be better to >> de-schedule immediately. If we set slice_idle to 0, then bfq-sq >> becomes more or less equivalent to cfq (for some operations apparently >> even much better): >> >> bfq-sq-tput-0idle >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13013 17.888 280.517 >> Close 133004 0.008 20.698 >> LockX 512 0.008 0.088 >> Rename 7427 2.041 193.232 >> ReadX 270534 0.138 408.534 >> WriteX 88598 429.615 6272.212 >> Unlink 33734 1.205 559.152 >> UnlockX 512 0.011 1.808 >> FIND_FIRST 61762 0.087 23.012 >> SET_FILE_INFORMATION 15337 1.322 220.155 >> QUERY_FILE_INFORMATION 28415 0.004 0.559 >> QUERY_PATH_INFORMATION 169423 0.150 580.570 >> QUERY_FS_INFORMATION 28547 0.019 24.466 >> NTCreateX 177618 0.544 681.795 >> >> I'll try soon with mq-bfq too, for which I expect however a deeper >> investigation to be needed. >> > > Hi, > to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also > applied Ming patches, and Ah, victory! > > Regardless of the value of slice idle: > > mq-bfq-tput > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13183 70.381 1025.407 > Close 134539 0.004 1.011 > LockX 512 0.005 0.025 > Rename 7721 0.740 404.979 > ReadX 274422 0.126 873.364 > WriteX 90535 408.371 7400.585 > Unlink 34276 0.634 581.067 > UnlockX 512 0.003 0.029 > FIND_FIRST 62664 0.052 321.027 > SET_FILE_INFORMATION 15981 0.234 124.739 > QUERY_FILE_INFORMATION 29042 0.003 1.731 > QUERY_PATH_INFORMATION 171769 0.032 522.415 > QUERY_FS_INFORMATION 28958 0.009 3.043 > NTCreateX 179643 0.298 687.466 > > Throughput 9.11183 MB/sec 64 clients 64 procs max_latency=7400.588 ms > > Differently from bfq-sq, setting slice_idle to 0 doesn't provide any > benefit, which lets me suspect that there is some other issue in > blk-mq (only a suspect). I think I may have already understood how to > guarantee that bfq almost never idles the device uselessly also for > this workload. Yet, since in blk-mq there is no gain even after > excluding useless idling, I'll wait for at least Ming's patches to be > merged before possibly proposing this contribution. Maybe some other > little issue related to this lack of gain in blk-mq will be found and > solved in the meantime. > > Moving to the read-write unfairness problem. > I've reproduced the unfairness issue (rand reader throttled by heavy writers) with bfq, using configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with an important side problem: cfq suffers from exactly the same unfairness (785kB/s writers, 13.4kB/s reader). Of course, this happens in my system, with a HITACHI HTS727550A9E364. This discrepancy with your results makes a little bit harder for me to understand how to better proceed, as I see no regression. Anyway, since this reader-throttling issue seems relevant, I have investigated it a little more in depth. The cause of the throttling is that the fdatasync frequently performed by the writers in this test turns the I/O of the writers into a 100% sync I/O. And neither bfq or cfq differentiate bandwidth between sync reads and sync writes. Basically both cfq and bfq are willing to dispatch the I/O requests of each writer for a time slot equal to that devoted to the reader. But write requests, after reaching the device, use the latter for much more time than reads. This delays the completion of the requests of the reader, and, being the I/O sync, the issuing of the next I/O requests by the reader. The final result is that the device spends most of the time serving write requests, while the reader issues its read requests very slowly. It might not be so difficult to balance this unfairness, although I'm a little worried about changing bfq without being able to see the regression you report. In case I give it a try, could I then count on some testing on your machines? Thanks, Paolo > Thanks, > Paolo > >> Thanks, >> Paolo >> >>> LEGACY-DEADLINE >>> >>> Operation Count AvgLat MaxLat >>> -------------------------------------------------- >>> Flush 13267 9.622 298.206 >>> Close 135692 0.007 10.627 >>> LockX 640 0.008 0.066 >>> Rename 7827 0.544 481.123 >>> ReadX 285929 0.220 2698.442 >>> WriteX 92309 430.867 5191.608 >>> Unlink 34534 1.133 619.235 >>> UnlockX 640 0.008 0.724 >>> FIND_FIRST 63289 0.086 56.851 >>> SET_FILE_INFORMATION 16000 1.254 844.065 >>> QUERY_FILE_INFORMATION 29883 0.004 0.618 >>> QUERY_PATH_INFORMATION 173232 0.089 1295.651 >>> QUERY_FS_INFORMATION 29632 0.017 4.813 >>> NTCreateX 181464 0.479 2214.343 >>> >>> >>> MQ-DEADLINE >>> >>> Operation Count AvgLat MaxLat >>> -------------------------------------------------- >>> Flush 13760 90.542 13221.495 >>> Close 137654 0.008 27.133 >>> LockX 640 0.009 0.115 >>> Rename 8064 1.062 246.759 >>> ReadX 297956 0.051 347.018 >>> WriteX 94698 425.636 15090.020 >>> Unlink 35077 0.580 208.462 >>> UnlockX 640 0.007 0.291 >>> FIND_FIRST 66630 0.566 530.339 >>> SET_FILE_INFORMATION 16000 1.419 811.494 >>> QUERY_FILE_INFORMATION 30717 0.004 1.108 >>> QUERY_PATH_INFORMATION 176153 0.182 517.419 >>> QUERY_FS_INFORMATION 30857 0.018 18.562 >>> NTCreateX 184145 0.281 582.076 >>> >>> So, with both bfq and deadline there seems to be a serious regression, >>> especially on MaxLat, when moving from legacy block to blk-mq. The >>> regression is much worse with deadline, as legacy-deadline has the >>> lowest max latency among all the schedulers, whereas mq-deadline has >>> the highest one. >>> >>> Regardless of the actual culprit of this regression, I would like to >>> investigate further this issue. In this respect, I would like to ask >>> for a little help. I would like to isolate the workloads generating >>> the highest latencies. To this purpose, I had a look at the loadfile >>> client-tiny.txt, and I still have a doubt: is every item in the >>> loadfile executed somehow several times (for each value of the number >>> of clients), or is it executed only once? More precisely, IIUC, for >>> each operation reported in the above results, there are several items >>> (lines) in the loadfile. So, is each of these items executed only >>> once? >>> >>> I'm asking because, if it is executed only once, then I guess I can >>> find the critical tasks ore easily. Finally, if it is actually >>> executed only once, is it expected that the latency for such a task is >>> one order of magnitude higher than that of the average latency for >>> that group of tasks? I mean, is such a task intrinsically much >>> heavier, and then expectedly much longer, or is the fact that latency >>> is much higher for this task a sign that something in the kernel >>> misbehaves for that task? >>> >>> While waiting for some feedback, I'm going to execute your test >>> showing great unfairness between writes and reads, and to also check >>> whether responsiveness does worsen if the write workload for that test >>> is being executed in the background. >>> >>> Thanks, >>> Paolo >>> >>>> ... >>>>> -- >>>>> Mel Gorman >>>>> SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 17:33 ` Paolo Valente @ 2017-08-08 18:27 ` Mel Gorman 2017-08-09 21:49 ` Paolo Valente 1 sibling, 0 replies; 29+ messages in thread From: Mel Gorman @ 2017-08-08 18:27 UTC (permalink / raw) To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block On Tue, Aug 08, 2017 at 07:33:37PM +0200, Paolo Valente wrote: > > Differently from bfq-sq, setting slice_idle to 0 doesn't provide any > > benefit, which lets me suspect that there is some other issue in > > blk-mq (only a suspect). I think I may have already understood how to > > guarantee that bfq almost never idles the device uselessly also for > > this workload. Yet, since in blk-mq there is no gain even after > > excluding useless idling, I'll wait for at least Ming's patches to be > > merged before possibly proposing this contribution. Maybe some other > > little issue related to this lack of gain in blk-mq will be found and > > solved in the meantime. > > > > Moving to the read-write unfairness problem. > > > > I've reproduced the unfairness issue (rand reader throttled by heavy > writers) with bfq, using > configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with > an important side problem: cfq suffers from exactly the same > unfairness (785kB/s writers, 13.4kB/s reader). Of course, this > happens in my system, with a HITACHI HTS727550A9E364. > It's interesting that CFQ suffers the same on your system. It's possible that this is down to luck and the results depend not only on the disk but the number of CPUs. At absolute minimum we saw different latency figures from dbench even if the only observation s "different machines behave differently, news at 11". If the results are inconsistent, then the value of the benchmark can be dropped as a basis of comparison between IO schedulers (although I'll be keeping it for detecting regressions between releases). When the v4 results from Ming's patches complete, I'll double check the results from this config. > This discrepancy with your results makes a little bit harder for me to > understand how to better proceed, as I see no regression. Anyway, > since this reader-throttling issue seems relevant, I have investigated > it a little more in depth. The cause of the throttling is that the > fdatasync frequently performed by the writers in this test turns the > I/O of the writers into a 100% sync I/O. And neither bfq or cfq > differentiate bandwidth between sync reads and sync writes. Basically > both cfq and bfq are willing to dispatch the I/O requests of each > writer for a time slot equal to that devoted to the reader. But write > requests, after reaching the device, use the latter for much more time > than reads. This delays the completion of the requests of the reader, > and, being the I/O sync, the issuing of the next I/O requests by the > reader. The final result is that the device spends most of the time > serving write requests, while the reader issues its read requests very > slowly. > That is certainly plausible and implies that the actual results depend too heavily on random timing factors and disk model to be really useful. > It might not be so difficult to balance this unfairness, although I'm > a little worried about changing bfq without being able to see the > regression you report. In case I give it a try, could I then count on > some testing on your machines? > Yes with the caveat that results take a variable amount of time depending on how many problems I'm juggling in the air and how many of them are occupying time on the machines. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 17:33 ` Paolo Valente 2017-08-08 18:27 ` Mel Gorman @ 2017-08-09 21:49 ` Paolo Valente 2017-08-10 8:44 ` Mel Gorman 1 sibling, 1 reply; 29+ messages in thread From: Paolo Valente @ 2017-08-09 21:49 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 08 ago 2017, alle ore 19:33, Paolo Valente <paolo.valente@linaro.org> ha scritto: > >> >> Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente <paolo.valente@linaro.org> ha scritto: >> >>> >>> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <paolo.valente@linaro.org> ha scritto: >>> >>>> >>>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto: >>>> >>>>> >>>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto: >>>>> >>>>>> >>>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto: >>>>>> >>>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote: >>>>>>>> I took that into account BFQ with low-latency was also tested and the >>>>>>>> impact was not a universal improvement although it can be a noticable >>>>>>>> improvement. From the same machine; >>>>>>>> >>>>>>>> dbench4 Loadfile Execution Time >>>>>>>> 4.12.0 4.12.0 4.12.0 >>>>>>>> legacy-cfq mq-bfq mq-bfq-tput >>>>>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%) >>>>>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%) >>>>>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%) >>>>>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%) >>>>>>>> >>>>>>> >>>>>>> Thanks for trying with low_latency disabled. If I read numbers >>>>>>> correctly, we move from a worst case of 361% higher execution time to >>>>>>> a worst case of 11%. With a best case of 20% of lower execution time. >>>>>>> >>>>>> >>>>>> Yes. >>>>>> >>>>>>> I asked you about none and mq-deadline in a previous email, because >>>>>>> actually we have a double change here: change of the I/O stack, and >>>>>>> change of the scheduler, with the first change probably not irrelevant >>>>>>> with respect to the second one. >>>>>>> >>>>>> >>>>>> True. However, the difference between legacy-deadline mq-deadline is >>>>>> roughly around the 5-10% mark across workloads for SSD. It's not >>>>>> universally true but the impact is not as severe. While this is not >>>>>> proof that the stack change is the sole root cause, it makes it less >>>>>> likely. >>>>>> >>>>> >>>>> I'm getting a little lost here. If I'm not mistaken, you are saying, >>>>> since the difference between two virtually identical schedulers >>>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the >>>>> difference between cfq and mq-bfq-tput is higher, then in the latter >>>>> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the >>>>> above test is exactly in the 5-10% range? What am I missing? Other >>>>> tests with mq-bfq-tput not yet reported? >>>>> >>>>>>> By chance, according to what you have measured so far, is there any >>>>>>> test where, instead, you expect or have seen bfq-mq-tput to always >>>>>>> lose? I could start from there. >>>>>>> >>>>>> >>>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that >>>>>> it could be the stack change. >>>>>> >>>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any >>>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using >>>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem >>>>>> matters. >>>>>> >>>>> >>>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as >>>>> soon as I can, thanks. >>>>> >>>>> >>>> >>>> I've run this test and tried to further investigate this regression. >>>> For the moment, the gist seems to be that blk-mq plays an important >>>> role, not only with bfq (unless I'm considering the wrong numbers). >>>> Even if your main purpose in this thread was just to give a heads-up, >>>> I guess it may be useful to share what I have found out. In addition, >>>> I want to ask for some help, to try to get closer to the possible >>>> causes of at least this regression. If you think it would be better >>>> to open a new thread on this stuff, I'll do it. >>>> >>>> First, I got mixed results on my system. I'll focus only on the the >>>> case where mq-bfq-tput achieves its worst relative performance w.r.t. >>>> to cfq, which happens with 64 clients. Still, also in this case >>>> mq-bfq is better than cfq in all average values, but Flush. I don't >>>> know which are the best/right values to look at, so, here's the final >>>> report for both schedulers: >>>> >>>> CFQ >>>> >>>> Operation Count AvgLat MaxLat >>>> -------------------------------------------------- >>>> Flush 13120 20.069 348.594 >>>> Close 133696 0.008 14.642 >>>> LockX 512 0.009 0.059 >>>> Rename 7552 1.857 415.418 >>>> ReadX 270720 0.141 535.632 >>>> WriteX 89591 421.961 6363.271 >>>> Unlink 34048 1.281 662.467 >>>> UnlockX 512 0.007 0.057 >>>> FIND_FIRST 62016 0.086 25.060 >>>> SET_FILE_INFORMATION 15616 0.995 176.621 >>>> QUERY_FILE_INFORMATION 28734 0.004 1.372 >>>> QUERY_PATH_INFORMATION 170240 0.163 820.292 >>>> QUERY_FS_INFORMATION 28736 0.017 4.110 >>>> NTCreateX 178688 0.437 905.567 >>>> >>>> MQ-BFQ-TPUT >>>> >>>> Operation Count AvgLat MaxLat >>>> -------------------------------------------------- >>>> Flush 13504 75.828 11196.035 >>>> Close 136896 0.004 3.855 >>>> LockX 640 0.005 0.031 >>>> Rename 8064 1.020 288.989 >>>> ReadX 297600 0.081 685.850 >>>> WriteX 93515 391.637 12681.517 >>>> Unlink 34880 0.500 146.928 >>>> UnlockX 640 0.004 0.032 >>>> FIND_FIRST 63680 0.045 222.491 >>>> SET_FILE_INFORMATION 16000 0.436 686.115 >>>> QUERY_FILE_INFORMATION 30464 0.003 0.773 >>>> QUERY_PATH_INFORMATION 175552 0.044 148.449 >>>> QUERY_FS_INFORMATION 29888 0.009 1.984 >>>> NTCreateX 183152 0.289 300.867 >>>> >>>> Are these results in line with yours for this test? >>>> >>>> Anyway, to investigate this regression more in depth, I took two >>>> further steps. First, I repeated the same test with bfq-sq, my >>>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart >>>> from the changes needed for bfq to live in blk-mq). I got: >>>> >>>> BFQ-SQ-TPUT >>>> >>>> Operation Count AvgLat MaxLat >>>> -------------------------------------------------- >>>> Flush 12618 30.212 484.099 >>>> Close 123884 0.008 10.477 >>>> LockX 512 0.010 0.170 >>>> Rename 7296 2.032 426.409 >>>> ReadX 262179 0.251 985.478 >>>> WriteX 84072 461.398 7283.003 >>>> Unlink 33076 1.685 848.734 >>>> UnlockX 512 0.007 0.036 >>>> FIND_FIRST 58690 0.096 220.720 >>>> SET_FILE_INFORMATION 14976 1.792 466.435 >>>> QUERY_FILE_INFORMATION 26575 0.004 2.194 >>>> QUERY_PATH_INFORMATION 158125 0.112 614.063 >>>> QUERY_FS_INFORMATION 28224 0.017 1.385 >>>> NTCreateX 167877 0.827 945.644 >>>> >>>> So, the worst-case regression is now around 15%. This made me suspect >>>> that blk-mq influences results a lot for this test. To crosscheck, I >>>> compared legacy-deadline and mq-deadline too. >>>> >>> >>> Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets >>> occasionally confused by the workload, and grants device idling to >>> processes that, for this specific workload, would be better to >>> de-schedule immediately. If we set slice_idle to 0, then bfq-sq >>> becomes more or less equivalent to cfq (for some operations apparently >>> even much better): >>> >>> bfq-sq-tput-0idle >>> >>> Operation Count AvgLat MaxLat >>> -------------------------------------------------- >>> Flush 13013 17.888 280.517 >>> Close 133004 0.008 20.698 >>> LockX 512 0.008 0.088 >>> Rename 7427 2.041 193.232 >>> ReadX 270534 0.138 408.534 >>> WriteX 88598 429.615 6272.212 >>> Unlink 33734 1.205 559.152 >>> UnlockX 512 0.011 1.808 >>> FIND_FIRST 61762 0.087 23.012 >>> SET_FILE_INFORMATION 15337 1.322 220.155 >>> QUERY_FILE_INFORMATION 28415 0.004 0.559 >>> QUERY_PATH_INFORMATION 169423 0.150 580.570 >>> QUERY_FS_INFORMATION 28547 0.019 24.466 >>> NTCreateX 177618 0.544 681.795 >>> >>> I'll try soon with mq-bfq too, for which I expect however a deeper >>> investigation to be needed. >>> >> >> Hi, >> to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also >> applied Ming patches, and Ah, victory! >> >> Regardless of the value of slice idle: >> >> mq-bfq-tput >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13183 70.381 1025.407 >> Close 134539 0.004 1.011 >> LockX 512 0.005 0.025 >> Rename 7721 0.740 404.979 >> ReadX 274422 0.126 873.364 >> WriteX 90535 408.371 7400.585 >> Unlink 34276 0.634 581.067 >> UnlockX 512 0.003 0.029 >> FIND_FIRST 62664 0.052 321.027 >> SET_FILE_INFORMATION 15981 0.234 124.739 >> QUERY_FILE_INFORMATION 29042 0.003 1.731 >> QUERY_PATH_INFORMATION 171769 0.032 522.415 >> QUERY_FS_INFORMATION 28958 0.009 3.043 >> NTCreateX 179643 0.298 687.466 >> >> Throughput 9.11183 MB/sec 64 clients 64 procs max_latency=7400.588 ms >> >> Differently from bfq-sq, setting slice_idle to 0 doesn't provide any >> benefit, which lets me suspect that there is some other issue in >> blk-mq (only a suspect). I think I may have already understood how to >> guarantee that bfq almost never idles the device uselessly also for >> this workload. Yet, since in blk-mq there is no gain even after >> excluding useless idling, I'll wait for at least Ming's patches to be >> merged before possibly proposing this contribution. Maybe some other >> little issue related to this lack of gain in blk-mq will be found and >> solved in the meantime. >> >> Moving to the read-write unfairness problem. >> > > I've reproduced the unfairness issue (rand reader throttled by heavy > writers) with bfq, using > configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with > an important side problem: cfq suffers from exactly the same > unfairness (785kB/s writers, 13.4kB/s reader). Of course, this > happens in my system, with a HITACHI HTS727550A9E364. > > This discrepancy with your results makes a little bit harder for me to > understand how to better proceed, as I see no regression. Anyway, > since this reader-throttling issue seems relevant, I have investigated > it a little more in depth. The cause of the throttling is that the > fdatasync frequently performed by the writers in this test turns the > I/O of the writers into a 100% sync I/O. And neither bfq or cfq > differentiate bandwidth between sync reads and sync writes. Basically > both cfq and bfq are willing to dispatch the I/O requests of each > writer for a time slot equal to that devoted to the reader. But write > requests, after reaching the device, use the latter for much more time > than reads. This delays the completion of the requests of the reader, > and, being the I/O sync, the issuing of the next I/O requests by the > reader. The final result is that the device spends most of the time > serving write requests, while the reader issues its read requests very > slowly. > > It might not be so difficult to balance this unfairness, although I'm > a little worried about changing bfq without being able to see the > regression you report. In case I give it a try, could I then count on > some testing on your machines? > Hi Mel, I've investigated this test case a little bit more, and the outcome is unfortunately rather drastic, unless I'm missing some important point. It is impossible to control the rate of the reader with the exact configuration of this test. In fact, since iodepth is equal to 1, the reader issues one I/O request at a time. When one such request is dispatched, after some write requests have already been dispatched (and then queued in the device), the time to serve the request is controlled only by the device. The longer the device makes the read request wait before being served, the later the reader will see the completion of its request, and then the later the reader will issue a new request, and so on. So, for this test, it is mainly the device controller to decide the rate of the reader. On the other hand, the scheduler can gain again control of the bandwidth of the reader, if the reader issues more than one request at a time. Anyway, before analyzing this second, controllable case, I wanted to test responsiveness with this heavy write workload in the background. And it was very bad! After some hour of mild panic, I found out that this failure depends on a bug in bfq, bug that, luckily, happens to be triggered by these heavy writes as a background workload ... I've already found and am testing a fix for this bug. Yet, it will probably take me some week to submit this fix, because I'm finally going on vacation. Thanks, Paolo > Thanks, > Paolo > >> Thanks, >> Paolo >> >>> Thanks, >>> Paolo >>> >>>> LEGACY-DEADLINE >>>> >>>> Operation Count AvgLat MaxLat >>>> -------------------------------------------------- >>>> Flush 13267 9.622 298.206 >>>> Close 135692 0.007 10.627 >>>> LockX 640 0.008 0.066 >>>> Rename 7827 0.544 481.123 >>>> ReadX 285929 0.220 2698.442 >>>> WriteX 92309 430.867 5191.608 >>>> Unlink 34534 1.133 619.235 >>>> UnlockX 640 0.008 0.724 >>>> FIND_FIRST 63289 0.086 56.851 >>>> SET_FILE_INFORMATION 16000 1.254 844.065 >>>> QUERY_FILE_INFORMATION 29883 0.004 0.618 >>>> QUERY_PATH_INFORMATION 173232 0.089 1295.651 >>>> QUERY_FS_INFORMATION 29632 0.017 4.813 >>>> NTCreateX 181464 0.479 2214.343 >>>> >>>> >>>> MQ-DEADLINE >>>> >>>> Operation Count AvgLat MaxLat >>>> -------------------------------------------------- >>>> Flush 13760 90.542 13221.495 >>>> Close 137654 0.008 27.133 >>>> LockX 640 0.009 0.115 >>>> Rename 8064 1.062 246.759 >>>> ReadX 297956 0.051 347.018 >>>> WriteX 94698 425.636 15090.020 >>>> Unlink 35077 0.580 208.462 >>>> UnlockX 640 0.007 0.291 >>>> FIND_FIRST 66630 0.566 530.339 >>>> SET_FILE_INFORMATION 16000 1.419 811.494 >>>> QUERY_FILE_INFORMATION 30717 0.004 1.108 >>>> QUERY_PATH_INFORMATION 176153 0.182 517.419 >>>> QUERY_FS_INFORMATION 30857 0.018 18.562 >>>> NTCreateX 184145 0.281 582.076 >>>> >>>> So, with both bfq and deadline there seems to be a serious regression, >>>> especially on MaxLat, when moving from legacy block to blk-mq. The >>>> regression is much worse with deadline, as legacy-deadline has the >>>> lowest max latency among all the schedulers, whereas mq-deadline has >>>> the highest one. >>>> >>>> Regardless of the actual culprit of this regression, I would like to >>>> investigate further this issue. In this respect, I would like to ask >>>> for a little help. I would like to isolate the workloads generating >>>> the highest latencies. To this purpose, I had a look at the loadfile >>>> client-tiny.txt, and I still have a doubt: is every item in the >>>> loadfile executed somehow several times (for each value of the number >>>> of clients), or is it executed only once? More precisely, IIUC, for >>>> each operation reported in the above results, there are several items >>>> (lines) in the loadfile. So, is each of these items executed only >>>> once? >>>> >>>> I'm asking because, if it is executed only once, then I guess I can >>>> find the critical tasks ore easily. Finally, if it is actually >>>> executed only once, is it expected that the latency for such a task is >>>> one order of magnitude higher than that of the average latency for >>>> that group of tasks? I mean, is such a task intrinsically much >>>> heavier, and then expectedly much longer, or is the fact that latency >>>> is much higher for this task a sign that something in the kernel >>>> misbehaves for that task? >>>> >>>> While waiting for some feedback, I'm going to execute your test >>>> showing great unfairness between writes and reads, and to also check >>>> whether responsiveness does worsen if the write workload for that test >>>> is being executed in the background. >>>> >>>> Thanks, >>>> Paolo >>>> >>>>> ... >>>>>> -- >>>>>> Mel Gorman >>>>>> SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-09 21:49 ` Paolo Valente @ 2017-08-10 8:44 ` Mel Gorman 0 siblings, 0 replies; 29+ messages in thread From: Mel Gorman @ 2017-08-10 8:44 UTC (permalink / raw) To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block On Wed, Aug 09, 2017 at 11:49:17PM +0200, Paolo Valente wrote: > > This discrepancy with your results makes a little bit harder for me to > > understand how to better proceed, as I see no regression. Anyway, > > since this reader-throttling issue seems relevant, I have investigated > > it a little more in depth. The cause of the throttling is that the > > fdatasync frequently performed by the writers in this test turns the > > I/O of the writers into a 100% sync I/O. And neither bfq or cfq > > differentiate bandwidth between sync reads and sync writes. Basically > > both cfq and bfq are willing to dispatch the I/O requests of each > > writer for a time slot equal to that devoted to the reader. But write > > requests, after reaching the device, use the latter for much more time > > than reads. This delays the completion of the requests of the reader, > > and, being the I/O sync, the issuing of the next I/O requests by the > > reader. The final result is that the device spends most of the time > > serving write requests, while the reader issues its read requests very > > slowly. > > > > It might not be so difficult to balance this unfairness, although I'm > > a little worried about changing bfq without being able to see the > > regression you report. In case I give it a try, could I then count on > > some testing on your machines? > > > > Hi Mel, > I've investigated this test case a little bit more, and the outcome is > unfortunately rather drastic, unless I'm missing some important point. > It is impossible to control the rate of the reader with the exact > configuration of this test. Correct, both are simply competing for access to IO. Very broadly speaking, it's only checking for loose (but not perfect) fairness with different IO patterns. While it's not a recent problem, historically (2+ years ago) we had problems whereby a heavy reader or writer could starve IO completely. It had odd effects like some multi-threaded benchmarks being artifically good simply because one thread would dominate and artifically complete faster and exit prematurely. "Fixing" it had a tendency to help real workloads while hurting some benchmarks so it's not straight-forward to control for properly. Bottom line, I'm not necessarily worried if a particular benchmark shows an apparent regression once I understand why and can convince myself that a "real" workload benefits from it (preferably proving it). > In fact, since iodepth is equal to 1, the > reader issues one I/O request at a time. When one such request is > dispatched, after some write requests have already been dispatched > (and then queued in the device), the time to serve the request is > controlled only by the device. The longer the device makes the read > request wait before being served, the later the reader will see the > completion of its request, and then the later the reader will issue a > new request, and so on. So, for this test, it is mainly the device > controller to decide the rate of the reader. > Understood. It's less than ideal but not a completely silly test either. That said, the fio tests are relatively new compared to some of the tests monitored by mmtests looking for issues. It can take time to finalise a test configuration before it's giving useful data 100% of the time. > On the other hand, the scheduler can gain again control of the > bandwidth of the reader, if the reader issues more than one request at > a time. Ok, I'll take it as a todo item to increase the depth as a depth of 1 is not that interesting as such. It's also on my todo list to add fio configs that add think time. > Anyway, before analyzing this second, controllable case, I > wanted to test responsiveness with this heavy write workload in the > background. And it was very bad! After some hour of mild panic, I > found out that this failure depends on a bug in bfq, bug that, > luckily, happens to be triggered by these heavy writes as a background > workload ... > > I've already found and am testing a fix for this bug. Yet, it will > probably take me some week to submit this fix, because I'm finally > going on vacation. > This is obviously both good and bad. Bad in that the bug exists at all, good in that you detected it and a fix is possible. I don't think you have to panic considering that some of the pending fixes include Ming's work which won't be merged for quite some time and tests take a long time anyway. Whenever you get around to a fix after your vacation, just cc me and I'll queue it across a range of machines so you have some independent tests. A review from me would not be worth much as I haven't spent the time to fully understand BFQ yet. If the fixes do not hit until the next merge window or the window after that then someone who cares enough can do a performance-based -stable backport. If there are any bugs in the meantime (e.g. after 4.13 comes out) then there will be a series for the reporter to test. I think it's still reasonably positive that issues with MQ being enabled by default were detected within weeks with potential fixes in the pipeline. It's better than months passing before a distro picked up a suitable kernel and enough time passed for a coherent bug report to show up that's better than "my computer is slow". Thanks for the hard work and prompt research. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-07 17:32 ` Paolo Valente 2017-08-07 18:42 ` Paolo Valente @ 2017-08-08 10:30 ` Mel Gorman 2017-08-08 10:43 ` Ming Lei 2017-08-08 17:16 ` Paolo Valente 1 sibling, 2 replies; 29+ messages in thread From: Mel Gorman @ 2017-08-08 10:30 UTC (permalink / raw) To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote: > >> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any > >> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using > >> ext4 as a filesystem. The same is not true for XFS so the filesystem > >> matters. > >> > > > > Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as > > soon as I can, thanks. > > > > > > I've run this test and tried to further investigate this regression. > For the moment, the gist seems to be that blk-mq plays an important > role, not only with bfq (unless I'm considering the wrong numbers). > Even if your main purpose in this thread was just to give a heads-up, > I guess it may be useful to share what I have found out. In addition, > I want to ask for some help, to try to get closer to the possible > causes of at least this regression. If you think it would be better > to open a new thread on this stuff, I'll do it. > I don't think it's necessary unless Christoph or Jens object and I doubt they will. > First, I got mixed results on my system. For what it's worth, this is standard. In my experience, IO benchmarks are always multi-modal, particularly on rotary storage. Cases of universal win or universal loss for a scheduler or set of tuning are rare. > I'll focus only on the the > case where mq-bfq-tput achieves its worst relative performance w.r.t. > to cfq, which happens with 64 clients. Still, also in this case > mq-bfq is better than cfq in all average values, but Flush. I don't > know which are the best/right values to look at, so, here's the final > report for both schedulers: > For what it's worth, it has often been observed that dbench overall performance was dominated by flush costs. This is also true for the standard reported throughput figures rather than the modified load file elapsed time that mmtests reports. In dbench3 it was even worse where the "performance" was dominated by whether the temporary files were deleted before writeback started. > CFQ > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13120 20.069 348.594 > Close 133696 0.008 14.642 > LockX 512 0.009 0.059 > Rename 7552 1.857 415.418 > ReadX 270720 0.141 535.632 > WriteX 89591 421.961 6363.271 > Unlink 34048 1.281 662.467 > UnlockX 512 0.007 0.057 > FIND_FIRST 62016 0.086 25.060 > SET_FILE_INFORMATION 15616 0.995 176.621 > QUERY_FILE_INFORMATION 28734 0.004 1.372 > QUERY_PATH_INFORMATION 170240 0.163 820.292 > QUERY_FS_INFORMATION 28736 0.017 4.110 > NTCreateX 178688 0.437 905.567 > > MQ-BFQ-TPUT > > Operation Count AvgLat MaxLat > -------------------------------------------------- > Flush 13504 75.828 11196.035 > Close 136896 0.004 3.855 > LockX 640 0.005 0.031 > Rename 8064 1.020 288.989 > ReadX 297600 0.081 685.850 > WriteX 93515 391.637 12681.517 > Unlink 34880 0.500 146.928 > UnlockX 640 0.004 0.032 > FIND_FIRST 63680 0.045 222.491 > SET_FILE_INFORMATION 16000 0.436 686.115 > QUERY_FILE_INFORMATION 30464 0.003 0.773 > QUERY_PATH_INFORMATION 175552 0.044 148.449 > QUERY_FS_INFORMATION 29888 0.009 1.984 > NTCreateX 183152 0.289 300.867 > > Are these results in line with yours for this test? > Very broadly speaking yes, but it varies. On a small machine, the differences in flush latency are visible but not as dramatic. It only has a few CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On the one machine I have that topped out with CFQ/BFQ at 64 threads, the latency of flush is vaguely similar CFQ BFQ BFQ-TPUT latency avg-Flush-64 287.05 ( 0.00%) 389.14 ( -35.57%) 349.90 ( -21.90%) latency avg-Close-64 0.00 ( 0.00%) 0.00 ( -33.33%) 0.00 ( 0.00%) latency avg-LockX-64 0.01 ( 0.00%) 0.01 ( -16.67%) 0.01 ( 0.00%) latency avg-Rename-64 0.18 ( 0.00%) 0.21 ( -16.39%) 0.18 ( 3.28%) latency avg-ReadX-64 0.10 ( 0.00%) 0.15 ( -40.95%) 0.15 ( -40.95%) latency avg-WriteX-64 0.86 ( 0.00%) 0.81 ( 6.18%) 0.74 ( 13.75%) latency avg-Unlink-64 1.49 ( 0.00%) 1.52 ( -2.28%) 1.14 ( 23.69%) latency avg-UnlockX-64 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) latency avg-NTCreateX-64 0.26 ( 0.00%) 0.30 ( -16.15%) 0.21 ( 19.62%) So, different figures to yours but the general observation that flush latency is higher holds. > Anyway, to investigate this regression more in depth, I took two > further steps. First, I repeated the same test with bfq-sq, my > out-of-tree version of bfq for legacy block (identical to mq-bfq apart > from the changes needed for bfq to live in blk-mq). I got: > > <SNIP> > > So, with both bfq and deadline there seems to be a serious regression, > especially on MaxLat, when moving from legacy block to blk-mq. The > regression is much worse with deadline, as legacy-deadline has the > lowest max latency among all the schedulers, whereas mq-deadline has > the highest one. > I wouldn't worry too much about max latency simply because a large outliier can be due to multiple factors and it will be variable. However, I accept that deadline is not necessarily great either. > Regardless of the actual culprit of this regression, I would like to > investigate further this issue. In this respect, I would like to ask > for a little help. I would like to isolate the workloads generating > the highest latencies. To this purpose, I had a look at the loadfile > client-tiny.txt, and I still have a doubt: is every item in the > loadfile executed somehow several times (for each value of the number > of clients), or is it executed only once? More precisely, IIUC, for > each operation reported in the above results, there are several items > (lines) in the loadfile. So, is each of these items executed only > once? > The load file is executed multiple times. The normal loadfile was basically just the same commands, or very similar commands, run multiple times within a single load file. This made the workload too sensitive to the exact time the workload finished and too coarse. > I'm asking because, if it is executed only once, then I guess I can > find the critical tasks ore easily. Finally, if it is actually > executed only once, is it expected that the latency for such a task is > one order of magnitude higher than that of the average latency for > that group of tasks? I mean, is such a task intrinsically much > heavier, and then expectedly much longer, or is the fact that latency > is much higher for this task a sign that something in the kernel > misbehaves for that task? > I don't think it's quite as easily isolated. It's all the operations in combination that replicate the behaviour. If it was just a single operation like "fsync" then it would be fairly straight-forward but the full mix is relevant as it matters when writeback kicks off, when merges happen, how much dirty data was outstanding when writeback or sync started etc. I see you've made other responses to the thread so rather than respond individually o I've queued a subset of tests with Ming's v3 patchset as that was the latest branch at the time I looked. It'll take quite some time to execute as the grid I use to collect data is backlogged with other work o I've included pgioperf this time because it is good at demonstrate oddities related to fsync. Granted it's mostly simulating a database workload that is typically recommended to use deadline scheduler but I think it's still a useful demonstration o If you want a patch set queued that may improve workload pattern detection for dbench then I can add that to the grid with the caveat that results take time. It'll be a blind test as I'm not actively debugging IO-related problems right now. o I'll keep an eye out for other workloads that demonstrate empirically better performance given that a stopwatch and desktop performance is tough to quantify even though I'm typically working in other areas. While I don't spend a lot of time on IO-related problems, it would still be preferred if switching to MQ by default was a safe option so I'm interested enough to keep it in mind. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 10:30 ` Mel Gorman @ 2017-08-08 10:43 ` Ming Lei 2017-08-08 11:27 ` Mel Gorman 2017-08-08 17:16 ` Paolo Valente 1 sibling, 1 reply; 29+ messages in thread From: Ming Lei @ 2017-08-08 10:43 UTC (permalink / raw) To: Mel Gorman Cc: Paolo Valente, Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block Hi Mel Gorman, On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote: .... > > o I've queued a subset of tests with Ming's v3 patchset as that was the > latest branch at the time I looked. It'll take quite some time to execute > as the grid I use to collect data is backlogged with other work The latest patchset is in the following post: http://marc.info/?l=linux-block&m=150191624318513&w=2 And you can find it in my github: https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4 -- Ming Lei ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 10:43 ` Ming Lei @ 2017-08-08 11:27 ` Mel Gorman 2017-08-08 11:49 ` Ming Lei 0 siblings, 1 reply; 29+ messages in thread From: Mel Gorman @ 2017-08-08 11:27 UTC (permalink / raw) To: Ming Lei Cc: Paolo Valente, Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote: > Hi Mel Gorman, > > On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > .... > > > > o I've queued a subset of tests with Ming's v3 patchset as that was the > > latest branch at the time I looked. It'll take quite some time to execute > > as the grid I use to collect data is backlogged with other work > > The latest patchset is in the following post: > > http://marc.info/?l=linux-block&m=150191624318513&w=2 > > And you can find it in my github: > > https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4 > Unfortunately, the tests were queued last Friday and are partially complete depending on when machines become available. As it is, v3 will take a few days to complete and a requeue would incur further delays. If you believe the results will be substantially different then I'll discard v3 and requeue. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 11:27 ` Mel Gorman @ 2017-08-08 11:49 ` Ming Lei 2017-08-08 11:55 ` Mel Gorman 0 siblings, 1 reply; 29+ messages in thread From: Ming Lei @ 2017-08-08 11:49 UTC (permalink / raw) To: Mel Gorman Cc: Paolo Valente, Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block On Tue, Aug 8, 2017 at 7:27 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote: >> Hi Mel Gorman, >> >> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote: >> .... >> > >> > o I've queued a subset of tests with Ming's v3 patchset as that was the >> > latest branch at the time I looked. It'll take quite some time to execute >> > as the grid I use to collect data is backlogged with other work >> >> The latest patchset is in the following post: >> >> http://marc.info/?l=linux-block&m=150191624318513&w=2 >> >> And you can find it in my github: >> >> https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4 >> > > Unfortunately, the tests were queued last Friday and are partially complete > depending on when machines become available. As it is, v3 will take a few > days to complete and a requeue would incur further delays. If you believe > the results will be substantially different then I'll discard v3 and requeue. Firstly V3 on github(never posted out) causes boot hang if CPU cores is >= 16, so you need to check if the test is still running, :-( Also V3 on github may not perform well on IB SRP(or other low latency SCSI disk), so I improve bio merge in V4 and make IB SRP's perf better too, and it depends on devices. I suggest to focus on V2 posted in mail list(V4 in github). -- Ming Lei ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 11:49 ` Ming Lei @ 2017-08-08 11:55 ` Mel Gorman 0 siblings, 0 replies; 29+ messages in thread From: Mel Gorman @ 2017-08-08 11:55 UTC (permalink / raw) To: Ming Lei Cc: Paolo Valente, Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List, linux-block On Tue, Aug 08, 2017 at 07:49:53PM +0800, Ming Lei wrote: > On Tue, Aug 8, 2017 at 7:27 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote: > >> Hi Mel Gorman, > >> > >> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > >> .... > >> > > >> > o I've queued a subset of tests with Ming's v3 patchset as that was the > >> > latest branch at the time I looked. It'll take quite some time to execute > >> > as the grid I use to collect data is backlogged with other work > >> > >> The latest patchset is in the following post: > >> > >> http://marc.info/?l=linux-block&m=150191624318513&w=2 > >> > >> And you can find it in my github: > >> > >> https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4 > >> > > > > Unfortunately, the tests were queued last Friday and are partially complete > > depending on when machines become available. As it is, v3 will take a few > > days to complete and a requeue would incur further delays. If you believe > > the results will be substantially different then I'll discard v3 and requeue. > > Firstly V3 on github(never posted out) causes boot hang if CPU cores is >= 16, > so you need to check if the test is still running, :-( > By co-incidence, the few machines that have completed had core counts below this so I'll discard existing results and requeue. Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Switching to MQ by default may generate some bug reports 2017-08-08 10:30 ` Mel Gorman 2017-08-08 10:43 ` Ming Lei @ 2017-08-08 17:16 ` Paolo Valente 1 sibling, 0 replies; 29+ messages in thread From: Paolo Valente @ 2017-08-08 17:16 UTC (permalink / raw) To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block > Il giorno 08 ago 2017, alle ore 12:30, Mel Gorman <mgorman@techsingularity.net> ha scritto: > > On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote: >>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any >>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using >>>> ext4 as a filesystem. The same is not true for XFS so the filesystem >>>> matters. >>>> >>> >>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as >>> soon as I can, thanks. >>> >>> >> >> I've run this test and tried to further investigate this regression. >> For the moment, the gist seems to be that blk-mq plays an important >> role, not only with bfq (unless I'm considering the wrong numbers). >> Even if your main purpose in this thread was just to give a heads-up, >> I guess it may be useful to share what I have found out. In addition, >> I want to ask for some help, to try to get closer to the possible >> causes of at least this regression. If you think it would be better >> to open a new thread on this stuff, I'll do it. >> > > I don't think it's necessary unless Christoph or Jens object and I doubt > they will. > >> First, I got mixed results on my system. > > For what it's worth, this is standard. In my experience, IO benchmarks > are always multi-modal, particularly on rotary storage. Cases of universal > win or universal loss for a scheduler or set of tuning are rare. > >> I'll focus only on the the >> case where mq-bfq-tput achieves its worst relative performance w.r.t. >> to cfq, which happens with 64 clients. Still, also in this case >> mq-bfq is better than cfq in all average values, but Flush. I don't >> know which are the best/right values to look at, so, here's the final >> report for both schedulers: >> > > For what it's worth, it has often been observed that dbench overall > performance was dominated by flush costs. This is also true for the > standard reported throughput figures rather than the modified load file > elapsed time that mmtests reports. In dbench3 it was even worse where the > "performance" was dominated by whether the temporary files were deleted > before writeback started. > >> CFQ >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13120 20.069 348.594 >> Close 133696 0.008 14.642 >> LockX 512 0.009 0.059 >> Rename 7552 1.857 415.418 >> ReadX 270720 0.141 535.632 >> WriteX 89591 421.961 6363.271 >> Unlink 34048 1.281 662.467 >> UnlockX 512 0.007 0.057 >> FIND_FIRST 62016 0.086 25.060 >> SET_FILE_INFORMATION 15616 0.995 176.621 >> QUERY_FILE_INFORMATION 28734 0.004 1.372 >> QUERY_PATH_INFORMATION 170240 0.163 820.292 >> QUERY_FS_INFORMATION 28736 0.017 4.110 >> NTCreateX 178688 0.437 905.567 >> >> MQ-BFQ-TPUT >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13504 75.828 11196.035 >> Close 136896 0.004 3.855 >> LockX 640 0.005 0.031 >> Rename 8064 1.020 288.989 >> ReadX 297600 0.081 685.850 >> WriteX 93515 391.637 12681.517 >> Unlink 34880 0.500 146.928 >> UnlockX 640 0.004 0.032 >> FIND_FIRST 63680 0.045 222.491 >> SET_FILE_INFORMATION 16000 0.436 686.115 >> QUERY_FILE_INFORMATION 30464 0.003 0.773 >> QUERY_PATH_INFORMATION 175552 0.044 148.449 >> QUERY_FS_INFORMATION 29888 0.009 1.984 >> NTCreateX 183152 0.289 300.867 >> >> Are these results in line with yours for this test? >> > > Very broadly speaking yes, but it varies. On a small machine, the differences > in flush latency are visible but not as dramatic. It only has a few > CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On > the one machine I have that topped out with CFQ/BFQ at 64 threads, the > latency of flush is vaguely similar > > CFQ BFQ BFQ-TPUT > latency avg-Flush-64 287.05 ( 0.00%) 389.14 ( -35.57%) 349.90 ( -21.90%) > latency avg-Close-64 0.00 ( 0.00%) 0.00 ( -33.33%) 0.00 ( 0.00%) > latency avg-LockX-64 0.01 ( 0.00%) 0.01 ( -16.67%) 0.01 ( 0.00%) > latency avg-Rename-64 0.18 ( 0.00%) 0.21 ( -16.39%) 0.18 ( 3.28%) > latency avg-ReadX-64 0.10 ( 0.00%) 0.15 ( -40.95%) 0.15 ( -40.95%) > latency avg-WriteX-64 0.86 ( 0.00%) 0.81 ( 6.18%) 0.74 ( 13.75%) > latency avg-Unlink-64 1.49 ( 0.00%) 1.52 ( -2.28%) 1.14 ( 23.69%) > latency avg-UnlockX-64 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > latency avg-NTCreateX-64 0.26 ( 0.00%) 0.30 ( -16.15%) 0.21 ( 19.62%) > > So, different figures to yours but the general observation that flush > latency is higher holds. > >> Anyway, to investigate this regression more in depth, I took two >> further steps. First, I repeated the same test with bfq-sq, my >> out-of-tree version of bfq for legacy block (identical to mq-bfq apart >> from the changes needed for bfq to live in blk-mq). I got: >> >> <SNIP> >> >> So, with both bfq and deadline there seems to be a serious regression, >> especially on MaxLat, when moving from legacy block to blk-mq. The >> regression is much worse with deadline, as legacy-deadline has the >> lowest max latency among all the schedulers, whereas mq-deadline has >> the highest one. >> > > I wouldn't worry too much about max latency simply because a large > outliier can be due to multiple factors and it will be variable. > However, I accept that deadline is not necessarily great either. > >> Regardless of the actual culprit of this regression, I would like to >> investigate further this issue. In this respect, I would like to ask >> for a little help. I would like to isolate the workloads generating >> the highest latencies. To this purpose, I had a look at the loadfile >> client-tiny.txt, and I still have a doubt: is every item in the >> loadfile executed somehow several times (for each value of the number >> of clients), or is it executed only once? More precisely, IIUC, for >> each operation reported in the above results, there are several items >> (lines) in the loadfile. So, is each of these items executed only >> once? >> > > The load file is executed multiple times. The normal loadfile was > basically just the same commands, or very similar commands, run multiple > times within a single load file. This made the workload too sensitive to > the exact time the workload finished and too coarse. > >> I'm asking because, if it is executed only once, then I guess I can >> find the critical tasks ore easily. Finally, if it is actually >> executed only once, is it expected that the latency for such a task is >> one order of magnitude higher than that of the average latency for >> that group of tasks? I mean, is such a task intrinsically much >> heavier, and then expectedly much longer, or is the fact that latency >> is much higher for this task a sign that something in the kernel >> misbehaves for that task? >> > > I don't think it's quite as easily isolated. It's all the operations in > combination that replicate the behaviour. If it was just a single operation > like "fsync" then it would be fairly straight-forward but the full mix > is relevant as it matters when writeback kicks off, when merges happen, > how much dirty data was outstanding when writeback or sync started etc. > > I see you've made other responses to the thread so rather than respond > individually > > o I've queued a subset of tests with Ming's v3 patchset as that was the > latest branch at the time I looked. It'll take quite some time to execute > as the grid I use to collect data is backlogged with other work > > o I've included pgioperf this time because it is good at demonstrate > oddities related to fsync. Granted it's mostly simulating a database > workload that is typically recommended to use deadline scheduler but I > think it's still a useful demonstration > > o If you want a patch set queued that may improve workload pattern > detection for dbench then I can add that to the grid with the caveat that > results take time. It'll be a blind test as I'm not actively debugging > IO-related problems right now. > > o I'll keep an eye out for other workloads that demonstrate empirically > better performance given that a stopwatch and desktop performance is > tough to quantify even though I'm typically working in other areas. While > I don't spend a lot of time on IO-related problems, it would still > be preferred if switching to MQ by default was a safe option so I'm > interested enough to keep it in mind. > Hi Mel, thanks for your thorough responses (I'm about to write something about the read-write unfairness issue, with, again, some surprise). I want to reply only to your last point above. With our responsiveness benchmark of course you don't need a stopwatch, but, yes, to get some minimally comprehensive results you need a machine with at least a desktop application like a terminal installed. Thanks, Paolo > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2017-08-10 8:44 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-03 8:51 Switching to MQ by default may generate some bug reports Mel Gorman 2017-08-03 9:17 ` Ming Lei 2017-08-03 9:32 ` Ming Lei 2017-08-03 9:42 ` Mel Gorman 2017-08-03 9:44 ` Paolo Valente 2017-08-03 10:46 ` Mel Gorman 2017-08-03 9:57 ` Ming Lei 2017-08-03 10:47 ` Mel Gorman 2017-08-03 11:48 ` Ming Lei 2017-08-03 9:21 ` Paolo Valente 2017-08-03 11:01 ` Mel Gorman 2017-08-04 7:26 ` Paolo Valente 2017-08-04 11:01 ` Mel Gorman 2017-08-04 22:05 ` Paolo Valente 2017-08-05 11:54 ` Mel Gorman 2017-08-07 17:35 ` Paolo Valente 2017-08-07 17:32 ` Paolo Valente 2017-08-07 18:42 ` Paolo Valente 2017-08-08 8:06 ` Paolo Valente 2017-08-08 17:33 ` Paolo Valente 2017-08-08 18:27 ` Mel Gorman 2017-08-09 21:49 ` Paolo Valente 2017-08-10 8:44 ` Mel Gorman 2017-08-08 10:30 ` Mel Gorman 2017-08-08 10:43 ` Ming Lei 2017-08-08 11:27 ` Mel Gorman 2017-08-08 11:49 ` Ming Lei 2017-08-08 11:55 ` Mel Gorman 2017-08-08 17:16 ` Paolo Valente
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).