All of lore.kernel.org
 help / color / mirror / Atom feed
* Switching to MQ by default may generate some bug reports
@ 2017-08-03  8:51 Mel Gorman
  2017-08-03  9:17 ` Ming Lei
  2017-08-03  9:21   ` Paolo Valente
  0 siblings, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2017-08-03  8:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-block, Paolo Valente

Hi Christoph,

I know the reasons for switching to MQ by default but just be aware that it's
not without hazards albeit it the biggest issues I've seen are switching
CFQ to BFQ. On my home grid, there is some experimental automatic testing
running every few weeks searching for regressions. Yesterday, it noticed
that creating some work files for a postgres simulator called pgioperf
was 38.33% slower and it auto-bisected to the switch to MQ. This is just
linearly writing two files for testing on another benchmark and is not
remarkable. The relevant part of the report is

Last good/First bad commit
==========================
Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
>From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 16 Jun 2017 10:27:55 +0200
Subject: [PATCH] scsi: default to scsi-mq
Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
path now that we had plenty of testing, and have I/O schedulers for
blk-mq.  The module option to disable the blk-mq path is kept around for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
 drivers/scsi/Kconfig | 11 -----------
 drivers/scsi/scsi.c  |  4 ----
 2 files changed, 15 deletions(-)

Comparison
==========
                                initial                initial                   last                  penup                  first
                             good-v4.12       bad-16f73eb02d7e          good-6d311fa7          good-d06c587d           bad-5c279bd9
User    min             0.06 (   0.00%)        0.14 (-133.33%)        0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
User    mean            0.06 (   0.00%)        0.14 (-133.33%)        0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
User    stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
User    coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
User    max             0.06 (   0.00%)        0.14 (-133.33%)        0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
System  min            10.04 (   0.00%)       10.75 (  -7.07%)       10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
System  mean           10.04 (   0.00%)       10.75 (  -7.07%)       10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
System  stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
System  coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
System  max            10.04 (   0.00%)       10.75 (  -7.07%)       10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
Elapsed min           251.53 (   0.00%)      351.05 ( -39.57%)      252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
Elapsed mean          251.53 (   0.00%)      351.05 ( -39.57%)      252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
Elapsed stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
Elapsed coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
Elapsed max           251.53 (   0.00%)      351.05 ( -39.57%)      252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
CPU     min             4.00 (   0.00%)        3.00 (  25.00%)        4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
CPU     mean            4.00 (   0.00%)        3.00 (  25.00%)        4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
CPU     stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
CPU     coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
CPU     max             4.00 (   0.00%)        3.00 (  25.00%)        4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)

The "Elapsed mean" line is what the testing and auto-bisection was paying
attention to. Commit 16f73eb02d7e is simply the head commit at the time
the continuous testing started. The first "bad commit" is the last column.

It's not the only slowdown that has been observed from other testing when
examining whether it's ok to switch to MQ by default. The biggest slowdown
observed was with a modified version of dbench4 -- the modifications use
shorter, but representative, load files to avoid timing artifacts and
reports time to complete a load file instead of throughput as throughput
is kind of meaningless for dbench4

dbench4 Loadfile Execution Time
                             4.12.0                 4.12.0
                         legacy-cfq                 mq-bfq
Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)
Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)
Amean     4       102.72 (   0.00%)      474.33 (-361.77%)
Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)

The units are "milliseconds to complete a load file" so as thread count
increased, there were some fairly bad slowdowns. The most dramatic
slowdown was observed on a machine with a controller with on-board cache

                              4.12.0                 4.12.0
                          legacy-cfq                 mq-bfq
Amean     1        289.09 (   0.00%)      128.43 (  55.57%)
Amean     2        491.32 (   0.00%)      794.04 ( -61.61%)
Amean     4        875.26 (   0.00%)     9331.79 (-966.17%)
Amean     8       2074.30 (   0.00%)      317.79 (  84.68%)
Amean     16      3380.47 (   0.00%)      669.51 (  80.19%)
Amean     32      7427.25 (   0.00%)     8821.75 ( -18.78%)
Amean     256    53376.81 (   0.00%)    69006.94 ( -29.28%)

The slowdown wasn't universal but at 4 threads, it was severe. There
are other examples but it'd just be a lot of noise and not change the
central point.

The major problems were all observed switching from CFQ to BFQ on single disk
rotary storage. It's not machine specific as 5 separate machines noticed
problems with dbench and fio when switching to MQ on kernel 4.12. Weirdly,
I've seen cases of read starvation in the presence of heavy writers
using fio to generate the workload which was surprising to me. Jan Kara
suggested that it may be because the read workload is not being identified
as "interactive" but I didn't dig into the details myself and have zero
understanding of BFQ. I was only interested in answering the question "is
it safe to switch the default and will the performance be similar enough
to avoid bug reports?" and concluded that the answer is "no".

For what it's worth, I've noticed on SSDs that switching from legacy-mq
to deadline-mq also slowed down but in many cases the slowdown was small
enough that it may be tolerable and not generate many bug reports. Also,
mq-deadline appears to receive more attention so issues there are probably
going to be noticed faster.

I'm not suggesting for a second that you fix this or switch back to legacy
by default because it's BFQ, Paulo is cc'd and it'll have to be fixed
eventually but you might see "workload foo is slower on 4.13" reports that
bisect to this commit. What filesystem is used changes the results but at
least btrfs, ext3, ext4 and xfs experience slowdowns.

For Paulo, if you want to try preemptively dealing with regression reports
before 4.13 releases then all the tests in question can be reproduced with
https://github.com/gormanm/mmtests . The most relevant test configurations
I've seen so far are

configs/config-global-dhp__io-dbench4-async
configs/config-global-dhp__io-fio-randread-async-randwrite
configs/config-global-dhp__io-fio-randread-async-seqwrite
configs/config-global-dhp__io-fio-randread-sync-heavywrite
configs/config-global-dhp__io-fio-randread-sync-randwrite
configs/config-global-dhp__pgioperf

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  8:51 Switching to MQ by default may generate some bug reports Mel Gorman
@ 2017-08-03  9:17 ` Ming Lei
  2017-08-03  9:32   ` Ming Lei
  2017-08-03  9:42   ` Mel Gorman
  2017-08-03  9:21   ` Paolo Valente
  1 sibling, 2 replies; 40+ messages in thread
From: Ming Lei @ 2017-08-03  9:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List,
	linux-block, Paolo Valente

Hi Mel Gorman,

On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> Hi Christoph,
>
> I know the reasons for switching to MQ by default but just be aware that it's
> not without hazards albeit it the biggest issues I've seen are switching
> CFQ to BFQ. On my home grid, there is some experimental automatic testing
> running every few weeks searching for regressions. Yesterday, it noticed
> that creating some work files for a postgres simulator called pgioperf
> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> linearly writing two files for testing on another benchmark and is not
> remarkable. The relevant part of the report is

We saw some SCSI-MQ performance issue too, please see if the following
patchset fixes your issue:

http://marc.info/?l=linux-block&m=150151989915776&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  8:51 Switching to MQ by default may generate some bug reports Mel Gorman
@ 2017-08-03  9:21   ` Paolo Valente
  2017-08-03  9:21   ` Paolo Valente
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-03  9:21 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 03 ago 2017, alle ore 10:51, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>=20
> Hi Christoph,
>=20
> I know the reasons for switching to MQ by default but just be aware =
that it's
> not without hazards albeit it the biggest issues I've seen are =
switching
> CFQ to BFQ. On my home grid, there is some experimental automatic =
testing
> running every few weeks searching for regressions. Yesterday, it =
noticed
> that creating some work files for a postgres simulator called pgioperf
> was 38.33% slower and it auto-bisected to the switch to MQ. This is =
just
> linearly writing two files for testing on another benchmark and is not
> remarkable. The relevant part of the report is
>=20
> Last good/First bad commit
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
> Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
> First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
> =46rom 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 =
2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Fri, 16 Jun 2017 10:27:55 +0200
> Subject: [PATCH] scsi: default to scsi-mq
> Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
> path now that we had plenty of testing, and have I/O schedulers for
> blk-mq.  The module option to disable the blk-mq path is kept around =
for
> now.
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> drivers/scsi/Kconfig | 11 -----------
> drivers/scsi/scsi.c  |  4 ----
> 2 files changed, 15 deletions(-)
>=20
> Comparison
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>                                initial                initial          =
         last                  penup                  first
>                             good-v4.12       bad-16f73eb02d7e          =
good-6d311fa7          good-d06c587d           bad-5c279bd9
> User    min             0.06 (   0.00%)        0.14 (-133.33%)        =
0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
> User    mean            0.06 (   0.00%)        0.14 (-133.33%)        =
0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
> User    stddev          0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> User    coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> User    max             0.06 (   0.00%)        0.14 (-133.33%)        =
0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
> System  min            10.04 (   0.00%)       10.75 (  -7.07%)       =
10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
> System  mean           10.04 (   0.00%)       10.75 (  -7.07%)       =
10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
> System  stddev          0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> System  coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> System  max            10.04 (   0.00%)       10.75 (  -7.07%)       =
10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
> Elapsed min           251.53 (   0.00%)      351.05 ( -39.57%)      =
252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
> Elapsed mean          251.53 (   0.00%)      351.05 ( -39.57%)      =
252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
> Elapsed stddev          0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> Elapsed coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> Elapsed max           251.53 (   0.00%)      351.05 ( -39.57%)      =
252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
> CPU     min             4.00 (   0.00%)        3.00 (  25.00%)        =
4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
> CPU     mean            4.00 (   0.00%)        3.00 (  25.00%)        =
4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
> CPU     stddev          0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> CPU     coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        =
0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> CPU     max             4.00 (   0.00%)        3.00 (  25.00%)        =
4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
>=20
> The "Elapsed mean" line is what the testing and auto-bisection was =
paying
> attention to. Commit 16f73eb02d7e is simply the head commit at the =
time
> the continuous testing started. The first "bad commit" is the last =
column.
>=20
> It's not the only slowdown that has been observed from other testing =
when
> examining whether it's ok to switch to MQ by default. The biggest =
slowdown
> observed was with a modified version of dbench4 -- the modifications =
use
> shorter, but representative, load files to avoid timing artifacts and
> reports time to complete a load file instead of throughput as =
throughput
> is kind of meaningless for dbench4
>=20
> dbench4 Loadfile Execution Time
>                             4.12.0                 4.12.0
>                         legacy-cfq                 mq-bfq
> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)
> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)
> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)
> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)
>=20
> The units are "milliseconds to complete a load file" so as thread =
count
> increased, there were some fairly bad slowdowns. The most dramatic
> slowdown was observed on a machine with a controller with on-board =
cache
>=20
>                              4.12.0                 4.12.0
>                          legacy-cfq                 mq-bfq
> Amean     1        289.09 (   0.00%)      128.43 (  55.57%)
> Amean     2        491.32 (   0.00%)      794.04 ( -61.61%)
> Amean     4        875.26 (   0.00%)     9331.79 (-966.17%)
> Amean     8       2074.30 (   0.00%)      317.79 (  84.68%)
> Amean     16      3380.47 (   0.00%)      669.51 (  80.19%)
> Amean     32      7427.25 (   0.00%)     8821.75 ( -18.78%)
> Amean     256    53376.81 (   0.00%)    69006.94 ( -29.28%)
>=20
> The slowdown wasn't universal but at 4 threads, it was severe. There
> are other examples but it'd just be a lot of noise and not change the
> central point.
>=20
> The major problems were all observed switching from CFQ to BFQ on =
single disk
> rotary storage. It's not machine specific as 5 separate machines =
noticed
> problems with dbench and fio when switching to MQ on kernel 4.12. =
Weirdly,
> I've seen cases of read starvation in the presence of heavy writers
> using fio to generate the workload which was surprising to me. Jan =
Kara
> suggested that it may be because the read workload is not being =
identified
> as "interactive" but I didn't dig into the details myself and have =
zero
> understanding of BFQ. I was only interested in answering the question =
"is
> it safe to switch the default and will the performance be similar =
enough
> to avoid bug reports?" and concluded that the answer is "no".
>=20
> For what it's worth, I've noticed on SSDs that switching from =
legacy-mq
> to deadline-mq also slowed down but in many cases the slowdown was =
small
> enough that it may be tolerable and not generate many bug reports. =
Also,
> mq-deadline appears to receive more attention so issues there are =
probably
> going to be noticed faster.
>=20
> I'm not suggesting for a second that you fix this or switch back to =
legacy
> by default because it's BFQ, Paulo is cc'd and it'll have to be fixed
> eventually but you might see "workload foo is slower on 4.13" reports =
that
> bisect to this commit. What filesystem is used changes the results but =
at
> least btrfs, ext3, ext4 and xfs experience slowdowns.
>=20
> For Paulo, if you want to try preemptively dealing with regression =
reports
> before 4.13 releases then all the tests in question can be reproduced =
with
> https://github.com/gormanm/mmtests . The most relevant test =
configurations
> I've seen so far are
>=20
> configs/config-global-dhp__io-dbench4-async
> configs/config-global-dhp__io-fio-randread-async-randwrite
> configs/config-global-dhp__io-fio-randread-async-seqwrite
> configs/config-global-dhp__io-fio-randread-sync-heavywrite
> configs/config-global-dhp__io-fio-randread-sync-randwrite
> configs/config-global-dhp__pgioperf
>=20

Hi Mel,
as it already happened with the latest Phoronix benchmark article (and
with other test results reported several months ago on this list), bad
results may be caused (also) by the fact that the low-latency, default
configuration of BFQ is being used.  This configuration is the default
one because the motivation for yet-another-scheduler as BFQ is that it
drastically reduces latency for interactive and soft real-time tasks
(e.g., opening an app or playing/streaming a video), when there is
some background I/O.  Low-latency heuristics are willing to sacrifice
throughput when this provides a large benefit in terms of the above
latency.

Things do change if, instead, one wants to use BFQ for tasks that
don't need this kind of low-latency guarantees, but need only the
highest possible sustained throughput.  This seems to be the case for
all the tests you have listed above.  In this case, it doesn't make
much sense to leave low-latency heuristics on.  Throughput may only
get worse for these tests, and the elapsed time can only increase.

How to switch low-latency heuristics off?
echo 0 > /sys/block/<dev>/queue/iosched/low_latency

Of course, BFQ may not be optimal for every workload, even if
low-latency mode is switched off.  In addition, there may still be
some bug.  I'll repeat your tests on a machine of mine ASAP.

Thanks,
Paolo

> --=20
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-03  9:21   ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-03  9:21 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 03 ago 2017, alle ore 10:51, Mel Gorman <mgorman@techsingularity.net> ha scritto:
> 
> Hi Christoph,
> 
> I know the reasons for switching to MQ by default but just be aware that it's
> not without hazards albeit it the biggest issues I've seen are switching
> CFQ to BFQ. On my home grid, there is some experimental automatic testing
> running every few weeks searching for regressions. Yesterday, it noticed
> that creating some work files for a postgres simulator called pgioperf
> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> linearly writing two files for testing on another benchmark and is not
> remarkable. The relevant part of the report is
> 
> Last good/First bad commit
> ==========================
> Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
> First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
> From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Fri, 16 Jun 2017 10:27:55 +0200
> Subject: [PATCH] scsi: default to scsi-mq
> Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
> path now that we had plenty of testing, and have I/O schedulers for
> blk-mq.  The module option to disable the blk-mq path is kept around for
> now.
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> drivers/scsi/Kconfig | 11 -----------
> drivers/scsi/scsi.c  |  4 ----
> 2 files changed, 15 deletions(-)
> 
> Comparison
> ==========
>                                initial                initial                   last                  penup                  first
>                             good-v4.12       bad-16f73eb02d7e          good-6d311fa7          good-d06c587d           bad-5c279bd9
> User    min             0.06 (   0.00%)        0.14 (-133.33%)        0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
> User    mean            0.06 (   0.00%)        0.14 (-133.33%)        0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
> User    stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> User    coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> User    max             0.06 (   0.00%)        0.14 (-133.33%)        0.14 (-133.33%)        0.06 (   0.00%)        0.19 (-216.67%)
> System  min            10.04 (   0.00%)       10.75 (  -7.07%)       10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
> System  mean           10.04 (   0.00%)       10.75 (  -7.07%)       10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
> System  stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> System  coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> System  max            10.04 (   0.00%)       10.75 (  -7.07%)       10.05 (  -0.10%)       10.16 (  -1.20%)       10.73 (  -6.87%)
> Elapsed min           251.53 (   0.00%)      351.05 ( -39.57%)      252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
> Elapsed mean          251.53 (   0.00%)      351.05 ( -39.57%)      252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
> Elapsed stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> Elapsed coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> Elapsed max           251.53 (   0.00%)      351.05 ( -39.57%)      252.83 (  -0.52%)      252.96 (  -0.57%)      347.93 ( -38.33%)
> CPU     min             4.00 (   0.00%)        3.00 (  25.00%)        4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
> CPU     mean            4.00 (   0.00%)        3.00 (  25.00%)        4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
> CPU     stddev          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> CPU     coeffvar        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
> CPU     max             4.00 (   0.00%)        3.00 (  25.00%)        4.00 (   0.00%)        4.00 (   0.00%)        3.00 (  25.00%)
> 
> The "Elapsed mean" line is what the testing and auto-bisection was paying
> attention to. Commit 16f73eb02d7e is simply the head commit at the time
> the continuous testing started. The first "bad commit" is the last column.
> 
> It's not the only slowdown that has been observed from other testing when
> examining whether it's ok to switch to MQ by default. The biggest slowdown
> observed was with a modified version of dbench4 -- the modifications use
> shorter, but representative, load files to avoid timing artifacts and
> reports time to complete a load file instead of throughput as throughput
> is kind of meaningless for dbench4
> 
> dbench4 Loadfile Execution Time
>                             4.12.0                 4.12.0
>                         legacy-cfq                 mq-bfq
> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)
> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)
> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)
> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)
> 
> The units are "milliseconds to complete a load file" so as thread count
> increased, there were some fairly bad slowdowns. The most dramatic
> slowdown was observed on a machine with a controller with on-board cache
> 
>                              4.12.0                 4.12.0
>                          legacy-cfq                 mq-bfq
> Amean     1        289.09 (   0.00%)      128.43 (  55.57%)
> Amean     2        491.32 (   0.00%)      794.04 ( -61.61%)
> Amean     4        875.26 (   0.00%)     9331.79 (-966.17%)
> Amean     8       2074.30 (   0.00%)      317.79 (  84.68%)
> Amean     16      3380.47 (   0.00%)      669.51 (  80.19%)
> Amean     32      7427.25 (   0.00%)     8821.75 ( -18.78%)
> Amean     256    53376.81 (   0.00%)    69006.94 ( -29.28%)
> 
> The slowdown wasn't universal but at 4 threads, it was severe. There
> are other examples but it'd just be a lot of noise and not change the
> central point.
> 
> The major problems were all observed switching from CFQ to BFQ on single disk
> rotary storage. It's not machine specific as 5 separate machines noticed
> problems with dbench and fio when switching to MQ on kernel 4.12. Weirdly,
> I've seen cases of read starvation in the presence of heavy writers
> using fio to generate the workload which was surprising to me. Jan Kara
> suggested that it may be because the read workload is not being identified
> as "interactive" but I didn't dig into the details myself and have zero
> understanding of BFQ. I was only interested in answering the question "is
> it safe to switch the default and will the performance be similar enough
> to avoid bug reports?" and concluded that the answer is "no".
> 
> For what it's worth, I've noticed on SSDs that switching from legacy-mq
> to deadline-mq also slowed down but in many cases the slowdown was small
> enough that it may be tolerable and not generate many bug reports. Also,
> mq-deadline appears to receive more attention so issues there are probably
> going to be noticed faster.
> 
> I'm not suggesting for a second that you fix this or switch back to legacy
> by default because it's BFQ, Paulo is cc'd and it'll have to be fixed
> eventually but you might see "workload foo is slower on 4.13" reports that
> bisect to this commit. What filesystem is used changes the results but at
> least btrfs, ext3, ext4 and xfs experience slowdowns.
> 
> For Paulo, if you want to try preemptively dealing with regression reports
> before 4.13 releases then all the tests in question can be reproduced with
> https://github.com/gormanm/mmtests . The most relevant test configurations
> I've seen so far are
> 
> configs/config-global-dhp__io-dbench4-async
> configs/config-global-dhp__io-fio-randread-async-randwrite
> configs/config-global-dhp__io-fio-randread-async-seqwrite
> configs/config-global-dhp__io-fio-randread-sync-heavywrite
> configs/config-global-dhp__io-fio-randread-sync-randwrite
> configs/config-global-dhp__pgioperf
> 

Hi Mel,
as it already happened with the latest Phoronix benchmark article (and
with other test results reported several months ago on this list), bad
results may be caused (also) by the fact that the low-latency, default
configuration of BFQ is being used.  This configuration is the default
one because the motivation for yet-another-scheduler as BFQ is that it
drastically reduces latency for interactive and soft real-time tasks
(e.g., opening an app or playing/streaming a video), when there is
some background I/O.  Low-latency heuristics are willing to sacrifice
throughput when this provides a large benefit in terms of the above
latency.

Things do change if, instead, one wants to use BFQ for tasks that
don't need this kind of low-latency guarantees, but need only the
highest possible sustained throughput.  This seems to be the case for
all the tests you have listed above.  In this case, it doesn't make
much sense to leave low-latency heuristics on.  Throughput may only
get worse for these tests, and the elapsed time can only increase.

How to switch low-latency heuristics off?
echo 0 > /sys/block/<dev>/queue/iosched/low_latency

Of course, BFQ may not be optimal for every workload, even if
low-latency mode is switched off.  In addition, there may still be
some bug.  I'll repeat your tests on a machine of mine ASAP.

Thanks,
Paolo

> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  9:17 ` Ming Lei
@ 2017-08-03  9:32   ` Ming Lei
  2017-08-03  9:42   ` Mel Gorman
  1 sibling, 0 replies; 40+ messages in thread
From: Ming Lei @ 2017-08-03  9:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List,
	linux-block, Paolo Valente, Ming Lei

On Thu, Aug 3, 2017 at 5:17 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> Hi Mel Gorman,
>
> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> Hi Christoph,
>>
>> I know the reasons for switching to MQ by default but just be aware that it's
>> not without hazards albeit it the biggest issues I've seen are switching
>> CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> running every few weeks searching for regressions. Yesterday, it noticed
>> that creating some work files for a postgres simulator called pgioperf
>> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> linearly writing two files for testing on another benchmark and is not
>> remarkable. The relevant part of the report is
>
> We saw some SCSI-MQ performance issue too, please see if the following
> patchset fixes your issue:
>
> http://marc.info/?l=linux-block&m=150151989915776&w=2

BTW, the above patches(V1) can be found in the following tree:

      https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V1

V2 has already been done but not posted out yet, because the performance test
on SRP isn't completed:

      https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V2


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  9:17 ` Ming Lei
  2017-08-03  9:32   ` Ming Lei
@ 2017-08-03  9:42   ` Mel Gorman
  2017-08-03  9:44       ` Paolo Valente
  2017-08-03  9:57     ` Ming Lei
  1 sibling, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2017-08-03  9:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List,
	linux-block, Paolo Valente

On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
> Hi Mel Gorman,
> 
> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> > Hi Christoph,
> >
> > I know the reasons for switching to MQ by default but just be aware that it's
> > not without hazards albeit it the biggest issues I've seen are switching
> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
> > running every few weeks searching for regressions. Yesterday, it noticed
> > that creating some work files for a postgres simulator called pgioperf
> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> > linearly writing two files for testing on another benchmark and is not
> > remarkable. The relevant part of the report is
> 
> We saw some SCSI-MQ performance issue too, please see if the following
> patchset fixes your issue:
> 
> http://marc.info/?l=linux-block&m=150151989915776&w=2
> 

That series is dealing with problems with legacy-deadline vs mq-none where
as the bulk of the problems reported in this mail are related to
legacy-CFQ vs mq-BFQ.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  9:42   ` Mel Gorman
@ 2017-08-03  9:44       ` Paolo Valente
  2017-08-03  9:57     ` Ming Lei
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-03  9:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block


> Il giorno 03 ago 2017, alle ore 11:42, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>=20
> On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>>=20
>> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman =
<mgorman@techsingularity.net> wrote:
>>> Hi Christoph,
>>>=20
>>> I know the reasons for switching to MQ by default but just be aware =
that it's
>>> not without hazards albeit it the biggest issues I've seen are =
switching
>>> CFQ to BFQ. On my home grid, there is some experimental automatic =
testing
>>> running every few weeks searching for regressions. Yesterday, it =
noticed
>>> that creating some work files for a postgres simulator called =
pgioperf
>>> was 38.33% slower and it auto-bisected to the switch to MQ. This is =
just
>>> linearly writing two files for testing on another benchmark and is =
not
>>> remarkable. The relevant part of the report is
>>=20
>> We saw some SCSI-MQ performance issue too, please see if the =
following
>> patchset fixes your issue:
>>=20
>> http://marc.info/?l=3Dlinux-block&m=3D150151989915776&w=3D2
>>=20
>=20
> That series is dealing with problems with legacy-deadline vs mq-none =
where
> as the bulk of the problems reported in this mail are related to
> legacy-CFQ vs mq-BFQ.
>=20

Out-of-curiosity: you get no regression with mq-none or mq-deadline?

Thanks,
Paolo

> --=20
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-03  9:44       ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-03  9:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block


> Il giorno 03 ago 2017, alle ore 11:42, Mel Gorman <mgorman@techsingularity.net> ha scritto:
> 
> On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>> 
>> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>>> Hi Christoph,
>>> 
>>> I know the reasons for switching to MQ by default but just be aware that it's
>>> not without hazards albeit it the biggest issues I've seen are switching
>>> CFQ to BFQ. On my home grid, there is some experimental automatic testing
>>> running every few weeks searching for regressions. Yesterday, it noticed
>>> that creating some work files for a postgres simulator called pgioperf
>>> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>>> linearly writing two files for testing on another benchmark and is not
>>> remarkable. The relevant part of the report is
>> 
>> We saw some SCSI-MQ performance issue too, please see if the following
>> patchset fixes your issue:
>> 
>> http://marc.info/?l=linux-block&m=150151989915776&w=2
>> 
> 
> That series is dealing with problems with legacy-deadline vs mq-none where
> as the bulk of the problems reported in this mail are related to
> legacy-CFQ vs mq-BFQ.
> 

Out-of-curiosity: you get no regression with mq-none or mq-deadline?

Thanks,
Paolo

> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  9:42   ` Mel Gorman
  2017-08-03  9:44       ` Paolo Valente
@ 2017-08-03  9:57     ` Ming Lei
  2017-08-03 10:47       ` Mel Gorman
  1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2017-08-03  9:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List,
	linux-block, Paolo Valente

On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>>
>> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> > Hi Christoph,
>> >
>> > I know the reasons for switching to MQ by default but just be aware that it's
>> > not without hazards albeit it the biggest issues I've seen are switching
>> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> > running every few weeks searching for regressions. Yesterday, it noticed
>> > that creating some work files for a postgres simulator called pgioperf
>> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> > linearly writing two files for testing on another benchmark and is not
>> > remarkable. The relevant part of the report is
>>
>> We saw some SCSI-MQ performance issue too, please see if the following
>> patchset fixes your issue:
>>
>> http://marc.info/?l=linux-block&m=150151989915776&w=2
>>
>
> That series is dealing with problems with legacy-deadline vs mq-none where
> as the bulk of the problems reported in this mail are related to
> legacy-CFQ vs mq-BFQ.

The serials deals with none and all mq schedulers, and you can see
the improvement on mq-deadline in cover letter, :-)

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  9:44       ` Paolo Valente
  (?)
@ 2017-08-03 10:46       ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2017-08-03 10:46 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block

On Thu, Aug 03, 2017 at 11:44:06AM +0200, Paolo Valente wrote:
> > That series is dealing with problems with legacy-deadline vs mq-none where
> > as the bulk of the problems reported in this mail are related to
> > legacy-CFQ vs mq-BFQ.
> > 
> 
> Out-of-curiosity: you get no regression with mq-none or mq-deadline?
> 

I didn't test mq-none as the underlying storage was not fast enough to
make a legacy-noop vs mq-none meaningful. legacy-deadline vs mq-deadline
did show small regressions on some workloads but not as dramatic and
small enough that it would go unmissed in some cases.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  9:57     ` Ming Lei
@ 2017-08-03 10:47       ` Mel Gorman
  2017-08-03 11:48         ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2017-08-03 10:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List,
	linux-block, Paolo Valente

On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote:
> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
> >> Hi Mel Gorman,
> >>
> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> >> > Hi Christoph,
> >> >
> >> > I know the reasons for switching to MQ by default but just be aware that it's
> >> > not without hazards albeit it the biggest issues I've seen are switching
> >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
> >> > running every few weeks searching for regressions. Yesterday, it noticed
> >> > that creating some work files for a postgres simulator called pgioperf
> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> >> > linearly writing two files for testing on another benchmark and is not
> >> > remarkable. The relevant part of the report is
> >>
> >> We saw some SCSI-MQ performance issue too, please see if the following
> >> patchset fixes your issue:
> >>
> >> http://marc.info/?l=linux-block&m=150151989915776&w=2
> >>
> >
> > That series is dealing with problems with legacy-deadline vs mq-none where
> > as the bulk of the problems reported in this mail are related to
> > legacy-CFQ vs mq-BFQ.
> 
> The serials deals with none and all mq schedulers, and you can see
> the improvement on mq-deadline in cover letter, :-)
> 

Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ
that was not observed on other schedulers?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03  9:21   ` Paolo Valente
  (?)
@ 2017-08-03 11:01   ` Mel Gorman
  2017-08-04  7:26       ` Paolo Valente
  -1 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2017-08-03 11:01 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block

On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote:
> > For Paulo, if you want to try preemptively dealing with regression reports
> > before 4.13 releases then all the tests in question can be reproduced with
> > https://github.com/gormanm/mmtests . The most relevant test configurations
> > I've seen so far are
> > 
> > configs/config-global-dhp__io-dbench4-async
> > configs/config-global-dhp__io-fio-randread-async-randwrite
> > configs/config-global-dhp__io-fio-randread-async-seqwrite
> > configs/config-global-dhp__io-fio-randread-sync-heavywrite
> > configs/config-global-dhp__io-fio-randread-sync-randwrite
> > configs/config-global-dhp__pgioperf
> > 
> 
> Hi Mel,
> as it already happened with the latest Phoronix benchmark article (and
> with other test results reported several months ago on this list), bad
> results may be caused (also) by the fact that the low-latency, default
> configuration of BFQ is being used. 

I took that into account BFQ with low-latency was also tested and the
impact was not a universal improvement although it can be a noticable
improvement. From the same machine;

dbench4 Loadfile Execution Time
                             4.12.0                 4.12.0                 4.12.0
                         legacy-cfq                 mq-bfq            mq-bfq-tput
Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)

However, it's not a universal gain and there are also fairness issues.
For example, this is a fio configuration with a single random reader and
a single random writer on the same machine

fio Throughput
                                              4.12.0                 4.12.0                 4.12.0
                                          legacy-cfq                 mq-bfq            mq-bfq-tput
Hmean     kb/sec-writer-write      398.15 (   0.00%)     4659.18 (1070.21%)     4934.52 (1139.37%)
Hmean     kb/sec-reader-read       507.00 (   0.00%)       66.36 ( -86.91%)       14.68 ( -97.10%)

With CFQ, there is some fairness between the readers and writers and
with BFQ, there is a strong preference to writers. Again, this is not
universal. It'll be a mix and sometimes it'll be classed as a gain and
sometimes a regression.

While I accept that BFQ can be tuned, tuning IO schedulers is not something
that normal users get right and they'll only look at "out of box" performance
which, right now, will trigger bug reports. This is neither good nor bad,
it simply is.

> This configuration is the default
> one because the motivation for yet-another-scheduler as BFQ is that it
> drastically reduces latency for interactive and soft real-time tasks
> (e.g., opening an app or playing/streaming a video), when there is
> some background I/O.  Low-latency heuristics are willing to sacrifice
> throughput when this provides a large benefit in terms of the above
> latency.
> 

I had seen this assertion so one of the fio configurations had multiple
heavy writers in the background and a random reader of small files to
simulate that scenario. The intent was to simulate heavy IO in the presence
of application startup

                                              4.12.0                 4.12.0                 4.12.0
                                          legacy-cfq                 mq-bfq            mq-bfq-tput
Hmean     kb/sec-writer-write     1997.75 (   0.00%)     2035.65 (   1.90%)     2014.50 (   0.84%)
Hmean     kb/sec-reader-read       128.50 (   0.00%)       79.46 ( -38.16%)       12.78 ( -90.06%)

Write throughput is steady-ish across each IO scheduler but readers get
starved badly which I expect would slow application startup and disabling
low_latency makes it much worse. The mmtests configuration in question
is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
a fresh XFS filesystem on a test partition.

This is not exactly equivalent to real application startup but that can
be difficult to quantify properly.

> Of course, BFQ may not be optimal for every workload, even if
> low-latency mode is switched off.  In addition, there may still be
> some bug.  I'll repeat your tests on a machine of mine ASAP.
> 

The intent here is not to rag on BFQ because I know it's going to have some
wins and some losses and will take time to fix up. The primary intent was
to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports
due to the switching of defaults that will bisect to a misleading commit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03 10:47       ` Mel Gorman
@ 2017-08-03 11:48         ` Ming Lei
  0 siblings, 0 replies; 40+ messages in thread
From: Ming Lei @ 2017-08-03 11:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Jens Axboe, Linux Kernel Mailing List,
	linux-block, Paolo Valente

On Thu, Aug 3, 2017 at 6:47 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote:
>> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> >> Hi Mel Gorman,
>> >>
>> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> >> > Hi Christoph,
>> >> >
>> >> > I know the reasons for switching to MQ by default but just be aware that it's
>> >> > not without hazards albeit it the biggest issues I've seen are switching
>> >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> >> > running every few weeks searching for regressions. Yesterday, it noticed
>> >> > that creating some work files for a postgres simulator called pgioperf
>> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> >> > linearly writing two files for testing on another benchmark and is not
>> >> > remarkable. The relevant part of the report is
>> >>
>> >> We saw some SCSI-MQ performance issue too, please see if the following
>> >> patchset fixes your issue:
>> >>
>> >> http://marc.info/?l=linux-block&m=150151989915776&w=2
>> >>
>> >
>> > That series is dealing with problems with legacy-deadline vs mq-none where
>> > as the bulk of the problems reported in this mail are related to
>> > legacy-CFQ vs mq-BFQ.
>>
>> The serials deals with none and all mq schedulers, and you can see
>> the improvement on mq-deadline in cover letter, :-)
>>
>
> Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ
> that was not observed on other schedulers?

Actually if you look at the cover letter, you will see this patchset
increases by
> 10X sequential I/O IOPS on mq-deadline, so it would be reasonable to see
2x to 4x BFQ slowdown, but I didn't test BFQ.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-03 11:01   ` Mel Gorman
@ 2017-08-04  7:26       ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-04  7:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 03 ago 2017, alle ore 13:01, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>=20
> On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote:
>>> For Paulo, if you want to try preemptively dealing with regression =
reports
>>> before 4.13 releases then all the tests in question can be =
reproduced with
>>> https://github.com/gormanm/mmtests . The most relevant test =
configurations
>>> I've seen so far are
>>>=20
>>> configs/config-global-dhp__io-dbench4-async
>>> configs/config-global-dhp__io-fio-randread-async-randwrite
>>> configs/config-global-dhp__io-fio-randread-async-seqwrite
>>> configs/config-global-dhp__io-fio-randread-sync-heavywrite
>>> configs/config-global-dhp__io-fio-randread-sync-randwrite
>>> configs/config-global-dhp__pgioperf
>>>=20
>>=20
>> Hi Mel,
>> as it already happened with the latest Phoronix benchmark article =
(and
>> with other test results reported several months ago on this list), =
bad
>> results may be caused (also) by the fact that the low-latency, =
default
>> configuration of BFQ is being used.=20
>=20
> I took that into account BFQ with low-latency was also tested and the
> impact was not a universal improvement although it can be a noticable
> improvement. =46rom the same machine;
>=20
> dbench4 Loadfile Execution Time
>                             4.12.0                 4.12.0              =
   4.12.0
>                         legacy-cfq                 mq-bfq            =
mq-bfq-tput
> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 =
(  -5.00%)
> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 =
(   4.45%)
> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 =
( -10.95%)
> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 =
(  19.86%)
>=20

Thanks for trying with low_latency disabled.  If I read numbers
correctly, we move from a worst case of 361% higher execution time to
a worst case of 11%.  With a best case of 20% of lower execution time.

I asked you about none and mq-deadline in a previous email, because
actually we have a double change here: change of the I/O stack, and
change of the scheduler, with the first change probably not irrelevant
with respect to the second one.

Are we sure that part of the small losses and gains with bfq-mq-tput
aren't due to the change of I/O stack?  My problem is that it may be
hard to find issues or anomalies in BFQ that justify a 5% or 11% loss
in two cases, while the same scheduler has a 4% and a 20% gain in the
other two cases.

By chance, according to what you have measured so far, is there any
test where, instead, you expect or have seen bfq-mq-tput to always
lose?  I could start from there.

> However, it's not a universal gain and there are also fairness issues.
> For example, this is a fio configuration with a single random reader =
and
> a single random writer on the same machine
>=20
> fio Throughput
>                                              4.12.0                 =
4.12.0                 4.12.0
>                                          legacy-cfq                 =
mq-bfq            mq-bfq-tput
> Hmean     kb/sec-writer-write      398.15 (   0.00%)     4659.18 =
(1070.21%)     4934.52 (1139.37%)
> Hmean     kb/sec-reader-read       507.00 (   0.00%)       66.36 ( =
-86.91%)       14.68 ( -97.10%)
>=20
> With CFQ, there is some fairness between the readers and writers and
> with BFQ, there is a strong preference to writers. Again, this is not
> universal. It'll be a mix and sometimes it'll be classed as a gain and
> sometimes a regression.
>=20

Yes, that's why I didn't pay too much attention so far to such an
issue.  I preferred to tune for maximum responsiveness and minimal
latency for soft real-time applications, w.r.t.  to reducing a kind of
unfairness for which no user happened to complain (so far).  Do you
have some real application (or benchmark simulating a real
application) in which we can see actual problems because of this form
of unfairness?  I was thinking of, e.g., two virtual machines, one
doing heavy writes and the other heavy reads.  But in that case,
cgroups have to be used, and I'm not sure we would still see this
problem.  Any suggestion is welcome.

In any case, if needed, changing read/write throughput ratio should
not be a problem.

> While I accept that BFQ can be tuned, tuning IO schedulers is not =
something
> that normal users get right and they'll only look at "out of box" =
performance
> which, right now, will trigger bug reports. This is neither good nor =
bad,
> it simply is.
>=20
>> This configuration is the default
>> one because the motivation for yet-another-scheduler as BFQ is that =
it
>> drastically reduces latency for interactive and soft real-time tasks
>> (e.g., opening an app or playing/streaming a video), when there is
>> some background I/O.  Low-latency heuristics are willing to sacrifice
>> throughput when this provides a large benefit in terms of the above
>> latency.
>>=20
>=20
> I had seen this assertion so one of the fio configurations had =
multiple
> heavy writers in the background and a random reader of small files to
> simulate that scenario. The intent was to simulate heavy IO in the =
presence
> of application startup
>=20
>                                              4.12.0                 =
4.12.0                 4.12.0
>                                          legacy-cfq                 =
mq-bfq            mq-bfq-tput
> Hmean     kb/sec-writer-write     1997.75 (   0.00%)     2035.65 (   =
1.90%)     2014.50 (   0.84%)
> Hmean     kb/sec-reader-read       128.50 (   0.00%)       79.46 ( =
-38.16%)       12.78 ( -90.06%)
>=20
> Write throughput is steady-ish across each IO scheduler but readers =
get
> starved badly which I expect would slow application startup and =
disabling
> low_latency makes it much worse.

A greedy random reader that goes on steadily mimics an application =
startup
only for the first handful of seconds.

Where can I find the exact script/configuration you used, to check
more precisely what is going on and whether BFQ is actually behaving =
very
badly for some reason?

> The mmtests configuration in question
> is global-dhp__io-fio-randread-sync-heavywrite albeit editted to =
create
> a fresh XFS filesystem on a test partition.
>=20
> This is not exactly equivalent to real application startup but that =
can
> be difficult to quantify properly.
>=20

If you do want to check application startup, then just 1) start some
background workload, 2) drop caches, 3) start the app, 4) measure how
long it takes to start.  Otherwise, the comm_startup_lat test in the
S suite [1] does all of this for you.

[1] https://github.com/Algodev-github/S

>> Of course, BFQ may not be optimal for every workload, even if
>> low-latency mode is switched off.  In addition, there may still be
>> some bug.  I'll repeat your tests on a machine of mine ASAP.
>>=20
>=20
> The intent here is not to rag on BFQ because I know it's going to have =
some
> wins and some losses and will take time to fix up. The primary intent =
was
> to flag that 4.13 might have some "blah blah blah is slower on 4.13" =
reports
> due to the switching of defaults that will bisect to a misleading =
commit.
>=20

I see, and being ready in advance is extremely helpful for me.

Thanks,
Paolo

> --=20
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-04  7:26       ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-04  7:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 03 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto:
> 
> On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote:
>>> For Paulo, if you want to try preemptively dealing with regression reports
>>> before 4.13 releases then all the tests in question can be reproduced with
>>> https://github.com/gormanm/mmtests . The most relevant test configurations
>>> I've seen so far are
>>> 
>>> configs/config-global-dhp__io-dbench4-async
>>> configs/config-global-dhp__io-fio-randread-async-randwrite
>>> configs/config-global-dhp__io-fio-randread-async-seqwrite
>>> configs/config-global-dhp__io-fio-randread-sync-heavywrite
>>> configs/config-global-dhp__io-fio-randread-sync-randwrite
>>> configs/config-global-dhp__pgioperf
>>> 
>> 
>> Hi Mel,
>> as it already happened with the latest Phoronix benchmark article (and
>> with other test results reported several months ago on this list), bad
>> results may be caused (also) by the fact that the low-latency, default
>> configuration of BFQ is being used. 
> 
> I took that into account BFQ with low-latency was also tested and the
> impact was not a universal improvement although it can be a noticable
> improvement. From the same machine;
> 
> dbench4 Loadfile Execution Time
>                             4.12.0                 4.12.0                 4.12.0
>                         legacy-cfq                 mq-bfq            mq-bfq-tput
> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
> 

Thanks for trying with low_latency disabled.  If I read numbers
correctly, we move from a worst case of 361% higher execution time to
a worst case of 11%.  With a best case of 20% of lower execution time.

I asked you about none and mq-deadline in a previous email, because
actually we have a double change here: change of the I/O stack, and
change of the scheduler, with the first change probably not irrelevant
with respect to the second one.

Are we sure that part of the small losses and gains with bfq-mq-tput
aren't due to the change of I/O stack?  My problem is that it may be
hard to find issues or anomalies in BFQ that justify a 5% or 11% loss
in two cases, while the same scheduler has a 4% and a 20% gain in the
other two cases.

By chance, according to what you have measured so far, is there any
test where, instead, you expect or have seen bfq-mq-tput to always
lose?  I could start from there.

> However, it's not a universal gain and there are also fairness issues.
> For example, this is a fio configuration with a single random reader and
> a single random writer on the same machine
> 
> fio Throughput
>                                              4.12.0                 4.12.0                 4.12.0
>                                          legacy-cfq                 mq-bfq            mq-bfq-tput
> Hmean     kb/sec-writer-write      398.15 (   0.00%)     4659.18 (1070.21%)     4934.52 (1139.37%)
> Hmean     kb/sec-reader-read       507.00 (   0.00%)       66.36 ( -86.91%)       14.68 ( -97.10%)
> 
> With CFQ, there is some fairness between the readers and writers and
> with BFQ, there is a strong preference to writers. Again, this is not
> universal. It'll be a mix and sometimes it'll be classed as a gain and
> sometimes a regression.
> 

Yes, that's why I didn't pay too much attention so far to such an
issue.  I preferred to tune for maximum responsiveness and minimal
latency for soft real-time applications, w.r.t.  to reducing a kind of
unfairness for which no user happened to complain (so far).  Do you
have some real application (or benchmark simulating a real
application) in which we can see actual problems because of this form
of unfairness?  I was thinking of, e.g., two virtual machines, one
doing heavy writes and the other heavy reads.  But in that case,
cgroups have to be used, and I'm not sure we would still see this
problem.  Any suggestion is welcome.

In any case, if needed, changing read/write throughput ratio should
not be a problem.

> While I accept that BFQ can be tuned, tuning IO schedulers is not something
> that normal users get right and they'll only look at "out of box" performance
> which, right now, will trigger bug reports. This is neither good nor bad,
> it simply is.
> 
>> This configuration is the default
>> one because the motivation for yet-another-scheduler as BFQ is that it
>> drastically reduces latency for interactive and soft real-time tasks
>> (e.g., opening an app or playing/streaming a video), when there is
>> some background I/O.  Low-latency heuristics are willing to sacrifice
>> throughput when this provides a large benefit in terms of the above
>> latency.
>> 
> 
> I had seen this assertion so one of the fio configurations had multiple
> heavy writers in the background and a random reader of small files to
> simulate that scenario. The intent was to simulate heavy IO in the presence
> of application startup
> 
>                                              4.12.0                 4.12.0                 4.12.0
>                                          legacy-cfq                 mq-bfq            mq-bfq-tput
> Hmean     kb/sec-writer-write     1997.75 (   0.00%)     2035.65 (   1.90%)     2014.50 (   0.84%)
> Hmean     kb/sec-reader-read       128.50 (   0.00%)       79.46 ( -38.16%)       12.78 ( -90.06%)
> 
> Write throughput is steady-ish across each IO scheduler but readers get
> starved badly which I expect would slow application startup and disabling
> low_latency makes it much worse.

A greedy random reader that goes on steadily mimics an application startup
only for the first handful of seconds.

Where can I find the exact script/configuration you used, to check
more precisely what is going on and whether BFQ is actually behaving very
badly for some reason?

> The mmtests configuration in question
> is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
> a fresh XFS filesystem on a test partition.
> 
> This is not exactly equivalent to real application startup but that can
> be difficult to quantify properly.
> 

If you do want to check application startup, then just 1) start some
background workload, 2) drop caches, 3) start the app, 4) measure how
long it takes to start.  Otherwise, the comm_startup_lat test in the
S suite [1] does all of this for you.

[1] https://github.com/Algodev-github/S

>> Of course, BFQ may not be optimal for every workload, even if
>> low-latency mode is switched off.  In addition, there may still be
>> some bug.  I'll repeat your tests on a machine of mine ASAP.
>> 
> 
> The intent here is not to rag on BFQ because I know it's going to have some
> wins and some losses and will take time to fix up. The primary intent was
> to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports
> due to the switching of defaults that will bisect to a misleading commit.
> 

I see, and being ready in advance is extremely helpful for me.

Thanks,
Paolo

> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-04  7:26       ` Paolo Valente
  (?)
@ 2017-08-04 11:01       ` Mel Gorman
  2017-08-04 22:05           ` Paolo Valente
  -1 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2017-08-04 11:01 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block

On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
> > I took that into account BFQ with low-latency was also tested and the
> > impact was not a universal improvement although it can be a noticable
> > improvement. From the same machine;
> > 
> > dbench4 Loadfile Execution Time
> >                             4.12.0                 4.12.0                 4.12.0
> >                         legacy-cfq                 mq-bfq            mq-bfq-tput
> > Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
> > Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
> > Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
> > Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
> > 
> 
> Thanks for trying with low_latency disabled.  If I read numbers
> correctly, we move from a worst case of 361% higher execution time to
> a worst case of 11%.  With a best case of 20% of lower execution time.
> 

Yes.

> I asked you about none and mq-deadline in a previous email, because
> actually we have a double change here: change of the I/O stack, and
> change of the scheduler, with the first change probably not irrelevant
> with respect to the second one.
> 

True. However, the difference between legacy-deadline mq-deadline is
roughly around the 5-10% mark across workloads for SSD. It's not
universally true but the impact is not as severe. While this is not
proof that the stack change is the sole root cause, it makes it less
likely.

> By chance, according to what you have measured so far, is there any
> test where, instead, you expect or have seen bfq-mq-tput to always
> lose?  I could start from there.
> 

global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
it could be the stack change.

global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
ext4 as a filesystem. The same is not true for XFS so the filesystem
matters.

> > However, it's not a universal gain and there are also fairness issues.
> > For example, this is a fio configuration with a single random reader and
> > a single random writer on the same machine
> > 
> > fio Throughput
> >                                              4.12.0                 4.12.0                 4.12.0
> >                                          legacy-cfq                 mq-bfq            mq-bfq-tput
> > Hmean     kb/sec-writer-write      398.15 (   0.00%)     4659.18 (1070.21%)     4934.52 (1139.37%)
> > Hmean     kb/sec-reader-read       507.00 (   0.00%)       66.36 ( -86.91%)       14.68 ( -97.10%)
> > 
> > With CFQ, there is some fairness between the readers and writers and
> > with BFQ, there is a strong preference to writers. Again, this is not
> > universal. It'll be a mix and sometimes it'll be classed as a gain and
> > sometimes a regression.
> > 
> 
> Yes, that's why I didn't pay too much attention so far to such an
> issue.  I preferred to tune for maximum responsiveness and minimal
> latency for soft real-time applications, w.r.t.  to reducing a kind of
> unfairness for which no user happened to complain (so far).  Do you
> have some real application (or benchmark simulating a real
> application) in which we can see actual problems because of this form
> of unfairness? 

I don't have data on that. This was a preliminary study only to see if
a switch was safe running workloads that would appear in internal bug
reports related to benchmarking.

> I was thinking of, e.g., two virtual machines, one
> doing heavy writes and the other heavy reads.  But in that case,
> cgroups have to be used, and I'm not sure we would still see this
> problem.  Any suggestion is welcome.
> 

I haven't spent time designing such a thing. Even if I did, I know I would
get hit within weeks of a switch during distro development with reports
related to fio, dbench and other basic IO benchmarks.

> > I had seen this assertion so one of the fio configurations had multiple
> > heavy writers in the background and a random reader of small files to
> > simulate that scenario. The intent was to simulate heavy IO in the presence
> > of application startup
> > 
> >                                              4.12.0                 4.12.0                 4.12.0
> >                                          legacy-cfq                 mq-bfq            mq-bfq-tput
> > Hmean     kb/sec-writer-write     1997.75 (   0.00%)     2035.65 (   1.90%)     2014.50 (   0.84%)
> > Hmean     kb/sec-reader-read       128.50 (   0.00%)       79.46 ( -38.16%)       12.78 ( -90.06%)
> > 
> > Write throughput is steady-ish across each IO scheduler but readers get
> > starved badly which I expect would slow application startup and disabling
> > low_latency makes it much worse.
> 
> A greedy random reader that goes on steadily mimics an application startup
> only for the first handful of seconds.
> 

Sure, but if during those handful of seconds the throughput is 10% of
what is used to be, it'll still be noticable.

> Where can I find the exact script/configuration you used, to check
> more precisely what is going on and whether BFQ is actually behaving very
> badly for some reason?
> 

https://github.com/gormanm/mmtests

All the configuration files are in configs/ so
global-dhp__io-dbench4-fsync-ext4 maps to global-dhp__io-dbench4-fsync but
it has to be editted if you want to format a test partition.  Otherwise,
you'd just need to make sure the current directory was ext4 and ignore
any filesystem aging artifacts.

> > The mmtests configuration in question
> > is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
> > a fresh XFS filesystem on a test partition.
> > 
> > This is not exactly equivalent to real application startup but that can
> > be difficult to quantify properly.
> > 
> 
> If you do want to check application startup, then just 1) start some
> background workload, 2) drop caches, 3) start the app, 4) measure how
> long it takes to start.  Otherwise, the comm_startup_lat test in the
> S suite [1] does all of this for you.
> 

I did have something like this before but found it unreliable because it
couldn't tell the difference between when an application has a window
and when it's ready for use. Evolution for example may start up and
start displaing but then clicking on a mail may stall for a few seconds.
It's difficult to quantify meaningfully which is why I eventually gave
up and relied instead on proxy measures.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-04 11:01       ` Mel Gorman
@ 2017-08-04 22:05           ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-04 22:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>=20
> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>> I took that into account BFQ with low-latency was also tested and =
the
>>> impact was not a universal improvement although it can be a =
noticable
>>> improvement. =46rom the same machine;
>>>=20
>>> dbench4 Loadfile Execution Time
>>>                            4.12.0                 4.12.0             =
    4.12.0
>>>                        legacy-cfq                 mq-bfq            =
mq-bfq-tput
>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       =
84.70 (  -5.00%)
>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       =
88.74 (   4.45%)
>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      =
113.97 ( -10.95%)
>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     =
2038.74 (  19.86%)
>>>=20
>>=20
>> Thanks for trying with low_latency disabled.  If I read numbers
>> correctly, we move from a worst case of 361% higher execution time to
>> a worst case of 11%.  With a best case of 20% of lower execution =
time.
>>=20
>=20
> Yes.
>=20
>> I asked you about none and mq-deadline in a previous email, because
>> actually we have a double change here: change of the I/O stack, and
>> change of the scheduler, with the first change probably not =
irrelevant
>> with respect to the second one.
>>=20
>=20
> True. However, the difference between legacy-deadline mq-deadline is
> roughly around the 5-10% mark across workloads for SSD. It's not
> universally true but the impact is not as severe. While this is not
> proof that the stack change is the sole root cause, it makes it less
> likely.
>=20

I'm getting a little lost here.  If I'm not mistaken, you are saying,
since the difference between two virtually identical schedulers
(legacy-deadline and mq-deadline) is only around 5-10%, while the
difference between cfq and mq-bfq-tput is higher, then in the latter
case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
above test is exactly in the 5-10% range?  What am I missing?  Other
tests with mq-bfq-tput not yet reported?

>> By chance, according to what you have measured so far, is there any
>> test where, instead, you expect or have seen bfq-mq-tput to always
>> lose?  I could start from there.
>>=20
>=20
> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough =
that
> it could be the stack change.
>=20
> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
> machine tested. This is global-dhp__io-dbench4-fsync from mmtests =
using
> ext4 as a filesystem. The same is not true for XFS so the filesystem
> matters.
>=20

Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
soon as I can, thanks.


>>> However, it's not a universal gain and there are also fairness =
issues.
>>> For example, this is a fio configuration with a single random reader =
and
>>> a single random writer on the same machine
>>>=20
>>> fio Throughput
>>>                                             4.12.0                 =
4.12.0                 4.12.0
>>>                                         legacy-cfq                 =
mq-bfq            mq-bfq-tput
>>> Hmean     kb/sec-writer-write      398.15 (   0.00%)     4659.18 =
(1070.21%)     4934.52 (1139.37%)
>>> Hmean     kb/sec-reader-read       507.00 (   0.00%)       66.36 ( =
-86.91%)       14.68 ( -97.10%)
>>>=20
>>> With CFQ, there is some fairness between the readers and writers and
>>> with BFQ, there is a strong preference to writers. Again, this is =
not
>>> universal. It'll be a mix and sometimes it'll be classed as a gain =
and
>>> sometimes a regression.
>>>=20
>>=20
>> Yes, that's why I didn't pay too much attention so far to such an
>> issue.  I preferred to tune for maximum responsiveness and minimal
>> latency for soft real-time applications, w.r.t.  to reducing a kind =
of
>> unfairness for which no user happened to complain (so far).  Do you
>> have some real application (or benchmark simulating a real
>> application) in which we can see actual problems because of this form
>> of unfairness?=20
>=20
> I don't have data on that. This was a preliminary study only to see if
> a switch was safe running workloads that would appear in internal bug
> reports related to benchmarking.
>=20
>> I was thinking of, e.g., two virtual machines, one
>> doing heavy writes and the other heavy reads.  But in that case,
>> cgroups have to be used, and I'm not sure we would still see this
>> problem.  Any suggestion is welcome.
>>=20
>=20
> I haven't spent time designing such a thing. Even if I did, I know I =
would
> get hit within weeks of a switch during distro development with =
reports
> related to fio, dbench and other basic IO benchmarks.
>=20

I see.

>>> I had seen this assertion so one of the fio configurations had =
multiple
>>> heavy writers in the background and a random reader of small files =
to
>>> simulate that scenario. The intent was to simulate heavy IO in the =
presence
>>> of application startup
>>>=20
>>>                                             4.12.0                 =
4.12.0                 4.12.0
>>>                                         legacy-cfq                 =
mq-bfq            mq-bfq-tput
>>> Hmean     kb/sec-writer-write     1997.75 (   0.00%)     2035.65 (   =
1.90%)     2014.50 (   0.84%)
>>> Hmean     kb/sec-reader-read       128.50 (   0.00%)       79.46 ( =
-38.16%)       12.78 ( -90.06%)
>>>=20
>>> Write throughput is steady-ish across each IO scheduler but readers =
get
>>> starved badly which I expect would slow application startup and =
disabling
>>> low_latency makes it much worse.
>>=20
>> A greedy random reader that goes on steadily mimics an application =
startup
>> only for the first handful of seconds.
>>=20
>=20
> Sure, but if during those handful of seconds the throughput is 10% of
> what is used to be, it'll still be noticeable.
>=20

I did not have the time yet to repeat this test (I will try soon), but
I had the time think about it a little bit.  And I soon realized that
actually this is not a responsiveness test against background
workload, or, it is at most an extreme corner case for it.  Both the
write and the read thread start at the same time.  So, we are
mimicking a user starting, e.g., a file copy, and, exactly at the same
time, an app(in addition, the file copy starts to cause heavy writes
immediately).

BFQ uses time patterns to guess which processes to privilege, and the
time patterns of the writer and reader are indistinguishable here.
Only tagging processes with extra information would help, but that is
a different story.  And in this case tagging would help for a
not-so-frequent use case.

In addition, a greedy random reader may mimick the start-up of only
very simple applications.  Even a simple terminal such as xterm does
some I/O (not completely random, but I guess we don't need to be
overpicky), then it stops doing I/O and passes the ball to the X
server, which does some I/O, stops and passes the ball back to xterm
for its final start-up phase.  More and more processes are involved,
and more and more complex I/O patterns are issued as applications
become more complex.  This is the reason why we strived to benchmark
application start-up by truly starting real applications and measuring
their start-up time (see below).

>> Where can I find the exact script/configuration you used, to check
>> more precisely what is going on and whether BFQ is actually behaving =
very
>> badly for some reason?
>>=20
>=20
> https://github.com/gormanm/mmtests
>=20
> All the configuration files are in configs/ so
> global-dhp__io-dbench4-fsync-ext4 maps to global-dhp__io-dbench4-fsync =
but
> it has to be editted if you want to format a test partition.  =
Otherwise,
> you'd just need to make sure the current directory was ext4 and ignore
> any filesystem aging artifacts.
>=20

Thank you, I'll do it ASAP.

>>> The mmtests configuration in question
>>> is global-dhp__io-fio-randread-sync-heavywrite albeit editted to =
create
>>> a fresh XFS filesystem on a test partition.
>>>=20
>>> This is not exactly equivalent to real application startup but that =
can
>>> be difficult to quantify properly.
>>>=20
>>=20
>> If you do want to check application startup, then just 1) start some
>> background workload, 2) drop caches, 3) start the app, 4) measure how
>> long it takes to start.  Otherwise, the comm_startup_lat test in the
>> S suite [1] does all of this for you.
>>=20
>=20
> I did have something like this before but found it unreliable because =
it
> couldn't tell the difference between when an application has a window
> and when it's ready for use. Evolution for example may start up and
> start displaing but then clicking on a mail may stall for a few =
seconds.
> It's difficult to quantify meaningfully which is why I eventually gave
> up and relied instead on proxy measures.
>=20

Right, that's why we looked for other applications that were as
popular, but for which we could get reliable and precise measures.
One such application is a terminal, another one a shell.  On the
opposite end of the size spectrum, another other such applications are
libreoffice/openoffice.

For, e.g, gnome-terminal, it is enough to invoke "time gnome-terminal
-e /bin/true".  By the stopwatch, such a command measures very
precisely the time that elapses from when you start the terminal, to
when you can start typing a command in its window.  Similarly, "xterm
/bin/true", "ssh localhost exit", "bash -c exit", "lowriter
--terminate-after-init".  Of course, these tricks certainly cause a
few more block reads than the real, bare application start-up, but,
even if the difference were noticeable in terms of time, what matters
is to measure the execution time of these commands without background
workload, and then compare it against their execution time with some
background workload.  If it takes, say, 5 seconds without background
workload, and still about 5 seconds with background workload and a
given scheduler, but, with another scheduler, it takes 40 seconds with
background workload (all real numbers, actually), then you can draw
some sound conclusion on responsiveness for the each of the two
schedulers.

In addition, as for coverage, we made the empiric assumption that
start-up time measured with each of the above easy-to-benchmark
applications gives an idea of the time that it would take with any
application of the same size and complexity.  User feedback confirmed
this assumptions so far.  Of course there may well be exceptions.

Thanks,
Paolo

> --=20
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-04 22:05           ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-04 22:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto:
> 
> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>> I took that into account BFQ with low-latency was also tested and the
>>> impact was not a universal improvement although it can be a noticable
>>> improvement. From the same machine;
>>> 
>>> dbench4 Loadfile Execution Time
>>>                            4.12.0                 4.12.0                 4.12.0
>>>                        legacy-cfq                 mq-bfq            mq-bfq-tput
>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
>>> 
>> 
>> Thanks for trying with low_latency disabled.  If I read numbers
>> correctly, we move from a worst case of 361% higher execution time to
>> a worst case of 11%.  With a best case of 20% of lower execution time.
>> 
> 
> Yes.
> 
>> I asked you about none and mq-deadline in a previous email, because
>> actually we have a double change here: change of the I/O stack, and
>> change of the scheduler, with the first change probably not irrelevant
>> with respect to the second one.
>> 
> 
> True. However, the difference between legacy-deadline mq-deadline is
> roughly around the 5-10% mark across workloads for SSD. It's not
> universally true but the impact is not as severe. While this is not
> proof that the stack change is the sole root cause, it makes it less
> likely.
> 

I'm getting a little lost here.  If I'm not mistaken, you are saying,
since the difference between two virtually identical schedulers
(legacy-deadline and mq-deadline) is only around 5-10%, while the
difference between cfq and mq-bfq-tput is higher, then in the latter
case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
above test is exactly in the 5-10% range?  What am I missing?  Other
tests with mq-bfq-tput not yet reported?

>> By chance, according to what you have measured so far, is there any
>> test where, instead, you expect or have seen bfq-mq-tput to always
>> lose?  I could start from there.
>> 
> 
> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
> it could be the stack change.
> 
> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
> ext4 as a filesystem. The same is not true for XFS so the filesystem
> matters.
> 

Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
soon as I can, thanks.


>>> However, it's not a universal gain and there are also fairness issues.
>>> For example, this is a fio configuration with a single random reader and
>>> a single random writer on the same machine
>>> 
>>> fio Throughput
>>>                                             4.12.0                 4.12.0                 4.12.0
>>>                                         legacy-cfq                 mq-bfq            mq-bfq-tput
>>> Hmean     kb/sec-writer-write      398.15 (   0.00%)     4659.18 (1070.21%)     4934.52 (1139.37%)
>>> Hmean     kb/sec-reader-read       507.00 (   0.00%)       66.36 ( -86.91%)       14.68 ( -97.10%)
>>> 
>>> With CFQ, there is some fairness between the readers and writers and
>>> with BFQ, there is a strong preference to writers. Again, this is not
>>> universal. It'll be a mix and sometimes it'll be classed as a gain and
>>> sometimes a regression.
>>> 
>> 
>> Yes, that's why I didn't pay too much attention so far to such an
>> issue.  I preferred to tune for maximum responsiveness and minimal
>> latency for soft real-time applications, w.r.t.  to reducing a kind of
>> unfairness for which no user happened to complain (so far).  Do you
>> have some real application (or benchmark simulating a real
>> application) in which we can see actual problems because of this form
>> of unfairness? 
> 
> I don't have data on that. This was a preliminary study only to see if
> a switch was safe running workloads that would appear in internal bug
> reports related to benchmarking.
> 
>> I was thinking of, e.g., two virtual machines, one
>> doing heavy writes and the other heavy reads.  But in that case,
>> cgroups have to be used, and I'm not sure we would still see this
>> problem.  Any suggestion is welcome.
>> 
> 
> I haven't spent time designing such a thing. Even if I did, I know I would
> get hit within weeks of a switch during distro development with reports
> related to fio, dbench and other basic IO benchmarks.
> 

I see.

>>> I had seen this assertion so one of the fio configurations had multiple
>>> heavy writers in the background and a random reader of small files to
>>> simulate that scenario. The intent was to simulate heavy IO in the presence
>>> of application startup
>>> 
>>>                                             4.12.0                 4.12.0                 4.12.0
>>>                                         legacy-cfq                 mq-bfq            mq-bfq-tput
>>> Hmean     kb/sec-writer-write     1997.75 (   0.00%)     2035.65 (   1.90%)     2014.50 (   0.84%)
>>> Hmean     kb/sec-reader-read       128.50 (   0.00%)       79.46 ( -38.16%)       12.78 ( -90.06%)
>>> 
>>> Write throughput is steady-ish across each IO scheduler but readers get
>>> starved badly which I expect would slow application startup and disabling
>>> low_latency makes it much worse.
>> 
>> A greedy random reader that goes on steadily mimics an application startup
>> only for the first handful of seconds.
>> 
> 
> Sure, but if during those handful of seconds the throughput is 10% of
> what is used to be, it'll still be noticeable.
> 

I did not have the time yet to repeat this test (I will try soon), but
I had the time think about it a little bit.  And I soon realized that
actually this is not a responsiveness test against background
workload, or, it is at most an extreme corner case for it.  Both the
write and the read thread start at the same time.  So, we are
mimicking a user starting, e.g., a file copy, and, exactly at the same
time, an app(in addition, the file copy starts to cause heavy writes
immediately).

BFQ uses time patterns to guess which processes to privilege, and the
time patterns of the writer and reader are indistinguishable here.
Only tagging processes with extra information would help, but that is
a different story.  And in this case tagging would help for a
not-so-frequent use case.

In addition, a greedy random reader may mimick the start-up of only
very simple applications.  Even a simple terminal such as xterm does
some I/O (not completely random, but I guess we don't need to be
overpicky), then it stops doing I/O and passes the ball to the X
server, which does some I/O, stops and passes the ball back to xterm
for its final start-up phase.  More and more processes are involved,
and more and more complex I/O patterns are issued as applications
become more complex.  This is the reason why we strived to benchmark
application start-up by truly starting real applications and measuring
their start-up time (see below).

>> Where can I find the exact script/configuration you used, to check
>> more precisely what is going on and whether BFQ is actually behaving very
>> badly for some reason?
>> 
> 
> https://github.com/gormanm/mmtests
> 
> All the configuration files are in configs/ so
> global-dhp__io-dbench4-fsync-ext4 maps to global-dhp__io-dbench4-fsync but
> it has to be editted if you want to format a test partition.  Otherwise,
> you'd just need to make sure the current directory was ext4 and ignore
> any filesystem aging artifacts.
> 

Thank you, I'll do it ASAP.

>>> The mmtests configuration in question
>>> is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
>>> a fresh XFS filesystem on a test partition.
>>> 
>>> This is not exactly equivalent to real application startup but that can
>>> be difficult to quantify properly.
>>> 
>> 
>> If you do want to check application startup, then just 1) start some
>> background workload, 2) drop caches, 3) start the app, 4) measure how
>> long it takes to start.  Otherwise, the comm_startup_lat test in the
>> S suite [1] does all of this for you.
>> 
> 
> I did have something like this before but found it unreliable because it
> couldn't tell the difference between when an application has a window
> and when it's ready for use. Evolution for example may start up and
> start displaing but then clicking on a mail may stall for a few seconds.
> It's difficult to quantify meaningfully which is why I eventually gave
> up and relied instead on proxy measures.
> 

Right, that's why we looked for other applications that were as
popular, but for which we could get reliable and precise measures.
One such application is a terminal, another one a shell.  On the
opposite end of the size spectrum, another other such applications are
libreoffice/openoffice.

For, e.g, gnome-terminal, it is enough to invoke "time gnome-terminal
-e /bin/true".  By the stopwatch, such a command measures very
precisely the time that elapses from when you start the terminal, to
when you can start typing a command in its window.  Similarly, "xterm
/bin/true", "ssh localhost exit", "bash -c exit", "lowriter
--terminate-after-init".  Of course, these tricks certainly cause a
few more block reads than the real, bare application start-up, but,
even if the difference were noticeable in terms of time, what matters
is to measure the execution time of these commands without background
workload, and then compare it against their execution time with some
background workload.  If it takes, say, 5 seconds without background
workload, and still about 5 seconds with background workload and a
given scheduler, but, with another scheduler, it takes 40 seconds with
background workload (all real numbers, actually), then you can draw
some sound conclusion on responsiveness for the each of the two
schedulers.

In addition, as for coverage, we made the empiric assumption that
start-up time measured with each of the above easy-to-benchmark
applications gives an idea of the time that it would take with any
application of the same size and complexity.  User feedback confirmed
this assumptions so far.  Of course there may well be exceptions.

Thanks,
Paolo

> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-04 22:05           ` Paolo Valente
  (?)
@ 2017-08-05 11:54           ` Mel Gorman
  2017-08-07 17:35               ` Paolo Valente
  -1 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2017-08-05 11:54 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block

On Sat, Aug 05, 2017 at 12:05:00AM +0200, Paolo Valente wrote:
> > 
> > True. However, the difference between legacy-deadline mq-deadline is
> > roughly around the 5-10% mark across workloads for SSD. It's not
> > universally true but the impact is not as severe. While this is not
> > proof that the stack change is the sole root cause, it makes it less
> > likely.
> > 
> 
> I'm getting a little lost here.  If I'm not mistaken, you are saying,
> since the difference between two virtually identical schedulers
> (legacy-deadline and mq-deadline) is only around 5-10%, while the
> difference between cfq and mq-bfq-tput is higher, then in the latter
> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
> above test is exactly in the 5-10% range?  What am I missing?  Other
> tests with mq-bfq-tput not yet reported?
> 

Unfortunately it's due to very broad generalisations. 10 configurations
from mmtests were used in total when I was checking this. Multiply those by
4 for each tested filesystem and then multiply again for each io scheduler
on a total of 7 machines taking 3-4 weeks to execute all tests. The deltas
between each configuration on different machines varies a lot. It also
is an impractical amount of information to present and discuss and the
point of the original mail was to highlight that switching the default
may create some bug reports so as not be too surprised or panic.

The general trend observed was that legacy-deadline vs mq-deadline generally
showed a small regression switching to mq-deadline but it was not universal
and it wasn't consistent. If nothing else, IO tests that are borderline
are difficult to test for significance as distributions are multimodal.
However, it was generally close enough to conclude "this could be tolerated
and more mq work is on the way". However, it's impossible to give a precise
range of how much of a hit it would take but it generally seemed to be
around the 5% mark.

CFQ switching to BFQ was often more dramatic. Sometimes it doesn't really
matter and sometimes turning off low_latency helped enough. bonnie, which
is a single IO issuer didn't show much differences in throughput. It had
a few problems with file create/delete but the absolute times there are
so small that tiny differences look relatively large and were ignored.
For the moment, I'll be temporarily ignoring bonnie because it was a
sniff-test only and I didn't expect many surprises from a single IO issuer.

The workload that cropped up as being most alarming was dbench was is ironic
given that it's not actually that IO intensive and tends to be limited by
fsync times. The benchmark has a number of other weaknesses.  It's more
often dominated by scheduler performance, can be gamed by starving all
but one threads from IO to give "better" results and is sensitive to the
exact timing of when writeback occurs which mmtests tries to mitigate by
reducing the loadfile size. If it turns out that it's the only benchmark
that really suffers then I think we would live with or find ways of tuning
around it but fio concerned me.

The fio ones were a concern because of different read/write throughputs
and the fact it was not consistent read or write that was favoured. These
changes are not necessary good or bad but I've seen in the past that writes
that get starved tend to impact workloads that periodically fsync dirty
data (think databases) and had to be tuned by reducing dirty_ratio. I've
also seen cases where syncing of metadata on some filesystems would cause
large stalls if there was a lot of write starvation. I regretted not adding
pgioperf (basic simulator of postgres IO behaviour) to the original set
of tests because it tends to be very good at detecting fsync stalls due
to write starvation.

> > <SNIP>
> > Sure, but if during those handful of seconds the throughput is 10% of
> > what is used to be, it'll still be noticeable.
> > 
> 
> I did not have the time yet to repeat this test (I will try soon), but
> I had the time think about it a little bit.  And I soon realized that
> actually this is not a responsiveness test against background
> workload, or, it is at most an extreme corner case for it.  Both the
> write and the read thread start at the same time.  So, we are
> mimicking a user starting, e.g., a file copy, and, exactly at the same
> time, an app(in addition, the file copy starts to cause heavy writes
> immediately).
> 

Yes, although it's not entirely unrealistic to have light random readers
and heavy writers starting at the same time. A write-intensive database
can behave like this.

Also, I wouldn't panic about needing time to repeat this test. This is
not blocking me as such as all I was interested in was checking if the
switch could be safely made now or should it be deferred while keeping an
eye on how it's doing. It's perfectly possible others will make the switch
and find the majority of their workloads are fine. If others report bugs
and they're using rotary storage then it should be obvious to ask them
to test with the legacy block layer and work from there. At least then,
there should be better reference workloads to look from. Unfortunately,
given the scope and the time it takes to test, I had little choice except
to shotgun a few workloads and see what happened.

> BFQ uses time patterns to guess which processes to privilege, and the
> time patterns of the writer and reader are indistinguishable here.
> Only tagging processes with extra information would help, but that is
> a different story.  And in this case tagging would help for a
> not-so-frequent use case.
> 

Hopefully there will not be a reliance on tagging processes. If we're
lucky, I just happened to pick a few IO workloads that seemed to suffer
particularly badly.

> In addition, a greedy random reader may mimick the start-up of only
> very simple applications.  Even a simple terminal such as xterm does
> some I/O (not completely random, but I guess we don't need to be
> overpicky), then it stops doing I/O and passes the ball to the X
> server, which does some I/O, stops and passes the ball back to xterm
> for its final start-up phase.  More and more processes are involved,
> and more and more complex I/O patterns are issued as applications
> become more complex.  This is the reason why we strived to benchmark
> application start-up by truly starting real applications and measuring
> their start-up time (see below).
> 

Which is fair enough, can't argue with that. Again, the intent here is
not to rag on BFQ. I had a few configurations that looked alarming which I
sometimes use as an early warning that complex workloads may have problems
that are harder to debug. It's not always true. Sometimes the early warnings
are red herrings. I've had a long dislike for dbench4 too but each time I
got rid of it, it showed up again on some random bug report which is the
only reason I included it in this evaluation.

> > I did have something like this before but found it unreliable because it
> > couldn't tell the difference between when an application has a window
> > and when it's ready for use. Evolution for example may start up and
> > start displaing but then clicking on a mail may stall for a few seconds.
> > It's difficult to quantify meaningfully which is why I eventually gave
> > up and relied instead on proxy measures.
> > 
> 
> Right, that's why we looked for other applications that were as
> popular, but for which we could get reliable and precise measures.
> One such application is a terminal, another one a shell.  On the
> opposite end of the size spectrum, another other such applications are
> libreoffice/openoffice.
> 

Seems reasonable.

> For, e.g, gnome-terminal, it is enough to invoke "time gnome-terminal
> -e /bin/true".  By the stopwatch, such a command measures very
> precisely the time that elapses from when you start the terminal, to
> when you can start typing a command in its window.  Similarly, "xterm
> /bin/true", "ssh localhost exit", "bash -c exit", "lowriter
> --terminate-after-init".  Of course, these tricks certainly cause a
> few more block reads than the real, bare application start-up, but,
> even if the difference were noticeable in terms of time, what matters
> is to measure the execution time of these commands without background
> workload, and then compare it against their execution time with some
> background workload.  If it takes, say, 5 seconds without background
> workload, and still about 5 seconds with background workload and a
> given scheduler, but, with another scheduler, it takes 40 seconds with
> background workload (all real numbers, actually), then you can draw
> some sound conclusion on responsiveness for the each of the two
> schedulers.
> 

Again, that is a fair enough methodology and will work in many cases.
It's somewhat impractical for myself. When I'm checking patches (be they new
patches I developed, am backporting or looking at new kernels), I usually
am checking a range of workloads across multiple machines and it's only
when I'm doing live analysis of a problem that I'm directly using a machine.

> In addition, as for coverage, we made the empiric assumption that
> start-up time measured with each of the above easy-to-benchmark
> applications gives an idea of the time that it would take with any
> application of the same size and complexity.  User feedback confirmed
> this assumptions so far.  Of course there may well be exceptions.
> 

FWIW, I also have anecdotal evidence from at least one user that using
BFQ is way better on their desktop than CFQ ever was even under the best
of circumstances. I've had problems directly measuring it empirically but
this was also the first time I switched on BFQ to see what fell out so
it's early days yet.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-04 22:05           ` Paolo Valente
@ 2017-08-07 17:32             ` Paolo Valente
  -1 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-07 17:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>=20
>>=20
>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>>=20
>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>> I took that into account BFQ with low-latency was also tested and =
the
>>>> impact was not a universal improvement although it can be a =
noticable
>>>> improvement. =46rom the same machine;
>>>>=20
>>>> dbench4 Loadfile Execution Time
>>>>                           4.12.0                 4.12.0             =
    4.12.0
>>>>                       legacy-cfq                 mq-bfq            =
mq-bfq-tput
>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       =
84.70 (  -5.00%)
>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       =
88.74 (   4.45%)
>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      =
113.97 ( -10.95%)
>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     =
2038.74 (  19.86%)
>>>>=20
>>>=20
>>> Thanks for trying with low_latency disabled.  If I read numbers
>>> correctly, we move from a worst case of 361% higher execution time =
to
>>> a worst case of 11%.  With a best case of 20% of lower execution =
time.
>>>=20
>>=20
>> Yes.
>>=20
>>> I asked you about none and mq-deadline in a previous email, because
>>> actually we have a double change here: change of the I/O stack, and
>>> change of the scheduler, with the first change probably not =
irrelevant
>>> with respect to the second one.
>>>=20
>>=20
>> True. However, the difference between legacy-deadline mq-deadline is
>> roughly around the 5-10% mark across workloads for SSD. It's not
>> universally true but the impact is not as severe. While this is not
>> proof that the stack change is the sole root cause, it makes it less
>> likely.
>>=20
>=20
> I'm getting a little lost here.  If I'm not mistaken, you are saying,
> since the difference between two virtually identical schedulers
> (legacy-deadline and mq-deadline) is only around 5-10%, while the
> difference between cfq and mq-bfq-tput is higher, then in the latter
> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
> above test is exactly in the 5-10% range?  What am I missing?  Other
> tests with mq-bfq-tput not yet reported?
>=20
>>> By chance, according to what you have measured so far, is there any
>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>> lose?  I could start from there.
>>>=20
>>=20
>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough =
that
>> it could be the stack change.
>>=20
>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests =
using
>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>> matters.
>>=20
>=20
> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
> soon as I can, thanks.
>=20
>=20

I've run this test and tried to further investigate this regression.
For the moment, the gist seems to be that blk-mq plays an important
role, not only with bfq (unless I'm considering the wrong numbers).
Even if your main purpose in this thread was just to give a heads-up,
I guess it may be useful to share what I have found out.  In addition,
I want to ask for some help, to try to get closer to the possible
causes of at least this regression.  If you think it would be better
to open a new thread on this stuff, I'll do it.

First, I got mixed results on my system.  I'll focus only on the the
case where mq-bfq-tput achieves its worst relative performance w.r.t.
to cfq, which happens with 64 clients.  Still, also in this case
mq-bfq is better than cfq in all average values, but Flush.  I don't
know which are the best/right values to look at, so, here's the final
report for both schedulers:

CFQ

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13120    20.069   348.594
 Close                   133696     0.008    14.642
 LockX                      512     0.009     0.059
 Rename                    7552     1.857   415.418
 ReadX                   270720     0.141   535.632
 WriteX                   89591   421.961  6363.271
 Unlink                   34048     1.281   662.467
 UnlockX                    512     0.007     0.057
 FIND_FIRST               62016     0.086    25.060
 SET_FILE_INFORMATION     15616     0.995   176.621
 QUERY_FILE_INFORMATION   28734     0.004     1.372
 QUERY_PATH_INFORMATION  170240     0.163   820.292
 QUERY_FS_INFORMATION     28736     0.017     4.110
 NTCreateX               178688     0.437   905.567

MQ-BFQ-TPUT

Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13504    75.828 11196.035
 Close                   136896     0.004     3.855
 LockX                      640     0.005     0.031
 Rename                    8064     1.020   288.989
 ReadX                   297600     0.081   685.850
 WriteX                   93515   391.637 12681.517
 Unlink                   34880     0.500   146.928
 UnlockX                    640     0.004     0.032
 FIND_FIRST               63680     0.045   222.491
 SET_FILE_INFORMATION     16000     0.436   686.115
 QUERY_FILE_INFORMATION   30464     0.003     0.773
 QUERY_PATH_INFORMATION  175552     0.044   148.449
 QUERY_FS_INFORMATION     29888     0.009     1.984
 NTCreateX               183152     0.289   300.867

Are these results in line with yours for this test?

Anyway, to investigate this regression more in depth, I took two
further steps.  First, I repeated the same test with bfq-sq, my
out-of-tree version of bfq for legacy block (identical to mq-bfq apart
from the changes needed for bfq to live in blk-mq).  I got:

BFQ-SQ-TPUT

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    12618    30.212   484.099
 Close                   123884     0.008    10.477
 LockX                      512     0.010     0.170
 Rename                    7296     2.032   426.409
 ReadX                   262179     0.251   985.478
 WriteX                   84072   461.398  7283.003
 Unlink                   33076     1.685   848.734
 UnlockX                    512     0.007     0.036
 FIND_FIRST               58690     0.096   220.720
 SET_FILE_INFORMATION     14976     1.792   466.435
 QUERY_FILE_INFORMATION   26575     0.004     2.194
 QUERY_PATH_INFORMATION  158125     0.112   614.063
 QUERY_FS_INFORMATION     28224     0.017     1.385
 NTCreateX               167877     0.827   945.644

So, the worst-case regression is now around 15%.  This made me suspect
that blk-mq influences results a lot for this test.  To crosscheck, I
compared legacy-deadline and mq-deadline too.

LEGACY-DEADLINE

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13267     9.622   298.206
 Close                   135692     0.007    10.627
 LockX                      640     0.008     0.066
 Rename                    7827     0.544   481.123
 ReadX                   285929     0.220  2698.442
 WriteX                   92309   430.867  5191.608
 Unlink                   34534     1.133   619.235
 UnlockX                    640     0.008     0.724
 FIND_FIRST               63289     0.086    56.851
 SET_FILE_INFORMATION     16000     1.254   844.065
 QUERY_FILE_INFORMATION   29883     0.004     0.618
 QUERY_PATH_INFORMATION  173232     0.089  1295.651
 QUERY_FS_INFORMATION     29632     0.017     4.813
 NTCreateX               181464     0.479  2214.343


MQ-DEADLINE

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13760    90.542 13221.495
 Close                   137654     0.008    27.133
 LockX                      640     0.009     0.115
 Rename                    8064     1.062   246.759
 ReadX                   297956     0.051   347.018
 WriteX                   94698   425.636 15090.020
 Unlink                   35077     0.580   208.462
 UnlockX                    640     0.007     0.291
 FIND_FIRST               66630     0.566   530.339
 SET_FILE_INFORMATION     16000     1.419   811.494
 QUERY_FILE_INFORMATION   30717     0.004     1.108
 QUERY_PATH_INFORMATION  176153     0.182   517.419
 QUERY_FS_INFORMATION     30857     0.018    18.562
 NTCreateX               184145     0.281   582.076

So, with both bfq and deadline there seems to be a serious regression,
especially on MaxLat, when moving from legacy block to blk-mq.  The
regression is much worse with deadline, as legacy-deadline has the
lowest max latency among all the schedulers, whereas mq-deadline has
the highest one.

Regardless of the actual culprit of this regression, I would like to
investigate further this issue.  In this respect, I would like to ask
for a little help.  I would like to isolate the workloads generating
the highest latencies.  To this purpose, I had a look at the loadfile
client-tiny.txt, and I still have a doubt: is every item in the
loadfile executed somehow several times (for each value of the number
of clients), or is it executed only once?  More precisely, IIUC, for
each operation reported in the above results, there are several items
(lines) in the loadfile.  So, is each of these items executed only
once?

I'm asking because, if it is executed only once, then I guess I can
find the critical tasks ore easily.  Finally, if it is actually
executed only once, is it expected that the latency for such a task is
one order of magnitude higher than that of the average latency for
that group of tasks?  I mean, is such a task intrinsically much
heavier, and then expectedly much longer, or is the fact that latency
is much higher for this task a sign that something in the kernel
misbehaves for that task?

While waiting for some feedback, I'm going to execute your test
showing great unfairness between writes and reads, and to also check
whether responsiveness does worsen if the write workload for that test
is being executed in the background.

Thanks,
Paolo

> ...
>> --=20
>> Mel Gorman
>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-07 17:32             ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-07 17:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto:
>> 
>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>> I took that into account BFQ with low-latency was also tested and the
>>>> impact was not a universal improvement although it can be a noticable
>>>> improvement. From the same machine;
>>>> 
>>>> dbench4 Loadfile Execution Time
>>>>                           4.12.0                 4.12.0                 4.12.0
>>>>                       legacy-cfq                 mq-bfq            mq-bfq-tput
>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
>>>> 
>>> 
>>> Thanks for trying with low_latency disabled.  If I read numbers
>>> correctly, we move from a worst case of 361% higher execution time to
>>> a worst case of 11%.  With a best case of 20% of lower execution time.
>>> 
>> 
>> Yes.
>> 
>>> I asked you about none and mq-deadline in a previous email, because
>>> actually we have a double change here: change of the I/O stack, and
>>> change of the scheduler, with the first change probably not irrelevant
>>> with respect to the second one.
>>> 
>> 
>> True. However, the difference between legacy-deadline mq-deadline is
>> roughly around the 5-10% mark across workloads for SSD. It's not
>> universally true but the impact is not as severe. While this is not
>> proof that the stack change is the sole root cause, it makes it less
>> likely.
>> 
> 
> I'm getting a little lost here.  If I'm not mistaken, you are saying,
> since the difference between two virtually identical schedulers
> (legacy-deadline and mq-deadline) is only around 5-10%, while the
> difference between cfq and mq-bfq-tput is higher, then in the latter
> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
> above test is exactly in the 5-10% range?  What am I missing?  Other
> tests with mq-bfq-tput not yet reported?
> 
>>> By chance, according to what you have measured so far, is there any
>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>> lose?  I could start from there.
>>> 
>> 
>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>> it could be the stack change.
>> 
>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>> matters.
>> 
> 
> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
> soon as I can, thanks.
> 
> 

I've run this test and tried to further investigate this regression.
For the moment, the gist seems to be that blk-mq plays an important
role, not only with bfq (unless I'm considering the wrong numbers).
Even if your main purpose in this thread was just to give a heads-up,
I guess it may be useful to share what I have found out.  In addition,
I want to ask for some help, to try to get closer to the possible
causes of at least this regression.  If you think it would be better
to open a new thread on this stuff, I'll do it.

First, I got mixed results on my system.  I'll focus only on the the
case where mq-bfq-tput achieves its worst relative performance w.r.t.
to cfq, which happens with 64 clients.  Still, also in this case
mq-bfq is better than cfq in all average values, but Flush.  I don't
know which are the best/right values to look at, so, here's the final
report for both schedulers:

CFQ

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13120    20.069   348.594
 Close                   133696     0.008    14.642
 LockX                      512     0.009     0.059
 Rename                    7552     1.857   415.418
 ReadX                   270720     0.141   535.632
 WriteX                   89591   421.961  6363.271
 Unlink                   34048     1.281   662.467
 UnlockX                    512     0.007     0.057
 FIND_FIRST               62016     0.086    25.060
 SET_FILE_INFORMATION     15616     0.995   176.621
 QUERY_FILE_INFORMATION   28734     0.004     1.372
 QUERY_PATH_INFORMATION  170240     0.163   820.292
 QUERY_FS_INFORMATION     28736     0.017     4.110
 NTCreateX               178688     0.437   905.567

MQ-BFQ-TPUT

Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13504    75.828 11196.035
 Close                   136896     0.004     3.855
 LockX                      640     0.005     0.031
 Rename                    8064     1.020   288.989
 ReadX                   297600     0.081   685.850
 WriteX                   93515   391.637 12681.517
 Unlink                   34880     0.500   146.928
 UnlockX                    640     0.004     0.032
 FIND_FIRST               63680     0.045   222.491
 SET_FILE_INFORMATION     16000     0.436   686.115
 QUERY_FILE_INFORMATION   30464     0.003     0.773
 QUERY_PATH_INFORMATION  175552     0.044   148.449
 QUERY_FS_INFORMATION     29888     0.009     1.984
 NTCreateX               183152     0.289   300.867

Are these results in line with yours for this test?

Anyway, to investigate this regression more in depth, I took two
further steps.  First, I repeated the same test with bfq-sq, my
out-of-tree version of bfq for legacy block (identical to mq-bfq apart
from the changes needed for bfq to live in blk-mq).  I got:

BFQ-SQ-TPUT

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    12618    30.212   484.099
 Close                   123884     0.008    10.477
 LockX                      512     0.010     0.170
 Rename                    7296     2.032   426.409
 ReadX                   262179     0.251   985.478
 WriteX                   84072   461.398  7283.003
 Unlink                   33076     1.685   848.734
 UnlockX                    512     0.007     0.036
 FIND_FIRST               58690     0.096   220.720
 SET_FILE_INFORMATION     14976     1.792   466.435
 QUERY_FILE_INFORMATION   26575     0.004     2.194
 QUERY_PATH_INFORMATION  158125     0.112   614.063
 QUERY_FS_INFORMATION     28224     0.017     1.385
 NTCreateX               167877     0.827   945.644

So, the worst-case regression is now around 15%.  This made me suspect
that blk-mq influences results a lot for this test.  To crosscheck, I
compared legacy-deadline and mq-deadline too.

LEGACY-DEADLINE

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13267     9.622   298.206
 Close                   135692     0.007    10.627
 LockX                      640     0.008     0.066
 Rename                    7827     0.544   481.123
 ReadX                   285929     0.220  2698.442
 WriteX                   92309   430.867  5191.608
 Unlink                   34534     1.133   619.235
 UnlockX                    640     0.008     0.724
 FIND_FIRST               63289     0.086    56.851
 SET_FILE_INFORMATION     16000     1.254   844.065
 QUERY_FILE_INFORMATION   29883     0.004     0.618
 QUERY_PATH_INFORMATION  173232     0.089  1295.651
 QUERY_FS_INFORMATION     29632     0.017     4.813
 NTCreateX               181464     0.479  2214.343


MQ-DEADLINE

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13760    90.542 13221.495
 Close                   137654     0.008    27.133
 LockX                      640     0.009     0.115
 Rename                    8064     1.062   246.759
 ReadX                   297956     0.051   347.018
 WriteX                   94698   425.636 15090.020
 Unlink                   35077     0.580   208.462
 UnlockX                    640     0.007     0.291
 FIND_FIRST               66630     0.566   530.339
 SET_FILE_INFORMATION     16000     1.419   811.494
 QUERY_FILE_INFORMATION   30717     0.004     1.108
 QUERY_PATH_INFORMATION  176153     0.182   517.419
 QUERY_FS_INFORMATION     30857     0.018    18.562
 NTCreateX               184145     0.281   582.076

So, with both bfq and deadline there seems to be a serious regression,
especially on MaxLat, when moving from legacy block to blk-mq.  The
regression is much worse with deadline, as legacy-deadline has the
lowest max latency among all the schedulers, whereas mq-deadline has
the highest one.

Regardless of the actual culprit of this regression, I would like to
investigate further this issue.  In this respect, I would like to ask
for a little help.  I would like to isolate the workloads generating
the highest latencies.  To this purpose, I had a look at the loadfile
client-tiny.txt, and I still have a doubt: is every item in the
loadfile executed somehow several times (for each value of the number
of clients), or is it executed only once?  More precisely, IIUC, for
each operation reported in the above results, there are several items
(lines) in the loadfile.  So, is each of these items executed only
once?

I'm asking because, if it is executed only once, then I guess I can
find the critical tasks ore easily.  Finally, if it is actually
executed only once, is it expected that the latency for such a task is
one order of magnitude higher than that of the average latency for
that group of tasks?  I mean, is such a task intrinsically much
heavier, and then expectedly much longer, or is the fact that latency
is much higher for this task a sign that something in the kernel
misbehaves for that task?

While waiting for some feedback, I'm going to execute your test
showing great unfairness between writes and reads, and to also check
whether responsiveness does worsen if the write workload for that test
is being executed in the background.

Thanks,
Paolo

> ...
>> -- 
>> Mel Gorman
>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-05 11:54           ` Mel Gorman
@ 2017-08-07 17:35               ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-07 17:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 05 ago 2017, alle ore 13:54, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
> ...
>=20
>> In addition, as for coverage, we made the empiric assumption that
>> start-up time measured with each of the above easy-to-benchmark
>> applications gives an idea of the time that it would take with any
>> application of the same size and complexity.  User feedback confirmed
>> this assumptions so far.  Of course there may well be exceptions.
>>=20
>=20
> FWIW, I also have anecdotal evidence from at least one user that using
> BFQ is way better on their desktop than CFQ ever was even under the =
best
> of circumstances. I've had problems directly measuring it empirically =
but
> this was also the first time I switched on BFQ to see what fell out so
> it's early days yet.
>=20

Yeah, I'm constantly trying (without great success so far :) ) to turn
this folklore into shared, repeatable tests and numbers.  The latter
could then be reliably evaluated, questioned or defended.

Thanks,
Paolo

> --=20
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-07 17:35               ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-07 17:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 05 ago 2017, alle ore 13:54, Mel Gorman <mgorman@techsingularity.net> ha scritto:
> ...
> 
>> In addition, as for coverage, we made the empiric assumption that
>> start-up time measured with each of the above easy-to-benchmark
>> applications gives an idea of the time that it would take with any
>> application of the same size and complexity.  User feedback confirmed
>> this assumptions so far.  Of course there may well be exceptions.
>> 
> 
> FWIW, I also have anecdotal evidence from at least one user that using
> BFQ is way better on their desktop than CFQ ever was even under the best
> of circumstances. I've had problems directly measuring it empirically but
> this was also the first time I switched on BFQ to see what fell out so
> it's early days yet.
> 

Yeah, I'm constantly trying (without great success so far :) ) to turn
this folklore into shared, repeatable tests and numbers.  The latter
could then be reliably evaluated, questioned or defended.

Thanks,
Paolo

> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-07 17:32             ` Paolo Valente
@ 2017-08-07 18:42               ` Paolo Valente
  -1 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-07 18:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>=20
>>=20
>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>=20
>>>=20
>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>>>=20
>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>> I took that into account BFQ with low-latency was also tested and =
the
>>>>> impact was not a universal improvement although it can be a =
noticable
>>>>> improvement. =46rom the same machine;
>>>>>=20
>>>>> dbench4 Loadfile Execution Time
>>>>>                          4.12.0                 4.12.0             =
    4.12.0
>>>>>                      legacy-cfq                 mq-bfq            =
mq-bfq-tput
>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       =
84.70 (  -5.00%)
>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       =
88.74 (   4.45%)
>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      =
113.97 ( -10.95%)
>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     =
2038.74 (  19.86%)
>>>>>=20
>>>>=20
>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>> correctly, we move from a worst case of 361% higher execution time =
to
>>>> a worst case of 11%.  With a best case of 20% of lower execution =
time.
>>>>=20
>>>=20
>>> Yes.
>>>=20
>>>> I asked you about none and mq-deadline in a previous email, because
>>>> actually we have a double change here: change of the I/O stack, and
>>>> change of the scheduler, with the first change probably not =
irrelevant
>>>> with respect to the second one.
>>>>=20
>>>=20
>>> True. However, the difference between legacy-deadline mq-deadline is
>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>> universally true but the impact is not as severe. While this is not
>>> proof that the stack change is the sole root cause, it makes it less
>>> likely.
>>>=20
>>=20
>> I'm getting a little lost here.  If I'm not mistaken, you are saying,
>> since the difference between two virtually identical schedulers
>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>> difference between cfq and mq-bfq-tput is higher, then in the latter
>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
>> above test is exactly in the 5-10% range?  What am I missing?  Other
>> tests with mq-bfq-tput not yet reported?
>>=20
>>>> By chance, according to what you have measured so far, is there any
>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>> lose?  I could start from there.
>>>>=20
>>>=20
>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough =
that
>>> it could be the stack change.
>>>=20
>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests =
using
>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>> matters.
>>>=20
>>=20
>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>> soon as I can, thanks.
>>=20
>>=20
>=20
> I've run this test and tried to further investigate this regression.
> For the moment, the gist seems to be that blk-mq plays an important
> role, not only with bfq (unless I'm considering the wrong numbers).
> Even if your main purpose in this thread was just to give a heads-up,
> I guess it may be useful to share what I have found out.  In addition,
> I want to ask for some help, to try to get closer to the possible
> causes of at least this regression.  If you think it would be better
> to open a new thread on this stuff, I'll do it.
>=20
> First, I got mixed results on my system.  I'll focus only on the the
> case where mq-bfq-tput achieves its worst relative performance w.r.t.
> to cfq, which happens with 64 clients.  Still, also in this case
> mq-bfq is better than cfq in all average values, but Flush.  I don't
> know which are the best/right values to look at, so, here's the final
> report for both schedulers:
>=20
> CFQ
>=20
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13120    20.069   348.594
> Close                   133696     0.008    14.642
> LockX                      512     0.009     0.059
> Rename                    7552     1.857   415.418
> ReadX                   270720     0.141   535.632
> WriteX                   89591   421.961  6363.271
> Unlink                   34048     1.281   662.467
> UnlockX                    512     0.007     0.057
> FIND_FIRST               62016     0.086    25.060
> SET_FILE_INFORMATION     15616     0.995   176.621
> QUERY_FILE_INFORMATION   28734     0.004     1.372
> QUERY_PATH_INFORMATION  170240     0.163   820.292
> QUERY_FS_INFORMATION     28736     0.017     4.110
> NTCreateX               178688     0.437   905.567
>=20
> MQ-BFQ-TPUT
>=20
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13504    75.828 11196.035
> Close                   136896     0.004     3.855
> LockX                      640     0.005     0.031
> Rename                    8064     1.020   288.989
> ReadX                   297600     0.081   685.850
> WriteX                   93515   391.637 12681.517
> Unlink                   34880     0.500   146.928
> UnlockX                    640     0.004     0.032
> FIND_FIRST               63680     0.045   222.491
> SET_FILE_INFORMATION     16000     0.436   686.115
> QUERY_FILE_INFORMATION   30464     0.003     0.773
> QUERY_PATH_INFORMATION  175552     0.044   148.449
> QUERY_FS_INFORMATION     29888     0.009     1.984
> NTCreateX               183152     0.289   300.867
>=20
> Are these results in line with yours for this test?
>=20
> Anyway, to investigate this regression more in depth, I took two
> further steps.  First, I repeated the same test with bfq-sq, my
> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
> from the changes needed for bfq to live in blk-mq).  I got:
>=20
> BFQ-SQ-TPUT
>=20
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    12618    30.212   484.099
> Close                   123884     0.008    10.477
> LockX                      512     0.010     0.170
> Rename                    7296     2.032   426.409
> ReadX                   262179     0.251   985.478
> WriteX                   84072   461.398  7283.003
> Unlink                   33076     1.685   848.734
> UnlockX                    512     0.007     0.036
> FIND_FIRST               58690     0.096   220.720
> SET_FILE_INFORMATION     14976     1.792   466.435
> QUERY_FILE_INFORMATION   26575     0.004     2.194
> QUERY_PATH_INFORMATION  158125     0.112   614.063
> QUERY_FS_INFORMATION     28224     0.017     1.385
> NTCreateX               167877     0.827   945.644
>=20
> So, the worst-case regression is now around 15%.  This made me suspect
> that blk-mq influences results a lot for this test.  To crosscheck, I
> compared legacy-deadline and mq-deadline too.
>=20

Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
occasionally confused by the workload, and grants device idling to
processes that, for this specific workload, would be better to
de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
becomes more or less equivalent to cfq (for some operations apparently
even much better):

bfq-sq-tput-0idle

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13013    17.888   280.517
 Close                   133004     0.008    20.698
 LockX                      512     0.008     0.088
 Rename                    7427     2.041   193.232
 ReadX                   270534     0.138   408.534
 WriteX                   88598   429.615  6272.212
 Unlink                   33734     1.205   559.152
 UnlockX                    512     0.011     1.808
 FIND_FIRST               61762     0.087    23.012
 SET_FILE_INFORMATION     15337     1.322   220.155
 QUERY_FILE_INFORMATION   28415     0.004     0.559
 QUERY_PATH_INFORMATION  169423     0.150   580.570
 QUERY_FS_INFORMATION     28547     0.019    24.466
 NTCreateX               177618     0.544   681.795

I'll try soon with mq-bfq too, for which I expect however a deeper
investigation to be needed.

Thanks,
Paolo

> LEGACY-DEADLINE
>=20
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13267     9.622   298.206
> Close                   135692     0.007    10.627
> LockX                      640     0.008     0.066
> Rename                    7827     0.544   481.123
> ReadX                   285929     0.220  2698.442
> WriteX                   92309   430.867  5191.608
> Unlink                   34534     1.133   619.235
> UnlockX                    640     0.008     0.724
> FIND_FIRST               63289     0.086    56.851
> SET_FILE_INFORMATION     16000     1.254   844.065
> QUERY_FILE_INFORMATION   29883     0.004     0.618
> QUERY_PATH_INFORMATION  173232     0.089  1295.651
> QUERY_FS_INFORMATION     29632     0.017     4.813
> NTCreateX               181464     0.479  2214.343
>=20
>=20
> MQ-DEADLINE
>=20
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13760    90.542 13221.495
> Close                   137654     0.008    27.133
> LockX                      640     0.009     0.115
> Rename                    8064     1.062   246.759
> ReadX                   297956     0.051   347.018
> WriteX                   94698   425.636 15090.020
> Unlink                   35077     0.580   208.462
> UnlockX                    640     0.007     0.291
> FIND_FIRST               66630     0.566   530.339
> SET_FILE_INFORMATION     16000     1.419   811.494
> QUERY_FILE_INFORMATION   30717     0.004     1.108
> QUERY_PATH_INFORMATION  176153     0.182   517.419
> QUERY_FS_INFORMATION     30857     0.018    18.562
> NTCreateX               184145     0.281   582.076
>=20
> So, with both bfq and deadline there seems to be a serious regression,
> especially on MaxLat, when moving from legacy block to blk-mq.  The
> regression is much worse with deadline, as legacy-deadline has the
> lowest max latency among all the schedulers, whereas mq-deadline has
> the highest one.
>=20
> Regardless of the actual culprit of this regression, I would like to
> investigate further this issue.  In this respect, I would like to ask
> for a little help.  I would like to isolate the workloads generating
> the highest latencies.  To this purpose, I had a look at the loadfile
> client-tiny.txt, and I still have a doubt: is every item in the
> loadfile executed somehow several times (for each value of the number
> of clients), or is it executed only once?  More precisely, IIUC, for
> each operation reported in the above results, there are several items
> (lines) in the loadfile.  So, is each of these items executed only
> once?
>=20
> I'm asking because, if it is executed only once, then I guess I can
> find the critical tasks ore easily.  Finally, if it is actually
> executed only once, is it expected that the latency for such a task is
> one order of magnitude higher than that of the average latency for
> that group of tasks?  I mean, is such a task intrinsically much
> heavier, and then expectedly much longer, or is the fact that latency
> is much higher for this task a sign that something in the kernel
> misbehaves for that task?
>=20
> While waiting for some feedback, I'm going to execute your test
> showing great unfairness between writes and reads, and to also check
> whether responsiveness does worsen if the write workload for that test
> is being executed in the background.
>=20
> Thanks,
> Paolo
>=20
>> ...
>>> --=20
>>> Mel Gorman
>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-07 18:42               ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-07 18:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>> 
>>> 
>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto:
>>> 
>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>> impact was not a universal improvement although it can be a noticable
>>>>> improvement. From the same machine;
>>>>> 
>>>>> dbench4 Loadfile Execution Time
>>>>>                          4.12.0                 4.12.0                 4.12.0
>>>>>                      legacy-cfq                 mq-bfq            mq-bfq-tput
>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
>>>>> 
>>>> 
>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>> correctly, we move from a worst case of 361% higher execution time to
>>>> a worst case of 11%.  With a best case of 20% of lower execution time.
>>>> 
>>> 
>>> Yes.
>>> 
>>>> I asked you about none and mq-deadline in a previous email, because
>>>> actually we have a double change here: change of the I/O stack, and
>>>> change of the scheduler, with the first change probably not irrelevant
>>>> with respect to the second one.
>>>> 
>>> 
>>> True. However, the difference between legacy-deadline mq-deadline is
>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>> universally true but the impact is not as severe. While this is not
>>> proof that the stack change is the sole root cause, it makes it less
>>> likely.
>>> 
>> 
>> I'm getting a little lost here.  If I'm not mistaken, you are saying,
>> since the difference between two virtually identical schedulers
>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>> difference between cfq and mq-bfq-tput is higher, then in the latter
>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
>> above test is exactly in the 5-10% range?  What am I missing?  Other
>> tests with mq-bfq-tput not yet reported?
>> 
>>>> By chance, according to what you have measured so far, is there any
>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>> lose?  I could start from there.
>>>> 
>>> 
>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>> it could be the stack change.
>>> 
>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>> matters.
>>> 
>> 
>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>> soon as I can, thanks.
>> 
>> 
> 
> I've run this test and tried to further investigate this regression.
> For the moment, the gist seems to be that blk-mq plays an important
> role, not only with bfq (unless I'm considering the wrong numbers).
> Even if your main purpose in this thread was just to give a heads-up,
> I guess it may be useful to share what I have found out.  In addition,
> I want to ask for some help, to try to get closer to the possible
> causes of at least this regression.  If you think it would be better
> to open a new thread on this stuff, I'll do it.
> 
> First, I got mixed results on my system.  I'll focus only on the the
> case where mq-bfq-tput achieves its worst relative performance w.r.t.
> to cfq, which happens with 64 clients.  Still, also in this case
> mq-bfq is better than cfq in all average values, but Flush.  I don't
> know which are the best/right values to look at, so, here's the final
> report for both schedulers:
> 
> CFQ
> 
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13120    20.069   348.594
> Close                   133696     0.008    14.642
> LockX                      512     0.009     0.059
> Rename                    7552     1.857   415.418
> ReadX                   270720     0.141   535.632
> WriteX                   89591   421.961  6363.271
> Unlink                   34048     1.281   662.467
> UnlockX                    512     0.007     0.057
> FIND_FIRST               62016     0.086    25.060
> SET_FILE_INFORMATION     15616     0.995   176.621
> QUERY_FILE_INFORMATION   28734     0.004     1.372
> QUERY_PATH_INFORMATION  170240     0.163   820.292
> QUERY_FS_INFORMATION     28736     0.017     4.110
> NTCreateX               178688     0.437   905.567
> 
> MQ-BFQ-TPUT
> 
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13504    75.828 11196.035
> Close                   136896     0.004     3.855
> LockX                      640     0.005     0.031
> Rename                    8064     1.020   288.989
> ReadX                   297600     0.081   685.850
> WriteX                   93515   391.637 12681.517
> Unlink                   34880     0.500   146.928
> UnlockX                    640     0.004     0.032
> FIND_FIRST               63680     0.045   222.491
> SET_FILE_INFORMATION     16000     0.436   686.115
> QUERY_FILE_INFORMATION   30464     0.003     0.773
> QUERY_PATH_INFORMATION  175552     0.044   148.449
> QUERY_FS_INFORMATION     29888     0.009     1.984
> NTCreateX               183152     0.289   300.867
> 
> Are these results in line with yours for this test?
> 
> Anyway, to investigate this regression more in depth, I took two
> further steps.  First, I repeated the same test with bfq-sq, my
> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
> from the changes needed for bfq to live in blk-mq).  I got:
> 
> BFQ-SQ-TPUT
> 
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    12618    30.212   484.099
> Close                   123884     0.008    10.477
> LockX                      512     0.010     0.170
> Rename                    7296     2.032   426.409
> ReadX                   262179     0.251   985.478
> WriteX                   84072   461.398  7283.003
> Unlink                   33076     1.685   848.734
> UnlockX                    512     0.007     0.036
> FIND_FIRST               58690     0.096   220.720
> SET_FILE_INFORMATION     14976     1.792   466.435
> QUERY_FILE_INFORMATION   26575     0.004     2.194
> QUERY_PATH_INFORMATION  158125     0.112   614.063
> QUERY_FS_INFORMATION     28224     0.017     1.385
> NTCreateX               167877     0.827   945.644
> 
> So, the worst-case regression is now around 15%.  This made me suspect
> that blk-mq influences results a lot for this test.  To crosscheck, I
> compared legacy-deadline and mq-deadline too.
> 

Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
occasionally confused by the workload, and grants device idling to
processes that, for this specific workload, would be better to
de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
becomes more or less equivalent to cfq (for some operations apparently
even much better):

bfq-sq-tput-0idle

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13013    17.888   280.517
 Close                   133004     0.008    20.698
 LockX                      512     0.008     0.088
 Rename                    7427     2.041   193.232
 ReadX                   270534     0.138   408.534
 WriteX                   88598   429.615  6272.212
 Unlink                   33734     1.205   559.152
 UnlockX                    512     0.011     1.808
 FIND_FIRST               61762     0.087    23.012
 SET_FILE_INFORMATION     15337     1.322   220.155
 QUERY_FILE_INFORMATION   28415     0.004     0.559
 QUERY_PATH_INFORMATION  169423     0.150   580.570
 QUERY_FS_INFORMATION     28547     0.019    24.466
 NTCreateX               177618     0.544   681.795

I'll try soon with mq-bfq too, for which I expect however a deeper
investigation to be needed.

Thanks,
Paolo

> LEGACY-DEADLINE
> 
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13267     9.622   298.206
> Close                   135692     0.007    10.627
> LockX                      640     0.008     0.066
> Rename                    7827     0.544   481.123
> ReadX                   285929     0.220  2698.442
> WriteX                   92309   430.867  5191.608
> Unlink                   34534     1.133   619.235
> UnlockX                    640     0.008     0.724
> FIND_FIRST               63289     0.086    56.851
> SET_FILE_INFORMATION     16000     1.254   844.065
> QUERY_FILE_INFORMATION   29883     0.004     0.618
> QUERY_PATH_INFORMATION  173232     0.089  1295.651
> QUERY_FS_INFORMATION     29632     0.017     4.813
> NTCreateX               181464     0.479  2214.343
> 
> 
> MQ-DEADLINE
> 
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13760    90.542 13221.495
> Close                   137654     0.008    27.133
> LockX                      640     0.009     0.115
> Rename                    8064     1.062   246.759
> ReadX                   297956     0.051   347.018
> WriteX                   94698   425.636 15090.020
> Unlink                   35077     0.580   208.462
> UnlockX                    640     0.007     0.291
> FIND_FIRST               66630     0.566   530.339
> SET_FILE_INFORMATION     16000     1.419   811.494
> QUERY_FILE_INFORMATION   30717     0.004     1.108
> QUERY_PATH_INFORMATION  176153     0.182   517.419
> QUERY_FS_INFORMATION     30857     0.018    18.562
> NTCreateX               184145     0.281   582.076
> 
> So, with both bfq and deadline there seems to be a serious regression,
> especially on MaxLat, when moving from legacy block to blk-mq.  The
> regression is much worse with deadline, as legacy-deadline has the
> lowest max latency among all the schedulers, whereas mq-deadline has
> the highest one.
> 
> Regardless of the actual culprit of this regression, I would like to
> investigate further this issue.  In this respect, I would like to ask
> for a little help.  I would like to isolate the workloads generating
> the highest latencies.  To this purpose, I had a look at the loadfile
> client-tiny.txt, and I still have a doubt: is every item in the
> loadfile executed somehow several times (for each value of the number
> of clients), or is it executed only once?  More precisely, IIUC, for
> each operation reported in the above results, there are several items
> (lines) in the loadfile.  So, is each of these items executed only
> once?
> 
> I'm asking because, if it is executed only once, then I guess I can
> find the critical tasks ore easily.  Finally, if it is actually
> executed only once, is it expected that the latency for such a task is
> one order of magnitude higher than that of the average latency for
> that group of tasks?  I mean, is such a task intrinsically much
> heavier, and then expectedly much longer, or is the fact that latency
> is much higher for this task a sign that something in the kernel
> misbehaves for that task?
> 
> While waiting for some feedback, I'm going to execute your test
> showing great unfairness between writes and reads, and to also check
> whether responsiveness does worsen if the write workload for that test
> is being executed in the background.
> 
> Thanks,
> Paolo
> 
>> ...
>>> -- 
>>> Mel Gorman
>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-07 18:42               ` Paolo Valente
@ 2017-08-08  8:06                 ` Paolo Valente
  -1 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-08  8:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>=20
>>=20
>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>=20
>>>=20
>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>>=20
>>>>=20
>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>>>>=20
>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>> I took that into account BFQ with low-latency was also tested and =
the
>>>>>> impact was not a universal improvement although it can be a =
noticable
>>>>>> improvement. =46rom the same machine;
>>>>>>=20
>>>>>> dbench4 Loadfile Execution Time
>>>>>>                         4.12.0                 4.12.0             =
    4.12.0
>>>>>>                     legacy-cfq                 mq-bfq            =
mq-bfq-tput
>>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       =
84.70 (  -5.00%)
>>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       =
88.74 (   4.45%)
>>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      =
113.97 ( -10.95%)
>>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     =
2038.74 (  19.86%)
>>>>>>=20
>>>>>=20
>>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>>> correctly, we move from a worst case of 361% higher execution time =
to
>>>>> a worst case of 11%.  With a best case of 20% of lower execution =
time.
>>>>>=20
>>>>=20
>>>> Yes.
>>>>=20
>>>>> I asked you about none and mq-deadline in a previous email, =
because
>>>>> actually we have a double change here: change of the I/O stack, =
and
>>>>> change of the scheduler, with the first change probably not =
irrelevant
>>>>> with respect to the second one.
>>>>>=20
>>>>=20
>>>> True. However, the difference between legacy-deadline mq-deadline =
is
>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>> universally true but the impact is not as severe. While this is not
>>>> proof that the stack change is the sole root cause, it makes it =
less
>>>> likely.
>>>>=20
>>>=20
>>> I'm getting a little lost here.  If I'm not mistaken, you are =
saying,
>>> since the difference between two virtually identical schedulers
>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>> difference between cfq and mq-bfq-tput is higher, then in the latter
>>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in =
the
>>> above test is exactly in the 5-10% range?  What am I missing?  Other
>>> tests with mq-bfq-tput not yet reported?
>>>=20
>>>>> By chance, according to what you have measured so far, is there =
any
>>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>>> lose?  I could start from there.
>>>>>=20
>>>>=20
>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough =
that
>>>> it could be the stack change.
>>>>=20
>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests =
using
>>>> ext4 as a filesystem. The same is not true for XFS so the =
filesystem
>>>> matters.
>>>>=20
>>>=20
>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>> soon as I can, thanks.
>>>=20
>>>=20
>>=20
>> I've run this test and tried to further investigate this regression.
>> For the moment, the gist seems to be that blk-mq plays an important
>> role, not only with bfq (unless I'm considering the wrong numbers).
>> Even if your main purpose in this thread was just to give a heads-up,
>> I guess it may be useful to share what I have found out.  In =
addition,
>> I want to ask for some help, to try to get closer to the possible
>> causes of at least this regression.  If you think it would be better
>> to open a new thread on this stuff, I'll do it.
>>=20
>> First, I got mixed results on my system.  I'll focus only on the the
>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>> to cfq, which happens with 64 clients.  Still, also in this case
>> mq-bfq is better than cfq in all average values, but Flush.  I don't
>> know which are the best/right values to look at, so, here's the final
>> report for both schedulers:
>>=20
>> CFQ
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13120    20.069   348.594
>> Close                   133696     0.008    14.642
>> LockX                      512     0.009     0.059
>> Rename                    7552     1.857   415.418
>> ReadX                   270720     0.141   535.632
>> WriteX                   89591   421.961  6363.271
>> Unlink                   34048     1.281   662.467
>> UnlockX                    512     0.007     0.057
>> FIND_FIRST               62016     0.086    25.060
>> SET_FILE_INFORMATION     15616     0.995   176.621
>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>> QUERY_FS_INFORMATION     28736     0.017     4.110
>> NTCreateX               178688     0.437   905.567
>>=20
>> MQ-BFQ-TPUT
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13504    75.828 11196.035
>> Close                   136896     0.004     3.855
>> LockX                      640     0.005     0.031
>> Rename                    8064     1.020   288.989
>> ReadX                   297600     0.081   685.850
>> WriteX                   93515   391.637 12681.517
>> Unlink                   34880     0.500   146.928
>> UnlockX                    640     0.004     0.032
>> FIND_FIRST               63680     0.045   222.491
>> SET_FILE_INFORMATION     16000     0.436   686.115
>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>> QUERY_FS_INFORMATION     29888     0.009     1.984
>> NTCreateX               183152     0.289   300.867
>>=20
>> Are these results in line with yours for this test?
>>=20
>> Anyway, to investigate this regression more in depth, I took two
>> further steps.  First, I repeated the same test with bfq-sq, my
>> out-of-tree version of bfq for legacy block (identical to mq-bfq =
apart
>> from the changes needed for bfq to live in blk-mq).  I got:
>>=20
>> BFQ-SQ-TPUT
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    12618    30.212   484.099
>> Close                   123884     0.008    10.477
>> LockX                      512     0.010     0.170
>> Rename                    7296     2.032   426.409
>> ReadX                   262179     0.251   985.478
>> WriteX                   84072   461.398  7283.003
>> Unlink                   33076     1.685   848.734
>> UnlockX                    512     0.007     0.036
>> FIND_FIRST               58690     0.096   220.720
>> SET_FILE_INFORMATION     14976     1.792   466.435
>> QUERY_FILE_INFORMATION   26575     0.004     2.194
>> QUERY_PATH_INFORMATION  158125     0.112   614.063
>> QUERY_FS_INFORMATION     28224     0.017     1.385
>> NTCreateX               167877     0.827   945.644
>>=20
>> So, the worst-case regression is now around 15%.  This made me =
suspect
>> that blk-mq influences results a lot for this test.  To crosscheck, I
>> compared legacy-deadline and mq-deadline too.
>>=20
>=20
> Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
> occasionally confused by the workload, and grants device idling to
> processes that, for this specific workload, would be better to
> de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
> becomes more or less equivalent to cfq (for some operations apparently
> even much better):
>=20
> bfq-sq-tput-0idle
>=20
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13013    17.888   280.517
> Close                   133004     0.008    20.698
> LockX                      512     0.008     0.088
> Rename                    7427     2.041   193.232
> ReadX                   270534     0.138   408.534
> WriteX                   88598   429.615  6272.212
> Unlink                   33734     1.205   559.152
> UnlockX                    512     0.011     1.808
> FIND_FIRST               61762     0.087    23.012
> SET_FILE_INFORMATION     15337     1.322   220.155
> QUERY_FILE_INFORMATION   28415     0.004     0.559
> QUERY_PATH_INFORMATION  169423     0.150   580.570
> QUERY_FS_INFORMATION     28547     0.019    24.466
> NTCreateX               177618     0.544   681.795
>=20
> I'll try soon with mq-bfq too, for which I expect however a deeper
> investigation to be needed.
>=20

Hi,
to test mq-bfq (with both slice_idle=3D=3D0 and slice_idle>0), I have =
also
applied Ming patches, and Ah, victory!

Regardless of the value of slice idle:

mq-bfq-tput

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13183    70.381  1025.407
 Close                   134539     0.004     1.011
 LockX                      512     0.005     0.025
 Rename                    7721     0.740   404.979
 ReadX                   274422     0.126   873.364
 WriteX                   90535   408.371  7400.585
 Unlink                   34276     0.634   581.067
 UnlockX                    512     0.003     0.029
 FIND_FIRST               62664     0.052   321.027
 SET_FILE_INFORMATION     15981     0.234   124.739
 QUERY_FILE_INFORMATION   29042     0.003     1.731
 QUERY_PATH_INFORMATION  171769     0.032   522.415
 QUERY_FS_INFORMATION     28958     0.009     3.043
 NTCreateX               179643     0.298   687.466

Throughput 9.11183 MB/sec  64 clients  64 procs  max_latency=3D7400.588 =
ms

Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
benefit, which lets me suspect that there is some other issue in
blk-mq (only a suspect).  I think I may have already understood how to
guarantee that bfq almost never idles the device uselessly also for
this workload.  Yet, since in blk-mq there is no gain even after
excluding useless idling, I'll wait for at least Ming's patches to be
merged before possibly proposing this contribution.  Maybe some other
little issue related to this lack of gain in blk-mq will be found and
solved in the meantime.

Moving to the read-write unfairness problem.

Thanks,
Paolo

> Thanks,
> Paolo
>=20
>> LEGACY-DEADLINE
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13267     9.622   298.206
>> Close                   135692     0.007    10.627
>> LockX                      640     0.008     0.066
>> Rename                    7827     0.544   481.123
>> ReadX                   285929     0.220  2698.442
>> WriteX                   92309   430.867  5191.608
>> Unlink                   34534     1.133   619.235
>> UnlockX                    640     0.008     0.724
>> FIND_FIRST               63289     0.086    56.851
>> SET_FILE_INFORMATION     16000     1.254   844.065
>> QUERY_FILE_INFORMATION   29883     0.004     0.618
>> QUERY_PATH_INFORMATION  173232     0.089  1295.651
>> QUERY_FS_INFORMATION     29632     0.017     4.813
>> NTCreateX               181464     0.479  2214.343
>>=20
>>=20
>> MQ-DEADLINE
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13760    90.542 13221.495
>> Close                   137654     0.008    27.133
>> LockX                      640     0.009     0.115
>> Rename                    8064     1.062   246.759
>> ReadX                   297956     0.051   347.018
>> WriteX                   94698   425.636 15090.020
>> Unlink                   35077     0.580   208.462
>> UnlockX                    640     0.007     0.291
>> FIND_FIRST               66630     0.566   530.339
>> SET_FILE_INFORMATION     16000     1.419   811.494
>> QUERY_FILE_INFORMATION   30717     0.004     1.108
>> QUERY_PATH_INFORMATION  176153     0.182   517.419
>> QUERY_FS_INFORMATION     30857     0.018    18.562
>> NTCreateX               184145     0.281   582.076
>>=20
>> So, with both bfq and deadline there seems to be a serious =
regression,
>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>> regression is much worse with deadline, as legacy-deadline has the
>> lowest max latency among all the schedulers, whereas mq-deadline has
>> the highest one.
>>=20
>> Regardless of the actual culprit of this regression, I would like to
>> investigate further this issue.  In this respect, I would like to ask
>> for a little help.  I would like to isolate the workloads generating
>> the highest latencies.  To this purpose, I had a look at the loadfile
>> client-tiny.txt, and I still have a doubt: is every item in the
>> loadfile executed somehow several times (for each value of the number
>> of clients), or is it executed only once?  More precisely, IIUC, for
>> each operation reported in the above results, there are several items
>> (lines) in the loadfile.  So, is each of these items executed only
>> once?
>>=20
>> I'm asking because, if it is executed only once, then I guess I can
>> find the critical tasks ore easily.  Finally, if it is actually
>> executed only once, is it expected that the latency for such a task =
is
>> one order of magnitude higher than that of the average latency for
>> that group of tasks?  I mean, is such a task intrinsically much
>> heavier, and then expectedly much longer, or is the fact that latency
>> is much higher for this task a sign that something in the kernel
>> misbehaves for that task?
>>=20
>> While waiting for some feedback, I'm going to execute your test
>> showing great unfairness between writes and reads, and to also check
>> whether responsiveness does worsen if the write workload for that =
test
>> is being executed in the background.
>>=20
>> Thanks,
>> Paolo
>>=20
>>> ...
>>>> --=20
>>>> Mel Gorman
>>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-08  8:06                 ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-08  8:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>> 
>>> 
>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>> 
>>>> 
>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto:
>>>> 
>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>>> impact was not a universal improvement although it can be a noticable
>>>>>> improvement. From the same machine;
>>>>>> 
>>>>>> dbench4 Loadfile Execution Time
>>>>>>                         4.12.0                 4.12.0                 4.12.0
>>>>>>                     legacy-cfq                 mq-bfq            mq-bfq-tput
>>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
>>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
>>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
>>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
>>>>>> 
>>>>> 
>>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>>> correctly, we move from a worst case of 361% higher execution time to
>>>>> a worst case of 11%.  With a best case of 20% of lower execution time.
>>>>> 
>>>> 
>>>> Yes.
>>>> 
>>>>> I asked you about none and mq-deadline in a previous email, because
>>>>> actually we have a double change here: change of the I/O stack, and
>>>>> change of the scheduler, with the first change probably not irrelevant
>>>>> with respect to the second one.
>>>>> 
>>>> 
>>>> True. However, the difference between legacy-deadline mq-deadline is
>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>> universally true but the impact is not as severe. While this is not
>>>> proof that the stack change is the sole root cause, it makes it less
>>>> likely.
>>>> 
>>> 
>>> I'm getting a little lost here.  If I'm not mistaken, you are saying,
>>> since the difference between two virtually identical schedulers
>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>> difference between cfq and mq-bfq-tput is higher, then in the latter
>>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
>>> above test is exactly in the 5-10% range?  What am I missing?  Other
>>> tests with mq-bfq-tput not yet reported?
>>> 
>>>>> By chance, according to what you have measured so far, is there any
>>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>>> lose?  I could start from there.
>>>>> 
>>>> 
>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>>> it could be the stack change.
>>>> 
>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>> matters.
>>>> 
>>> 
>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>> soon as I can, thanks.
>>> 
>>> 
>> 
>> I've run this test and tried to further investigate this regression.
>> For the moment, the gist seems to be that blk-mq plays an important
>> role, not only with bfq (unless I'm considering the wrong numbers).
>> Even if your main purpose in this thread was just to give a heads-up,
>> I guess it may be useful to share what I have found out.  In addition,
>> I want to ask for some help, to try to get closer to the possible
>> causes of at least this regression.  If you think it would be better
>> to open a new thread on this stuff, I'll do it.
>> 
>> First, I got mixed results on my system.  I'll focus only on the the
>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>> to cfq, which happens with 64 clients.  Still, also in this case
>> mq-bfq is better than cfq in all average values, but Flush.  I don't
>> know which are the best/right values to look at, so, here's the final
>> report for both schedulers:
>> 
>> CFQ
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13120    20.069   348.594
>> Close                   133696     0.008    14.642
>> LockX                      512     0.009     0.059
>> Rename                    7552     1.857   415.418
>> ReadX                   270720     0.141   535.632
>> WriteX                   89591   421.961  6363.271
>> Unlink                   34048     1.281   662.467
>> UnlockX                    512     0.007     0.057
>> FIND_FIRST               62016     0.086    25.060
>> SET_FILE_INFORMATION     15616     0.995   176.621
>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>> QUERY_FS_INFORMATION     28736     0.017     4.110
>> NTCreateX               178688     0.437   905.567
>> 
>> MQ-BFQ-TPUT
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13504    75.828 11196.035
>> Close                   136896     0.004     3.855
>> LockX                      640     0.005     0.031
>> Rename                    8064     1.020   288.989
>> ReadX                   297600     0.081   685.850
>> WriteX                   93515   391.637 12681.517
>> Unlink                   34880     0.500   146.928
>> UnlockX                    640     0.004     0.032
>> FIND_FIRST               63680     0.045   222.491
>> SET_FILE_INFORMATION     16000     0.436   686.115
>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>> QUERY_FS_INFORMATION     29888     0.009     1.984
>> NTCreateX               183152     0.289   300.867
>> 
>> Are these results in line with yours for this test?
>> 
>> Anyway, to investigate this regression more in depth, I took two
>> further steps.  First, I repeated the same test with bfq-sq, my
>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>> from the changes needed for bfq to live in blk-mq).  I got:
>> 
>> BFQ-SQ-TPUT
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    12618    30.212   484.099
>> Close                   123884     0.008    10.477
>> LockX                      512     0.010     0.170
>> Rename                    7296     2.032   426.409
>> ReadX                   262179     0.251   985.478
>> WriteX                   84072   461.398  7283.003
>> Unlink                   33076     1.685   848.734
>> UnlockX                    512     0.007     0.036
>> FIND_FIRST               58690     0.096   220.720
>> SET_FILE_INFORMATION     14976     1.792   466.435
>> QUERY_FILE_INFORMATION   26575     0.004     2.194
>> QUERY_PATH_INFORMATION  158125     0.112   614.063
>> QUERY_FS_INFORMATION     28224     0.017     1.385
>> NTCreateX               167877     0.827   945.644
>> 
>> So, the worst-case regression is now around 15%.  This made me suspect
>> that blk-mq influences results a lot for this test.  To crosscheck, I
>> compared legacy-deadline and mq-deadline too.
>> 
> 
> Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
> occasionally confused by the workload, and grants device idling to
> processes that, for this specific workload, would be better to
> de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
> becomes more or less equivalent to cfq (for some operations apparently
> even much better):
> 
> bfq-sq-tput-0idle
> 
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13013    17.888   280.517
> Close                   133004     0.008    20.698
> LockX                      512     0.008     0.088
> Rename                    7427     2.041   193.232
> ReadX                   270534     0.138   408.534
> WriteX                   88598   429.615  6272.212
> Unlink                   33734     1.205   559.152
> UnlockX                    512     0.011     1.808
> FIND_FIRST               61762     0.087    23.012
> SET_FILE_INFORMATION     15337     1.322   220.155
> QUERY_FILE_INFORMATION   28415     0.004     0.559
> QUERY_PATH_INFORMATION  169423     0.150   580.570
> QUERY_FS_INFORMATION     28547     0.019    24.466
> NTCreateX               177618     0.544   681.795
> 
> I'll try soon with mq-bfq too, for which I expect however a deeper
> investigation to be needed.
> 

Hi,
to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also
applied Ming patches, and Ah, victory!

Regardless of the value of slice idle:

mq-bfq-tput

 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    13183    70.381  1025.407
 Close                   134539     0.004     1.011
 LockX                      512     0.005     0.025
 Rename                    7721     0.740   404.979
 ReadX                   274422     0.126   873.364
 WriteX                   90535   408.371  7400.585
 Unlink                   34276     0.634   581.067
 UnlockX                    512     0.003     0.029
 FIND_FIRST               62664     0.052   321.027
 SET_FILE_INFORMATION     15981     0.234   124.739
 QUERY_FILE_INFORMATION   29042     0.003     1.731
 QUERY_PATH_INFORMATION  171769     0.032   522.415
 QUERY_FS_INFORMATION     28958     0.009     3.043
 NTCreateX               179643     0.298   687.466

Throughput 9.11183 MB/sec  64 clients  64 procs  max_latency=7400.588 ms

Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
benefit, which lets me suspect that there is some other issue in
blk-mq (only a suspect).  I think I may have already understood how to
guarantee that bfq almost never idles the device uselessly also for
this workload.  Yet, since in blk-mq there is no gain even after
excluding useless idling, I'll wait for at least Ming's patches to be
merged before possibly proposing this contribution.  Maybe some other
little issue related to this lack of gain in blk-mq will be found and
solved in the meantime.

Moving to the read-write unfairness problem.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> LEGACY-DEADLINE
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13267     9.622   298.206
>> Close                   135692     0.007    10.627
>> LockX                      640     0.008     0.066
>> Rename                    7827     0.544   481.123
>> ReadX                   285929     0.220  2698.442
>> WriteX                   92309   430.867  5191.608
>> Unlink                   34534     1.133   619.235
>> UnlockX                    640     0.008     0.724
>> FIND_FIRST               63289     0.086    56.851
>> SET_FILE_INFORMATION     16000     1.254   844.065
>> QUERY_FILE_INFORMATION   29883     0.004     0.618
>> QUERY_PATH_INFORMATION  173232     0.089  1295.651
>> QUERY_FS_INFORMATION     29632     0.017     4.813
>> NTCreateX               181464     0.479  2214.343
>> 
>> 
>> MQ-DEADLINE
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13760    90.542 13221.495
>> Close                   137654     0.008    27.133
>> LockX                      640     0.009     0.115
>> Rename                    8064     1.062   246.759
>> ReadX                   297956     0.051   347.018
>> WriteX                   94698   425.636 15090.020
>> Unlink                   35077     0.580   208.462
>> UnlockX                    640     0.007     0.291
>> FIND_FIRST               66630     0.566   530.339
>> SET_FILE_INFORMATION     16000     1.419   811.494
>> QUERY_FILE_INFORMATION   30717     0.004     1.108
>> QUERY_PATH_INFORMATION  176153     0.182   517.419
>> QUERY_FS_INFORMATION     30857     0.018    18.562
>> NTCreateX               184145     0.281   582.076
>> 
>> So, with both bfq and deadline there seems to be a serious regression,
>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>> regression is much worse with deadline, as legacy-deadline has the
>> lowest max latency among all the schedulers, whereas mq-deadline has
>> the highest one.
>> 
>> Regardless of the actual culprit of this regression, I would like to
>> investigate further this issue.  In this respect, I would like to ask
>> for a little help.  I would like to isolate the workloads generating
>> the highest latencies.  To this purpose, I had a look at the loadfile
>> client-tiny.txt, and I still have a doubt: is every item in the
>> loadfile executed somehow several times (for each value of the number
>> of clients), or is it executed only once?  More precisely, IIUC, for
>> each operation reported in the above results, there are several items
>> (lines) in the loadfile.  So, is each of these items executed only
>> once?
>> 
>> I'm asking because, if it is executed only once, then I guess I can
>> find the critical tasks ore easily.  Finally, if it is actually
>> executed only once, is it expected that the latency for such a task is
>> one order of magnitude higher than that of the average latency for
>> that group of tasks?  I mean, is such a task intrinsically much
>> heavier, and then expectedly much longer, or is the fact that latency
>> is much higher for this task a sign that something in the kernel
>> misbehaves for that task?
>> 
>> While waiting for some feedback, I'm going to execute your test
>> showing great unfairness between writes and reads, and to also check
>> whether responsiveness does worsen if the write workload for that test
>> is being executed in the background.
>> 
>> Thanks,
>> Paolo
>> 
>>> ...
>>>> -- 
>>>> Mel Gorman
>>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-07 17:32             ` Paolo Valente
  (?)
  (?)
@ 2017-08-08 10:30             ` Mel Gorman
  2017-08-08 10:43               ` Ming Lei
  2017-08-08 17:16                 ` Paolo Valente
  -1 siblings, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2017-08-08 10:30 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block

On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote:
> >> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
> >> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
> >> ext4 as a filesystem. The same is not true for XFS so the filesystem
> >> matters.
> >> 
> > 
> > Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
> > soon as I can, thanks.
> > 
> > 
> 
> I've run this test and tried to further investigate this regression.
> For the moment, the gist seems to be that blk-mq plays an important
> role, not only with bfq (unless I'm considering the wrong numbers).
> Even if your main purpose in this thread was just to give a heads-up,
> I guess it may be useful to share what I have found out.  In addition,
> I want to ask for some help, to try to get closer to the possible
> causes of at least this regression.  If you think it would be better
> to open a new thread on this stuff, I'll do it.
> 

I don't think it's necessary unless Christoph or Jens object and I doubt
they will.

> First, I got mixed results on my system. 

For what it's worth, this is standard. In my experience, IO benchmarks
are always multi-modal, particularly on rotary storage. Cases of universal
win or universal loss for a scheduler or set of tuning are rare.

> I'll focus only on the the
> case where mq-bfq-tput achieves its worst relative performance w.r.t.
> to cfq, which happens with 64 clients.  Still, also in this case
> mq-bfq is better than cfq in all average values, but Flush.  I don't
> know which are the best/right values to look at, so, here's the final
> report for both schedulers:
> 

For what it's worth, it has often been observed that dbench overall
performance was dominated by flush costs. This is also true for the
standard reported throughput figures rather than the modified load file
elapsed time that mmtests reports. In dbench3 it was even worse where the
"performance" was dominated by whether the temporary files were deleted
before writeback started.

> CFQ
> 
>  Operation                Count    AvgLat    MaxLat
>  --------------------------------------------------
>  Flush                    13120    20.069   348.594
>  Close                   133696     0.008    14.642
>  LockX                      512     0.009     0.059
>  Rename                    7552     1.857   415.418
>  ReadX                   270720     0.141   535.632
>  WriteX                   89591   421.961  6363.271
>  Unlink                   34048     1.281   662.467
>  UnlockX                    512     0.007     0.057
>  FIND_FIRST               62016     0.086    25.060
>  SET_FILE_INFORMATION     15616     0.995   176.621
>  QUERY_FILE_INFORMATION   28734     0.004     1.372
>  QUERY_PATH_INFORMATION  170240     0.163   820.292
>  QUERY_FS_INFORMATION     28736     0.017     4.110
>  NTCreateX               178688     0.437   905.567
> 
> MQ-BFQ-TPUT
> 
> Operation                Count    AvgLat    MaxLat
>  --------------------------------------------------
>  Flush                    13504    75.828 11196.035
>  Close                   136896     0.004     3.855
>  LockX                      640     0.005     0.031
>  Rename                    8064     1.020   288.989
>  ReadX                   297600     0.081   685.850
>  WriteX                   93515   391.637 12681.517
>  Unlink                   34880     0.500   146.928
>  UnlockX                    640     0.004     0.032
>  FIND_FIRST               63680     0.045   222.491
>  SET_FILE_INFORMATION     16000     0.436   686.115
>  QUERY_FILE_INFORMATION   30464     0.003     0.773
>  QUERY_PATH_INFORMATION  175552     0.044   148.449
>  QUERY_FS_INFORMATION     29888     0.009     1.984
>  NTCreateX               183152     0.289   300.867
> 
> Are these results in line with yours for this test?
> 

Very broadly speaking yes, but it varies. On a small machine, the differences
in flush latency are visible but not as dramatic. It only has a few
CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On
the one machine I have that topped out with CFQ/BFQ at 64 threads, the
latency of flush is vaguely similar

			CFQ			BFQ			BFQ-TPUT
latency	avg-Flush-64 	287.05	( 0.00%)	389.14	( -35.57%)	349.90	( -21.90%)
latency	avg-Close-64 	0.00	( 0.00%)	0.00	( -33.33%)	0.00	( 0.00%)
latency	avg-LockX-64 	0.01	( 0.00%)	0.01	( -16.67%)	0.01	( 0.00%)
latency	avg-Rename-64 	0.18	( 0.00%)	0.21	( -16.39%)	0.18	( 3.28%)
latency	avg-ReadX-64 	0.10	( 0.00%)	0.15	( -40.95%)	0.15	( -40.95%)
latency	avg-WriteX-64 	0.86	( 0.00%)	0.81	( 6.18%)	0.74	( 13.75%)
latency	avg-Unlink-64 	1.49	( 0.00%)	1.52	( -2.28%)	1.14	( 23.69%)
latency	avg-UnlockX-64 	0.00	( 0.00%)	0.00	( 0.00%)	0.00	( 0.00%)
latency	avg-NTCreateX-64 	0.26	( 0.00%)	0.30	( -16.15%)	0.21	( 19.62%)

So, different figures to yours but the general observation that flush
latency is higher holds.

> Anyway, to investigate this regression more in depth, I took two
> further steps.  First, I repeated the same test with bfq-sq, my
> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
> from the changes needed for bfq to live in blk-mq).  I got:
> 
> <SNIP>
> 
> So, with both bfq and deadline there seems to be a serious regression,
> especially on MaxLat, when moving from legacy block to blk-mq.  The
> regression is much worse with deadline, as legacy-deadline has the
> lowest max latency among all the schedulers, whereas mq-deadline has
> the highest one.
> 

I wouldn't worry too much about max latency simply because a large
outliier can be due to multiple factors and it will be variable.
However, I accept that deadline is not necessarily great either.

> Regardless of the actual culprit of this regression, I would like to
> investigate further this issue.  In this respect, I would like to ask
> for a little help.  I would like to isolate the workloads generating
> the highest latencies.  To this purpose, I had a look at the loadfile
> client-tiny.txt, and I still have a doubt: is every item in the
> loadfile executed somehow several times (for each value of the number
> of clients), or is it executed only once?  More precisely, IIUC, for
> each operation reported in the above results, there are several items
> (lines) in the loadfile.  So, is each of these items executed only
> once?
> 

The load file is executed multiple times. The normal loadfile was
basically just the same commands, or very similar commands, run multiple
times within a single load file. This made the workload too sensitive to
the exact time the workload finished and too coarse.

> I'm asking because, if it is executed only once, then I guess I can
> find the critical tasks ore easily.  Finally, if it is actually
> executed only once, is it expected that the latency for such a task is
> one order of magnitude higher than that of the average latency for
> that group of tasks?  I mean, is such a task intrinsically much
> heavier, and then expectedly much longer, or is the fact that latency
> is much higher for this task a sign that something in the kernel
> misbehaves for that task?
> 

I don't think it's quite as easily isolated. It's all the operations in
combination that replicate the behaviour. If it was just a single operation
like "fsync" then it would be fairly straight-forward but the full mix
is relevant as it matters when writeback kicks off, when merges happen,
how much dirty data was outstanding when writeback or sync started etc.

I see you've made other responses to the thread so rather than respond
individually 

o I've queued a subset of tests with Ming's v3 patchset as that was the
  latest branch at the time I looked. It'll take quite some time to execute
  as the grid I use to collect data is backlogged with other work

o I've included pgioperf this time because it is good at demonstrate
  oddities related to fsync. Granted it's mostly simulating a database
  workload that is typically recommended to use deadline scheduler but I
  think it's still a useful demonstration 

o If you want a patch set queued that may improve workload pattern
  detection for dbench then I can add that to the grid with the caveat that
  results take time. It'll be a blind test as I'm not actively debugging
  IO-related problems right now.

o I'll keep an eye out for other workloads that demonstrate empirically
  better performance given that a stopwatch and desktop performance is
  tough to quantify even though I'm typically working in other areas. While
  I don't spend a lot of time on IO-related problems, it would still
  be preferred if switching to MQ by default was a safe option so I'm
  interested enough to keep it in mind.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08 10:30             ` Mel Gorman
@ 2017-08-08 10:43               ` Ming Lei
  2017-08-08 11:27                 ` Mel Gorman
  2017-08-08 17:16                 ` Paolo Valente
  1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2017-08-08 10:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Paolo Valente, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block

Hi Mel Gorman,

On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
....
>
> o I've queued a subset of tests with Ming's v3 patchset as that was the
>   latest branch at the time I looked. It'll take quite some time to execute
>   as the grid I use to collect data is backlogged with other work

The latest patchset is in the following post:

      http://marc.info/?l=linux-block&m=150191624318513&w=2

And you can find it in my github:

      https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08 10:43               ` Ming Lei
@ 2017-08-08 11:27                 ` Mel Gorman
  2017-08-08 11:49                   ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2017-08-08 11:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Paolo Valente, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block

On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote:
> Hi Mel Gorman,
> 
> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> ....
> >
> > o I've queued a subset of tests with Ming's v3 patchset as that was the
> >   latest branch at the time I looked. It'll take quite some time to execute
> >   as the grid I use to collect data is backlogged with other work
> 
> The latest patchset is in the following post:
> 
>       http://marc.info/?l=linux-block&m=150191624318513&w=2
> 
> And you can find it in my github:
> 
>       https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4
> 

Unfortunately, the tests were queued last Friday and are partially complete
depending on when machines become available. As it is, v3 will take a few
days to complete and a requeue would incur further delays. If you believe
the results will be substantially different then I'll discard v3 and requeue.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08 11:27                 ` Mel Gorman
@ 2017-08-08 11:49                   ` Ming Lei
  2017-08-08 11:55                     ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2017-08-08 11:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Paolo Valente, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block

On Tue, Aug 8, 2017 at 7:27 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>>
>> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> ....
>> >
>> > o I've queued a subset of tests with Ming's v3 patchset as that was the
>> >   latest branch at the time I looked. It'll take quite some time to execute
>> >   as the grid I use to collect data is backlogged with other work
>>
>> The latest patchset is in the following post:
>>
>>       http://marc.info/?l=linux-block&m=150191624318513&w=2
>>
>> And you can find it in my github:
>>
>>       https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4
>>
>
> Unfortunately, the tests were queued last Friday and are partially complete
> depending on when machines become available. As it is, v3 will take a few
> days to complete and a requeue would incur further delays. If you believe
> the results will be substantially different then I'll discard v3 and requeue.

Firstly V3 on github(never posted out) causes boot hang if CPU cores is >= 16,
so you need to check if the test is still running, :-(

Also V3 on github may not perform well on IB SRP(or other low latency
SCSI disk), so
I improve bio merge in V4 and make IB SRP's perf better too, and it depends on
devices.

I suggest to focus on V2 posted in mail list(V4 in github).

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08 11:49                   ` Ming Lei
@ 2017-08-08 11:55                     ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2017-08-08 11:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Paolo Valente, Christoph Hellwig, Jens Axboe,
	Linux Kernel Mailing List, linux-block

On Tue, Aug 08, 2017 at 07:49:53PM +0800, Ming Lei wrote:
> On Tue, Aug 8, 2017 at 7:27 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> > On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote:
> >> Hi Mel Gorman,
> >>
> >> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> >> ....
> >> >
> >> > o I've queued a subset of tests with Ming's v3 patchset as that was the
> >> >   latest branch at the time I looked. It'll take quite some time to execute
> >> >   as the grid I use to collect data is backlogged with other work
> >>
> >> The latest patchset is in the following post:
> >>
> >>       http://marc.info/?l=linux-block&m=150191624318513&w=2
> >>
> >> And you can find it in my github:
> >>
> >>       https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4
> >>
> >
> > Unfortunately, the tests were queued last Friday and are partially complete
> > depending on when machines become available. As it is, v3 will take a few
> > days to complete and a requeue would incur further delays. If you believe
> > the results will be substantially different then I'll discard v3 and requeue.
> 
> Firstly V3 on github(never posted out) causes boot hang if CPU cores is >= 16,
> so you need to check if the test is still running, :-(
> 

By co-incidence, the few machines that have completed had core counts
below this so I'll discard existing results and requeue.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08 10:30             ` Mel Gorman
@ 2017-08-08 17:16                 ` Paolo Valente
  2017-08-08 17:16                 ` Paolo Valente
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-08 17:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 08 ago 2017, alle ore 12:30, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>=20
> On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote:
>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests =
using
>>>> ext4 as a filesystem. The same is not true for XFS so the =
filesystem
>>>> matters.
>>>>=20
>>>=20
>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>> soon as I can, thanks.
>>>=20
>>>=20
>>=20
>> I've run this test and tried to further investigate this regression.
>> For the moment, the gist seems to be that blk-mq plays an important
>> role, not only with bfq (unless I'm considering the wrong numbers).
>> Even if your main purpose in this thread was just to give a heads-up,
>> I guess it may be useful to share what I have found out.  In =
addition,
>> I want to ask for some help, to try to get closer to the possible
>> causes of at least this regression.  If you think it would be better
>> to open a new thread on this stuff, I'll do it.
>>=20
>=20
> I don't think it's necessary unless Christoph or Jens object and I =
doubt
> they will.
>=20
>> First, I got mixed results on my system.=20
>=20
> For what it's worth, this is standard. In my experience, IO benchmarks
> are always multi-modal, particularly on rotary storage. Cases of =
universal
> win or universal loss for a scheduler or set of tuning are rare.
>=20
>> I'll focus only on the the
>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>> to cfq, which happens with 64 clients.  Still, also in this case
>> mq-bfq is better than cfq in all average values, but Flush.  I don't
>> know which are the best/right values to look at, so, here's the final
>> report for both schedulers:
>>=20
>=20
> For what it's worth, it has often been observed that dbench overall
> performance was dominated by flush costs. This is also true for the
> standard reported throughput figures rather than the modified load =
file
> elapsed time that mmtests reports. In dbench3 it was even worse where =
the
> "performance" was dominated by whether the temporary files were =
deleted
> before writeback started.
>=20
>> CFQ
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13120    20.069   348.594
>> Close                   133696     0.008    14.642
>> LockX                      512     0.009     0.059
>> Rename                    7552     1.857   415.418
>> ReadX                   270720     0.141   535.632
>> WriteX                   89591   421.961  6363.271
>> Unlink                   34048     1.281   662.467
>> UnlockX                    512     0.007     0.057
>> FIND_FIRST               62016     0.086    25.060
>> SET_FILE_INFORMATION     15616     0.995   176.621
>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>> QUERY_FS_INFORMATION     28736     0.017     4.110
>> NTCreateX               178688     0.437   905.567
>>=20
>> MQ-BFQ-TPUT
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13504    75.828 11196.035
>> Close                   136896     0.004     3.855
>> LockX                      640     0.005     0.031
>> Rename                    8064     1.020   288.989
>> ReadX                   297600     0.081   685.850
>> WriteX                   93515   391.637 12681.517
>> Unlink                   34880     0.500   146.928
>> UnlockX                    640     0.004     0.032
>> FIND_FIRST               63680     0.045   222.491
>> SET_FILE_INFORMATION     16000     0.436   686.115
>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>> QUERY_FS_INFORMATION     29888     0.009     1.984
>> NTCreateX               183152     0.289   300.867
>>=20
>> Are these results in line with yours for this test?
>>=20
>=20
> Very broadly speaking yes, but it varies. On a small machine, the =
differences
> in flush latency are visible but not as dramatic. It only has a few
> CPUs. On a machine that tops out with 32 CPUs, it is more noticable. =
On
> the one machine I have that topped out with CFQ/BFQ at 64 threads, the
> latency of flush is vaguely similar
>=20
> 			CFQ			BFQ			=
BFQ-TPUT
> latency	avg-Flush-64 	287.05	( 0.00%)	389.14	( =
-35.57%)	349.90	( -21.90%)
> latency	avg-Close-64 	0.00	( 0.00%)	0.00	( =
-33.33%)	0.00	( 0.00%)
> latency	avg-LockX-64 	0.01	( 0.00%)	0.01	( =
-16.67%)	0.01	( 0.00%)
> latency	avg-Rename-64 	0.18	( 0.00%)	0.21	( =
-16.39%)	0.18	( 3.28%)
> latency	avg-ReadX-64 	0.10	( 0.00%)	0.15	( =
-40.95%)	0.15	( -40.95%)
> latency	avg-WriteX-64 	0.86	( 0.00%)	0.81	( 6.18%)	=
0.74	( 13.75%)
> latency	avg-Unlink-64 	1.49	( 0.00%)	1.52	( =
-2.28%)	1.14	( 23.69%)
> latency	avg-UnlockX-64 	0.00	( 0.00%)	0.00	( 0.00%)	=
0.00	( 0.00%)
> latency	avg-NTCreateX-64 	0.26	( 0.00%)	0.30	=
( -16.15%)	0.21	( 19.62%)
>=20
> So, different figures to yours but the general observation that flush
> latency is higher holds.
>=20
>> Anyway, to investigate this regression more in depth, I took two
>> further steps.  First, I repeated the same test with bfq-sq, my
>> out-of-tree version of bfq for legacy block (identical to mq-bfq =
apart
>> from the changes needed for bfq to live in blk-mq).  I got:
>>=20
>> <SNIP>
>>=20
>> So, with both bfq and deadline there seems to be a serious =
regression,
>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>> regression is much worse with deadline, as legacy-deadline has the
>> lowest max latency among all the schedulers, whereas mq-deadline has
>> the highest one.
>>=20
>=20
> I wouldn't worry too much about max latency simply because a large
> outliier can be due to multiple factors and it will be variable.
> However, I accept that deadline is not necessarily great either.
>=20
>> Regardless of the actual culprit of this regression, I would like to
>> investigate further this issue.  In this respect, I would like to ask
>> for a little help.  I would like to isolate the workloads generating
>> the highest latencies.  To this purpose, I had a look at the loadfile
>> client-tiny.txt, and I still have a doubt: is every item in the
>> loadfile executed somehow several times (for each value of the number
>> of clients), or is it executed only once?  More precisely, IIUC, for
>> each operation reported in the above results, there are several items
>> (lines) in the loadfile.  So, is each of these items executed only
>> once?
>>=20
>=20
> The load file is executed multiple times. The normal loadfile was
> basically just the same commands, or very similar commands, run =
multiple
> times within a single load file. This made the workload too sensitive =
to
> the exact time the workload finished and too coarse.
>=20
>> I'm asking because, if it is executed only once, then I guess I can
>> find the critical tasks ore easily.  Finally, if it is actually
>> executed only once, is it expected that the latency for such a task =
is
>> one order of magnitude higher than that of the average latency for
>> that group of tasks?  I mean, is such a task intrinsically much
>> heavier, and then expectedly much longer, or is the fact that latency
>> is much higher for this task a sign that something in the kernel
>> misbehaves for that task?
>>=20
>=20
> I don't think it's quite as easily isolated. It's all the operations =
in
> combination that replicate the behaviour. If it was just a single =
operation
> like "fsync" then it would be fairly straight-forward but the full mix
> is relevant as it matters when writeback kicks off, when merges =
happen,
> how much dirty data was outstanding when writeback or sync started =
etc.
>=20
> I see you've made other responses to the thread so rather than respond
> individually=20
>=20
> o I've queued a subset of tests with Ming's v3 patchset as that was =
the
>  latest branch at the time I looked. It'll take quite some time to =
execute
>  as the grid I use to collect data is backlogged with other work
>=20
> o I've included pgioperf this time because it is good at demonstrate
>  oddities related to fsync. Granted it's mostly simulating a database
>  workload that is typically recommended to use deadline scheduler but =
I
>  think it's still a useful demonstration=20
>=20
> o If you want a patch set queued that may improve workload pattern
>  detection for dbench then I can add that to the grid with the caveat =
that
>  results take time. It'll be a blind test as I'm not actively =
debugging
>  IO-related problems right now.
>=20
> o I'll keep an eye out for other workloads that demonstrate =
empirically
>  better performance given that a stopwatch and desktop performance is
>  tough to quantify even though I'm typically working in other areas. =
While
>  I don't spend a lot of time on IO-related problems, it would still
>  be preferred if switching to MQ by default was a safe option so I'm
>  interested enough to keep it in mind.
>=20

Hi Mel,
thanks for your thorough responses (I'm about to write something about
the read-write unfairness issue, with, again, some surprise).

I want to reply only to your last point above.  With our
responsiveness benchmark of course you don't need a stopwatch, but,
yes, to get some minimally comprehensive results you need a machine
with at least a desktop application like a terminal installed.

Thanks,
Paolo

> --=20
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-08 17:16                 ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-08 17:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 08 ago 2017, alle ore 12:30, Mel Gorman <mgorman@techsingularity.net> ha scritto:
> 
> On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote:
>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>> matters.
>>>> 
>>> 
>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>> soon as I can, thanks.
>>> 
>>> 
>> 
>> I've run this test and tried to further investigate this regression.
>> For the moment, the gist seems to be that blk-mq plays an important
>> role, not only with bfq (unless I'm considering the wrong numbers).
>> Even if your main purpose in this thread was just to give a heads-up,
>> I guess it may be useful to share what I have found out.  In addition,
>> I want to ask for some help, to try to get closer to the possible
>> causes of at least this regression.  If you think it would be better
>> to open a new thread on this stuff, I'll do it.
>> 
> 
> I don't think it's necessary unless Christoph or Jens object and I doubt
> they will.
> 
>> First, I got mixed results on my system. 
> 
> For what it's worth, this is standard. In my experience, IO benchmarks
> are always multi-modal, particularly on rotary storage. Cases of universal
> win or universal loss for a scheduler or set of tuning are rare.
> 
>> I'll focus only on the the
>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>> to cfq, which happens with 64 clients.  Still, also in this case
>> mq-bfq is better than cfq in all average values, but Flush.  I don't
>> know which are the best/right values to look at, so, here's the final
>> report for both schedulers:
>> 
> 
> For what it's worth, it has often been observed that dbench overall
> performance was dominated by flush costs. This is also true for the
> standard reported throughput figures rather than the modified load file
> elapsed time that mmtests reports. In dbench3 it was even worse where the
> "performance" was dominated by whether the temporary files were deleted
> before writeback started.
> 
>> CFQ
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13120    20.069   348.594
>> Close                   133696     0.008    14.642
>> LockX                      512     0.009     0.059
>> Rename                    7552     1.857   415.418
>> ReadX                   270720     0.141   535.632
>> WriteX                   89591   421.961  6363.271
>> Unlink                   34048     1.281   662.467
>> UnlockX                    512     0.007     0.057
>> FIND_FIRST               62016     0.086    25.060
>> SET_FILE_INFORMATION     15616     0.995   176.621
>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>> QUERY_FS_INFORMATION     28736     0.017     4.110
>> NTCreateX               178688     0.437   905.567
>> 
>> MQ-BFQ-TPUT
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13504    75.828 11196.035
>> Close                   136896     0.004     3.855
>> LockX                      640     0.005     0.031
>> Rename                    8064     1.020   288.989
>> ReadX                   297600     0.081   685.850
>> WriteX                   93515   391.637 12681.517
>> Unlink                   34880     0.500   146.928
>> UnlockX                    640     0.004     0.032
>> FIND_FIRST               63680     0.045   222.491
>> SET_FILE_INFORMATION     16000     0.436   686.115
>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>> QUERY_FS_INFORMATION     29888     0.009     1.984
>> NTCreateX               183152     0.289   300.867
>> 
>> Are these results in line with yours for this test?
>> 
> 
> Very broadly speaking yes, but it varies. On a small machine, the differences
> in flush latency are visible but not as dramatic. It only has a few
> CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On
> the one machine I have that topped out with CFQ/BFQ at 64 threads, the
> latency of flush is vaguely similar
> 
> 			CFQ			BFQ			BFQ-TPUT
> latency	avg-Flush-64 	287.05	( 0.00%)	389.14	( -35.57%)	349.90	( -21.90%)
> latency	avg-Close-64 	0.00	( 0.00%)	0.00	( -33.33%)	0.00	( 0.00%)
> latency	avg-LockX-64 	0.01	( 0.00%)	0.01	( -16.67%)	0.01	( 0.00%)
> latency	avg-Rename-64 	0.18	( 0.00%)	0.21	( -16.39%)	0.18	( 3.28%)
> latency	avg-ReadX-64 	0.10	( 0.00%)	0.15	( -40.95%)	0.15	( -40.95%)
> latency	avg-WriteX-64 	0.86	( 0.00%)	0.81	( 6.18%)	0.74	( 13.75%)
> latency	avg-Unlink-64 	1.49	( 0.00%)	1.52	( -2.28%)	1.14	( 23.69%)
> latency	avg-UnlockX-64 	0.00	( 0.00%)	0.00	( 0.00%)	0.00	( 0.00%)
> latency	avg-NTCreateX-64 	0.26	( 0.00%)	0.30	( -16.15%)	0.21	( 19.62%)
> 
> So, different figures to yours but the general observation that flush
> latency is higher holds.
> 
>> Anyway, to investigate this regression more in depth, I took two
>> further steps.  First, I repeated the same test with bfq-sq, my
>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>> from the changes needed for bfq to live in blk-mq).  I got:
>> 
>> <SNIP>
>> 
>> So, with both bfq and deadline there seems to be a serious regression,
>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>> regression is much worse with deadline, as legacy-deadline has the
>> lowest max latency among all the schedulers, whereas mq-deadline has
>> the highest one.
>> 
> 
> I wouldn't worry too much about max latency simply because a large
> outliier can be due to multiple factors and it will be variable.
> However, I accept that deadline is not necessarily great either.
> 
>> Regardless of the actual culprit of this regression, I would like to
>> investigate further this issue.  In this respect, I would like to ask
>> for a little help.  I would like to isolate the workloads generating
>> the highest latencies.  To this purpose, I had a look at the loadfile
>> client-tiny.txt, and I still have a doubt: is every item in the
>> loadfile executed somehow several times (for each value of the number
>> of clients), or is it executed only once?  More precisely, IIUC, for
>> each operation reported in the above results, there are several items
>> (lines) in the loadfile.  So, is each of these items executed only
>> once?
>> 
> 
> The load file is executed multiple times. The normal loadfile was
> basically just the same commands, or very similar commands, run multiple
> times within a single load file. This made the workload too sensitive to
> the exact time the workload finished and too coarse.
> 
>> I'm asking because, if it is executed only once, then I guess I can
>> find the critical tasks ore easily.  Finally, if it is actually
>> executed only once, is it expected that the latency for such a task is
>> one order of magnitude higher than that of the average latency for
>> that group of tasks?  I mean, is such a task intrinsically much
>> heavier, and then expectedly much longer, or is the fact that latency
>> is much higher for this task a sign that something in the kernel
>> misbehaves for that task?
>> 
> 
> I don't think it's quite as easily isolated. It's all the operations in
> combination that replicate the behaviour. If it was just a single operation
> like "fsync" then it would be fairly straight-forward but the full mix
> is relevant as it matters when writeback kicks off, when merges happen,
> how much dirty data was outstanding when writeback or sync started etc.
> 
> I see you've made other responses to the thread so rather than respond
> individually 
> 
> o I've queued a subset of tests with Ming's v3 patchset as that was the
>  latest branch at the time I looked. It'll take quite some time to execute
>  as the grid I use to collect data is backlogged with other work
> 
> o I've included pgioperf this time because it is good at demonstrate
>  oddities related to fsync. Granted it's mostly simulating a database
>  workload that is typically recommended to use deadline scheduler but I
>  think it's still a useful demonstration 
> 
> o If you want a patch set queued that may improve workload pattern
>  detection for dbench then I can add that to the grid with the caveat that
>  results take time. It'll be a blind test as I'm not actively debugging
>  IO-related problems right now.
> 
> o I'll keep an eye out for other workloads that demonstrate empirically
>  better performance given that a stopwatch and desktop performance is
>  tough to quantify even though I'm typically working in other areas. While
>  I don't spend a lot of time on IO-related problems, it would still
>  be preferred if switching to MQ by default was a safe option so I'm
>  interested enough to keep it in mind.
> 

Hi Mel,
thanks for your thorough responses (I'm about to write something about
the read-write unfairness issue, with, again, some surprise).

I want to reply only to your last point above.  With our
responsiveness benchmark of course you don't need a stopwatch, but,
yes, to get some minimally comprehensive results you need a machine
with at least a desktop application like a terminal installed.

Thanks,
Paolo

> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08  8:06                 ` Paolo Valente
@ 2017-08-08 17:33                   ` Paolo Valente
  -1 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-08 17:33 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>=20
>>=20
>> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>=20
>>>=20
>>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>>=20
>>>>=20
>>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>>>=20
>>>>>=20
>>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>>>>>=20
>>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>>> I took that into account BFQ with low-latency was also tested =
and the
>>>>>>> impact was not a universal improvement although it can be a =
noticable
>>>>>>> improvement. =46rom the same machine;
>>>>>>>=20
>>>>>>> dbench4 Loadfile Execution Time
>>>>>>>                        4.12.0                 4.12.0             =
    4.12.0
>>>>>>>                    legacy-cfq                 mq-bfq            =
mq-bfq-tput
>>>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       =
84.70 (  -5.00%)
>>>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       =
88.74 (   4.45%)
>>>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      =
113.97 ( -10.95%)
>>>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     =
2038.74 (  19.86%)
>>>>>>>=20
>>>>>>=20
>>>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>>>> correctly, we move from a worst case of 361% higher execution =
time to
>>>>>> a worst case of 11%.  With a best case of 20% of lower execution =
time.
>>>>>>=20
>>>>>=20
>>>>> Yes.
>>>>>=20
>>>>>> I asked you about none and mq-deadline in a previous email, =
because
>>>>>> actually we have a double change here: change of the I/O stack, =
and
>>>>>> change of the scheduler, with the first change probably not =
irrelevant
>>>>>> with respect to the second one.
>>>>>>=20
>>>>>=20
>>>>> True. However, the difference between legacy-deadline mq-deadline =
is
>>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>>> universally true but the impact is not as severe. While this is =
not
>>>>> proof that the stack change is the sole root cause, it makes it =
less
>>>>> likely.
>>>>>=20
>>>>=20
>>>> I'm getting a little lost here.  If I'm not mistaken, you are =
saying,
>>>> since the difference between two virtually identical schedulers
>>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>>> difference between cfq and mq-bfq-tput is higher, then in the =
latter
>>>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in =
the
>>>> above test is exactly in the 5-10% range?  What am I missing?  =
Other
>>>> tests with mq-bfq-tput not yet reported?
>>>>=20
>>>>>> By chance, according to what you have measured so far, is there =
any
>>>>>> test where, instead, you expect or have seen bfq-mq-tput to =
always
>>>>>> lose?  I could start from there.
>>>>>>=20
>>>>>=20
>>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal =
enough that
>>>>> it could be the stack change.
>>>>>=20
>>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests =
using
>>>>> ext4 as a filesystem. The same is not true for XFS so the =
filesystem
>>>>> matters.
>>>>>=20
>>>>=20
>>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>>> soon as I can, thanks.
>>>>=20
>>>>=20
>>>=20
>>> I've run this test and tried to further investigate this regression.
>>> For the moment, the gist seems to be that blk-mq plays an important
>>> role, not only with bfq (unless I'm considering the wrong numbers).
>>> Even if your main purpose in this thread was just to give a =
heads-up,
>>> I guess it may be useful to share what I have found out.  In =
addition,
>>> I want to ask for some help, to try to get closer to the possible
>>> causes of at least this regression.  If you think it would be better
>>> to open a new thread on this stuff, I'll do it.
>>>=20
>>> First, I got mixed results on my system.  I'll focus only on the the
>>> case where mq-bfq-tput achieves its worst relative performance =
w.r.t.
>>> to cfq, which happens with 64 clients.  Still, also in this case
>>> mq-bfq is better than cfq in all average values, but Flush.  I don't
>>> know which are the best/right values to look at, so, here's the =
final
>>> report for both schedulers:
>>>=20
>>> CFQ
>>>=20
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13120    20.069   348.594
>>> Close                   133696     0.008    14.642
>>> LockX                      512     0.009     0.059
>>> Rename                    7552     1.857   415.418
>>> ReadX                   270720     0.141   535.632
>>> WriteX                   89591   421.961  6363.271
>>> Unlink                   34048     1.281   662.467
>>> UnlockX                    512     0.007     0.057
>>> FIND_FIRST               62016     0.086    25.060
>>> SET_FILE_INFORMATION     15616     0.995   176.621
>>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>>> QUERY_FS_INFORMATION     28736     0.017     4.110
>>> NTCreateX               178688     0.437   905.567
>>>=20
>>> MQ-BFQ-TPUT
>>>=20
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13504    75.828 11196.035
>>> Close                   136896     0.004     3.855
>>> LockX                      640     0.005     0.031
>>> Rename                    8064     1.020   288.989
>>> ReadX                   297600     0.081   685.850
>>> WriteX                   93515   391.637 12681.517
>>> Unlink                   34880     0.500   146.928
>>> UnlockX                    640     0.004     0.032
>>> FIND_FIRST               63680     0.045   222.491
>>> SET_FILE_INFORMATION     16000     0.436   686.115
>>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>>> QUERY_FS_INFORMATION     29888     0.009     1.984
>>> NTCreateX               183152     0.289   300.867
>>>=20
>>> Are these results in line with yours for this test?
>>>=20
>>> Anyway, to investigate this regression more in depth, I took two
>>> further steps.  First, I repeated the same test with bfq-sq, my
>>> out-of-tree version of bfq for legacy block (identical to mq-bfq =
apart
>>> from the changes needed for bfq to live in blk-mq).  I got:
>>>=20
>>> BFQ-SQ-TPUT
>>>=20
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    12618    30.212   484.099
>>> Close                   123884     0.008    10.477
>>> LockX                      512     0.010     0.170
>>> Rename                    7296     2.032   426.409
>>> ReadX                   262179     0.251   985.478
>>> WriteX                   84072   461.398  7283.003
>>> Unlink                   33076     1.685   848.734
>>> UnlockX                    512     0.007     0.036
>>> FIND_FIRST               58690     0.096   220.720
>>> SET_FILE_INFORMATION     14976     1.792   466.435
>>> QUERY_FILE_INFORMATION   26575     0.004     2.194
>>> QUERY_PATH_INFORMATION  158125     0.112   614.063
>>> QUERY_FS_INFORMATION     28224     0.017     1.385
>>> NTCreateX               167877     0.827   945.644
>>>=20
>>> So, the worst-case regression is now around 15%.  This made me =
suspect
>>> that blk-mq influences results a lot for this test.  To crosscheck, =
I
>>> compared legacy-deadline and mq-deadline too.
>>>=20
>>=20
>> Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
>> occasionally confused by the workload, and grants device idling to
>> processes that, for this specific workload, would be better to
>> de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
>> becomes more or less equivalent to cfq (for some operations =
apparently
>> even much better):
>>=20
>> bfq-sq-tput-0idle
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13013    17.888   280.517
>> Close                   133004     0.008    20.698
>> LockX                      512     0.008     0.088
>> Rename                    7427     2.041   193.232
>> ReadX                   270534     0.138   408.534
>> WriteX                   88598   429.615  6272.212
>> Unlink                   33734     1.205   559.152
>> UnlockX                    512     0.011     1.808
>> FIND_FIRST               61762     0.087    23.012
>> SET_FILE_INFORMATION     15337     1.322   220.155
>> QUERY_FILE_INFORMATION   28415     0.004     0.559
>> QUERY_PATH_INFORMATION  169423     0.150   580.570
>> QUERY_FS_INFORMATION     28547     0.019    24.466
>> NTCreateX               177618     0.544   681.795
>>=20
>> I'll try soon with mq-bfq too, for which I expect however a deeper
>> investigation to be needed.
>>=20
>=20
> Hi,
> to test mq-bfq (with both slice_idle=3D=3D0 and slice_idle>0), I have =
also
> applied Ming patches, and Ah, victory!
>=20
> Regardless of the value of slice idle:
>=20
> mq-bfq-tput
>=20
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13183    70.381  1025.407
> Close                   134539     0.004     1.011
> LockX                      512     0.005     0.025
> Rename                    7721     0.740   404.979
> ReadX                   274422     0.126   873.364
> WriteX                   90535   408.371  7400.585
> Unlink                   34276     0.634   581.067
> UnlockX                    512     0.003     0.029
> FIND_FIRST               62664     0.052   321.027
> SET_FILE_INFORMATION     15981     0.234   124.739
> QUERY_FILE_INFORMATION   29042     0.003     1.731
> QUERY_PATH_INFORMATION  171769     0.032   522.415
> QUERY_FS_INFORMATION     28958     0.009     3.043
> NTCreateX               179643     0.298   687.466
>=20
> Throughput 9.11183 MB/sec  64 clients  64 procs  max_latency=3D7400.588 =
ms
>=20
> Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
> benefit, which lets me suspect that there is some other issue in
> blk-mq (only a suspect).  I think I may have already understood how to
> guarantee that bfq almost never idles the device uselessly also for
> this workload.  Yet, since in blk-mq there is no gain even after
> excluding useless idling, I'll wait for at least Ming's patches to be
> merged before possibly proposing this contribution.  Maybe some other
> little issue related to this lack of gain in blk-mq will be found and
> solved in the meantime.
>=20
> Moving to the read-write unfairness problem.
>=20

I've reproduced the unfairness issue (rand reader throttled by heavy
writers) with bfq, using
configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
an important side problem: cfq suffers from exactly the same
unfairness (785kB/s writers, 13.4kB/s reader).  Of course, this
happens in my system, with a HITACHI HTS727550A9E364.

This discrepancy with your results makes a little bit harder for me to
understand how to better proceed, as I see no regression.  Anyway,
since this reader-throttling issue seems relevant, I have investigated
it a little more in depth.  The cause of the throttling is that the
fdatasync frequently performed by the writers in this test turns the
I/O of the writers into a 100% sync I/O.  And neither bfq or cfq
differentiate bandwidth between sync reads and sync writes.  Basically
both cfq and bfq are willing to dispatch the I/O requests of each
writer for a time slot equal to that devoted to the reader.  But write
requests, after reaching the device, use the latter for much more time
than reads.  This delays the completion of the requests of the reader,
and, being the I/O sync, the issuing of the next I/O requests by the
reader.  The final result is that the device spends most of the time
serving write requests, while the reader issues its read requests very
slowly.

It might not be so difficult to balance this unfairness, although I'm
a little worried about changing bfq without being able to see the
regression you report.  In case I give it a try, could I then count on
some testing on your machines?

Thanks,
Paolo

> Thanks,
> Paolo
>=20
>> Thanks,
>> Paolo
>>=20
>>> LEGACY-DEADLINE
>>>=20
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13267     9.622   298.206
>>> Close                   135692     0.007    10.627
>>> LockX                      640     0.008     0.066
>>> Rename                    7827     0.544   481.123
>>> ReadX                   285929     0.220  2698.442
>>> WriteX                   92309   430.867  5191.608
>>> Unlink                   34534     1.133   619.235
>>> UnlockX                    640     0.008     0.724
>>> FIND_FIRST               63289     0.086    56.851
>>> SET_FILE_INFORMATION     16000     1.254   844.065
>>> QUERY_FILE_INFORMATION   29883     0.004     0.618
>>> QUERY_PATH_INFORMATION  173232     0.089  1295.651
>>> QUERY_FS_INFORMATION     29632     0.017     4.813
>>> NTCreateX               181464     0.479  2214.343
>>>=20
>>>=20
>>> MQ-DEADLINE
>>>=20
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13760    90.542 13221.495
>>> Close                   137654     0.008    27.133
>>> LockX                      640     0.009     0.115
>>> Rename                    8064     1.062   246.759
>>> ReadX                   297956     0.051   347.018
>>> WriteX                   94698   425.636 15090.020
>>> Unlink                   35077     0.580   208.462
>>> UnlockX                    640     0.007     0.291
>>> FIND_FIRST               66630     0.566   530.339
>>> SET_FILE_INFORMATION     16000     1.419   811.494
>>> QUERY_FILE_INFORMATION   30717     0.004     1.108
>>> QUERY_PATH_INFORMATION  176153     0.182   517.419
>>> QUERY_FS_INFORMATION     30857     0.018    18.562
>>> NTCreateX               184145     0.281   582.076
>>>=20
>>> So, with both bfq and deadline there seems to be a serious =
regression,
>>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>>> regression is much worse with deadline, as legacy-deadline has the
>>> lowest max latency among all the schedulers, whereas mq-deadline has
>>> the highest one.
>>>=20
>>> Regardless of the actual culprit of this regression, I would like to
>>> investigate further this issue.  In this respect, I would like to =
ask
>>> for a little help.  I would like to isolate the workloads generating
>>> the highest latencies.  To this purpose, I had a look at the =
loadfile
>>> client-tiny.txt, and I still have a doubt: is every item in the
>>> loadfile executed somehow several times (for each value of the =
number
>>> of clients), or is it executed only once?  More precisely, IIUC, for
>>> each operation reported in the above results, there are several =
items
>>> (lines) in the loadfile.  So, is each of these items executed only
>>> once?
>>>=20
>>> I'm asking because, if it is executed only once, then I guess I can
>>> find the critical tasks ore easily.  Finally, if it is actually
>>> executed only once, is it expected that the latency for such a task =
is
>>> one order of magnitude higher than that of the average latency for
>>> that group of tasks?  I mean, is such a task intrinsically much
>>> heavier, and then expectedly much longer, or is the fact that =
latency
>>> is much higher for this task a sign that something in the kernel
>>> misbehaves for that task?
>>>=20
>>> While waiting for some feedback, I'm going to execute your test
>>> showing great unfairness between writes and reads, and to also check
>>> whether responsiveness does worsen if the write workload for that =
test
>>> is being executed in the background.
>>>=20
>>> Thanks,
>>> Paolo
>>>=20
>>>> ...
>>>>> --=20
>>>>> Mel Gorman
>>>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-08 17:33                   ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-08 17:33 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>> 
>>> 
>>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>> 
>>>> 
>>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>>> 
>>>>> 
>>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto:
>>>>> 
>>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>>>> impact was not a universal improvement although it can be a noticable
>>>>>>> improvement. From the same machine;
>>>>>>> 
>>>>>>> dbench4 Loadfile Execution Time
>>>>>>>                        4.12.0                 4.12.0                 4.12.0
>>>>>>>                    legacy-cfq                 mq-bfq            mq-bfq-tput
>>>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
>>>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
>>>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
>>>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
>>>>>>> 
>>>>>> 
>>>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>>>> correctly, we move from a worst case of 361% higher execution time to
>>>>>> a worst case of 11%.  With a best case of 20% of lower execution time.
>>>>>> 
>>>>> 
>>>>> Yes.
>>>>> 
>>>>>> I asked you about none and mq-deadline in a previous email, because
>>>>>> actually we have a double change here: change of the I/O stack, and
>>>>>> change of the scheduler, with the first change probably not irrelevant
>>>>>> with respect to the second one.
>>>>>> 
>>>>> 
>>>>> True. However, the difference between legacy-deadline mq-deadline is
>>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>>> universally true but the impact is not as severe. While this is not
>>>>> proof that the stack change is the sole root cause, it makes it less
>>>>> likely.
>>>>> 
>>>> 
>>>> I'm getting a little lost here.  If I'm not mistaken, you are saying,
>>>> since the difference between two virtually identical schedulers
>>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>>> difference between cfq and mq-bfq-tput is higher, then in the latter
>>>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
>>>> above test is exactly in the 5-10% range?  What am I missing?  Other
>>>> tests with mq-bfq-tput not yet reported?
>>>> 
>>>>>> By chance, according to what you have measured so far, is there any
>>>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>>>> lose?  I could start from there.
>>>>>> 
>>>>> 
>>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>>>> it could be the stack change.
>>>>> 
>>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>>> matters.
>>>>> 
>>>> 
>>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>>> soon as I can, thanks.
>>>> 
>>>> 
>>> 
>>> I've run this test and tried to further investigate this regression.
>>> For the moment, the gist seems to be that blk-mq plays an important
>>> role, not only with bfq (unless I'm considering the wrong numbers).
>>> Even if your main purpose in this thread was just to give a heads-up,
>>> I guess it may be useful to share what I have found out.  In addition,
>>> I want to ask for some help, to try to get closer to the possible
>>> causes of at least this regression.  If you think it would be better
>>> to open a new thread on this stuff, I'll do it.
>>> 
>>> First, I got mixed results on my system.  I'll focus only on the the
>>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>>> to cfq, which happens with 64 clients.  Still, also in this case
>>> mq-bfq is better than cfq in all average values, but Flush.  I don't
>>> know which are the best/right values to look at, so, here's the final
>>> report for both schedulers:
>>> 
>>> CFQ
>>> 
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13120    20.069   348.594
>>> Close                   133696     0.008    14.642
>>> LockX                      512     0.009     0.059
>>> Rename                    7552     1.857   415.418
>>> ReadX                   270720     0.141   535.632
>>> WriteX                   89591   421.961  6363.271
>>> Unlink                   34048     1.281   662.467
>>> UnlockX                    512     0.007     0.057
>>> FIND_FIRST               62016     0.086    25.060
>>> SET_FILE_INFORMATION     15616     0.995   176.621
>>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>>> QUERY_FS_INFORMATION     28736     0.017     4.110
>>> NTCreateX               178688     0.437   905.567
>>> 
>>> MQ-BFQ-TPUT
>>> 
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13504    75.828 11196.035
>>> Close                   136896     0.004     3.855
>>> LockX                      640     0.005     0.031
>>> Rename                    8064     1.020   288.989
>>> ReadX                   297600     0.081   685.850
>>> WriteX                   93515   391.637 12681.517
>>> Unlink                   34880     0.500   146.928
>>> UnlockX                    640     0.004     0.032
>>> FIND_FIRST               63680     0.045   222.491
>>> SET_FILE_INFORMATION     16000     0.436   686.115
>>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>>> QUERY_FS_INFORMATION     29888     0.009     1.984
>>> NTCreateX               183152     0.289   300.867
>>> 
>>> Are these results in line with yours for this test?
>>> 
>>> Anyway, to investigate this regression more in depth, I took two
>>> further steps.  First, I repeated the same test with bfq-sq, my
>>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>>> from the changes needed for bfq to live in blk-mq).  I got:
>>> 
>>> BFQ-SQ-TPUT
>>> 
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    12618    30.212   484.099
>>> Close                   123884     0.008    10.477
>>> LockX                      512     0.010     0.170
>>> Rename                    7296     2.032   426.409
>>> ReadX                   262179     0.251   985.478
>>> WriteX                   84072   461.398  7283.003
>>> Unlink                   33076     1.685   848.734
>>> UnlockX                    512     0.007     0.036
>>> FIND_FIRST               58690     0.096   220.720
>>> SET_FILE_INFORMATION     14976     1.792   466.435
>>> QUERY_FILE_INFORMATION   26575     0.004     2.194
>>> QUERY_PATH_INFORMATION  158125     0.112   614.063
>>> QUERY_FS_INFORMATION     28224     0.017     1.385
>>> NTCreateX               167877     0.827   945.644
>>> 
>>> So, the worst-case regression is now around 15%.  This made me suspect
>>> that blk-mq influences results a lot for this test.  To crosscheck, I
>>> compared legacy-deadline and mq-deadline too.
>>> 
>> 
>> Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
>> occasionally confused by the workload, and grants device idling to
>> processes that, for this specific workload, would be better to
>> de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
>> becomes more or less equivalent to cfq (for some operations apparently
>> even much better):
>> 
>> bfq-sq-tput-0idle
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13013    17.888   280.517
>> Close                   133004     0.008    20.698
>> LockX                      512     0.008     0.088
>> Rename                    7427     2.041   193.232
>> ReadX                   270534     0.138   408.534
>> WriteX                   88598   429.615  6272.212
>> Unlink                   33734     1.205   559.152
>> UnlockX                    512     0.011     1.808
>> FIND_FIRST               61762     0.087    23.012
>> SET_FILE_INFORMATION     15337     1.322   220.155
>> QUERY_FILE_INFORMATION   28415     0.004     0.559
>> QUERY_PATH_INFORMATION  169423     0.150   580.570
>> QUERY_FS_INFORMATION     28547     0.019    24.466
>> NTCreateX               177618     0.544   681.795
>> 
>> I'll try soon with mq-bfq too, for which I expect however a deeper
>> investigation to be needed.
>> 
> 
> Hi,
> to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also
> applied Ming patches, and Ah, victory!
> 
> Regardless of the value of slice idle:
> 
> mq-bfq-tput
> 
> Operation                Count    AvgLat    MaxLat
> --------------------------------------------------
> Flush                    13183    70.381  1025.407
> Close                   134539     0.004     1.011
> LockX                      512     0.005     0.025
> Rename                    7721     0.740   404.979
> ReadX                   274422     0.126   873.364
> WriteX                   90535   408.371  7400.585
> Unlink                   34276     0.634   581.067
> UnlockX                    512     0.003     0.029
> FIND_FIRST               62664     0.052   321.027
> SET_FILE_INFORMATION     15981     0.234   124.739
> QUERY_FILE_INFORMATION   29042     0.003     1.731
> QUERY_PATH_INFORMATION  171769     0.032   522.415
> QUERY_FS_INFORMATION     28958     0.009     3.043
> NTCreateX               179643     0.298   687.466
> 
> Throughput 9.11183 MB/sec  64 clients  64 procs  max_latency=7400.588 ms
> 
> Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
> benefit, which lets me suspect that there is some other issue in
> blk-mq (only a suspect).  I think I may have already understood how to
> guarantee that bfq almost never idles the device uselessly also for
> this workload.  Yet, since in blk-mq there is no gain even after
> excluding useless idling, I'll wait for at least Ming's patches to be
> merged before possibly proposing this contribution.  Maybe some other
> little issue related to this lack of gain in blk-mq will be found and
> solved in the meantime.
> 
> Moving to the read-write unfairness problem.
> 

I've reproduced the unfairness issue (rand reader throttled by heavy
writers) with bfq, using
configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
an important side problem: cfq suffers from exactly the same
unfairness (785kB/s writers, 13.4kB/s reader).  Of course, this
happens in my system, with a HITACHI HTS727550A9E364.

This discrepancy with your results makes a little bit harder for me to
understand how to better proceed, as I see no regression.  Anyway,
since this reader-throttling issue seems relevant, I have investigated
it a little more in depth.  The cause of the throttling is that the
fdatasync frequently performed by the writers in this test turns the
I/O of the writers into a 100% sync I/O.  And neither bfq or cfq
differentiate bandwidth between sync reads and sync writes.  Basically
both cfq and bfq are willing to dispatch the I/O requests of each
writer for a time slot equal to that devoted to the reader.  But write
requests, after reaching the device, use the latter for much more time
than reads.  This delays the completion of the requests of the reader,
and, being the I/O sync, the issuing of the next I/O requests by the
reader.  The final result is that the device spends most of the time
serving write requests, while the reader issues its read requests very
slowly.

It might not be so difficult to balance this unfairness, although I'm
a little worried about changing bfq without being able to see the
regression you report.  In case I give it a try, could I then count on
some testing on your machines?

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> Thanks,
>> Paolo
>> 
>>> LEGACY-DEADLINE
>>> 
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13267     9.622   298.206
>>> Close                   135692     0.007    10.627
>>> LockX                      640     0.008     0.066
>>> Rename                    7827     0.544   481.123
>>> ReadX                   285929     0.220  2698.442
>>> WriteX                   92309   430.867  5191.608
>>> Unlink                   34534     1.133   619.235
>>> UnlockX                    640     0.008     0.724
>>> FIND_FIRST               63289     0.086    56.851
>>> SET_FILE_INFORMATION     16000     1.254   844.065
>>> QUERY_FILE_INFORMATION   29883     0.004     0.618
>>> QUERY_PATH_INFORMATION  173232     0.089  1295.651
>>> QUERY_FS_INFORMATION     29632     0.017     4.813
>>> NTCreateX               181464     0.479  2214.343
>>> 
>>> 
>>> MQ-DEADLINE
>>> 
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13760    90.542 13221.495
>>> Close                   137654     0.008    27.133
>>> LockX                      640     0.009     0.115
>>> Rename                    8064     1.062   246.759
>>> ReadX                   297956     0.051   347.018
>>> WriteX                   94698   425.636 15090.020
>>> Unlink                   35077     0.580   208.462
>>> UnlockX                    640     0.007     0.291
>>> FIND_FIRST               66630     0.566   530.339
>>> SET_FILE_INFORMATION     16000     1.419   811.494
>>> QUERY_FILE_INFORMATION   30717     0.004     1.108
>>> QUERY_PATH_INFORMATION  176153     0.182   517.419
>>> QUERY_FS_INFORMATION     30857     0.018    18.562
>>> NTCreateX               184145     0.281   582.076
>>> 
>>> So, with both bfq and deadline there seems to be a serious regression,
>>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>>> regression is much worse with deadline, as legacy-deadline has the
>>> lowest max latency among all the schedulers, whereas mq-deadline has
>>> the highest one.
>>> 
>>> Regardless of the actual culprit of this regression, I would like to
>>> investigate further this issue.  In this respect, I would like to ask
>>> for a little help.  I would like to isolate the workloads generating
>>> the highest latencies.  To this purpose, I had a look at the loadfile
>>> client-tiny.txt, and I still have a doubt: is every item in the
>>> loadfile executed somehow several times (for each value of the number
>>> of clients), or is it executed only once?  More precisely, IIUC, for
>>> each operation reported in the above results, there are several items
>>> (lines) in the loadfile.  So, is each of these items executed only
>>> once?
>>> 
>>> I'm asking because, if it is executed only once, then I guess I can
>>> find the critical tasks ore easily.  Finally, if it is actually
>>> executed only once, is it expected that the latency for such a task is
>>> one order of magnitude higher than that of the average latency for
>>> that group of tasks?  I mean, is such a task intrinsically much
>>> heavier, and then expectedly much longer, or is the fact that latency
>>> is much higher for this task a sign that something in the kernel
>>> misbehaves for that task?
>>> 
>>> While waiting for some feedback, I'm going to execute your test
>>> showing great unfairness between writes and reads, and to also check
>>> whether responsiveness does worsen if the write workload for that test
>>> is being executed in the background.
>>> 
>>> Thanks,
>>> Paolo
>>> 
>>>> ...
>>>>> -- 
>>>>> Mel Gorman
>>>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08 17:33                   ` Paolo Valente
  (?)
@ 2017-08-08 18:27                   ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2017-08-08 18:27 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block

On Tue, Aug 08, 2017 at 07:33:37PM +0200, Paolo Valente wrote:
> > Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
> > benefit, which lets me suspect that there is some other issue in
> > blk-mq (only a suspect).  I think I may have already understood how to
> > guarantee that bfq almost never idles the device uselessly also for
> > this workload.  Yet, since in blk-mq there is no gain even after
> > excluding useless idling, I'll wait for at least Ming's patches to be
> > merged before possibly proposing this contribution.  Maybe some other
> > little issue related to this lack of gain in blk-mq will be found and
> > solved in the meantime.
> > 
> > Moving to the read-write unfairness problem.
> > 
> 
> I've reproduced the unfairness issue (rand reader throttled by heavy
> writers) with bfq, using
> configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
> an important side problem: cfq suffers from exactly the same
> unfairness (785kB/s writers, 13.4kB/s reader).  Of course, this
> happens in my system, with a HITACHI HTS727550A9E364.
> 

It's interesting that CFQ suffers the same on your system. It's possible
that this is down to luck and the results depend not only on the disk but
the number of CPUs. At absolute minimum we saw different latency figures
from dbench even if the only observation s "different machines behave
differently, news at 11". If the results are inconsistent, then the value of
the benchmark can be dropped as a basis of comparison between IO schedulers
(although I'll be keeping it for detecting regressions between releases).

When the v4 results from Ming's patches complete, I'll double check the
results from this config.

> This discrepancy with your results makes a little bit harder for me to
> understand how to better proceed, as I see no regression.  Anyway,
> since this reader-throttling issue seems relevant, I have investigated
> it a little more in depth.  The cause of the throttling is that the
> fdatasync frequently performed by the writers in this test turns the
> I/O of the writers into a 100% sync I/O.  And neither bfq or cfq
> differentiate bandwidth between sync reads and sync writes.  Basically
> both cfq and bfq are willing to dispatch the I/O requests of each
> writer for a time slot equal to that devoted to the reader.  But write
> requests, after reaching the device, use the latter for much more time
> than reads.  This delays the completion of the requests of the reader,
> and, being the I/O sync, the issuing of the next I/O requests by the
> reader.  The final result is that the device spends most of the time
> serving write requests, while the reader issues its read requests very
> slowly.
> 

That is certainly plausible and implies that the actual results depend
too heavily on random timing factors and disk model to be really useful.

> It might not be so difficult to balance this unfairness, although I'm
> a little worried about changing bfq without being able to see the
> regression you report.  In case I give it a try, could I then count on
> some testing on your machines?
> 

Yes with the caveat that results take a variable amount of time depending
on how many problems I'm juggling in the air and how many of them are
occupying time on the machines.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-08 17:33                   ` Paolo Valente
@ 2017-08-09 21:49                     ` Paolo Valente
  -1 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-09 21:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 08 ago 2017, alle ore 19:33, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>=20
>>=20
>> Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>=20
>>>=20
>>> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>>=20
>>>>=20
>>>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>>>=20
>>>>>=20
>>>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente =
<paolo.valente@linaro.org> ha scritto:
>>>>>=20
>>>>>>=20
>>>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman =
<mgorman@techsingularity.net> ha scritto:
>>>>>>=20
>>>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>>>> I took that into account BFQ with low-latency was also tested =
and the
>>>>>>>> impact was not a universal improvement although it can be a =
noticable
>>>>>>>> improvement. =46rom the same machine;
>>>>>>>>=20
>>>>>>>> dbench4 Loadfile Execution Time
>>>>>>>>                       4.12.0                 4.12.0             =
    4.12.0
>>>>>>>>                   legacy-cfq                 mq-bfq            =
mq-bfq-tput
>>>>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)      =
 84.70 (  -5.00%)
>>>>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)      =
 88.74 (   4.45%)
>>>>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      =
113.97 ( -10.95%)
>>>>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     =
2038.74 (  19.86%)
>>>>>>>>=20
>>>>>>>=20
>>>>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>>>>> correctly, we move from a worst case of 361% higher execution =
time to
>>>>>>> a worst case of 11%.  With a best case of 20% of lower execution =
time.
>>>>>>>=20
>>>>>>=20
>>>>>> Yes.
>>>>>>=20
>>>>>>> I asked you about none and mq-deadline in a previous email, =
because
>>>>>>> actually we have a double change here: change of the I/O stack, =
and
>>>>>>> change of the scheduler, with the first change probably not =
irrelevant
>>>>>>> with respect to the second one.
>>>>>>>=20
>>>>>>=20
>>>>>> True. However, the difference between legacy-deadline mq-deadline =
is
>>>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>>>> universally true but the impact is not as severe. While this is =
not
>>>>>> proof that the stack change is the sole root cause, it makes it =
less
>>>>>> likely.
>>>>>>=20
>>>>>=20
>>>>> I'm getting a little lost here.  If I'm not mistaken, you are =
saying,
>>>>> since the difference between two virtually identical schedulers
>>>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>>>> difference between cfq and mq-bfq-tput is higher, then in the =
latter
>>>>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in =
the
>>>>> above test is exactly in the 5-10% range?  What am I missing?  =
Other
>>>>> tests with mq-bfq-tput not yet reported?
>>>>>=20
>>>>>>> By chance, according to what you have measured so far, is there =
any
>>>>>>> test where, instead, you expect or have seen bfq-mq-tput to =
always
>>>>>>> lose?  I could start from there.
>>>>>>>=20
>>>>>>=20
>>>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal =
enough that
>>>>>> it could be the stack change.
>>>>>>=20
>>>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests =
using
>>>>>> ext4 as a filesystem. The same is not true for XFS so the =
filesystem
>>>>>> matters.
>>>>>>=20
>>>>>=20
>>>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>>>> soon as I can, thanks.
>>>>>=20
>>>>>=20
>>>>=20
>>>> I've run this test and tried to further investigate this =
regression.
>>>> For the moment, the gist seems to be that blk-mq plays an important
>>>> role, not only with bfq (unless I'm considering the wrong numbers).
>>>> Even if your main purpose in this thread was just to give a =
heads-up,
>>>> I guess it may be useful to share what I have found out.  In =
addition,
>>>> I want to ask for some help, to try to get closer to the possible
>>>> causes of at least this regression.  If you think it would be =
better
>>>> to open a new thread on this stuff, I'll do it.
>>>>=20
>>>> First, I got mixed results on my system.  I'll focus only on the =
the
>>>> case where mq-bfq-tput achieves its worst relative performance =
w.r.t.
>>>> to cfq, which happens with 64 clients.  Still, also in this case
>>>> mq-bfq is better than cfq in all average values, but Flush.  I =
don't
>>>> know which are the best/right values to look at, so, here's the =
final
>>>> report for both schedulers:
>>>>=20
>>>> CFQ
>>>>=20
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13120    20.069   348.594
>>>> Close                   133696     0.008    14.642
>>>> LockX                      512     0.009     0.059
>>>> Rename                    7552     1.857   415.418
>>>> ReadX                   270720     0.141   535.632
>>>> WriteX                   89591   421.961  6363.271
>>>> Unlink                   34048     1.281   662.467
>>>> UnlockX                    512     0.007     0.057
>>>> FIND_FIRST               62016     0.086    25.060
>>>> SET_FILE_INFORMATION     15616     0.995   176.621
>>>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>>>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>>>> QUERY_FS_INFORMATION     28736     0.017     4.110
>>>> NTCreateX               178688     0.437   905.567
>>>>=20
>>>> MQ-BFQ-TPUT
>>>>=20
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13504    75.828 11196.035
>>>> Close                   136896     0.004     3.855
>>>> LockX                      640     0.005     0.031
>>>> Rename                    8064     1.020   288.989
>>>> ReadX                   297600     0.081   685.850
>>>> WriteX                   93515   391.637 12681.517
>>>> Unlink                   34880     0.500   146.928
>>>> UnlockX                    640     0.004     0.032
>>>> FIND_FIRST               63680     0.045   222.491
>>>> SET_FILE_INFORMATION     16000     0.436   686.115
>>>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>>>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>>>> QUERY_FS_INFORMATION     29888     0.009     1.984
>>>> NTCreateX               183152     0.289   300.867
>>>>=20
>>>> Are these results in line with yours for this test?
>>>>=20
>>>> Anyway, to investigate this regression more in depth, I took two
>>>> further steps.  First, I repeated the same test with bfq-sq, my
>>>> out-of-tree version of bfq for legacy block (identical to mq-bfq =
apart
>>>> from the changes needed for bfq to live in blk-mq).  I got:
>>>>=20
>>>> BFQ-SQ-TPUT
>>>>=20
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    12618    30.212   484.099
>>>> Close                   123884     0.008    10.477
>>>> LockX                      512     0.010     0.170
>>>> Rename                    7296     2.032   426.409
>>>> ReadX                   262179     0.251   985.478
>>>> WriteX                   84072   461.398  7283.003
>>>> Unlink                   33076     1.685   848.734
>>>> UnlockX                    512     0.007     0.036
>>>> FIND_FIRST               58690     0.096   220.720
>>>> SET_FILE_INFORMATION     14976     1.792   466.435
>>>> QUERY_FILE_INFORMATION   26575     0.004     2.194
>>>> QUERY_PATH_INFORMATION  158125     0.112   614.063
>>>> QUERY_FS_INFORMATION     28224     0.017     1.385
>>>> NTCreateX               167877     0.827   945.644
>>>>=20
>>>> So, the worst-case regression is now around 15%.  This made me =
suspect
>>>> that blk-mq influences results a lot for this test.  To crosscheck, =
I
>>>> compared legacy-deadline and mq-deadline too.
>>>>=20
>>>=20
>>> Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
>>> occasionally confused by the workload, and grants device idling to
>>> processes that, for this specific workload, would be better to
>>> de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
>>> becomes more or less equivalent to cfq (for some operations =
apparently
>>> even much better):
>>>=20
>>> bfq-sq-tput-0idle
>>>=20
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13013    17.888   280.517
>>> Close                   133004     0.008    20.698
>>> LockX                      512     0.008     0.088
>>> Rename                    7427     2.041   193.232
>>> ReadX                   270534     0.138   408.534
>>> WriteX                   88598   429.615  6272.212
>>> Unlink                   33734     1.205   559.152
>>> UnlockX                    512     0.011     1.808
>>> FIND_FIRST               61762     0.087    23.012
>>> SET_FILE_INFORMATION     15337     1.322   220.155
>>> QUERY_FILE_INFORMATION   28415     0.004     0.559
>>> QUERY_PATH_INFORMATION  169423     0.150   580.570
>>> QUERY_FS_INFORMATION     28547     0.019    24.466
>>> NTCreateX               177618     0.544   681.795
>>>=20
>>> I'll try soon with mq-bfq too, for which I expect however a deeper
>>> investigation to be needed.
>>>=20
>>=20
>> Hi,
>> to test mq-bfq (with both slice_idle=3D=3D0 and slice_idle>0), I have =
also
>> applied Ming patches, and Ah, victory!
>>=20
>> Regardless of the value of slice idle:
>>=20
>> mq-bfq-tput
>>=20
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13183    70.381  1025.407
>> Close                   134539     0.004     1.011
>> LockX                      512     0.005     0.025
>> Rename                    7721     0.740   404.979
>> ReadX                   274422     0.126   873.364
>> WriteX                   90535   408.371  7400.585
>> Unlink                   34276     0.634   581.067
>> UnlockX                    512     0.003     0.029
>> FIND_FIRST               62664     0.052   321.027
>> SET_FILE_INFORMATION     15981     0.234   124.739
>> QUERY_FILE_INFORMATION   29042     0.003     1.731
>> QUERY_PATH_INFORMATION  171769     0.032   522.415
>> QUERY_FS_INFORMATION     28958     0.009     3.043
>> NTCreateX               179643     0.298   687.466
>>=20
>> Throughput 9.11183 MB/sec  64 clients  64 procs  max_latency=3D7400.588=
 ms
>>=20
>> Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
>> benefit, which lets me suspect that there is some other issue in
>> blk-mq (only a suspect).  I think I may have already understood how =
to
>> guarantee that bfq almost never idles the device uselessly also for
>> this workload.  Yet, since in blk-mq there is no gain even after
>> excluding useless idling, I'll wait for at least Ming's patches to be
>> merged before possibly proposing this contribution.  Maybe some other
>> little issue related to this lack of gain in blk-mq will be found and
>> solved in the meantime.
>>=20
>> Moving to the read-write unfairness problem.
>>=20
>=20
> I've reproduced the unfairness issue (rand reader throttled by heavy
> writers) with bfq, using
> configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
> an important side problem: cfq suffers from exactly the same
> unfairness (785kB/s writers, 13.4kB/s reader).  Of course, this
> happens in my system, with a HITACHI HTS727550A9E364.
>=20
> This discrepancy with your results makes a little bit harder for me to
> understand how to better proceed, as I see no regression.  Anyway,
> since this reader-throttling issue seems relevant, I have investigated
> it a little more in depth.  The cause of the throttling is that the
> fdatasync frequently performed by the writers in this test turns the
> I/O of the writers into a 100% sync I/O.  And neither bfq or cfq
> differentiate bandwidth between sync reads and sync writes.  Basically
> both cfq and bfq are willing to dispatch the I/O requests of each
> writer for a time slot equal to that devoted to the reader.  But write
> requests, after reaching the device, use the latter for much more time
> than reads.  This delays the completion of the requests of the reader,
> and, being the I/O sync, the issuing of the next I/O requests by the
> reader.  The final result is that the device spends most of the time
> serving write requests, while the reader issues its read requests very
> slowly.
>=20
> It might not be so difficult to balance this unfairness, although I'm
> a little worried about changing bfq without being able to see the
> regression you report.  In case I give it a try, could I then count on
> some testing on your machines?
>=20

Hi Mel,
I've investigated this test case a little bit more, and the outcome is
unfortunately rather drastic, unless I'm missing some important point.
It is impossible to control the rate of the reader with the exact
configuration of this test.  In fact, since iodepth is equal to 1, the
reader issues one I/O request at a time.  When one such request is
dispatched, after some write requests have already been dispatched
(and then queued in the device), the time to serve the request is
controlled only by the device.  The longer the device makes the read
request wait before being served, the later the reader will see the
completion of its request, and then the later the reader will issue a
new request, and so on.  So, for this test, it is mainly the device
controller to decide the rate of the reader.

On the other hand, the scheduler can gain again control of the
bandwidth of the reader, if the reader issues more than one request at
a time.  Anyway, before analyzing this second, controllable case, I
wanted to test responsiveness with this heavy write workload in the
background.  And it was very bad!  After some hour of mild panic, I
found out that this failure depends on a bug in bfq, bug that,
luckily, happens to be triggered by these heavy writes as a background
workload ...

I've already found and am testing a fix for this bug. Yet, it will
probably take me some week to submit this fix, because I'm finally
going on vacation.

Thanks,
Paolo

> Thanks,
> Paolo
>=20
>> Thanks,
>> Paolo
>>=20
>>> Thanks,
>>> Paolo
>>>=20
>>>> LEGACY-DEADLINE
>>>>=20
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13267     9.622   298.206
>>>> Close                   135692     0.007    10.627
>>>> LockX                      640     0.008     0.066
>>>> Rename                    7827     0.544   481.123
>>>> ReadX                   285929     0.220  2698.442
>>>> WriteX                   92309   430.867  5191.608
>>>> Unlink                   34534     1.133   619.235
>>>> UnlockX                    640     0.008     0.724
>>>> FIND_FIRST               63289     0.086    56.851
>>>> SET_FILE_INFORMATION     16000     1.254   844.065
>>>> QUERY_FILE_INFORMATION   29883     0.004     0.618
>>>> QUERY_PATH_INFORMATION  173232     0.089  1295.651
>>>> QUERY_FS_INFORMATION     29632     0.017     4.813
>>>> NTCreateX               181464     0.479  2214.343
>>>>=20
>>>>=20
>>>> MQ-DEADLINE
>>>>=20
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13760    90.542 13221.495
>>>> Close                   137654     0.008    27.133
>>>> LockX                      640     0.009     0.115
>>>> Rename                    8064     1.062   246.759
>>>> ReadX                   297956     0.051   347.018
>>>> WriteX                   94698   425.636 15090.020
>>>> Unlink                   35077     0.580   208.462
>>>> UnlockX                    640     0.007     0.291
>>>> FIND_FIRST               66630     0.566   530.339
>>>> SET_FILE_INFORMATION     16000     1.419   811.494
>>>> QUERY_FILE_INFORMATION   30717     0.004     1.108
>>>> QUERY_PATH_INFORMATION  176153     0.182   517.419
>>>> QUERY_FS_INFORMATION     30857     0.018    18.562
>>>> NTCreateX               184145     0.281   582.076
>>>>=20
>>>> So, with both bfq and deadline there seems to be a serious =
regression,
>>>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>>>> regression is much worse with deadline, as legacy-deadline has the
>>>> lowest max latency among all the schedulers, whereas mq-deadline =
has
>>>> the highest one.
>>>>=20
>>>> Regardless of the actual culprit of this regression, I would like =
to
>>>> investigate further this issue.  In this respect, I would like to =
ask
>>>> for a little help.  I would like to isolate the workloads =
generating
>>>> the highest latencies.  To this purpose, I had a look at the =
loadfile
>>>> client-tiny.txt, and I still have a doubt: is every item in the
>>>> loadfile executed somehow several times (for each value of the =
number
>>>> of clients), or is it executed only once?  More precisely, IIUC, =
for
>>>> each operation reported in the above results, there are several =
items
>>>> (lines) in the loadfile.  So, is each of these items executed only
>>>> once?
>>>>=20
>>>> I'm asking because, if it is executed only once, then I guess I can
>>>> find the critical tasks ore easily.  Finally, if it is actually
>>>> executed only once, is it expected that the latency for such a task =
is
>>>> one order of magnitude higher than that of the average latency for
>>>> that group of tasks?  I mean, is such a task intrinsically much
>>>> heavier, and then expectedly much longer, or is the fact that =
latency
>>>> is much higher for this task a sign that something in the kernel
>>>> misbehaves for that task?
>>>>=20
>>>> While waiting for some feedback, I'm going to execute your test
>>>> showing great unfairness between writes and reads, and to also =
check
>>>> whether responsiveness does worsen if the write workload for that =
test
>>>> is being executed in the background.
>>>>=20
>>>> Thanks,
>>>> Paolo
>>>>=20
>>>>> ...
>>>>>> --=20
>>>>>> Mel Gorman
>>>>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
@ 2017-08-09 21:49                     ` Paolo Valente
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Valente @ 2017-08-09 21:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block


> Il giorno 08 ago 2017, alle ore 19:33, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
>> 
>> Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>> 
>>> 
>>> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>> 
>>>> 
>>>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>>> 
>>>>> 
>>>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>>>> 
>>>>>> 
>>>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <mgorman@techsingularity.net> ha scritto:
>>>>>> 
>>>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>>>>> impact was not a universal improvement although it can be a noticable
>>>>>>>> improvement. From the same machine;
>>>>>>>> 
>>>>>>>> dbench4 Loadfile Execution Time
>>>>>>>>                       4.12.0                 4.12.0                 4.12.0
>>>>>>>>                   legacy-cfq                 mq-bfq            mq-bfq-tput
>>>>>>>> Amean     1        80.67 (   0.00%)       83.68 (  -3.74%)       84.70 (  -5.00%)
>>>>>>>> Amean     2        92.87 (   0.00%)      121.63 ( -30.96%)       88.74 (   4.45%)
>>>>>>>> Amean     4       102.72 (   0.00%)      474.33 (-361.77%)      113.97 ( -10.95%)
>>>>>>>> Amean     32     2543.93 (   0.00%)     1927.65 (  24.23%)     2038.74 (  19.86%)
>>>>>>>> 
>>>>>>> 
>>>>>>> Thanks for trying with low_latency disabled.  If I read numbers
>>>>>>> correctly, we move from a worst case of 361% higher execution time to
>>>>>>> a worst case of 11%.  With a best case of 20% of lower execution time.
>>>>>>> 
>>>>>> 
>>>>>> Yes.
>>>>>> 
>>>>>>> I asked you about none and mq-deadline in a previous email, because
>>>>>>> actually we have a double change here: change of the I/O stack, and
>>>>>>> change of the scheduler, with the first change probably not irrelevant
>>>>>>> with respect to the second one.
>>>>>>> 
>>>>>> 
>>>>>> True. However, the difference between legacy-deadline mq-deadline is
>>>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>>>> universally true but the impact is not as severe. While this is not
>>>>>> proof that the stack change is the sole root cause, it makes it less
>>>>>> likely.
>>>>>> 
>>>>> 
>>>>> I'm getting a little lost here.  If I'm not mistaken, you are saying,
>>>>> since the difference between two virtually identical schedulers
>>>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>>>> difference between cfq and mq-bfq-tput is higher, then in the latter
>>>>> case it is not the stack's fault.  Yet the loss of mq-bfq-tput in the
>>>>> above test is exactly in the 5-10% range?  What am I missing?  Other
>>>>> tests with mq-bfq-tput not yet reported?
>>>>> 
>>>>>>> By chance, according to what you have measured so far, is there any
>>>>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>>>>> lose?  I could start from there.
>>>>>>> 
>>>>>> 
>>>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>>>>> it could be the stack change.
>>>>>> 
>>>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>>>> matters.
>>>>>> 
>>>>> 
>>>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>>>> soon as I can, thanks.
>>>>> 
>>>>> 
>>>> 
>>>> I've run this test and tried to further investigate this regression.
>>>> For the moment, the gist seems to be that blk-mq plays an important
>>>> role, not only with bfq (unless I'm considering the wrong numbers).
>>>> Even if your main purpose in this thread was just to give a heads-up,
>>>> I guess it may be useful to share what I have found out.  In addition,
>>>> I want to ask for some help, to try to get closer to the possible
>>>> causes of at least this regression.  If you think it would be better
>>>> to open a new thread on this stuff, I'll do it.
>>>> 
>>>> First, I got mixed results on my system.  I'll focus only on the the
>>>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>>>> to cfq, which happens with 64 clients.  Still, also in this case
>>>> mq-bfq is better than cfq in all average values, but Flush.  I don't
>>>> know which are the best/right values to look at, so, here's the final
>>>> report for both schedulers:
>>>> 
>>>> CFQ
>>>> 
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13120    20.069   348.594
>>>> Close                   133696     0.008    14.642
>>>> LockX                      512     0.009     0.059
>>>> Rename                    7552     1.857   415.418
>>>> ReadX                   270720     0.141   535.632
>>>> WriteX                   89591   421.961  6363.271
>>>> Unlink                   34048     1.281   662.467
>>>> UnlockX                    512     0.007     0.057
>>>> FIND_FIRST               62016     0.086    25.060
>>>> SET_FILE_INFORMATION     15616     0.995   176.621
>>>> QUERY_FILE_INFORMATION   28734     0.004     1.372
>>>> QUERY_PATH_INFORMATION  170240     0.163   820.292
>>>> QUERY_FS_INFORMATION     28736     0.017     4.110
>>>> NTCreateX               178688     0.437   905.567
>>>> 
>>>> MQ-BFQ-TPUT
>>>> 
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13504    75.828 11196.035
>>>> Close                   136896     0.004     3.855
>>>> LockX                      640     0.005     0.031
>>>> Rename                    8064     1.020   288.989
>>>> ReadX                   297600     0.081   685.850
>>>> WriteX                   93515   391.637 12681.517
>>>> Unlink                   34880     0.500   146.928
>>>> UnlockX                    640     0.004     0.032
>>>> FIND_FIRST               63680     0.045   222.491
>>>> SET_FILE_INFORMATION     16000     0.436   686.115
>>>> QUERY_FILE_INFORMATION   30464     0.003     0.773
>>>> QUERY_PATH_INFORMATION  175552     0.044   148.449
>>>> QUERY_FS_INFORMATION     29888     0.009     1.984
>>>> NTCreateX               183152     0.289   300.867
>>>> 
>>>> Are these results in line with yours for this test?
>>>> 
>>>> Anyway, to investigate this regression more in depth, I took two
>>>> further steps.  First, I repeated the same test with bfq-sq, my
>>>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>>>> from the changes needed for bfq to live in blk-mq).  I got:
>>>> 
>>>> BFQ-SQ-TPUT
>>>> 
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    12618    30.212   484.099
>>>> Close                   123884     0.008    10.477
>>>> LockX                      512     0.010     0.170
>>>> Rename                    7296     2.032   426.409
>>>> ReadX                   262179     0.251   985.478
>>>> WriteX                   84072   461.398  7283.003
>>>> Unlink                   33076     1.685   848.734
>>>> UnlockX                    512     0.007     0.036
>>>> FIND_FIRST               58690     0.096   220.720
>>>> SET_FILE_INFORMATION     14976     1.792   466.435
>>>> QUERY_FILE_INFORMATION   26575     0.004     2.194
>>>> QUERY_PATH_INFORMATION  158125     0.112   614.063
>>>> QUERY_FS_INFORMATION     28224     0.017     1.385
>>>> NTCreateX               167877     0.827   945.644
>>>> 
>>>> So, the worst-case regression is now around 15%.  This made me suspect
>>>> that blk-mq influences results a lot for this test.  To crosscheck, I
>>>> compared legacy-deadline and mq-deadline too.
>>>> 
>>> 
>>> Ok, found the problem for the 15% loss in bfq-sq.  bfq-sq gets
>>> occasionally confused by the workload, and grants device idling to
>>> processes that, for this specific workload, would be better to
>>> de-schedule immediately.  If we set slice_idle to 0, then bfq-sq
>>> becomes more or less equivalent to cfq (for some operations apparently
>>> even much better):
>>> 
>>> bfq-sq-tput-0idle
>>> 
>>> Operation                Count    AvgLat    MaxLat
>>> --------------------------------------------------
>>> Flush                    13013    17.888   280.517
>>> Close                   133004     0.008    20.698
>>> LockX                      512     0.008     0.088
>>> Rename                    7427     2.041   193.232
>>> ReadX                   270534     0.138   408.534
>>> WriteX                   88598   429.615  6272.212
>>> Unlink                   33734     1.205   559.152
>>> UnlockX                    512     0.011     1.808
>>> FIND_FIRST               61762     0.087    23.012
>>> SET_FILE_INFORMATION     15337     1.322   220.155
>>> QUERY_FILE_INFORMATION   28415     0.004     0.559
>>> QUERY_PATH_INFORMATION  169423     0.150   580.570
>>> QUERY_FS_INFORMATION     28547     0.019    24.466
>>> NTCreateX               177618     0.544   681.795
>>> 
>>> I'll try soon with mq-bfq too, for which I expect however a deeper
>>> investigation to be needed.
>>> 
>> 
>> Hi,
>> to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also
>> applied Ming patches, and Ah, victory!
>> 
>> Regardless of the value of slice idle:
>> 
>> mq-bfq-tput
>> 
>> Operation                Count    AvgLat    MaxLat
>> --------------------------------------------------
>> Flush                    13183    70.381  1025.407
>> Close                   134539     0.004     1.011
>> LockX                      512     0.005     0.025
>> Rename                    7721     0.740   404.979
>> ReadX                   274422     0.126   873.364
>> WriteX                   90535   408.371  7400.585
>> Unlink                   34276     0.634   581.067
>> UnlockX                    512     0.003     0.029
>> FIND_FIRST               62664     0.052   321.027
>> SET_FILE_INFORMATION     15981     0.234   124.739
>> QUERY_FILE_INFORMATION   29042     0.003     1.731
>> QUERY_PATH_INFORMATION  171769     0.032   522.415
>> QUERY_FS_INFORMATION     28958     0.009     3.043
>> NTCreateX               179643     0.298   687.466
>> 
>> Throughput 9.11183 MB/sec  64 clients  64 procs  max_latency=7400.588 ms
>> 
>> Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
>> benefit, which lets me suspect that there is some other issue in
>> blk-mq (only a suspect).  I think I may have already understood how to
>> guarantee that bfq almost never idles the device uselessly also for
>> this workload.  Yet, since in blk-mq there is no gain even after
>> excluding useless idling, I'll wait for at least Ming's patches to be
>> merged before possibly proposing this contribution.  Maybe some other
>> little issue related to this lack of gain in blk-mq will be found and
>> solved in the meantime.
>> 
>> Moving to the read-write unfairness problem.
>> 
> 
> I've reproduced the unfairness issue (rand reader throttled by heavy
> writers) with bfq, using
> configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
> an important side problem: cfq suffers from exactly the same
> unfairness (785kB/s writers, 13.4kB/s reader).  Of course, this
> happens in my system, with a HITACHI HTS727550A9E364.
> 
> This discrepancy with your results makes a little bit harder for me to
> understand how to better proceed, as I see no regression.  Anyway,
> since this reader-throttling issue seems relevant, I have investigated
> it a little more in depth.  The cause of the throttling is that the
> fdatasync frequently performed by the writers in this test turns the
> I/O of the writers into a 100% sync I/O.  And neither bfq or cfq
> differentiate bandwidth between sync reads and sync writes.  Basically
> both cfq and bfq are willing to dispatch the I/O requests of each
> writer for a time slot equal to that devoted to the reader.  But write
> requests, after reaching the device, use the latter for much more time
> than reads.  This delays the completion of the requests of the reader,
> and, being the I/O sync, the issuing of the next I/O requests by the
> reader.  The final result is that the device spends most of the time
> serving write requests, while the reader issues its read requests very
> slowly.
> 
> It might not be so difficult to balance this unfairness, although I'm
> a little worried about changing bfq without being able to see the
> regression you report.  In case I give it a try, could I then count on
> some testing on your machines?
> 

Hi Mel,
I've investigated this test case a little bit more, and the outcome is
unfortunately rather drastic, unless I'm missing some important point.
It is impossible to control the rate of the reader with the exact
configuration of this test.  In fact, since iodepth is equal to 1, the
reader issues one I/O request at a time.  When one such request is
dispatched, after some write requests have already been dispatched
(and then queued in the device), the time to serve the request is
controlled only by the device.  The longer the device makes the read
request wait before being served, the later the reader will see the
completion of its request, and then the later the reader will issue a
new request, and so on.  So, for this test, it is mainly the device
controller to decide the rate of the reader.

On the other hand, the scheduler can gain again control of the
bandwidth of the reader, if the reader issues more than one request at
a time.  Anyway, before analyzing this second, controllable case, I
wanted to test responsiveness with this heavy write workload in the
background.  And it was very bad!  After some hour of mild panic, I
found out that this failure depends on a bug in bfq, bug that,
luckily, happens to be triggered by these heavy writes as a background
workload ...

I've already found and am testing a fix for this bug. Yet, it will
probably take me some week to submit this fix, because I'm finally
going on vacation.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> Thanks,
>> Paolo
>> 
>>> Thanks,
>>> Paolo
>>> 
>>>> LEGACY-DEADLINE
>>>> 
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13267     9.622   298.206
>>>> Close                   135692     0.007    10.627
>>>> LockX                      640     0.008     0.066
>>>> Rename                    7827     0.544   481.123
>>>> ReadX                   285929     0.220  2698.442
>>>> WriteX                   92309   430.867  5191.608
>>>> Unlink                   34534     1.133   619.235
>>>> UnlockX                    640     0.008     0.724
>>>> FIND_FIRST               63289     0.086    56.851
>>>> SET_FILE_INFORMATION     16000     1.254   844.065
>>>> QUERY_FILE_INFORMATION   29883     0.004     0.618
>>>> QUERY_PATH_INFORMATION  173232     0.089  1295.651
>>>> QUERY_FS_INFORMATION     29632     0.017     4.813
>>>> NTCreateX               181464     0.479  2214.343
>>>> 
>>>> 
>>>> MQ-DEADLINE
>>>> 
>>>> Operation                Count    AvgLat    MaxLat
>>>> --------------------------------------------------
>>>> Flush                    13760    90.542 13221.495
>>>> Close                   137654     0.008    27.133
>>>> LockX                      640     0.009     0.115
>>>> Rename                    8064     1.062   246.759
>>>> ReadX                   297956     0.051   347.018
>>>> WriteX                   94698   425.636 15090.020
>>>> Unlink                   35077     0.580   208.462
>>>> UnlockX                    640     0.007     0.291
>>>> FIND_FIRST               66630     0.566   530.339
>>>> SET_FILE_INFORMATION     16000     1.419   811.494
>>>> QUERY_FILE_INFORMATION   30717     0.004     1.108
>>>> QUERY_PATH_INFORMATION  176153     0.182   517.419
>>>> QUERY_FS_INFORMATION     30857     0.018    18.562
>>>> NTCreateX               184145     0.281   582.076
>>>> 
>>>> So, with both bfq and deadline there seems to be a serious regression,
>>>> especially on MaxLat, when moving from legacy block to blk-mq.  The
>>>> regression is much worse with deadline, as legacy-deadline has the
>>>> lowest max latency among all the schedulers, whereas mq-deadline has
>>>> the highest one.
>>>> 
>>>> Regardless of the actual culprit of this regression, I would like to
>>>> investigate further this issue.  In this respect, I would like to ask
>>>> for a little help.  I would like to isolate the workloads generating
>>>> the highest latencies.  To this purpose, I had a look at the loadfile
>>>> client-tiny.txt, and I still have a doubt: is every item in the
>>>> loadfile executed somehow several times (for each value of the number
>>>> of clients), or is it executed only once?  More precisely, IIUC, for
>>>> each operation reported in the above results, there are several items
>>>> (lines) in the loadfile.  So, is each of these items executed only
>>>> once?
>>>> 
>>>> I'm asking because, if it is executed only once, then I guess I can
>>>> find the critical tasks ore easily.  Finally, if it is actually
>>>> executed only once, is it expected that the latency for such a task is
>>>> one order of magnitude higher than that of the average latency for
>>>> that group of tasks?  I mean, is such a task intrinsically much
>>>> heavier, and then expectedly much longer, or is the fact that latency
>>>> is much higher for this task a sign that something in the kernel
>>>> misbehaves for that task?
>>>> 
>>>> While waiting for some feedback, I'm going to execute your test
>>>> showing great unfairness between writes and reads, and to also check
>>>> whether responsiveness does worsen if the write workload for that test
>>>> is being executed in the background.
>>>> 
>>>> Thanks,
>>>> Paolo
>>>> 
>>>>> ...
>>>>>> -- 
>>>>>> Mel Gorman
>>>>>> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Switching to MQ by default may generate some bug reports
  2017-08-09 21:49                     ` Paolo Valente
  (?)
@ 2017-08-10  8:44                     ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2017-08-10  8:44 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-block

On Wed, Aug 09, 2017 at 11:49:17PM +0200, Paolo Valente wrote:
> > This discrepancy with your results makes a little bit harder for me to
> > understand how to better proceed, as I see no regression.  Anyway,
> > since this reader-throttling issue seems relevant, I have investigated
> > it a little more in depth.  The cause of the throttling is that the
> > fdatasync frequently performed by the writers in this test turns the
> > I/O of the writers into a 100% sync I/O.  And neither bfq or cfq
> > differentiate bandwidth between sync reads and sync writes.  Basically
> > both cfq and bfq are willing to dispatch the I/O requests of each
> > writer for a time slot equal to that devoted to the reader.  But write
> > requests, after reaching the device, use the latter for much more time
> > than reads.  This delays the completion of the requests of the reader,
> > and, being the I/O sync, the issuing of the next I/O requests by the
> > reader.  The final result is that the device spends most of the time
> > serving write requests, while the reader issues its read requests very
> > slowly.
> > 
> > It might not be so difficult to balance this unfairness, although I'm
> > a little worried about changing bfq without being able to see the
> > regression you report.  In case I give it a try, could I then count on
> > some testing on your machines?
> > 
> 
> Hi Mel,
> I've investigated this test case a little bit more, and the outcome is
> unfortunately rather drastic, unless I'm missing some important point.
> It is impossible to control the rate of the reader with the exact
> configuration of this test. 

Correct, both are simply competing for access to IO. Very broadly speaking,
it's only checking for loose (but not perfect) fairness with different IO
patterns.  While it's not a recent problem, historically (2+ years ago) we
had problems whereby a heavy reader or writer could starve IO completely. It
had odd effects like some multi-threaded benchmarks being artifically good
simply because one thread would dominate and artifically complete faster and
exit prematurely. "Fixing" it had a tendency to help real workloads while
hurting some benchmarks so it's not straight-forward to control for properly.
Bottom line, I'm not necessarily worried if a particular benchmark shows
an apparent regression once I understand why and can convince myself that a
"real" workload benefits from it (preferably proving it).

> In fact, since iodepth is equal to 1, the
> reader issues one I/O request at a time.  When one such request is
> dispatched, after some write requests have already been dispatched
> (and then queued in the device), the time to serve the request is
> controlled only by the device.  The longer the device makes the read
> request wait before being served, the later the reader will see the
> completion of its request, and then the later the reader will issue a
> new request, and so on.  So, for this test, it is mainly the device
> controller to decide the rate of the reader.
> 

Understood. It's less than ideal but not a completely silly test either.
That said, the fio tests are relatively new compared to some of the tests
monitored by mmtests looking for issues. It can take time to finalise a
test configuration before it's giving useful data 100% of the time.

> On the other hand, the scheduler can gain again control of the
> bandwidth of the reader, if the reader issues more than one request at
> a time. 

Ok, I'll take it as a todo item to increase the depth as a depth of 1 is
not that interesting as such. It's also on my todo list to add fio
configs that add think time.

> Anyway, before analyzing this second, controllable case, I
> wanted to test responsiveness with this heavy write workload in the
> background.  And it was very bad!  After some hour of mild panic, I
> found out that this failure depends on a bug in bfq, bug that,
> luckily, happens to be triggered by these heavy writes as a background
> workload ...
> 
> I've already found and am testing a fix for this bug. Yet, it will
> probably take me some week to submit this fix, because I'm finally
> going on vacation.
> 

This is obviously both good and bad. Bad in that the bug exists at all,
good in that you detected it and a fix is possible. I don't think you have
to panic considering that some of the pending fixes include Ming's work
which won't be merged for quite some time and tests take a long time anyway.
Whenever you get around to a fix after your vacation, just cc me and I'll
queue it across a range of machines so you have some independent tests.
A review from me would not be worth much as I haven't spent the time to
fully understand BFQ yet.

If the fixes do not hit until the next merge window or the window after that
then someone who cares enough can do a performance-based -stable backport. If
there are any bugs in the meantime (e.g. after 4.13 comes out) then there
will be a series for the reporter to test. I think it's still reasonably
positive that issues with MQ being enabled by default were detected within
weeks with potential fixes in the pipeline. It's better than months passing
before a distro picked up a suitable kernel and enough time passed for a
coherent bug report to show up that's better than "my computer is slow".

Thanks for the hard work and prompt research. 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-08-10  8:44 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-03  8:51 Switching to MQ by default may generate some bug reports Mel Gorman
2017-08-03  9:17 ` Ming Lei
2017-08-03  9:32   ` Ming Lei
2017-08-03  9:42   ` Mel Gorman
2017-08-03  9:44     ` Paolo Valente
2017-08-03  9:44       ` Paolo Valente
2017-08-03 10:46       ` Mel Gorman
2017-08-03  9:57     ` Ming Lei
2017-08-03 10:47       ` Mel Gorman
2017-08-03 11:48         ` Ming Lei
2017-08-03  9:21 ` Paolo Valente
2017-08-03  9:21   ` Paolo Valente
2017-08-03 11:01   ` Mel Gorman
2017-08-04  7:26     ` Paolo Valente
2017-08-04  7:26       ` Paolo Valente
2017-08-04 11:01       ` Mel Gorman
2017-08-04 22:05         ` Paolo Valente
2017-08-04 22:05           ` Paolo Valente
2017-08-05 11:54           ` Mel Gorman
2017-08-07 17:35             ` Paolo Valente
2017-08-07 17:35               ` Paolo Valente
2017-08-07 17:32           ` Paolo Valente
2017-08-07 17:32             ` Paolo Valente
2017-08-07 18:42             ` Paolo Valente
2017-08-07 18:42               ` Paolo Valente
2017-08-08  8:06               ` Paolo Valente
2017-08-08  8:06                 ` Paolo Valente
2017-08-08 17:33                 ` Paolo Valente
2017-08-08 17:33                   ` Paolo Valente
2017-08-08 18:27                   ` Mel Gorman
2017-08-09 21:49                   ` Paolo Valente
2017-08-09 21:49                     ` Paolo Valente
2017-08-10  8:44                     ` Mel Gorman
2017-08-08 10:30             ` Mel Gorman
2017-08-08 10:43               ` Ming Lei
2017-08-08 11:27                 ` Mel Gorman
2017-08-08 11:49                   ` Ming Lei
2017-08-08 11:55                     ` Mel Gorman
2017-08-08 17:16               ` Paolo Valente
2017-08-08 17:16                 ` Paolo Valente

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.