linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CFQ read performance regression
@ 2010-04-16 12:27 Miklos Szeredi
  2010-04-16 17:06 ` Chris
  2010-04-17 12:46 ` Corrado Zoccolo
  0 siblings, 2 replies; 21+ messages in thread
From: Miklos Szeredi @ 2010-04-16 12:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Jan Kara, Suresh Jayaraman

Hi Jens,

I'm chasing a performance bottleneck identified by tiobench that seems
to be caused by CFQ.  On a SLES10-SP3 kernel (2.6.16, with some patches
moving cfq closer to 2.6.17) tiobench with 8 threads gets about 260MB/s
sequential read throughput.  On a recent kernels (including vanilla
2.6.34-rc) it makes about 145MB/s, a regression of 45%.  The queue and
readahead parameters are the same.

This goes back some time, 2.6.27 already seems to have a bad
performance.

Changing the scheduler to noop will increase the throughput back into
the 260MB/s range.  So this is not a driver issue.

Also increasing quantum *and* readahead will increase the throughput,
but not by as much.  Both noop and these tweaks decrease the write
throughput somewhat however...

Apparently on recent kernels the number of dispatched requests stays
mostly at or below 4 and the dispatched sector count at or below 2000,
which is not enough to fill the bandwidth on this setup.

On 2.6.16 the number of dispatched requests hovers around 22 and the
sector count around 16000.

I uploaded blktraces for the read part of the tiobench runs for both
2.6.16 and 2.6.32:

 http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/

Do you have any idea about the cause of this regression?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-16 12:27 CFQ read performance regression Miklos Szeredi
@ 2010-04-16 17:06 ` Chris
  2010-04-17 12:46 ` Corrado Zoccolo
  1 sibling, 0 replies; 21+ messages in thread
From: Chris @ 2010-04-16 17:06 UTC (permalink / raw)
  To: linux-kernel

On Fri, Apr 16, 2010 at 02:27:58PM +0200, Miklos Szeredi wrote:
> Hi Jens,
> 
> I'm chasing a performance bottleneck identified by tiobench that seems
> to be caused by CFQ.  On a SLES10-SP3 kernel (2.6.16, with some patches
> moving cfq closer to 2.6.17) tiobench with 8 threads gets about 260MB/s
> sequential read throughput.  On a recent kernels (including vanilla
> 2.6.34-rc) it makes about 145MB/s, a regression of 45%.  The queue and
> readahead parameters are the same.

I've also just noticed this using the most recent Redhat kernels.  Writes don't
seem to be affected at all.  If the latest Redhat kernels mean anything here, I
might as well show you what I've got, in case there is something common.

./disktest  -B 4k -h 1 -I BD -K 32 -p l -P T -T 300  -r /dev/sdf

With cfq we get this:
 STAT  | 17260 | v1.4.2 | /dev/sdf | Heartbeat read throughput: 15032320.0B/s (14.34MB/s), IOPS 3670.0/s
And with noop we get this:
  STAT  | 17260 | v1.4.2 | /dev/sdf | Heartbeat read throughput: 111759360.0B/s (106.58MB/s), IOPS 27285.0/s.

Setting some very large and busy web servers to noop just out of curiousity
also reduced the average io time and dropped the load.

Chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-16 12:27 CFQ read performance regression Miklos Szeredi
  2010-04-16 17:06 ` Chris
@ 2010-04-17 12:46 ` Corrado Zoccolo
  2010-04-19 11:46   ` Miklos Szeredi
  1 sibling, 1 reply; 21+ messages in thread
From: Corrado Zoccolo @ 2010-04-17 12:46 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]

Hi Miklos,
I don't think this is related to CFQ. I've made a graph of the
accessed (read) sectors (see attached).
You can see that the green cloud (2.6.16) is much more concentrated,
while the red one (2.6.32) is split in two, and you can better
recognize the different lines.
This means that the FS put more distance between the blocks of the
files written by the tio threads, and the read time is therefore
impacted, since the disk head has to perform longer seeks. On the
other hand, if you read those files sequentially with a single thread,
the performance may be better with the new layout, so YMMV. When
testing 2.6.32 and up, you should consider testing also with
low_latency setting disabled, since tuning for latency can negatively
affect throughput.

Thanks,
Corrado

On Fri, Apr 16, 2010 at 2:27 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:
> Hi Jens,
>
> I'm chasing a performance bottleneck identified by tiobench that seems
> to be caused by CFQ.  On a SLES10-SP3 kernel (2.6.16, with some patches
> moving cfq closer to 2.6.17) tiobench with 8 threads gets about 260MB/s
> sequential read throughput.  On a recent kernels (including vanilla
> 2.6.34-rc) it makes about 145MB/s, a regression of 45%.  The queue and
> readahead parameters are the same.
>
> This goes back some time, 2.6.27 already seems to have a bad
> performance.
>
> Changing the scheduler to noop will increase the throughput back into
> the 260MB/s range.  So this is not a driver issue.
>
> Also increasing quantum *and* readahead will increase the throughput,
> but not by as much.  Both noop and these tweaks decrease the write
> throughput somewhat however...
>
> Apparently on recent kernels the number of dispatched requests stays
> mostly at or below 4 and the dispatched sector count at or below 2000,
> which is not enough to fill the bandwidth on this setup.
>
> On 2.6.16 the number of dispatched requests hovers around 22 and the
> sector count around 16000.
>
> I uploaded blktraces for the read part of the tiobench runs for both
> 2.6.16 and 2.6.32:
>
>  http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/
>
> Do you have any idea about the cause of this regression?
>
> Thanks,
> Miklos
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

[-- Attachment #2: access.png --]
[-- Type: image/png, Size: 12253 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-17 12:46 ` Corrado Zoccolo
@ 2010-04-19 11:46   ` Miklos Szeredi
  2010-04-20 20:50     ` Corrado Zoccolo
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2010-04-19 11:46 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

Hi Corrado,

On Sat, 2010-04-17 at 14:46 +0200, Corrado Zoccolo wrote:
> I don't think this is related to CFQ. I've made a graph of the
> accessed (read) sectors (see attached).
> You can see that the green cloud (2.6.16) is much more concentrated,
> while the red one (2.6.32) is split in two, and you can better
> recognize the different lines.
> This means that the FS put more distance between the blocks of the
> files written by the tio threads, and the read time is therefore
> impacted, since the disk head has to perform longer seeks. On the
> other hand, if you read those files sequentially with a single thread,
> the performance may be better with the new layout, so YMMV. When
> testing 2.6.32 and up, you should consider testing also with
> low_latency setting disabled, since tuning for latency can negatively
> affect throughput.

low_latency is set to zero in all tests.

The layout difference doesn't explain why setting the scheduler to
"noop" consistently speeds up read throughput in 8-thread tiobench to
almost twice.  This fact alone pretty clearly indicates that the I/O
scheduler is the culprit.

There are other indications, see the attached btt output for both
traces.  From there it appears that 2.6.16 does more and longer seeks,
yet it's getting an overall better performance.

I've also tested with plain "dd" instead of tiobench where the
filesystem layout stayed exactly the same between tests.  Still the
speed difference is there.

Thanks,
Miklos

************************************************************************
btt output for 2.6.16:
==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000047   0.000854417   1.003550405       67465
Q2G               0.000000458   0.000001211   0.000123527       46062
G2I               0.000000123   0.000001815   0.000494517       46074
Q2M               0.000000186   0.000001798   0.000010296       21404
I2D               0.000000162   0.000158001   0.040794333       46062
M2D               0.000000878   0.000133130   0.040585566       21404
D2C               0.000053870   0.023778266   0.234154543       67466
Q2C               0.000056746   0.023931014   0.234176000       67466

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 64) |   0.0035%   0.0052%   0.0024%   0.4508%  99.3617%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.0035%   0.0052%   0.0024%   0.4508%  99.3617%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8, 64) |    67466    46062     1.5 |        8      597     1024 27543688

==================== Device Q2Q Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8, 64) |           67466        866834.0               0 | 0(34558)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |           67466        866834.0               0 | 0(34558)

==================== Device D2D Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8, 64) |           46062       1265503.9               0 | 0(13242)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |           46062       1265503.9               0 | 0(13242)

==================== Plug Information ====================

       DEV |    # Plugs # Timer Us  | % Time Q Plugged
---------- | ---------- ----------  | ----------------
 (  8, 64) |      29271(       533) |   3.878105328%

       DEV |    IOs/Unp   IOs/Unp(to)
---------- | ----------   ----------
 (  8, 64) |       19.2         19.7
---------- | ----------   ----------
   Overall |    IOs/Unp   IOs/Unp(to)
   Average |       19.2         19.7

==================== Active Requests At Q Information ====================

       DEV |  Avg Reqs @ Q
---------- | -------------
 (  8, 64) |           0.8

==================== I/O Active Period Information ====================

       DEV |     # Live      Avg. Act     Avg. !Act % Live
---------- | ---------- ------------- ------------- ------
 (  8, 64) |        545   0.100012237   0.005766640  94.56
---------- | ---------- ------------- ------------- ------
 Total Sys |        545   0.100012237   0.005766640  94.56

************************************************************************
btt output for 2.6.32:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000279   0.001710581   1.803934429       69429
Q2G               0.000000908   0.001798735  23.144764798       54940
S2G               0.022460311   6.581680621  23.144763751          15
G2I               0.000000628   0.000001576   0.000120409       54942
Q2M               0.000000628   0.000001611   0.000013201       14490
I2D               0.000000768   0.289812035  86.820205789       54940
M2D               0.000005518   0.098208187   0.794441158       14490
D2C               0.000173141   0.008056256   0.219516385       69430
Q2C               0.000179077   0.259305605  86.820559403       69430

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 64) |   0.5489%   0.0005%   0.0001%  88.4394%   3.1069%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.5489%   0.0005%   0.0001%  88.4394%   3.1069%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8, 64) |    69430    54955     1.3 |        8      520     2048 28614984

==================== Device Q2Q Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8, 64) |           69430        546377.3               0 | 0(50235)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |           69430        546377.3               0 | 0(50235)

==================== Device D2D Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8, 64) |           54955        565286.3               0 | 0(37535)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |           54955        565286.3               0 | 0(37535)

==================== Plug Information ====================

       DEV |    # Plugs # Timer Us  | % Time Q Plugged
---------- | ---------- ----------  | ----------------
 (  8, 64) |       2310(         0) |   0.049353257%

       DEV |    IOs/Unp   IOs/Unp(to)
---------- | ----------   ----------
 (  8, 64) |        7.3          0.0
---------- | ----------   ----------
   Overall |    IOs/Unp   IOs/Unp(to)
   Average |        7.3          0.0

==================== Active Requests At Q Information ====================

       DEV |  Avg Reqs @ Q
---------- | -------------
 (  8, 64) |         132.8

==================== I/O Active Period Information ====================

       DEV |     # Live      Avg. Act     Avg. !Act % Live
---------- | ---------- ------------- ------------- ------
 (  8, 64) |       4835   0.023848998   0.000714665  97.09
---------- | ---------- ------------- ------------- ------
 Total Sys |       4835   0.023848998   0.000714665  97.09




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-19 11:46   ` Miklos Szeredi
@ 2010-04-20 20:50     ` Corrado Zoccolo
  2010-04-21 13:25       ` Miklos Szeredi
  0 siblings, 1 reply; 21+ messages in thread
From: Corrado Zoccolo @ 2010-04-20 20:50 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 10741 bytes --]

On Mon, Apr 19, 2010 at 1:46 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:> Hi Corrado,>> On Sat, 2010-04-17 at 14:46 +0200, Corrado Zoccolo wrote:>> I don't think this is related to CFQ. I've made a graph of the>> accessed (read) sectors (see attached).>> You can see that the green cloud (2.6.16) is much more concentrated,>> while the red one (2.6.32) is split in two, and you can better>> recognize the different lines.>> This means that the FS put more distance between the blocks of the>> files written by the tio threads, and the read time is therefore>> impacted, since the disk head has to perform longer seeks. On the>> other hand, if you read those files sequentially with a single thread,>> the performance may be better with the new layout, so YMMV. When>> testing 2.6.32 and up, you should consider testing also with>> low_latency setting disabled, since tuning for latency can negatively>> affect throughput.Hi Miklos,can you give more information about the setup?How much memory do you have, what is the disk configuration (is this ahw raid?) and so on.>> low_latency is set to zero in all tests.>> The layout difference doesn't explain why setting the scheduler to> "noop" consistently speeds up read throughput in 8-thread tiobench to> almost twice.  This fact alone pretty clearly indicates that the I/O> scheduler is the culprit.From the attached btt output, I see that a lot of time is spentwaiting to allocate new request structures.> S2G               0.022460311   6.581680621  23.144763751          15Since noop doesn't attach fancy data to each request, it can savethose allocations, thus resulting in no sleeps.The delays in allocation, though, may not be completely imputable tothe I/O scheduler, and working in constrained memory conditions willnegatively affect it.
> There are other indications, see the attached btt output for both> traces.  From there it appears that 2.6.16 does more and longer seeks,> yet it's getting an overall better performance.I see less seeks for 2.6.16, but longer on average.It seems that 2.6.16 allows more requests from the same process to bestreamed to disk before switching to an other process.Since the timeslice is the same, it might be that we are limiting thenumber of requests per queue due to memory congestion.
> I've also tested with plain "dd" instead of tiobench where the> filesystem layout stayed exactly the same between tests.  Still the> speed difference is there.Does dropping caches before the read test change the situation?
Thanks,Corrado>> Thanks,> Miklos>> ************************************************************************> btt output for 2.6.16:> ==================== All Devices ====================>>            ALL           MIN           AVG           MAX           N> --------------- ------------- ------------- ------------- ----------->> Q2Q               0.000000047   0.000854417   1.003550405       67465> Q2G               0.000000458   0.000001211   0.000123527       46062> G2I               0.000000123   0.000001815   0.000494517       46074> Q2M               0.000000186   0.000001798   0.000010296       21404> I2D               0.000000162   0.000158001   0.040794333       46062> M2D               0.000000878   0.000133130   0.040585566       21404> D2C               0.000053870   0.023778266   0.234154543       67466> Q2C               0.000056746   0.023931014   0.234176000       67466>> ==================== Device Overhead ====================>>       DEV |       Q2G       G2I       Q2M       I2D       D2C> ---------- | --------- --------- --------- --------- --------->  (  8, 64) |   0.0035%   0.0052%   0.0024%   0.4508%  99.3617%> ---------- | --------- --------- --------- --------- --------->   Overall |   0.0035%   0.0052%   0.0024%   0.4508%  99.3617%>> ==================== Device Merge Information ====================>>       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total> ---------- | -------- -------- ------- | -------- -------- -------- -------->  (  8, 64) |    67466    46062     1.5 |        8      597     1024 27543688>> ==================== Device Q2Q Seek Information ====================>>       DEV |          NSEEKS            MEAN          MEDIAN | MODE> ---------- | --------------- --------------- --------------- | --------------->  (  8, 64) |           67466        866834.0               0 | 0(34558)> ---------- | --------------- --------------- --------------- | --------------->   Overall |          NSEEKS            MEAN          MEDIAN | MODE>   Average |           67466        866834.0               0 | 0(34558)>> ==================== Device D2D Seek Information ====================>>       DEV |          NSEEKS            MEAN          MEDIAN | MODE> ---------- | --------------- --------------- --------------- | --------------->  (  8, 64) |           46062       1265503.9               0 | 0(13242)> ---------- | --------------- --------------- --------------- | --------------->   Overall |          NSEEKS            MEAN          MEDIAN | MODE>   Average |           46062       1265503.9               0 | 0(13242)>> ==================== Plug Information ====================>>       DEV |    # Plugs # Timer Us  | % Time Q Plugged> ---------- | ---------- ----------  | ---------------->  (  8, 64) |      29271(       533) |   3.878105328%>>       DEV |    IOs/Unp   IOs/Unp(to)> ---------- | ----------   ---------->  (  8, 64) |       19.2         19.7> ---------- | ----------   ---------->   Overall |    IOs/Unp   IOs/Unp(to)>   Average |       19.2         19.7>> ==================== Active Requests At Q Information ====================>>       DEV |  Avg Reqs @ Q> ---------- | ------------->  (  8, 64) |           0.8>> ==================== I/O Active Period Information ====================>>       DEV |     # Live      Avg. Act     Avg. !Act % Live> ---------- | ---------- ------------- ------------- ------>  (  8, 64) |        545   0.100012237   0.005766640  94.56> ---------- | ---------- ------------- ------------- ------>  Total Sys |        545   0.100012237   0.005766640  94.56>> ************************************************************************> btt output for 2.6.32:>> ==================== All Devices ====================>>            ALL           MIN           AVG           MAX           N> --------------- ------------- ------------- ------------- ----------->> Q2Q               0.000000279   0.001710581   1.803934429       69429> Q2G               0.000000908   0.001798735  23.144764798       54940> S2G               0.022460311   6.581680621  23.144763751          15> G2I               0.000000628   0.000001576   0.000120409       54942> Q2M               0.000000628   0.000001611   0.000013201       14490> I2D               0.000000768   0.289812035  86.820205789       54940> M2D               0.000005518   0.098208187   0.794441158       14490> D2C               0.000173141   0.008056256   0.219516385       69430> Q2C               0.000179077   0.259305605  86.820559403       69430>> ==================== Device Overhead ====================>>       DEV |       Q2G       G2I       Q2M       I2D       D2C> ---------- | --------- --------- --------- --------- --------->  (  8, 64) |   0.5489%   0.0005%   0.0001%  88.4394%   3.1069%> ---------- | --------- --------- --------- --------- --------->   Overall |   0.5489%   0.0005%   0.0001%  88.4394%   3.1069%>> ==================== Device Merge Information ====================>>       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total> ---------- | -------- -------- ------- | -------- -------- -------- -------->  (  8, 64) |    69430    54955     1.3 |        8      520     2048 28614984>> ==================== Device Q2Q Seek Information ====================>>       DEV |          NSEEKS            MEAN          MEDIAN | MODE> ---------- | --------------- --------------- --------------- | --------------->  (  8, 64) |           69430        546377.3               0 | 0(50235)> ---------- | --------------- --------------- --------------- | --------------->   Overall |          NSEEKS            MEAN          MEDIAN | MODE>   Average |           69430        546377.3               0 | 0(50235)>> ==================== Device D2D Seek Information ====================>>       DEV |          NSEEKS            MEAN          MEDIAN | MODE> ---------- | --------------- --------------- --------------- | --------------->  (  8, 64) |           54955        565286.3               0 | 0(37535)> ---------- | --------------- --------------- --------------- | --------------->   Overall |          NSEEKS            MEAN          MEDIAN | MODE>   Average |           54955        565286.3               0 | 0(37535)>> ==================== Plug Information ====================>>       DEV |    # Plugs # Timer Us  | % Time Q Plugged> ---------- | ---------- ----------  | ---------------->  (  8, 64) |       2310(         0) |   0.049353257%>>       DEV |    IOs/Unp   IOs/Unp(to)> ---------- | ----------   ---------->  (  8, 64) |        7.3          0.0> ---------- | ----------   ---------->   Overall |    IOs/Unp   IOs/Unp(to)>   Average |        7.3          0.0>> ==================== Active Requests At Q Information ====================>>       DEV |  Avg Reqs @ Q> ---------- | ------------->  (  8, 64) |         132.8>> ==================== I/O Active Period Information ====================>>       DEV |     # Live      Avg. Act     Avg. !Act % Live> ---------- | ---------- ------------- ------------- ------>  (  8, 64) |       4835   0.023848998   0.000714665  97.09> ---------- | ---------- ------------- ------------- ------>  Total Sys |       4835   0.023848998   0.000714665  97.09>>>>ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-20 20:50     ` Corrado Zoccolo
@ 2010-04-21 13:25       ` Miklos Szeredi
  2010-04-21 16:05         ` Miklos Szeredi
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2010-04-21 13:25 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

Corrado,

On Tue, 2010-04-20 at 22:50 +0200, Corrado Zoccolo wrote:
> can you give more information about the setup?
> How much memory do you have, what is the disk configuration (is this a
> hw raid?) and so on.

8G of memory 8-way Xeon CPU, fiber channel attached storage array (HP
HSV200).  I don't know the configuration of the array.

> > low_latency is set to zero in all tests.
> >
> > The layout difference doesn't explain why setting the scheduler to
> > "noop" consistently speeds up read throughput in 8-thread tiobench to
> > almost twice.  This fact alone pretty clearly indicates that the I/O
> > scheduler is the culprit.
> From the attached btt output, I see that a lot of time is spent
> waiting to allocate new request structures.
> > S2G               0.022460311   6.581680621  23.144763751          15
> Since noop doesn't attach fancy data to each request, it can save
> those allocations, thus resulting in no sleeps.
> The delays in allocation, though, may not be completely imputable to
> the I/O scheduler, and working in constrained memory conditions will
> negatively affect it.

I verified with the simple dd test that this happens even if there's no
memory pressure from the cache by dd-ing only 5G of files, after
clearing the cache.  This way ~2G of memory are completely free
throughout the test. 

> > I've also tested with plain "dd" instead of tiobench where the
> > filesystem layout stayed exactly the same between tests.  Still the
> > speed difference is there.
> Does dropping caches before the read test change the situation?

In all my tests I drop caches before running it.

Please let me know if you need more information.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-21 13:25       ` Miklos Szeredi
@ 2010-04-21 16:05         ` Miklos Szeredi
  2010-04-22  7:59           ` Corrado Zoccolo
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2010-04-21 16:05 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

Jens, Corrado,

Here's a graph showing the number of issued but not yet completed
requests versus time for CFQ and NOOP schedulers running the tiobench
benchmark with 8 threads:

http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg

It shows pretty clearly the performance problem is because CFQ is not
issuing enough request to fill the bandwidth.

Is this the correct behavior of CFQ or is this a bug?

This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:

read_ahead_kb=512
low_latency=0 (for CFQ)

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-21 16:05         ` Miklos Szeredi
@ 2010-04-22  7:59           ` Corrado Zoccolo
  2010-04-22 10:23             ` Miklos Szeredi
  2010-04-22 20:31             ` Vivek Goyal
  0 siblings, 2 replies; 21+ messages in thread
From: Corrado Zoccolo @ 2010-04-22  7:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

Hi Miklos,
On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:
> Jens, Corrado,
>
> Here's a graph showing the number of issued but not yet completed
> requests versus time for CFQ and NOOP schedulers running the tiobench
> benchmark with 8 threads:
>
> http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
>
> It shows pretty clearly the performance problem is because CFQ is not
> issuing enough request to fill the bandwidth.
>
> Is this the correct behavior of CFQ or is this a bug?
 This is the expected behavior from CFQ, even if it is not optimal,
since we aren't able to identify multi-splindle disks yet. Can you
post the result of "grep -r . ." in your /sys/block/*/queue directory,
to see if we can find some parameter that can help identifying your
hardware as a multi-spindle disk.
>
> This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:
>
> read_ahead_kb=512
> low_latency=0 (for CFQ)
You should get much better throughput by setting
/sys/block/_your_disk_/queue/iosched/slice_idle to 0, or
/sys/block/_your_disk_/queue/rotational to 0.

Thanks,
Corrado
>
> Thanks,
> Miklos
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-22  7:59           ` Corrado Zoccolo
@ 2010-04-22 10:23             ` Miklos Szeredi
  2010-04-22 15:53               ` Jan Kara
  2010-04-22 20:31             ` Vivek Goyal
  1 sibling, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2010-04-22 10:23 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Thu, 2010-04-22 at 09:59 +0200, Corrado Zoccolo wrote:
> Hi Miklos,
> On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:
> > Jens, Corrado,
> >
> > Here's a graph showing the number of issued but not yet completed
> > requests versus time for CFQ and NOOP schedulers running the tiobench
> > benchmark with 8 threads:
> >
> > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
> >
> > It shows pretty clearly the performance problem is because CFQ is not
> > issuing enough request to fill the bandwidth.
> >
> > Is this the correct behavior of CFQ or is this a bug?
>  This is the expected behavior from CFQ, even if it is not optimal,
> since we aren't able to identify multi-splindle disks yet. Can you
> post the result of "grep -r . ." in your /sys/block/*/queue directory,
> to see if we can find some parameter that can help identifying your
> hardware as a multi-spindle disk.

./iosched/quantum:8
./iosched/fifo_expire_sync:124
./iosched/fifo_expire_async:248
./iosched/back_seek_max:16384
./iosched/back_seek_penalty:2
./iosched/slice_sync:100
./iosched/slice_async:40
./iosched/slice_async_rq:2
./iosched/slice_idle:8
./iosched/low_latency:0
./iosched/group_isolation:0
./nr_requests:128
./read_ahead_kb:512
./max_hw_sectors_kb:32767
./max_sectors_kb:512
./max_segments:64
./max_segment_size:65536
./scheduler:noop deadline [cfq]
./hw_sector_size:512
./logical_block_size:512
./physical_block_size:512
./minimum_io_size:512
./optimal_io_size:0
./discard_granularity:0
./discard_max_bytes:0
./discard_zeroes_data:0
./rotational:1
./nomerges:0
./rq_affinity:1

> >
> > This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:
> >
> > read_ahead_kb=512
> > low_latency=0 (for CFQ)
> You should get much better throughput by setting
> /sys/block/_your_disk_/queue/iosched/slice_idle to 0, or
> /sys/block/_your_disk_/queue/rotational to 0.

slice_idle=0 definitely helps.  rotational=0 seems to help on 2.6.34-rc
but not on 2.6.32.

As far as I understand setting slice_idle to zero is just a workaround
to make cfq look at all the other queues instead of serving one
exclusively for a long time.

I have very little understanding of I/O scheduling but my idea of what's
really needed here is to realize that one queue is not able to saturate
the device and there's a large backlog of requests on other queues that
are waiting to be served.  Is something like that implementable?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-22 10:23             ` Miklos Szeredi
@ 2010-04-22 15:53               ` Jan Kara
  2010-04-23 10:48                 ` Miklos Szeredi
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2010-04-22 15:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Corrado Zoccolo, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Thu 22-04-10 12:23:29, Miklos Szeredi wrote:
> > >
> > > This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:
> > >
> > > read_ahead_kb=512
> > > low_latency=0 (for CFQ)
> > You should get much better throughput by setting
> > /sys/block/_your_disk_/queue/iosched/slice_idle to 0, or
> > /sys/block/_your_disk_/queue/rotational to 0.
> 
> slice_idle=0 definitely helps.  rotational=0 seems to help on 2.6.34-rc
> but not on 2.6.32.
> 
> As far as I understand setting slice_idle to zero is just a workaround
> to make cfq look at all the other queues instead of serving one
> exclusively for a long time.
  Yes, basically it disables idling (i.e., waiting whether a thread sends
more IO so that we can get better IO locality).

> I have very little understanding of I/O scheduling but my idea of what's
> really needed here is to realize that one queue is not able to saturate
> the device and there's a large backlog of requests on other queues that
> are waiting to be served.  Is something like that implementable?
  I see a problem with defining "saturate the device" - but maybe we could
measure something like "completed requests / sec" and try autotuning
slice_idle to maximize this value (hopefully the utility function should
be concave so we can just use "local optimization").

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-22  7:59           ` Corrado Zoccolo
  2010-04-22 10:23             ` Miklos Szeredi
@ 2010-04-22 20:31             ` Vivek Goyal
  2010-04-23 10:57               ` Miklos Szeredi
  1 sibling, 1 reply; 21+ messages in thread
From: Vivek Goyal @ 2010-04-22 20:31 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Miklos Szeredi, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
> Hi Miklos,
> On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:
> > Jens, Corrado,
> >
> > Here's a graph showing the number of issued but not yet completed
> > requests versus time for CFQ and NOOP schedulers running the tiobench
> > benchmark with 8 threads:
> >
> > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
> >
> > It shows pretty clearly the performance problem is because CFQ is not
> > issuing enough request to fill the bandwidth.
> >
> > Is this the correct behavior of CFQ or is this a bug?
>  This is the expected behavior from CFQ, even if it is not optimal,
> since we aren't able to identify multi-splindle disks yet.

In the past we were of the opinion that for sequential workload multi spindle
disks will not matter much as readahead logic (in OS and possibly in
hardware also) will help. For random workload we anyway don't idle on the
single cfqq so it is fine. But my tests now seem to be telling a different
story.

I also have one FC link to one of the HP EVA and I am running increasing 
number of sequential readers to see if throughput goes up as number of
readers go up. The results are with noop and cfq. I do flush OS caches
across the runs but I have no control on caching on HP EVA.

Kernel=2.6.34-rc5 
DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe        
Workload=bsr      iosched=cfq     Filesz=2G   bs=4K   
=========================================================================
job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)    
---       --- --  ------------   -----------    -------------  -----------    
bsr       1   1   135366         59024          0              0              
bsr       1   2   124256         126808         0              0              
bsr       1   4   132921         341436         0              0              
bsr       1   8   129807         392904         0              0              
bsr       1   16  129988         773991         0              0              

Kernel=2.6.34-rc5             
DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe        
Workload=bsr      iosched=noop    Filesz=2G   bs=4K   
=========================================================================
job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)    
---       --- --  ------------   -----------    -------------  -----------    
bsr       1   1   126187         95272          0              0              
bsr       1   2   185154         72908          0              0              
bsr       1   4   224622         88037          0              0              
bsr       1   8   285416         115592         0              0              
bsr       1   16  348564         156846         0              0              

So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
less constat, about 130MB/s.

So atleast in this case, a single sequential CFQ queue is not keeing the
disk busy enough.

I am wondering why my testing results were different in the past. May be
it was a different piece of hardware and behavior various across hardware?

Anyway, if that's the case, then we probably need to allow IO from
multiple sequential readers and keep a watch on throughput. If throughput
drops then reduce the number of parallel sequential readers. Not sure how
much of code that is but with multiple cfqq going in parallel, ioprio
logic will more or less stop working in CFQ (on multi-spindle hardware).

FWIW, I also ran tiobench on same HP EVA with NOOP and CFQ. And indeed
Read throughput is bad with CFQ.

With NOOP
=========
# /usr/bin/tiotest -t 8 -f 2000 -r 4000 -b 4096 -d /mnt/mpathe
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write       16000 MBs |   44.1 s | 362.410 MB/s |  25.3 %  | 1239.4 % |
| Random Write  125 MBs |    0.8 s | 156.182 MB/s |  19.7 %  | 484.8 % |
| Read        16000 MBs |   59.9 s | 267.008 MB/s |  12.4 %  | 197.1 % |
| Random Read   125 MBs |   16.7 s |   7.478 MB/s |   1.0 %  |  23.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.083 ms |      834.092 ms |  0.00000 |   0.00000 |
| Random Write |        0.021 ms |       21.024 ms |  0.00000 |   0.00000 |
| Read         |        0.115 ms |      105.830 ms |  0.00000 |   0.00000 |
| Random Read  |        4.088 ms |      295.605 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.114 ms |      834.092 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

With CFQ
========
# /usr/bin/tiotest -t 8 -f 2000 -r 4000 -b 4096 -d /mnt/mpathe
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write       16000 MBs |   49.5 s | 323.086 MB/s |  21.7 %  | 1175.6 % |
| Random Write  125 MBs |    2.2 s |  57.148 MB/s |   5.0 %  | 188.1 % |
| Read        16000 MBs |  162.7 s |  98.311 MB/s |   4.7 %  |  71.0 % |
| Random Read   125 MBs |   17.0 s |   7.344 MB/s |   0.8 %  |  26.5 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.093 ms |      832.680 ms |  0.00000 |   0.00000 |
| Random Write |        0.017 ms |       12.031 ms |  0.00000 |   0.00000 |
| Read         |        0.316 ms |      561.623 ms |  0.00000 |   0.00000 |
| Random Read  |        4.126 ms |      273.156 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.219 ms |      832.680 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

Thanks
Vivek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-22 15:53               ` Jan Kara
@ 2010-04-23 10:48                 ` Miklos Szeredi
  0 siblings, 0 replies; 21+ messages in thread
From: Miklos Szeredi @ 2010-04-23 10:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Corrado Zoccolo, Jens Axboe, linux-kernel, Suresh Jayaraman

On Thu, 2010-04-22 at 17:53 +0200, Jan Kara wrote:
> On Thu 22-04-10 12:23:29, Miklos Szeredi wrote:
> > I have very little understanding of I/O scheduling but my idea of what's
> > really needed here is to realize that one queue is not able to saturate
> > the device and there's a large backlog of requests on other queues that
> > are waiting to be served.  Is something like that implementable?
>   I see a problem with defining "saturate the device" - but maybe we could
> measure something like "completed requests / sec" and try autotuning
> slice_idle to maximize this value (hopefully the utility function should
> be concave so we can just use "local optimization").

Yeah, detecting saturation may be difficult.

I guess that function depends on a lot of other things as well,
including seek times, etc.  Not easy to optimize.

I'm still wondering what makes such a difference between CFQ on 2.6.16
and CFQ on 2.6.27-34, why is the one in older kernels performing so much
better in this situation?

What should we tell our customers?  The default settings should at least
handle these systems a bit better.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-22 20:31             ` Vivek Goyal
@ 2010-04-23 10:57               ` Miklos Szeredi
  2010-04-24 20:36                 ` Corrado Zoccolo
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2010-04-23 10:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Corrado Zoccolo, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Thu, 2010-04-22 at 16:31 -0400, Vivek Goyal wrote:
> On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
> > Hi Miklos,
> > On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:
> > > Jens, Corrado,
> > >
> > > Here's a graph showing the number of issued but not yet completed
> > > requests versus time for CFQ and NOOP schedulers running the tiobench
> > > benchmark with 8 threads:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
> > >
> > > It shows pretty clearly the performance problem is because CFQ is not
> > > issuing enough request to fill the bandwidth.
> > >
> > > Is this the correct behavior of CFQ or is this a bug?
> >  This is the expected behavior from CFQ, even if it is not optimal,
> > since we aren't able to identify multi-splindle disks yet.
> 
> In the past we were of the opinion that for sequential workload multi spindle
> disks will not matter much as readahead logic (in OS and possibly in
> hardware also) will help. For random workload we anyway don't idle on the
> single cfqq so it is fine. But my tests now seem to be telling a different
> story.
> 
> I also have one FC link to one of the HP EVA and I am running increasing 
> number of sequential readers to see if throughput goes up as number of
> readers go up. The results are with noop and cfq. I do flush OS caches
> across the runs but I have no control on caching on HP EVA.
> 
> Kernel=2.6.34-rc5 
> DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe        
> Workload=bsr      iosched=cfq     Filesz=2G   bs=4K   
> =========================================================================
> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)    
> ---       --- --  ------------   -----------    -------------  -----------    
> bsr       1   1   135366         59024          0              0              
> bsr       1   2   124256         126808         0              0              
> bsr       1   4   132921         341436         0              0              
> bsr       1   8   129807         392904         0              0              
> bsr       1   16  129988         773991         0              0              
> 
> Kernel=2.6.34-rc5             
> DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe        
> Workload=bsr      iosched=noop    Filesz=2G   bs=4K   
> =========================================================================
> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)    
> ---       --- --  ------------   -----------    -------------  -----------    
> bsr       1   1   126187         95272          0              0              
> bsr       1   2   185154         72908          0              0              
> bsr       1   4   224622         88037          0              0              
> bsr       1   8   285416         115592         0              0              
> bsr       1   16  348564         156846         0              0              
> 

These numbers are very similar to what I got.

> So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
> less constat, about 130MB/s.
> 
> So atleast in this case, a single sequential CFQ queue is not keeing the
> disk busy enough.
> 
> I am wondering why my testing results were different in the past. May be
> it was a different piece of hardware and behavior various across hardware?

Probably.  I haven't seen this type of behavior on other hardware.

> Anyway, if that's the case, then we probably need to allow IO from
> multiple sequential readers and keep a watch on throughput. If throughput
> drops then reduce the number of parallel sequential readers. Not sure how
> much of code that is but with multiple cfqq going in parallel, ioprio
> logic will more or less stop working in CFQ (on multi-spindle hardware).

Have you tested on older kernels?  Around 2.6.16 it seemed to allow more
parallel reads, but that might have been just accidental (due to I/O
being submitted in a different pattern).

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-23 10:57               ` Miklos Szeredi
@ 2010-04-24 20:36                 ` Corrado Zoccolo
  2010-04-26 13:50                   ` Vivek Goyal
  2010-04-26 19:14                   ` Vivek Goyal
  0 siblings, 2 replies; 21+ messages in thread
From: Corrado Zoccolo @ 2010-04-24 20:36 UTC (permalink / raw)
  To: Miklos Szeredi, Vivek Goyal
  Cc: Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

[-- Attachment #1: Type: text/plain, Size: 5651 bytes --]

On Fri, Apr 23, 2010 at 12:57 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:
> On Thu, 2010-04-22 at 16:31 -0400, Vivek Goyal wrote:
>> On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
>> > Hi Miklos,
>> > On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi@suse.cz> wrote:
>> > > Jens, Corrado,
>> > >
>> > > Here's a graph showing the number of issued but not yet completed
>> > > requests versus time for CFQ and NOOP schedulers running the tiobench
>> > > benchmark with 8 threads:
>> > >
>> > > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
>> > >
>> > > It shows pretty clearly the performance problem is because CFQ is not
>> > > issuing enough request to fill the bandwidth.
>> > >
>> > > Is this the correct behavior of CFQ or is this a bug?
>> >  This is the expected behavior from CFQ, even if it is not optimal,
>> > since we aren't able to identify multi-splindle disks yet.
>>
>> In the past we were of the opinion that for sequential workload multi spindle
>> disks will not matter much as readahead logic (in OS and possibly in
>> hardware also) will help. For random workload we anyway don't idle on the
>> single cfqq so it is fine. But my tests now seem to be telling a different
>> story.
>>
>> I also have one FC link to one of the HP EVA and I am running increasing
>> number of sequential readers to see if throughput goes up as number of
>> readers go up. The results are with noop and cfq. I do flush OS caches
>> across the runs but I have no control on caching on HP EVA.
>>
>> Kernel=2.6.34-rc5
>> DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe
>> Workload=bsr      iosched=cfq     Filesz=2G   bs=4K
>> =========================================================================
>> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
>> ---       --- --  ------------   -----------    -------------  -----------
>> bsr       1   1   135366         59024          0              0
>> bsr       1   2   124256         126808         0              0
>> bsr       1   4   132921         341436         0              0
>> bsr       1   8   129807         392904         0              0
>> bsr       1   16  129988         773991         0              0
>>
>> Kernel=2.6.34-rc5
>> DIR=/mnt/iostestmnt/fio        DEV=/dev/mapper/mpathe
>> Workload=bsr      iosched=noop    Filesz=2G   bs=4K
>> =========================================================================
>> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
>> ---       --- --  ------------   -----------    -------------  -----------
>> bsr       1   1   126187         95272          0              0
>> bsr       1   2   185154         72908          0              0
>> bsr       1   4   224622         88037          0              0
>> bsr       1   8   285416         115592         0              0
>> bsr       1   16  348564         156846         0              0
>>
>
> These numbers are very similar to what I got.
>
>> So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
>> less constat, about 130MB/s.
>>
>> So atleast in this case, a single sequential CFQ queue is not keeing the
>> disk busy enough.
>>
>> I am wondering why my testing results were different in the past. May be
>> it was a different piece of hardware and behavior various across hardware?
>
> Probably.  I haven't seen this type of behavior on other hardware.
>
>> Anyway, if that's the case, then we probably need to allow IO from
>> multiple sequential readers and keep a watch on throughput. If throughput
>> drops then reduce the number of parallel sequential readers. Not sure how
>> much of code that is but with multiple cfqq going in parallel, ioprio
>> logic will more or less stop working in CFQ (on multi-spindle hardware).
Hi Vivek,
I tried to implement exactly what you are proposing, see the attached patches.
I leverage the queue merging features to let multiple cfqqs share the
disk in the same timeslice.
I changed the queue split code to trigger on throughput drop instead
of on seeky pattern, so diverging queues can remain merged if they
have good throughput. Moreover, I measure the max bandwidth reached by
single queues and merged queues (you can see the values in the
bandwidth sysfs file).
If merged queues can outperform non-merged ones, the queue merging
code will try to opportunistically merge together queues that cannot
submit enough requests to fill half of the NCQ slots. I'd like to know
if you can see any improvements out of this on your hardware. There
are some magic numbers in the code, you may want to try tuning them.
Note that, since the opportunistic queue merging will start happening
only after merged queues have shown to reach higher bandwidth than
non-merged queues, you should use the disk for a while before trying
the test (and you can check sysfs), or the merging will not happen.

>
> Have you tested on older kernels?  Around 2.6.16 it seemed to allow more
> parallel reads, but that might have been just accidental (due to I/O
> being submitted in a different pattern).
Is the BW for 1 single reader also better on 2.6.16, or the
improvement is only seen with more concurrent readers?

Thanks,
Corrado
>
> Thanks,
> Miklos
>
>

[-- Attachment #2: 0001-cfq-iosched-introduce-bandwidth-measurement.patch --]
[-- Type: application/octet-stream, Size: 3522 bytes --]

From 1d42bea919090c89b016e30482260475ea53a724 Mon Sep 17 00:00:00 2001
From: Corrado Zoccolo <czoccolo@gmail.com>
Date: Fri, 23 Apr 2010 22:21:30 +0200
Subject: [PATCH 1/2] cfq-iosched: introduce bandwidth measurement

---
 block/cfq-iosched.c |   30 ++++++++++++++++++++++++++++--
 1 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 838834b..6551726 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -54,7 +54,7 @@ static const int cfq_hist_divisor = 4;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)		((struct cfq_queue *) ((rq)->elevator_private2))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -145,6 +145,7 @@ struct cfq_queue {
 	struct cfq_group *orig_cfqg;
 	/* Sectors dispatched in current dispatch round */
 	unsigned long nr_sectors;
+	unsigned transferred;
 };
 
 /*
@@ -241,6 +242,12 @@ struct cfq_data {
 	 */
 	int hw_tag_est_depth;
 	unsigned int hw_tag_samples;
+	/*
+	 * performance measurements
+	 * max_bw is indexed by coop flag.
+	 */
+	unsigned max_bw[2];
+	unsigned cur_bw;
 
 	/*
 	 * idle window management
@@ -1422,6 +1429,7 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 						cfqd->rq_in_driver);
 
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+	RQ_CFQQ(rq)->transferred += blk_rq_sectors(rq);
 }
 
 static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
@@ -1430,6 +1438,7 @@ static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
 
 	WARN_ON(!cfqd->rq_in_driver);
 	cfqd->rq_in_driver--;
+	RQ_CFQQ(rq)->transferred -= blk_rq_sectors(rq);
 	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
 						cfqd->rq_in_driver);
 }
@@ -1552,6 +1561,8 @@ static void
 __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		    bool timed_out)
 {
+	unsigned used_slice = jiffies - cfqq->slice_start;
+	unsigned bw = cfqd->max_bw[1];
 	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
 
 	if (cfq_cfqq_wait_request(cfqq))
@@ -1560,13 +1571,20 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfq_clear_cfqq_wait_request(cfqq);
 	cfq_clear_cfqq_wait_busy(cfqq);
 
+	if (used_slice > HZ / 100 && used_slice > 2) {
+		bw = cfqq->transferred / used_slice;
+		cfqd->cur_bw = bw;
+		cfqd->max_bw[cfq_cfqq_coop(cfqq)] =
+			max(cfqd->max_bw[cfq_cfqq_coop(cfqq)], bw);
+	}
+	cfqq->transferred = 0;
 	/*
 	 * If this cfqq is shared between multiple processes, check to
 	 * make sure that those processes are still issuing I/Os within
 	 * the mean seek distance.  If not, it may be time to break the
 	 * queues apart again.
 	 */
-	if (cfq_cfqq_coop(cfqq) && CFQQ_SEEKY(cfqq))
+	if (cfq_cfqq_coop(cfqq) && bw <= cfqd->max_bw[1] * 9/10)
 		cfq_mark_cfqq_split_coop(cfqq);
 
 	/*
@@ -3776,6 +3794,13 @@ fail:
 /*
  * sysfs parts below -->
  */
+static ssize_t bandwidth_show(struct elevator_queue *e, char *page)
+{
+	struct cfq_data *cfqd = e->elevator_data;
+	return sprintf(page, "%d\t%d\t%d\n", cfqd->cur_bw,
+		       cfqd->max_bw[0], cfqd->max_bw[1]);
+}
+
 static ssize_t
 cfq_var_show(unsigned int var, char *page)
 {
@@ -3861,6 +3886,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_idle),
 	CFQ_ATTR(low_latency),
 	CFQ_ATTR(group_isolation),
+	__ATTR_RO(bandwidth),
 	__ATTR_NULL
 };
 
-- 
1.6.2.5


[-- Attachment #3: 0002-cfq-iosched-optimistic-queue-merging.patch --]
[-- Type: application/octet-stream, Size: 3158 bytes --]

From 19435f618c1072a48a66531c17218a3fbc0a41cd Mon Sep 17 00:00:00 2001
From: Corrado Zoccolo <czoccolo@gmail.com>
Date: Sat, 24 Apr 2010 15:39:31 +0200
Subject: [PATCH 2/2] cfq-iosched: optimistic queue merging.

---
 block/cfq-iosched.c |   41 +++++++++++++++++++++++++++++++++--------
 1 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6551726..4e9e015 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -146,6 +146,7 @@ struct cfq_queue {
 	/* Sectors dispatched in current dispatch round */
 	unsigned long nr_sectors;
 	unsigned transferred;
+	unsigned last_bw;
 };
 
 /*
@@ -1574,6 +1575,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (used_slice > HZ / 100 && used_slice > 2) {
 		bw = cfqq->transferred / used_slice;
 		cfqd->cur_bw = bw;
+		cfqq->last_bw = bw;
 		cfqd->max_bw[cfq_cfqq_coop(cfqq)] =
 			max(cfqd->max_bw[cfq_cfqq_coop(cfqq)], bw);
 	}
@@ -1696,8 +1698,9 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
 {
 	struct rb_root *root = &cfqd->prio_trees[cur_cfqq->org_ioprio];
 	struct rb_node *parent, *node;
-	struct cfq_queue *__cfqq;
+	struct cfq_queue *__cfqq, *__cfqq1;
 	sector_t sector = cfqd->last_position;
+	unsigned rs;
 
 	if (RB_EMPTY_ROOT(root))
 		return NULL;
@@ -1715,7 +1718,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
 	 * will contain the closest sector.
 	 */
 	__cfqq = rb_entry(parent, struct cfq_queue, p_node);
-	if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq))
+	if (!CFQQ_SEEKY(__cfqq) && cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq))
 		return __cfqq;
 
 	if (blk_rq_pos(__cfqq->next_rq) < sector)
@@ -1723,12 +1726,34 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
 	else
 		node = rb_prev(&__cfqq->p_node);
 	if (!node)
-		return NULL;
-
-	__cfqq = rb_entry(node, struct cfq_queue, p_node);
-	if (cfq_rq_close(cfqd, cur_cfqq, __cfqq->next_rq))
-		return __cfqq;
-
+		__cfqq1 = __cfqq;
+	else
+		__cfqq1 = rb_entry(node, struct cfq_queue, p_node);
+
+	if (!CFQQ_SEEKY(__cfqq1) && cfq_rq_close(cfqd, cur_cfqq, __cfqq1->next_rq))
+		return __cfqq1;
+
+	// Opportunistic queue merging could be beneficial even on far queues
+	// We enable it only on NCQ disks, if we observed that merged queues
+	// can reach higher bandwidth than single queues.
+	rs = cur_cfqq->allocated[READ] + cur_cfqq->allocated[WRITE];
+	if (cfqd->hw_tag && cfqd->max_bw[1] > cfqd->max_bw[0] &&
+	    // Do not overload the device queue
+	    rs < cfqd->hw_tag_est_depth / 2) {
+		unsigned r1 = __cfqq->allocated[READ] + __cfqq->allocated[WRITE];
+		unsigned r2 = __cfqq1->allocated[READ] + __cfqq1->allocated[WRITE];
+		// Prefer merging with a queue that has fewer pending requests
+		if (r1 > r2 && !CFQQ_SEEKY(__cfqq1)) {
+			__cfqq = __cfqq1;
+			r1 = r2;
+		}
+		// Do not merge if the merged queue would have too many requests
+		if (r1 + rs > cfqd->hw_tag_est_depth / 2)
+			return NULL;
+		// Merge only if the BW of the two queues is comparable
+		if (abs(__cfqq->last_bw - cur_cfqq->last_bw) * 4 < cfqd->max_bw[0])
+			return __cfqq;
+	}
 	return NULL;
 }
 
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-24 20:36                 ` Corrado Zoccolo
@ 2010-04-26 13:50                   ` Vivek Goyal
  2010-04-26 19:14                   ` Vivek Goyal
  1 sibling, 0 replies; 21+ messages in thread
From: Vivek Goyal @ 2010-04-26 13:50 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Miklos Szeredi, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:

[..]
> >
> >> Anyway, if that's the case, then we probably need to allow IO from
> >> multiple sequential readers and keep a watch on throughput. If throughput
> >> drops then reduce the number of parallel sequential readers. Not sure how
> >> much of code that is but with multiple cfqq going in parallel, ioprio
> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
> Hi Vivek,
> I tried to implement exactly what you are proposing, see the attached patches.
> I leverage the queue merging features to let multiple cfqqs share the
> disk in the same timeslice.
> I changed the queue split code to trigger on throughput drop instead
> of on seeky pattern, so diverging queues can remain merged if they
> have good throughput. Moreover, I measure the max bandwidth reached by
> single queues and merged queues (you can see the values in the
> bandwidth sysfs file).
> If merged queues can outperform non-merged ones, the queue merging
> code will try to opportunistically merge together queues that cannot
> submit enough requests to fill half of the NCQ slots. I'd like to know
> if you can see any improvements out of this on your hardware. There
> are some magic numbers in the code, you may want to try tuning them.
> Note that, since the opportunistic queue merging will start happening
> only after merged queues have shown to reach higher bandwidth than
> non-merged queues, you should use the disk for a while before trying
> the test (and you can check sysfs), or the merging will not happen.

Thanks corrado. Using split queue sounds like the right place to do it.
I will test it and report back my results.

> 
> >
> > Have you tested on older kernels?  Around 2.6.16 it seemed to allow more
> > parallel reads, but that might have been just accidental (due to I/O
> > being submitted in a different pattern).
> Is the BW for 1 single reader also better on 2.6.16, or the
> improvement is only seen with more concurrent readers?

I will also test 2.6.16. I am anyway curious, how come 2.6.16 performed
better and we were dispatching requests from multiple cfqq and driving
deeper queue depths. To me this is fundamental cfq design that at one
time one queue gets to use the disk (at least for sync-idle tree). So
something must have been different in 2.6.16.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-24 20:36                 ` Corrado Zoccolo
  2010-04-26 13:50                   ` Vivek Goyal
@ 2010-04-26 19:14                   ` Vivek Goyal
  2010-04-27 17:25                     ` Corrado Zoccolo
  1 sibling, 1 reply; 21+ messages in thread
From: Vivek Goyal @ 2010-04-26 19:14 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Miklos Szeredi, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:

[..]
> >> Anyway, if that's the case, then we probably need to allow IO from
> >> multiple sequential readers and keep a watch on throughput. If throughput
> >> drops then reduce the number of parallel sequential readers. Not sure how
> >> much of code that is but with multiple cfqq going in parallel, ioprio
> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
> Hi Vivek,
> I tried to implement exactly what you are proposing, see the attached patches.
> I leverage the queue merging features to let multiple cfqqs share the
> disk in the same timeslice.
> I changed the queue split code to trigger on throughput drop instead
> of on seeky pattern, so diverging queues can remain merged if they
> have good throughput. Moreover, I measure the max bandwidth reached by
> single queues and merged queues (you can see the values in the
> bandwidth sysfs file).
> If merged queues can outperform non-merged ones, the queue merging
> code will try to opportunistically merge together queues that cannot
> submit enough requests to fill half of the NCQ slots. I'd like to know
> if you can see any improvements out of this on your hardware. There
> are some magic numbers in the code, you may want to try tuning them.
> Note that, since the opportunistic queue merging will start happening
> only after merged queues have shown to reach higher bandwidth than
> non-merged queues, you should use the disk for a while before trying
> the test (and you can check sysfs), or the merging will not happen.

Hi Corrado,

I ran these patches and I did not see any improvement. I think the reason
being that no cooperative queue merging took place and we did not have
any data for throughput with coop flag on.

#cat /sys/block/dm-3/queue/iosched/bandwidth
230	753	0

I think we need to implement something similiar to hw_tag detection logic
where we allow dispatches from multiple sync-idle queues at a time and try
to observe the BW. After certain window once we have observed the window,
then set the system behavior accordingly.

Kernel=2.6.34-rc5-corrado-multicfq
DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe       
Workload=bsr       iosched=cfq      Filesz=2G    bs=4K   
==========================================================================
job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)    
---       --- --  ------------   -----------    -------------  -----------    
bsr       1   1   126590         61448          0              0              
bsr       1   2   127849         242843         0              0              
bsr       1   4   131886         508021         0              0              
bsr       1   8   131890         398241         0              0              
bsr       1   16  129167         454244         0              0              

Thanks
Vivek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-26 19:14                   ` Vivek Goyal
@ 2010-04-27 17:25                     ` Corrado Zoccolo
  2010-04-28 20:02                       ` Vivek Goyal
  0 siblings, 1 reply; 21+ messages in thread
From: Corrado Zoccolo @ 2010-04-27 17:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miklos Szeredi, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Mon, Apr 26, 2010 at 9:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:
>
> [..]
>> >> Anyway, if that's the case, then we probably need to allow IO from
>> >> multiple sequential readers and keep a watch on throughput. If throughput
>> >> drops then reduce the number of parallel sequential readers. Not sure how
>> >> much of code that is but with multiple cfqq going in parallel, ioprio
>> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
>> Hi Vivek,
>> I tried to implement exactly what you are proposing, see the attached patches.
>> I leverage the queue merging features to let multiple cfqqs share the
>> disk in the same timeslice.
>> I changed the queue split code to trigger on throughput drop instead
>> of on seeky pattern, so diverging queues can remain merged if they
>> have good throughput. Moreover, I measure the max bandwidth reached by
>> single queues and merged queues (you can see the values in the
>> bandwidth sysfs file).
>> If merged queues can outperform non-merged ones, the queue merging
>> code will try to opportunistically merge together queues that cannot
>> submit enough requests to fill half of the NCQ slots. I'd like to know
>> if you can see any improvements out of this on your hardware. There
>> are some magic numbers in the code, you may want to try tuning them.
>> Note that, since the opportunistic queue merging will start happening
>> only after merged queues have shown to reach higher bandwidth than
>> non-merged queues, you should use the disk for a while before trying
>> the test (and you can check sysfs), or the merging will not happen.
>
> Hi Corrado,
>
> I ran these patches and I did not see any improvement. I think the reason
> being that no cooperative queue merging took place and we did not have
> any data for throughput with coop flag on.
>
> #cat /sys/block/dm-3/queue/iosched/bandwidth
> 230     753     0
>
> I think we need to implement something similiar to hw_tag detection logic
> where we allow dispatches from multiple sync-idle queues at a time and try
> to observe the BW. After certain window once we have observed the window,
> then set the system behavior accordingly.
Hi Vivek,
thanks for testing. Can you try changing the condition to enable the
queue merging to also consider the case in which max_bw[1] == 0 &&
max_bw[0] > 100MB/s (notice that max_bw is measured in
sectors/jiffie).
This should rule out low end disks, and enable merging where it can be
beneficial.
If the results are good, but we find this enabling condition
unreliable, then we can think of a better way, but I'm curious to see
if the results are promising before thinking to the details.

Thanks,
Corrado

>
> Kernel=2.6.34-rc5-corrado-multicfq
> DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe
> Workload=bsr       iosched=cfq      Filesz=2G    bs=4K
> ==========================================================================
> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
> ---       --- --  ------------   -----------    -------------  -----------
> bsr       1   1   126590         61448          0              0
> bsr       1   2   127849         242843         0              0
> bsr       1   4   131886         508021         0              0
> bsr       1   8   131890         398241         0              0
> bsr       1   16  129167         454244         0              0
>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-27 17:25                     ` Corrado Zoccolo
@ 2010-04-28 20:02                       ` Vivek Goyal
  2010-05-01 12:13                         ` Corrado Zoccolo
  0 siblings, 1 reply; 21+ messages in thread
From: Vivek Goyal @ 2010-04-28 20:02 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Miklos Szeredi, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Tue, Apr 27, 2010 at 07:25:14PM +0200, Corrado Zoccolo wrote:
> On Mon, Apr 26, 2010 at 9:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:
> >
> > [..]
> >> >> Anyway, if that's the case, then we probably need to allow IO from
> >> >> multiple sequential readers and keep a watch on throughput. If throughput
> >> >> drops then reduce the number of parallel sequential readers. Not sure how
> >> >> much of code that is but with multiple cfqq going in parallel, ioprio
> >> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
> >> Hi Vivek,
> >> I tried to implement exactly what you are proposing, see the attached patches.
> >> I leverage the queue merging features to let multiple cfqqs share the
> >> disk in the same timeslice.
> >> I changed the queue split code to trigger on throughput drop instead
> >> of on seeky pattern, so diverging queues can remain merged if they
> >> have good throughput. Moreover, I measure the max bandwidth reached by
> >> single queues and merged queues (you can see the values in the
> >> bandwidth sysfs file).
> >> If merged queues can outperform non-merged ones, the queue merging
> >> code will try to opportunistically merge together queues that cannot
> >> submit enough requests to fill half of the NCQ slots. I'd like to know
> >> if you can see any improvements out of this on your hardware. There
> >> are some magic numbers in the code, you may want to try tuning them.
> >> Note that, since the opportunistic queue merging will start happening
> >> only after merged queues have shown to reach higher bandwidth than
> >> non-merged queues, you should use the disk for a while before trying
> >> the test (and you can check sysfs), or the merging will not happen.
> >
> > Hi Corrado,
> >
> > I ran these patches and I did not see any improvement. I think the reason
> > being that no cooperative queue merging took place and we did not have
> > any data for throughput with coop flag on.
> >
> > #cat /sys/block/dm-3/queue/iosched/bandwidth
> > 230     753     0
> >
> > I think we need to implement something similiar to hw_tag detection logic
> > where we allow dispatches from multiple sync-idle queues at a time and try
> > to observe the BW. After certain window once we have observed the window,
> > then set the system behavior accordingly.
> Hi Vivek,
> thanks for testing. Can you try changing the condition to enable the
> queue merging to also consider the case in which max_bw[1] == 0 &&
> max_bw[0] > 100MB/s (notice that max_bw is measured in
> sectors/jiffie).
> This should rule out low end disks, and enable merging where it can be
> beneficial.
> If the results are good, but we find this enabling condition
> unreliable, then we can think of a better way, but I'm curious to see
> if the results are promising before thinking to the details.

Ok, I made some changes and now some queue merging seems to be happening
and I am getting little better BW. This will require more debugging. I
will try to spare some time later.

Kernel=2.6.34-rc5-corrado-multicfq
DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe       
Workload=bsr       iosched=cfq      Filesz=1G    bs=16K  
==========================================================================
job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)    
---       --- --  ------------   -----------    -------------  -----------    
bsr       1   1   136457         67353          0              0              
bsr       1   2   148008         192415         0              0              
bsr       1   4   180223         535205         0              0              
bsr       1   8   166983         542326         0              0              
bsr       1   16  176617         832188         0              0              

Output of iosched/bandwidth

0	546	740

I did following changes on top of your patch.

Vivek

---
 block/cfq-iosched.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4e9e015..7589c44 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -243,6 +243,7 @@ struct cfq_data {
 	 */
 	int hw_tag_est_depth;
 	unsigned int hw_tag_samples;
+	unsigned int cfqq_merged_samples;
 	/*
 	 * performance measurements
 	 * max_bw is indexed by coop flag.
@@ -1736,10 +1737,14 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
 	// Opportunistic queue merging could be beneficial even on far queues
 	// We enable it only on NCQ disks, if we observed that merged queues
 	// can reach higher bandwidth than single queues.
+	// 204 sectors per jiffy is equivalent to 100MB/s on 1000 HZ conf.
+	// Allow merge if we don't have sufficient merged cfqq samples.
 	rs = cur_cfqq->allocated[READ] + cur_cfqq->allocated[WRITE];
-	if (cfqd->hw_tag && cfqd->max_bw[1] > cfqd->max_bw[0] &&
+	if (cfqd->hw_tag
+	   && (cfqd->max_bw[1] > cfqd->max_bw[0]
+	       || (cfqd->max_bw[0] > 204 && !sample_valid(cfqd->cfqq_merged_samples)))
 	    // Do not overload the device queue
-	    rs < cfqd->hw_tag_est_depth / 2) {
+	    && rs < cfqd->hw_tag_est_depth / 2) {
 		unsigned r1 = __cfqq->allocated[READ] + __cfqq->allocated[WRITE];
 		unsigned r2 = __cfqq1->allocated[READ] + __cfqq1->allocated[WRITE];
 		// Prefer merging with a queue that has fewer pending requests
@@ -1750,6 +1755,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
 		// Do not merge if the merged queue would have too many requests
 		if (r1 + rs > cfqd->hw_tag_est_depth / 2)
 			return NULL;
+		cfqd->cfqq_merged_samples++;
+
 		// Merge only if the BW of the two queues is comparable
 		if (abs(__cfqq->last_bw - cur_cfqq->last_bw) * 4 < cfqd->max_bw[0])
 			return __cfqq;
-- 
1.6.2.5

> 
> Thanks,
> Corrado
> 
> >
> > Kernel=2.6.34-rc5-corrado-multicfq
> > DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe
> > Workload=bsr       iosched=cfq      Filesz=2G    bs=4K
> > ==========================================================================
> > job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
> > ---       --- --  ------------   -----------    -------------  -----------
> > bsr       1   1   126590         61448          0              0
> > bsr       1   2   127849         242843         0              0
> > bsr       1   4   131886         508021         0              0
> > bsr       1   8   131890         398241         0              0
> > bsr       1   16  129167         454244         0              0
> >
> > Thanks
> > Vivek
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
>                                Tales of Power - C. Castaneda

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-04-28 20:02                       ` Vivek Goyal
@ 2010-05-01 12:13                         ` Corrado Zoccolo
  2010-06-14 17:59                           ` Miklos Szeredi
  0 siblings, 1 reply; 21+ messages in thread
From: Corrado Zoccolo @ 2010-05-01 12:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miklos Szeredi, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Wed, Apr 28, 2010 at 10:02 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Apr 27, 2010 at 07:25:14PM +0200, Corrado Zoccolo wrote:
>> On Mon, Apr 26, 2010 at 9:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:
>> >
>> > [..]
>> >> >> Anyway, if that's the case, then we probably need to allow IO from
>> >> >> multiple sequential readers and keep a watch on throughput. If throughput
>> >> >> drops then reduce the number of parallel sequential readers. Not sure how
>> >> >> much of code that is but with multiple cfqq going in parallel, ioprio
>> >> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
>> >> Hi Vivek,
>> >> I tried to implement exactly what you are proposing, see the attached patches.
>> >> I leverage the queue merging features to let multiple cfqqs share the
>> >> disk in the same timeslice.
>> >> I changed the queue split code to trigger on throughput drop instead
>> >> of on seeky pattern, so diverging queues can remain merged if they
>> >> have good throughput. Moreover, I measure the max bandwidth reached by
>> >> single queues and merged queues (you can see the values in the
>> >> bandwidth sysfs file).
>> >> If merged queues can outperform non-merged ones, the queue merging
>> >> code will try to opportunistically merge together queues that cannot
>> >> submit enough requests to fill half of the NCQ slots. I'd like to know
>> >> if you can see any improvements out of this on your hardware. There
>> >> are some magic numbers in the code, you may want to try tuning them.
>> >> Note that, since the opportunistic queue merging will start happening
>> >> only after merged queues have shown to reach higher bandwidth than
>> >> non-merged queues, you should use the disk for a while before trying
>> >> the test (and you can check sysfs), or the merging will not happen.
>> >
>> > Hi Corrado,
>> >
>> > I ran these patches and I did not see any improvement. I think the reason
>> > being that no cooperative queue merging took place and we did not have
>> > any data for throughput with coop flag on.
>> >
>> > #cat /sys/block/dm-3/queue/iosched/bandwidth
>> > 230     753     0
>> >
>> > I think we need to implement something similiar to hw_tag detection logic
>> > where we allow dispatches from multiple sync-idle queues at a time and try
>> > to observe the BW. After certain window once we have observed the window,
>> > then set the system behavior accordingly.
>> Hi Vivek,
>> thanks for testing. Can you try changing the condition to enable the
>> queue merging to also consider the case in which max_bw[1] == 0 &&
>> max_bw[0] > 100MB/s (notice that max_bw is measured in
>> sectors/jiffie).
>> This should rule out low end disks, and enable merging where it can be
>> beneficial.
>> If the results are good, but we find this enabling condition
>> unreliable, then we can think of a better way, but I'm curious to see
>> if the results are promising before thinking to the details.
>
> Ok, I made some changes and now some queue merging seems to be happening
> and I am getting little better BW. This will require more debugging. I
> will try to spare some time later.
>
> Kernel=2.6.34-rc5-corrado-multicfq
> DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe
> Workload=bsr       iosched=cfq      Filesz=1G    bs=16K
> ==========================================================================
> job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
> ---       --- --  ------------   -----------    -------------  -----------
> bsr       1   1   136457         67353          0              0
> bsr       1   2   148008         192415         0              0
> bsr       1   4   180223         535205         0              0
> bsr       1   8   166983         542326         0              0
> bsr       1   16  176617         832188         0              0
>
> Output of iosched/bandwidth
>
> 0       546     740
>
> I did following changes on top of your patch.
This is becoming interesting. I think a major limitation of the
current approach is that it is too easy for a merged queue to be
separated again.
My code:
   if (cfq_cfqq_coop(cfqq) && bw <= cfqd->max_bw[1] * 9/10)
                cfq_mark_cfqq_split_coop(cfqq);
will immediately split any merged queue as soon as max_bw[1] grows too
much, so it should be based on max_bw[0].
Moreover, this code will likely split off all cics from the merged
queue, while it would be much better to split off only the cics that
are receiving less than their fair share of the BW (this will also
improve the fairness of the scheduler when queues are merged).

Corrado
>
> Vivek
>
> ---
>  block/cfq-iosched.c |   11 +++++++++--
>  1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 4e9e015..7589c44 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -243,6 +243,7 @@ struct cfq_data {
>         */
>        int hw_tag_est_depth;
>        unsigned int hw_tag_samples;
> +       unsigned int cfqq_merged_samples;
>        /*
>         * performance measurements
>         * max_bw is indexed by coop flag.
> @@ -1736,10 +1737,14 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
>        // Opportunistic queue merging could be beneficial even on far queues
>        // We enable it only on NCQ disks, if we observed that merged queues
>        // can reach higher bandwidth than single queues.
> +       // 204 sectors per jiffy is equivalent to 100MB/s on 1000 HZ conf.
> +       // Allow merge if we don't have sufficient merged cfqq samples.
>        rs = cur_cfqq->allocated[READ] + cur_cfqq->allocated[WRITE];
> -       if (cfqd->hw_tag && cfqd->max_bw[1] > cfqd->max_bw[0] &&
> +       if (cfqd->hw_tag
> +          && (cfqd->max_bw[1] > cfqd->max_bw[0]
> +              || (cfqd->max_bw[0] > 204 && !sample_valid(cfqd->cfqq_merged_samples)))
>            // Do not overload the device queue
> -           rs < cfqd->hw_tag_est_depth / 2) {
> +           && rs < cfqd->hw_tag_est_depth / 2) {
>                unsigned r1 = __cfqq->allocated[READ] + __cfqq->allocated[WRITE];
>                unsigned r2 = __cfqq1->allocated[READ] + __cfqq1->allocated[WRITE];
>                // Prefer merging with a queue that has fewer pending requests
> @@ -1750,6 +1755,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
>                // Do not merge if the merged queue would have too many requests
>                if (r1 + rs > cfqd->hw_tag_est_depth / 2)
>                        return NULL;
> +               cfqd->cfqq_merged_samples++;
> +
>                // Merge only if the BW of the two queues is comparable
>                if (abs(__cfqq->last_bw - cur_cfqq->last_bw) * 4 < cfqd->max_bw[0])
>                        return __cfqq;
> --
> 1.6.2.5
>
>>
>> Thanks,
>> Corrado
>>
>> >
>> > Kernel=2.6.34-rc5-corrado-multicfq
>> > DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe
>> > Workload=bsr       iosched=cfq      Filesz=2G    bs=4K
>> > ==========================================================================
>> > job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
>> > ---       --- --  ------------   -----------    -------------  -----------
>> > bsr       1   1   126590         61448          0              0
>> > bsr       1   2   127849         242843         0              0
>> > bsr       1   4   131886         508021         0              0
>> > bsr       1   8   131890         398241         0              0
>> > bsr       1   16  129167         454244         0              0
>> >
>> > Thanks
>> > Vivek
>> >
>>
>>
>>
>> --
>> __________________________________________________________________________
>>
>> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
>> PhD - Department of Computer Science - University of Pisa, Italy
>> --------------------------------------------------------------------------
>> The self-confidence of a warrior is not the self-confidence of the average
>> man. The average man seeks certainty in the eyes of the onlooker and calls
>> that self-confidence. The warrior seeks impeccability in his own eyes and
>> calls that humbleness.
>>                                Tales of Power - C. Castaneda
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-05-01 12:13                         ` Corrado Zoccolo
@ 2010-06-14 17:59                           ` Miklos Szeredi
  2010-06-14 18:06                             ` Vivek Goyal
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2010-06-14 17:59 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Sat, 2010-05-01 at 14:13 +0200, Corrado Zoccolo wrote:
> On Wed, Apr 28, 2010 at 10:02 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Tue, Apr 27, 2010 at 07:25:14PM +0200, Corrado Zoccolo wrote:
> >> On Mon, Apr 26, 2010 at 9:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:
> >> >
> >> > [..]
> >> >> >> Anyway, if that's the case, then we probably need to allow IO from
> >> >> >> multiple sequential readers and keep a watch on throughput. If throughput
> >> >> >> drops then reduce the number of parallel sequential readers. Not sure how
> >> >> >> much of code that is but with multiple cfqq going in parallel, ioprio
> >> >> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
> >> >> Hi Vivek,
> >> >> I tried to implement exactly what you are proposing, see the attached patches.
> >> >> I leverage the queue merging features to let multiple cfqqs share the
> >> >> disk in the same timeslice.
> >> >> I changed the queue split code to trigger on throughput drop instead
> >> >> of on seeky pattern, so diverging queues can remain merged if they
> >> >> have good throughput. Moreover, I measure the max bandwidth reached by
> >> >> single queues and merged queues (you can see the values in the
> >> >> bandwidth sysfs file).
> >> >> If merged queues can outperform non-merged ones, the queue merging
> >> >> code will try to opportunistically merge together queues that cannot
> >> >> submit enough requests to fill half of the NCQ slots. I'd like to know
> >> >> if you can see any improvements out of this on your hardware. There
> >> >> are some magic numbers in the code, you may want to try tuning them.
> >> >> Note that, since the opportunistic queue merging will start happening
> >> >> only after merged queues have shown to reach higher bandwidth than
> >> >> non-merged queues, you should use the disk for a while before trying
> >> >> the test (and you can check sysfs), or the merging will not happen.
> >> >
> >> > Hi Corrado,
> >> >
> >> > I ran these patches and I did not see any improvement. I think the reason
> >> > being that no cooperative queue merging took place and we did not have
> >> > any data for throughput with coop flag on.
> >> >
> >> > #cat /sys/block/dm-3/queue/iosched/bandwidth
> >> > 230     753     0
> >> >
> >> > I think we need to implement something similiar to hw_tag detection logic
> >> > where we allow dispatches from multiple sync-idle queues at a time and try
> >> > to observe the BW. After certain window once we have observed the window,
> >> > then set the system behavior accordingly.
> >> Hi Vivek,
> >> thanks for testing. Can you try changing the condition to enable the
> >> queue merging to also consider the case in which max_bw[1] == 0 &&
> >> max_bw[0] > 100MB/s (notice that max_bw is measured in
> >> sectors/jiffie).
> >> This should rule out low end disks, and enable merging where it can be
> >> beneficial.
> >> If the results are good, but we find this enabling condition
> >> unreliable, then we can think of a better way, but I'm curious to see
> >> if the results are promising before thinking to the details.
> >
> > Ok, I made some changes and now some queue merging seems to be happening
> > and I am getting little better BW. This will require more debugging. I
> > will try to spare some time later.
> >
> > Kernel=2.6.34-rc5-corrado-multicfq
> > DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe
> > Workload=bsr       iosched=cfq      Filesz=1G    bs=16K
> > ==========================================================================
> > job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
> > ---       --- --  ------------   -----------    -------------  -----------
> > bsr       1   1   136457         67353          0              0
> > bsr       1   2   148008         192415         0              0
> > bsr       1   4   180223         535205         0              0
> > bsr       1   8   166983         542326         0              0
> > bsr       1   16  176617         832188         0              0
> >
> > Output of iosched/bandwidth
> >
> > 0       546     740
> >
> > I did following changes on top of your patch.
> This is becoming interesting. I think a major limitation of the
> current approach is that it is too easy for a merged queue to be
> separated again.
> My code:
>    if (cfq_cfqq_coop(cfqq) && bw <= cfqd->max_bw[1] * 9/10)
>                 cfq_mark_cfqq_split_coop(cfqq);
> will immediately split any merged queue as soon as max_bw[1] grows too
> much, so it should be based on max_bw[0].
> Moreover, this code will likely split off all cics from the merged
> queue, while it would be much better to split off only the cics that
> are receiving less than their fair share of the BW (this will also
> improve the fairness of the scheduler when queues are merged).

Is there any update on the status of this issue?

Thanks,
Miklos


> Corrado
> >
> > Vivek
> >
> > ---
> >  block/cfq-iosched.c |   11 +++++++++--
> >  1 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 4e9e015..7589c44 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -243,6 +243,7 @@ struct cfq_data {
> >         */
> >        int hw_tag_est_depth;
> >        unsigned int hw_tag_samples;
> > +       unsigned int cfqq_merged_samples;
> >        /*
> >         * performance measurements
> >         * max_bw is indexed by coop flag.
> > @@ -1736,10 +1737,14 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
> >        // Opportunistic queue merging could be beneficial even on far queues
> >        // We enable it only on NCQ disks, if we observed that merged queues
> >        // can reach higher bandwidth than single queues.
> > +       // 204 sectors per jiffy is equivalent to 100MB/s on 1000 HZ conf.
> > +       // Allow merge if we don't have sufficient merged cfqq samples.
> >        rs = cur_cfqq->allocated[READ] + cur_cfqq->allocated[WRITE];
> > -       if (cfqd->hw_tag && cfqd->max_bw[1] > cfqd->max_bw[0] &&
> > +       if (cfqd->hw_tag
> > +          && (cfqd->max_bw[1] > cfqd->max_bw[0]
> > +              || (cfqd->max_bw[0] > 204 && !sample_valid(cfqd->cfqq_merged_samples)))
> >            // Do not overload the device queue
> > -           rs < cfqd->hw_tag_est_depth / 2) {
> > +           && rs < cfqd->hw_tag_est_depth / 2) {
> >                unsigned r1 = __cfqq->allocated[READ] + __cfqq->allocated[WRITE];
> >                unsigned r2 = __cfqq1->allocated[READ] + __cfqq1->allocated[WRITE];
> >                // Prefer merging with a queue that has fewer pending requests
> > @@ -1750,6 +1755,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
> >                // Do not merge if the merged queue would have too many requests
> >                if (r1 + rs > cfqd->hw_tag_est_depth / 2)
> >                        return NULL;
> > +               cfqd->cfqq_merged_samples++;
> > +
> >                // Merge only if the BW of the two queues is comparable
> >                if (abs(__cfqq->last_bw - cur_cfqq->last_bw) * 4 < cfqd->max_bw[0])
> >                        return __cfqq;
> > --
> > 1.6.2.5
> >
> >>
> >> Thanks,
> >> Corrado
> >>
> >> >
> >> > Kernel=2.6.34-rc5-corrado-multicfq
> >> > DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe
> >> > Workload=bsr       iosched=cfq      Filesz=2G    bs=4K
> >> > ==========================================================================
> >> > job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
> >> > ---       --- --  ------------   -----------    -------------  -----------
> >> > bsr       1   1   126590         61448          0              0
> >> > bsr       1   2   127849         242843         0              0
> >> > bsr       1   4   131886         508021         0              0
> >> > bsr       1   8   131890         398241         0              0
> >> > bsr       1   16  129167         454244         0              0
> >> >
> >> > Thanks
> >> > Vivek
> >> >
> >>
> >>
> >>
> >> --
> >> __________________________________________________________________________
> >>
> >> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> >> PhD - Department of Computer Science - University of Pisa, Italy
> >> --------------------------------------------------------------------------
> >> The self-confidence of a warrior is not the self-confidence of the average
> >> man. The average man seeks certainty in the eyes of the onlooker and calls
> >> that self-confidence. The warrior seeks impeccability in his own eyes and
> >> calls that humbleness.
> >>                                Tales of Power - C. Castaneda
> >
> 
> 
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: CFQ read performance regression
  2010-06-14 17:59                           ` Miklos Szeredi
@ 2010-06-14 18:06                             ` Vivek Goyal
  0 siblings, 0 replies; 21+ messages in thread
From: Vivek Goyal @ 2010-06-14 18:06 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Corrado Zoccolo, Jens Axboe, linux-kernel, Jan Kara, Suresh Jayaraman

On Mon, Jun 14, 2010 at 07:59:13PM +0200, Miklos Szeredi wrote:
> On Sat, 2010-05-01 at 14:13 +0200, Corrado Zoccolo wrote:
> > On Wed, Apr 28, 2010 at 10:02 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Tue, Apr 27, 2010 at 07:25:14PM +0200, Corrado Zoccolo wrote:
> > >> On Mon, Apr 26, 2010 at 9:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > >> > On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:
> > >> >
> > >> > [..]
> > >> >> >> Anyway, if that's the case, then we probably need to allow IO from
> > >> >> >> multiple sequential readers and keep a watch on throughput. If throughput
> > >> >> >> drops then reduce the number of parallel sequential readers. Not sure how
> > >> >> >> much of code that is but with multiple cfqq going in parallel, ioprio
> > >> >> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
> > >> >> Hi Vivek,
> > >> >> I tried to implement exactly what you are proposing, see the attached patches.
> > >> >> I leverage the queue merging features to let multiple cfqqs share the
> > >> >> disk in the same timeslice.
> > >> >> I changed the queue split code to trigger on throughput drop instead
> > >> >> of on seeky pattern, so diverging queues can remain merged if they
> > >> >> have good throughput. Moreover, I measure the max bandwidth reached by
> > >> >> single queues and merged queues (you can see the values in the
> > >> >> bandwidth sysfs file).
> > >> >> If merged queues can outperform non-merged ones, the queue merging
> > >> >> code will try to opportunistically merge together queues that cannot
> > >> >> submit enough requests to fill half of the NCQ slots. I'd like to know
> > >> >> if you can see any improvements out of this on your hardware. There
> > >> >> are some magic numbers in the code, you may want to try tuning them.
> > >> >> Note that, since the opportunistic queue merging will start happening
> > >> >> only after merged queues have shown to reach higher bandwidth than
> > >> >> non-merged queues, you should use the disk for a while before trying
> > >> >> the test (and you can check sysfs), or the merging will not happen.
> > >> >
> > >> > Hi Corrado,
> > >> >
> > >> > I ran these patches and I did not see any improvement. I think the reason
> > >> > being that no cooperative queue merging took place and we did not have
> > >> > any data for throughput with coop flag on.
> > >> >
> > >> > #cat /sys/block/dm-3/queue/iosched/bandwidth
> > >> > 230     753     0
> > >> >
> > >> > I think we need to implement something similiar to hw_tag detection logic
> > >> > where we allow dispatches from multiple sync-idle queues at a time and try
> > >> > to observe the BW. After certain window once we have observed the window,
> > >> > then set the system behavior accordingly.
> > >> Hi Vivek,
> > >> thanks for testing. Can you try changing the condition to enable the
> > >> queue merging to also consider the case in which max_bw[1] == 0 &&
> > >> max_bw[0] > 100MB/s (notice that max_bw is measured in
> > >> sectors/jiffie).
> > >> This should rule out low end disks, and enable merging where it can be
> > >> beneficial.
> > >> If the results are good, but we find this enabling condition
> > >> unreliable, then we can think of a better way, but I'm curious to see
> > >> if the results are promising before thinking to the details.
> > >
> > > Ok, I made some changes and now some queue merging seems to be happening
> > > and I am getting little better BW. This will require more debugging. I
> > > will try to spare some time later.
> > >
> > > Kernel=2.6.34-rc5-corrado-multicfq
> > > DIR= /mnt/iostmnt/fio          DEV= /dev/mapper/mpathe
> > > Workload=bsr       iosched=cfq      Filesz=1G    bs=16K
> > > ==========================================================================
> > > job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
> > > ---       --- --  ------------   -----------    -------------  -----------
> > > bsr       1   1   136457         67353          0              0
> > > bsr       1   2   148008         192415         0              0
> > > bsr       1   4   180223         535205         0              0
> > > bsr       1   8   166983         542326         0              0
> > > bsr       1   16  176617         832188         0              0
> > >
> > > Output of iosched/bandwidth
> > >
> > > 0       546     740
> > >
> > > I did following changes on top of your patch.
> > This is becoming interesting. I think a major limitation of the
> > current approach is that it is too easy for a merged queue to be
> > separated again.
> > My code:
> >    if (cfq_cfqq_coop(cfqq) && bw <= cfqd->max_bw[1] * 9/10)
> >                 cfq_mark_cfqq_split_coop(cfqq);
> > will immediately split any merged queue as soon as max_bw[1] grows too
> > much, so it should be based on max_bw[0].
> > Moreover, this code will likely split off all cics from the merged
> > queue, while it would be much better to split off only the cics that
> > are receiving less than their fair share of the BW (this will also
> > improve the fairness of the scheduler when queues are merged).
> 
> Is there any update on the status of this issue?

How about running cfq with slice_idle=0 on high end storage. This should make
it very close to deadline behavior?

There has not been any further progress on my end for merging more
sequential queues for achieving better throughput.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-06-14 18:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-16 12:27 CFQ read performance regression Miklos Szeredi
2010-04-16 17:06 ` Chris
2010-04-17 12:46 ` Corrado Zoccolo
2010-04-19 11:46   ` Miklos Szeredi
2010-04-20 20:50     ` Corrado Zoccolo
2010-04-21 13:25       ` Miklos Szeredi
2010-04-21 16:05         ` Miklos Szeredi
2010-04-22  7:59           ` Corrado Zoccolo
2010-04-22 10:23             ` Miklos Szeredi
2010-04-22 15:53               ` Jan Kara
2010-04-23 10:48                 ` Miklos Szeredi
2010-04-22 20:31             ` Vivek Goyal
2010-04-23 10:57               ` Miklos Szeredi
2010-04-24 20:36                 ` Corrado Zoccolo
2010-04-26 13:50                   ` Vivek Goyal
2010-04-26 19:14                   ` Vivek Goyal
2010-04-27 17:25                     ` Corrado Zoccolo
2010-04-28 20:02                       ` Vivek Goyal
2010-05-01 12:13                         ` Corrado Zoccolo
2010-06-14 17:59                           ` Miklos Szeredi
2010-06-14 18:06                             ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).