Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1 - Jens' for-2.6.28 tree
@ 2008-09-05 16:19 Alan D. Brunelle
  2008-09-06 23:21 ` Alan D. Brunelle
  0 siblings, 1 reply; 5+ messages in thread
From: Alan D. Brunelle @ 2008-09-05 16:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe

Some DSS results from a 32-way ia64 machine set up to try and analyze
Oracle OLTP & DSS loads (128GB RAM, >200 disks). The data collected was
the elapsed time for DSS runs w/ 128 MBRs and 128 Readers, running on a
kernel generated from Jens Axboe's origin/for-2.6.28 tree. I alternated
runs, setting rq_affinity to 0 and 1 for all disks at the beginning of
each run.

There are a total of 68 data points for each alternative, and the
overall results show a decided improvement for this type of load with
rq_affinity set to 1:

rq=0: min=27.440000 avg=27.980500 max=28.500000 sdev=0.296827
rq=1: min=26.900000 avg=27.071500 max=27.480000 sdev=0.125169

Not only do we see about a 3.25% improvement in reduced average time, we
also see that the run-to-run deviations are much smaller as well. For a
pictorial representation, check out the graph @

http://free.linux.hp.com/~adb/jens/08-09-05/dss.png

The red and green areas illustrate the delta from the average for all
the data points with that rq_affinity setting. (Red being rq_affinity=0,
green being rq_affinity=1.)

I collected some vmstat & iostat data, and will be evaluating that as
well, and perhaps looking into lockstat & profiling data (time permitting).

The system has been set up as part of a collaboration between HP & Red
Hat's Linux performance teams, and we've been using it to analyze
performance characteristics of Oracle loads on large-ish systems, as
well as for evaluating potential code changes.

Alan D. Brunelle
HP Linux Kernel Technology Team

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1 - Jens' for-2.6.28 tree
  2008-09-05 16:19 Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1 - Jens' for-2.6.28 tree Alan D. Brunelle
@ 2008-09-06 23:21 ` Alan D. Brunelle
  2008-09-08 18:10   ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Alan D. Brunelle @ 2008-09-06 23:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe

Here are some results obtained during runs where we varied the number of
 readers & the multi-block read counts. 5 runs per rq_affinity setting
were done, and the averages are plotted at:

http://free.linux.hp.com/~adb/jens/08-09-05/by_mbr.jpeg

In all cases we are seeing a noticeable improvement in reduction of
elapsed time to perform the tasks, and again we are seeing much tighter
deviations with rq_affinity set to 1:

                          Min   Avg   Max   Std Dev
                         ----- ----- -----  --------
mbrs= 32 nrdrs= 64 rq=0: 30.57 31.50 34.06  1.456108
mbrs= 32 nrdrs= 64 rq=1: 29.27 29.59 29.96  0.325469   6.05% improvement

mbrs= 32 nrdrs=128 rq=0: 28.14 28.48 29.32  0.480208
mbrs= 32 nrdrs=128 rq=1: 27.54 27.88 28.48  0.359194   2.11% improvement

mbrs= 32 nrdrs=256 rq=0: 33.05 33.70 34.30  0.548151
mbrs= 32 nrdrs=256 rq=1: 33.10 33.36 33.74  0.257158   1.01% improvement

mbrs= 64 nrdrs= 64 rq=0: 30.53 30.74 31.10  0.255441
mbrs= 64 nrdrs= 64 rq=1: 29.40 29.65 29.91  0.187216   3.55% improvement

mbrs= 64 nrdrs=128 rq=0: 28.09 28.79 29.23  0.484149
mbrs= 64 nrdrs=128 rq=1: 27.73 27.96 28.33  0.226429   2.89% improvement

mbrs= 64 nrdrs=256 rq=0: 33.35 34.04 34.76  0.518816
mbrs= 64 nrdrs=256 rq=1: 33.02 33.13 33.25  0.088034   2.67% improvement

mbrs=128 nrdrs= 64 rq=0: 30.37 30.75 31.23  0.329439
mbrs=128 nrdrs= 64 rq=1: 29.20 29.49 29.82  0.221179   4.08% improvement

mbrs=128 nrdrs=128 rq=0: 28.04 28.54 29.00  0.392785
mbrs=128 nrdrs=128 rq=1: 27.76 28.08 28.26  0.190840   1.63% improvement

mbrs=128 nrdrs=256 rq=0: 33.37 33.89 34.30  0.448297
mbrs=128 nrdrs=256 rq=1: 33.04 33.30 33.56  0.203175   1.76% improvement

mbrs=256 nrdrs= 64 rq=0: 30.55 30.80 30.94  0.167392
mbrs=256 nrdrs= 64 rq=1: 29.23 29.57 29.91  0.305156   3.99% improvement

mbrs=256 nrdrs=128 rq=0: 28.38 28.82 29.20  0.305172
mbrs=256 nrdrs=128 rq=1: 27.78 27.92 28.11  0.142583   3.12% improvement

mbrs=256 nrdrs=256 rq=0: 33.25 34.21 34.88  0.598398
mbrs=256 nrdrs=256 rq=1: 33.11 33.23 33.48  0.154499   2.88% improvement


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1  - Jens' for-2.6.28 tree
  2008-09-06 23:21 ` Alan D. Brunelle
@ 2008-09-08 18:10   ` Jens Axboe
  2008-09-08 22:30     ` Alan D. Brunelle
  2008-09-09 16:54     ` Alan D. Brunelle
  0 siblings, 2 replies; 5+ messages in thread
From: Jens Axboe @ 2008-09-08 18:10 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel

On Sat, Sep 06 2008, Alan D. Brunelle wrote:
> Here are some results obtained during runs where we varied the number of
>  readers & the multi-block read counts. 5 runs per rq_affinity setting
> were done, and the averages are plotted at:
> 
> http://free.linux.hp.com/~adb/jens/08-09-05/by_mbr.jpeg
> 
> In all cases we are seeing a noticeable improvement in reduction of
> elapsed time to perform the tasks, and again we are seeing much tighter
> deviations with rq_affinity set to 1:
> 
>                           Min   Avg   Max   Std Dev
>                          ----- ----- -----  --------
> mbrs= 32 nrdrs= 64 rq=0: 30.57 31.50 34.06  1.456108
> mbrs= 32 nrdrs= 64 rq=1: 29.27 29.59 29.96  0.325469   6.05% improvement
> 
> mbrs= 32 nrdrs=128 rq=0: 28.14 28.48 29.32  0.480208
> mbrs= 32 nrdrs=128 rq=1: 27.54 27.88 28.48  0.359194   2.11% improvement
> 
> mbrs= 32 nrdrs=256 rq=0: 33.05 33.70 34.30  0.548151
> mbrs= 32 nrdrs=256 rq=1: 33.10 33.36 33.74  0.257158   1.01% improvement
> 
> mbrs= 64 nrdrs= 64 rq=0: 30.53 30.74 31.10  0.255441
> mbrs= 64 nrdrs= 64 rq=1: 29.40 29.65 29.91  0.187216   3.55% improvement
> 
> mbrs= 64 nrdrs=128 rq=0: 28.09 28.79 29.23  0.484149
> mbrs= 64 nrdrs=128 rq=1: 27.73 27.96 28.33  0.226429   2.89% improvement
> 
> mbrs= 64 nrdrs=256 rq=0: 33.35 34.04 34.76  0.518816
> mbrs= 64 nrdrs=256 rq=1: 33.02 33.13 33.25  0.088034   2.67% improvement
> 
> mbrs=128 nrdrs= 64 rq=0: 30.37 30.75 31.23  0.329439
> mbrs=128 nrdrs= 64 rq=1: 29.20 29.49 29.82  0.221179   4.08% improvement
> 
> mbrs=128 nrdrs=128 rq=0: 28.04 28.54 29.00  0.392785
> mbrs=128 nrdrs=128 rq=1: 27.76 28.08 28.26  0.190840   1.63% improvement
> 
> mbrs=128 nrdrs=256 rq=0: 33.37 33.89 34.30  0.448297
> mbrs=128 nrdrs=256 rq=1: 33.04 33.30 33.56  0.203175   1.76% improvement
> 
> mbrs=256 nrdrs= 64 rq=0: 30.55 30.80 30.94  0.167392
> mbrs=256 nrdrs= 64 rq=1: 29.23 29.57 29.91  0.305156   3.99% improvement
> 
> mbrs=256 nrdrs=128 rq=0: 28.38 28.82 29.20  0.305172
> mbrs=256 nrdrs=128 rq=1: 27.78 27.92 28.11  0.142583   3.12% improvement
> 
> mbrs=256 nrdrs=256 rq=0: 33.25 34.21 34.88  0.598398
> mbrs=256 nrdrs=256 rq=1: 33.11 33.23 33.48  0.154499   2.88% improvement

Thanks a lot for these numbers Alan, it definitely looks like a clear
win (and a pretty big one) for all of the above and the previous mail.
It would be interesting to see sys and usr times seperately, as well as
trying to compare profiles of two runs. On the testing that I did with a
4-way ppc box, lock contention and bouncing was way down with XFS and
btrfs. I didn't test other file systems yet. I saw mean acquisition and
hold time reductions in the 20-30% range and waittime reductions of over
40% in just simple meta data intensive fs testing.

For the casual reader, let me try and explan what the rq_affinity
toggles does in the 2.6.28 block tree. On the process side, the
scheduler will move processes around between cores in the system as it
deems it most beneficial. Or you can tie down a process to one or more
CPUs explicitly. On the IRQ side things are a little more rigid. You can
set IRQ affinity statically for a device (controller, typically) and
that's about it. With the new IO CPU affinity feature in the block
layer, you have full control over where you get your hardware completion
even. When you set rq_affinity=1, the block layer will route the
completion of a request to the CPU that submitted it. It does that by
utilizing the new generic SMP IPI code that was included in 2.6.26.

There's also the option of letting the submitter specify a CPU. I
haven't played much with that yet, but the intention is having the file
system pass down such a hint for keeping lock contention down. So when
the file system knows that it will update file system structures (and
thus grab a lock or more) on IO completion, it can ask for completion on
a limited range of CPUs instead of having that lock bounced around
between all cores. Right now it just supports a single CPU just like the
rq_affinity switch, but it can easily be extended to support a mask of
cores instead. It actually used to include that for the cases where you
would statically set a range of completion CPUs, but I removed that
since the option made little sense.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1 - Jens' for-2.6.28 tree
  2008-09-08 18:10   ` Jens Axboe
@ 2008-09-08 22:30     ` Alan D. Brunelle
  2008-09-09 16:54     ` Alan D. Brunelle
  1 sibling, 0 replies; 5+ messages in thread
From: Alan D. Brunelle @ 2008-09-08 22:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

Jens Axboe wrote:
> > On Sat, Sep 06 2008, Alan D. Brunelle wrote:
>> >> Here are some results obtained during runs where we varied the
number of
>> >>  readers & the multi-block read counts. 5 runs per rq_affinity setting
>> >> were done, and the averages are plotted at:
>> >>
>> >> http://free.linux.hp.com/~adb/jens/08-09-05/by_mbr.jpeg
>> >>
> > Thanks a lot for these numbers Alan, it definitely looks like a clear
> > win (and a pretty big one) for all of the above and the previous mail.
> > It would be interesting to see sys and usr times seperately, as well as
> > trying to compare profiles of two runs. On the testing that I did with a
> > 4-way ppc box, lock contention and bouncing was way down with XFS and
> > btrfs. I didn't test other file systems yet. I saw mean acquisition and
> > hold time reductions in the 20-30% range and waittime reductions of over
> > 40% in just simple meta data intensive fs testing.

Jens:

The graph up at :

http://free.linux.hp.com/~adb/jens/09-08-05/p_stats2.png

may or may not help clarify some things (the p_stats2.agr file in the
same directory can be fed into xmgrace, it may show better then the .png
file that was rendered).

The bottom graph shows reads (as measured by iostat), then above that
are the %user, %system and (%user+%system) values (as measured by
iostat). Black lines are rq_affinity=0 and red are for rq_affinity=1

/All/ values presented are averaged out over the 68 runs I did.

When rq_affinity=1, it appears that we attain the peak performance
/much/ quicker, and then we plateau out (gated by SOMETHING...). You'll
note that the red lines "terminate" quicker, as the work is more
front-loaded.

I don't see a large delta in %system between the two - and what is there
appears to be proportional to the increased I/O bandwidth. The increase
in %user also seems to be in proportion to the I/O (which is in
proportion to the DSS load capable of being performed).

I'm not sure if this helps much, but I think it may help answer part of
your question regarding %user + %sys. I'll work on some lock & profile
stuff on Tuesday (9/9/08).

Alan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1 - Jens' for-2.6.28 tree
  2008-09-08 18:10   ` Jens Axboe
  2008-09-08 22:30     ` Alan D. Brunelle
@ 2008-09-09 16:54     ` Alan D. Brunelle
  1 sibling, 0 replies; 5+ messages in thread
From: Alan D. Brunelle @ 2008-09-09 16:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

Jens Axboe wrote:
> Thanks a lot for these numbers Alan, it definitely looks like a clear
> win (and a pretty big one) for all of the above and the previous mail.
> It would be interesting to see sys and usr times seperately, as well as
> trying to compare profiles of two runs. On the testing that I did with a
> 4-way ppc box, lock contention and bouncing was way down with XFS and
> btrfs. I didn't test other file systems yet. I saw mean acquisition and
> hold time reductions in the 20-30% range and waittime reductions of over
> 40% in just simple meta data intensive fs testing.

Unfortunately, ia64 does /not/ currently support the standard lockstat
reporting interface. However, I was able to utilize Caliper (an HP ia64
profiler similar to Oprofile) and gathered a couple of interesting
values. These are averaged over 10 runs each (10 w/ rq_affinity=0 and 10
w/ rq_affinity=1, alternating between the two).

First: For the overall system we can gauge how efficient the instruction
stream is by looking at un-stalled instructions: w/ rq_affinity set to 0
we see 24.001% of the instructions were un-stalled, whilst w/
rq_affinity set to 1 we see 24.469% un-stalled (1.95% better w/
rq_affinity set to 1).

Just looking at the gross amount of cycles attributed to the application
(Oracle) we see that with rq_affinity set to 0 about 70.123% of the
cycles are attributed to that, whilst with rq_affinity set to 1 we see
71.520% of the cycles used by Oracle (1.99% better w/ rq_affinity set to 1).

Overall stats for the two results posted above:

Unstalled values:
rq=0: min=22.180 avg=24.001 max=27.220 sdev=1.881 range=5.040
rq=1: min=23.580 avg=24.469 max=24.930 sdev=0.486 range=1.350

%Cycles attributed to Oracle:
rq=0: min=69.890 avg=70.231 max=71.170 sdev=0.379 range=1.280
rq=1: min=71.420 avg=71.520 max=71.670 sdev=0.080 range=0.250

Again, in both sets we see a /much/ smaller deviation from the average
amongst all the run results when rq_affinity is set to 1.

I'm going to see if an older lock stat mechanism still works w/
2.6.27-rc5-ish kernels on ia64, and try that.

Alan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-09-09 16:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-05 16:19 Benchmarking results: DSS elapsed time values w/ rq_affinity=0/1 - Jens' for-2.6.28 tree Alan D. Brunelle
2008-09-06 23:21 ` Alan D. Brunelle
2008-09-08 18:10   ` Jens Axboe
2008-09-08 22:30     ` Alan D. Brunelle
2008-09-09 16:54     ` Alan D. Brunelle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).