linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IO queueing and complete affinity w/ threads: Some results
@ 2008-02-11 20:56 Alan D. Brunelle
  2008-02-12 20:56 ` Alan D. Brunelle
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Alan D. Brunelle @ 2008-02-11 20:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, npiggin, dgc, arjan

The test case chosen may not be a very good start, but anyways, here are some initial test results with the "nasty arch bits". This was performed on a 32-way ia64 box with 1 terrabyte of RAM, and 144 FC disks (contained in 24 HP MSA1000 RAID controlers attached to 12 dual-port adapters). Each test case was run for 3 minutes. I had one application per device performing a large amount of direct/asynchronous large reads. Here's the table of results, with explanation below (results are for all 144 devices either accumulated (MBPS) or averaged (other columns)):

A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
----- | ------ -------- ------ | -------- -------- | ------- --------
X X X | 3859.9 1.190067 0.0502 |      0.0  19484.7 |     0.0   9758.8
X X A | 3856.3 1.191220 0.0490 |      0.0  19467.2 |     0.0   9750.1
X X I | 3850.3 1.192992 0.0508 |      0.0  19437.3 |  9735.1      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X A X | 3853.9 1.191891 0.0503 |  19455.4      0.0 |     0.0   9744.2
X A A | 3853.5 1.191935 0.0507 |  19453.2      0.0 |     0.0   9743.1
X A I | 3856.6 1.191043 0.0512 |  19468.7      0.0 |  9750.8      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X I X | 3854.7 1.191674 0.0491 |      0.0  19459.8 |     0.0   9746.4
X I A | 3855.3 1.191434 0.0501 |      0.0  19461.9 |     0.0   9747.4
X I I | 3856.2 1.191128 0.0506 |      0.0  19466.6 |  9749.8      0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
I X X | 3857.0 1.190987 0.0500 |      0.0  19471.9 |     0.0   9752.5
I X A | 3856.5 1.191082 0.0496 |      0.0  19469.4 |  9751.2      0.0
I X I | 3853.7 1.191938 0.0500 |      0.0  19456.2 |  9744.6      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I A X | 3854.8 1.191675 0.0502 |  19461.5      0.0 |     0.0   9747.2
I A A | 3855.1 1.191464 0.0503 |  19464.0      0.0 |  9748.5      0.0
I A I | 3854.9 1.191627 0.0483 |  19461.7      0.0 |  9747.4      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I I X | 3853.4 1.192070 0.0484 |  19454.8      0.0 |     0.0   9743.9
I I A | 3852.2 1.192403 0.0502 |  19448.5      0.0 |  9740.8      0.0
I I I | 3854.0 1.191822 0.0499 |  19457.9      0.0 |  9745.5      0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
rq=0  | 3854.8 1.191680 0.0480 |  19459.7      0.0 |   202.9   9543.5
rq=1  | 3854.0 1.191965 0.0483 |  19457.0      0.0 |   403.1   9341.9
----- | ------ -------- ------ | -------- -------- | ------- --------

The variables being played with:

'A' - When set to X the application was placed on a CPU other than the one handling IRQs for the device (in another cell)

'Q' - When set to X, queue affinity was placed in another cell from the application OR completion OR IRQ, when set to 'A' it was pegged onto the same CPU as the application, when set to 'I' it was set to the CPU that was managing the IRQ for its device.

'C' - Likewise for the completion affinity: 'X' means on another cell besides the one containing the application or the queueing or the IRQ handling CPU, A means put on the same CPU as the application, and I means put on the same CPU as the IRQ handler.

o  For the last two rows, we set Q == C == -1, and let the application go to any CPU (as dictated by the scheduler). Then we had 'rq_affinity' set to 0 or 1.

The resulting columns include:

MBPS - Total megabytes per second (so we're seeing about 3.8 gigabytes per second for the system)
Avg lat - Average per IO measured latency in seconds (note: I had upwards of 128 X 256K IOs going on per device across the system)
StdDev - Average standard deviation across the devices

Q-local & Q-remote refer to the average number of queue operations handled locally and remotely, respectively. (Average per device)
C-local & C-remote refer to the average number of completion operations handled locally and remotely, respectively. (Average per device)

As noted above, I'm not so sure this is the best test case - it's rather artificial, I was hoping to see some differences based upon affinitization, but whilst there appears to be some trends, the results are so close (0.2% difference from best to worst case MBPS, and the standard deviation on the latencies are +/- within the groups), I doubt there is anything definitive. Unfortunately, most of the disks are all being used for real data right now, so I can't perform significant write tests (with file systems in place, say) which would be more real-worldly. I do have access to about 24 of the disks, so I will try to place file system on those and do some tests. [I won't be able to use XFS without going through some hoops - its a Red Hat installation right now, and they don't support XFS out of the box...] 

BTW: The Q/C local/remote columns were put in place to make sure that I had things set up right, and for the first 18 cases I think they look right. For the RQ cases at the end, I /think/ what is happening is that on occasion we end up with the application on the CPU that had the IRQ handler, and that would cause us to some times be local - but most of the time (due to the pseudo-random nature of the initial process placement) we'd end up elsewhere from the IRQ handling CPU, and thus end up with remoting the queue/complete handling... The disparity between the Q & C results are due to merging - we issue (and hence complete) less IOs than are submitted to the block IO layer (here it looks to be about 2-to-1).

Alan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-11 20:56 IO queueing and complete affinity w/ threads: Some results Alan D. Brunelle
@ 2008-02-12 20:56 ` Alan D. Brunelle
  2008-02-12 22:08 ` Alan D. Brunelle
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Alan D. Brunelle @ 2008-02-12 20:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, npiggin, dgc, arjan

Whilst running a series of file system related loads on our 32-way*, I dropped down to a 16-way w/ only 24 disks, and ran two kernels: the original set of Jens' patches and then his subsequent kthreads-based set. Here are the results:

Original:
A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
----- | ------ -------- ------ | -------- -------- | ------- --------
X X X | 1850.4 0.413880 0.0109 |      0.0  55860.8 |     0.0  27946.9
X X A | 1850.6 0.413848 0.0106 |      0.0  55859.2 |     0.0  27946.1
X X I | 1850.6 0.413830 0.0107 |      0.0  55858.5 | 27945.8      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X A X | 1850.0 0.413949 0.0106 |  55843.7      0.0 |     0.0  27938.3
X A A | 1850.2 0.413931 0.0107 |  55844.2      0.0 |     0.0  27938.6
X A I | 1850.4 0.413862 0.0107 |  55854.3      0.0 | 27943.7      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X I X | 1850.9 0.413764 0.0107 |      0.0  55866.2 |     0.0  27949.6
X I A | 1850.5 0.413854 0.0108 |      0.0  55855.0 |     0.0  27944.0
X I I | 1850.4 0.413848 0.0105 |      0.0  55854.6 | 27943.8      0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
I X X | 1570.7 0.487686 0.0142 |      0.0  47406.1 |     0.0  23719.5
I X A | 1570.8 0.487666 0.0143 |      0.0  47409.3 | 23721.2      0.0
I X I | 1570.8 0.487664 0.0142 |      0.0  47410.7 | 23721.8      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I A X | 1570.9 0.487642 0.0144 |  47412.2      0.0 |     0.0  23722.6
I A A | 1570.8 0.487647 0.0141 |  47411.2      0.0 | 23722.1      0.0
I A I | 1570.8 0.487651 0.0143 |  47410.8      0.0 | 23721.9      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I I X | 1570.8 0.487683 0.0142 |  47410.2      0.0 |     0.0  23721.6
I I A | 1571.1 0.487591 0.0146 |  47415.0      0.0 | 23724.0      0.0
I I I | 1571.0 0.487623 0.0143 |  47412.5      0.0 | 23722.8      0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
rq=0  | 1726.7 0.443562 0.0120 |  52118.6      0.0 |  2138.6  23937.2
rq=1  | 1820.5 0.420729 0.0110 |  54938.2      0.0 |     0.0  27485.6
----- | ------ -------- ------ | -------- -------- | ------- --------


kthreads-based:
A Q C |  MBPS   Avg Lat StdDev |  Q-local Q-remote | C-local C-remote
----- | ------ -------- ------ | -------- -------- | ------- --------
X X X | 1850.5 0.413867 0.0107 |      0.0  55854.7 |     0.0  27943.8
X X A | 1850.9 0.413763 0.0107 |      0.0  55867.0 |     0.0  27950.0
X X I | 1850.3 0.413911 0.0109 |      0.0  55849.0 | 27941.0      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X A X | 1851.0 0.413730 0.0107 |  55871.4      0.0 |     0.0  27952.2
X A A | 1850.1 0.413919 0.0107 |  55845.5      0.0 |     0.0  27939.2
X A I | 1850.8 0.413789 0.0108 |  55864.8      0.0 | 27948.9      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
X I X | 1850.5 0.413849 0.0107 |      0.0  55856.5 |     0.0  27944.8
X I A | 1850.6 0.413818 0.0108 |      0.0  55860.2 |     0.0  27946.6
X I I | 1850.8 0.413764 0.0108 |      0.0  55866.7 | 27949.8      0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
I X X | 1570.9 0.487662 0.0145 |      0.0  47410.1 |     0.0  23721.6
I X A | 1570.7 0.487691 0.0142 |      0.0  47406.9 | 23720.0      0.0
I X I | 1570.7 0.487688 0.0141 |      0.0  47406.5 | 23719.8      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I A X | 1570.9 0.487661 0.0144 |  47415.4      0.0 |     0.0  23724.2
I A A | 1570.8 0.487648 0.0141 |  47409.1      0.0 | 23721.0      0.0
I A I | 1570.7 0.487667 0.0141 |  47406.1      0.0 | 23719.5      0.0
----- | ------ -------- ------ | -------- -------- | ------- --------
I I X | 1570.8 0.487691 0.0142 |  47409.3      0.0 |     0.0  23721.2
I I A | 1570.9 0.487644 0.0142 |  47408.8      0.0 | 23720.9      0.0
I I I | 1570.6 0.487671 0.0141 |  47412.5      0.0 | 23722.8      0.0
===== | ====== ======== ====== | ======== ======== | ======= ========
rq=0  | 1742.1 0.439676 0.0118 |  52578.1      0.0 |  3602.6  22703.0
rq=1  | 1745.0 0.438918 0.0115 |  52666.3      0.0 |  3473.0  22876.6
----- | ------ -------- ------ | -------- -------- | ------- --------

For the first 18 sets on both kernels the results are very similar, the last two rq=0/1 sets are perturbed too much by application placement (I would guess). Have to think about that some more.

Alan
* What I'm doing on the 32-way is to compare and contrast mkfs, untar, kernel make & kernel clean times with different combinations of Q, C and RQ. [[This is currently with the "Jens original" patch, if things go well, I can do an overnight run with the kthreads-based patch.]]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-11 20:56 IO queueing and complete affinity w/ threads: Some results Alan D. Brunelle
  2008-02-12 20:56 ` Alan D. Brunelle
@ 2008-02-12 22:08 ` Alan D. Brunelle
  2008-02-12 22:26   ` Alan D. Brunelle
  2008-02-13 15:35 ` Alan D. Brunelle
  2008-02-14 15:36 ` Alan D. Brunelle
  3 siblings, 1 reply; 13+ messages in thread
From: Alan D. Brunelle @ 2008-02-12 22:08 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, Jens Axboe, npiggin, dgc, arjan

Back on the 32-way, in this set of tests we're running 12 disks spread out through the 8 cells of the 32-way. Each disk will have an Ext2 FS placed on it, a clean Linux kernel source untar()ed onto it, then a full make (-j4) and then a make clean performed. The 12 series are done in parallel - so each disk will have:

mkfs
tar x
make
make clean

performed. This was performed ten times, and the overall averages are presented below - note this is Jens' original patch sequence NOT the kthread one (those results available tomorrow, hopefully). 

mkfs        Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  17.814  30.322  33.263   4.551 
q0.c0.rq1  17.540  30.058  32.885   4.321 
q0.c1.rq0  17.770  31.328  32.958   3.121 
q1.c0.rq0  17.907  31.032  32.767   3.515 
q1.c1.rq0  16.891  30.319  33.097   4.624 

untar       Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  19.747  21.971  26.292   1.215 
q0.c0.rq1  19.680  22.365  36.395   2.010 
q0.c1.rq0  18.823  21.390  24.455   0.976 
q1.c0.rq0  18.433  21.500  23.371   1.009 
q1.c1.rq0  19.414  21.761  34.115   1.378 

make        Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0 527.418 543.296 552.030   5.384 
q0.c0.rq1 526.265 542.312 549.477   5.467 
q0.c1.rq0 528.935 544.940 553.823   4.746 
q1.c0.rq0 529.432 544.399 553.212   5.166 
q1.c1.rq0 527.638 543.577 551.323   5.478 

clean       Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  16.962  20.308  33.775   3.179 
q0.c0.rq1  17.436  20.156  29.370   3.097 
q0.c1.rq0  17.061  20.111  31.504   2.791 
q1.c0.rq0  16.745  20.247  29.327   2.953 
q1.c1.rq0  17.346  20.316  31.178   3.283 

Hopefully, the first column is self-explanatory - these are the settings applied to the queue_affinity, completion_affinity and rq_affinity tunables. Due to the fact that the standard deviations are so large coupled with the very close average results, I'm not seeing anything in this set of tests to favor any of the combinations...

As noted, I will be having the machine run the kthreads-variant of the patch stream tonight, and then I have to go back and run a non-patched kernel to see if there are any /regressions/. 

Alan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-12 22:08 ` Alan D. Brunelle
@ 2008-02-12 22:26   ` Alan D. Brunelle
  0 siblings, 0 replies; 13+ messages in thread
From: Alan D. Brunelle @ 2008-02-12 22:26 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, Jens Axboe, npiggin, dgc, arjan

Alan D. Brunelle wrote:

> 
> Hopefully, the first column is self-explanatory - these are the settings applied to the queue_affinity, completion_affinity and rq_affinity tunables. Due to the fact that the standard deviations are so large coupled with the very close average results, I'm not seeing anything in this set of tests to favor any of the combinations...
> 

Note quite:

Q or C = 0 really means Q or C set to -1 (default), Q or C = 1 means placing that thread on the CPU managing the IRQ. Sorry... 

<sigh>
Alan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-11 20:56 IO queueing and complete affinity w/ threads: Some results Alan D. Brunelle
  2008-02-12 20:56 ` Alan D. Brunelle
  2008-02-12 22:08 ` Alan D. Brunelle
@ 2008-02-13 15:35 ` Alan D. Brunelle
  2008-02-14 15:36 ` Alan D. Brunelle
  3 siblings, 0 replies; 13+ messages in thread
From: Alan D. Brunelle @ 2008-02-13 15:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, npiggin, dgc, arjan

Comparative results between the original affinity patch and the kthreads-based patch on the 32-way running the kernel make sequence. 

It may be easier to compare/contrast with the graphs provided at http://free.linux.hp.com/~adb/jens/kernmk.png (kernmk.agr also provided, if you want to run xmgrace by hand). 

Tests are:

1. Make Ext2 FS on each of 12 64GB devices in parallel, times include: mkfs, mount & unmount
2. Untar a full Linux source code tree onto the devices in parallel, times include: mount, untar, unmount
3. Make (-j4) of the full source code tree, times include: mount, make -j4, unmount
4. Clean full source code tree, times include: mount, make clean, unmount

The results are so close amongst all the runs (given the large-ish standard deviations), that we probably can't deduce much from this. A bit of a concern on the top two graphs - mkfs & untar - it certainly appears that the kthreads version is a little slower (about 2.9% difference across the values for the mkfs runs, and 3.5% for the untar operations). On the make runs, however, we didn't see hardly any difference between the runs at all...

We are trying to setup to do some AIM7 tests on a different system over the weekend (15 February - 18 February 2008), I'll post those results on the 18th or 19th if we can pull it off. [I'll also try to steal time on the 32-way to run a straight 2.6.24 kernel, do these runs again, and post those results.]

For the tables below:

 q0 == queue_affinity set to -1
 q1 == queue_affinity set to the CPU managing the IRQ for each device
 c0 == completion_affinity set to -1
 c1 == completion_affinity set to CPU managing the IRQ for each device
rq0 == rq_affinity set to 0
rq1 == rq_affinity set to 1

This 4-test sequence was run 10 times (for each kernel), and results averaged. As posted yesterday, here's the original patch sequence results:

mkfs        Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  17.814  30.322  33.263   4.551 
q0.c0.rq1  17.540  30.058  32.885   4.321 
q0.c1.rq0  17.770  31.328  32.958   3.121 
q1.c0.rq0  17.907  31.032  32.767   3.515 
q1.c1.rq0  16.891  30.319  33.097   4.624 

untar       Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  19.747  21.971  26.292   1.215 
q0.c0.rq1  19.680  22.365  36.395   2.010 
q0.c1.rq0  18.823  21.390  24.455   0.976 
q1.c0.rq0  18.433  21.500  23.371   1.009 
q1.c1.rq0  19.414  21.761  34.115   1.378 

make        Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0 527.418 543.296 552.030   5.384 
q0.c0.rq1 526.265 542.312 549.477   5.467 
q0.c1.rq0 528.935 544.940 553.823   4.746 
q1.c0.rq0 529.432 544.399 553.212   5.166 
q1.c1.rq0 527.638 543.577 551.323   5.478 

clean       Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  16.962  20.308  33.775   3.179 
q0.c0.rq1  17.436  20.156  29.370   3.097 
q0.c1.rq0  17.061  20.111  31.504   2.791 
q1.c0.rq0  16.745  20.247  29.327   2.953 
q1.c1.rq0  17.346  20.316  31.178   3.283 

And for the kthreads-based kernel:

mkfs        Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  16.686  31.069  33.361   3.452 
q0.c0.rq1  16.976  31.719  32.869   2.395 
q0.c1.rq0  16.857  31.345  33.410   3.209 
q1.c0.rq0  17.317  31.997  34.444   3.099 
q1.c1.rq0  16.791  32.266  33.378   2.035 

untar       Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  19.769  22.398  25.196   1.076 
q0.c0.rq1  19.742  22.517  38.498   1.733 
q0.c1.rq0  20.071  22.698  36.160   2.259 
q1.c0.rq0  19.910  22.377  35.640   1.528 
q1.c1.rq0  19.448  22.339  24.887   0.926 

make        Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0 526.971 542.820 550.591   4.607 
q0.c0.rq1 527.320 544.422 550.504   3.798 
q0.c1.rq0 527.367 543.856 550.331   4.152 
q1.c0.rq0 527.406 543.636 552.947   4.315 
q1.c1.rq0 528.921 544.594 550.832   3.786 

clean       Min     Avg     Max   Std Dev 
--------- ------- ------- ------- -------
q0.c0.rq0  16.644  20.242  29.524   2.991 
q0.c0.rq1  16.942  20.008  29.729   2.845 
q0.c1.rq0  17.205  20.117  29.851   2.661 
q1.c0.rq0  17.400  20.147  32.581   2.862 
q1.c1.rq0  16.799  20.072  31.883   2.872 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-11 20:56 IO queueing and complete affinity w/ threads: Some results Alan D. Brunelle
                   ` (2 preceding siblings ...)
  2008-02-13 15:35 ` Alan D. Brunelle
@ 2008-02-14 15:36 ` Alan D. Brunelle
  2008-02-18 12:37   ` Jens Axboe
  3 siblings, 1 reply; 13+ messages in thread
From: Alan D. Brunelle @ 2008-02-14 15:36 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, Jens Axboe, npiggin, dgc, arjan

Taking a step back, I went to a very simple test environment:

o  4-way IA64
o  2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs).
o  Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses)

Basically:

o  CPU 0 handled IRQs for /dev/sds
o  CPU 2 handled IRQs for /dev/sdaa

We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek & other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare & contrast system measurements).

First: overall performance

2.6.24 (no patches)              : 106.90 MB/sec

2.6.24 + original patches + rq=0 : 103.09 MB/sec
                            rq=1 :  98.81 MB/sec

2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
                            rq=1 : 107.16 MB/sec

So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses &c. Looking at the gross data:

Kernel                                CPU_CYCLES       BACK END BUBBLES  100.0 * (BEB/CC)
--------------------------------   -----------------  -----------------  ----------------
2.6.24 (no patches)              : 2,357,215,454,852    231,547,237,267   9.8%

2.6.24 + original patches + rq=0 : 2,444,895,579,790    242,719,920,828   9.9%
                            rq=1 : 2,551,175,203,455    148,586,145,513   5.8%

2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043    255,563,975,526  10.8%
                            rq=1 : 2,350,539,631,362    208,888,961,094   8.9%

For both the original & kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs.

Combining %sys & %soft IRQ, we see:

Kernel                              % user     % sys   % iowait   % idle
--------------------------------   --------  --------  --------  --------
2.6.24 (no patches)              :   0.141%   10.088%   43.949%   45.819%

2.6.24 + original patches + rq=0 :   0.123%   11.361%   43.507%   45.008%
                            rq=1 :   0.156%    6.030%   44.021%   49.794%

2.6.24 + kthreads patches + rq=0 :   0.163%   10.402%   43.744%   45.686%
                            rq=1 :   0.156%    8.160%   41.880%   49.804%

The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok...

I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel is always helpful! :-)

I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code & results, coincidence?).

Alan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-14 15:36 ` Alan D. Brunelle
@ 2008-02-18 12:37   ` Jens Axboe
  2008-02-18 13:33     ` Andi Kleen
  2008-02-19 21:14     ` Paul Jackson
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2008-02-18 12:37 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, npiggin, dgc, arjan

On Thu, Feb 14 2008, Alan D. Brunelle wrote:
> Taking a step back, I went to a very simple test environment:
> 
> o  4-way IA64
> o  2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs).
> o  Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses)
> 
> Basically:
> 
> o  CPU 0 handled IRQs for /dev/sds
> o  CPU 2 handled IRQs for /dev/sdaa
> 
> We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek & other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare & contrast system measurements).
> 
> First: overall performance
> 
> 2.6.24 (no patches)              : 106.90 MB/sec
> 
> 2.6.24 + original patches + rq=0 : 103.09 MB/sec
>                             rq=1 :  98.81 MB/sec
> 
> 2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
>                             rq=1 : 107.16 MB/sec
> 
> So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses &c. Looking at the gross data:
> 
> Kernel                                CPU_CYCLES       BACK END BUBBLES  100.0 * (BEB/CC)
> --------------------------------   -----------------  -----------------  ----------------
> 2.6.24 (no patches)              : 2,357,215,454,852    231,547,237,267   9.8%
> 
> 2.6.24 + original patches + rq=0 : 2,444,895,579,790    242,719,920,828   9.9%
>                             rq=1 : 2,551,175,203,455    148,586,145,513   5.8%
> 
> 2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043    255,563,975,526  10.8%
>                             rq=1 : 2,350,539,631,362    208,888,961,094   8.9%
> 
> For both the original & kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs.
> 
> Combining %sys & %soft IRQ, we see:
> 
> Kernel                              % user     % sys   % iowait   % idle
> --------------------------------   --------  --------  --------  --------
> 2.6.24 (no patches)              :   0.141%   10.088%   43.949%   45.819%
> 
> 2.6.24 + original patches + rq=0 :   0.123%   11.361%   43.507%   45.008%
>                             rq=1 :   0.156%    6.030%   44.021%   49.794%
> 
> 2.6.24 + kthreads patches + rq=0 :   0.163%   10.402%   43.744%   45.686%
>                             rq=1 :   0.156%    8.160%   41.880%   49.804%
> 
> The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok...
> 
> I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel is always helpful! :-)
> 
> I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code & results, coincidence?).

Alan, thanks for your very nice testing efforts on this! It's very
encouraging to see that the kthread based approach is even faster than
the softirq one, since the code is indeed much simpler and doesn't
require any arch modifications. So I'd agree that just testing the
kthread approach is the best way forward, and that scrapping the remote
softirq trigger stuff is sanest.

My main worry with the current code is the ->lock in the per-cpu
completion structure. If we do a lot of migrations to other CPUs, then
that cacheline will be bounced around. But we'll be dirtying the list of
that CPU structure anyway, so playing games to make that part lockless
is probably pretty pointless. So if you get around to testing on bigger
SMP boxes, it'd be interesting to look for. So far it looks like it's a
net win with more idle time, the benefit of keeping the rq completion
queue local must be out weighing the cost of diddling with the per-cpu
data.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-18 12:37   ` Jens Axboe
@ 2008-02-18 13:33     ` Andi Kleen
  2008-02-18 14:16       ` Jens Axboe
  2008-02-19  1:49       ` Nick Piggin
  2008-02-19 21:14     ` Paul Jackson
  1 sibling, 2 replies; 13+ messages in thread
From: Andi Kleen @ 2008-02-18 13:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Alan D. Brunelle, linux-kernel, npiggin, dgc, arjan

Jens Axboe <jens.axboe@oracle.com> writes:

> and that scrapping the remote
> softirq trigger stuff is sanest.

I actually liked Nick's queued smp_function_call_single() patch. So even
if it was not used for block I would still like to see it being merged 
in some form to speed up all the other IPI users.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-18 13:33     ` Andi Kleen
@ 2008-02-18 14:16       ` Jens Axboe
  2008-02-19  1:49       ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2008-02-18 14:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan D. Brunelle, linux-kernel, npiggin, dgc, arjan

On Mon, Feb 18 2008, Andi Kleen wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > and that scrapping the remote
> > softirq trigger stuff is sanest.
> 
> I actually liked Nick's queued smp_function_call_single() patch. So even
> if it was not used for block I would still like to see it being merged 
> in some form to speed up all the other IPI users.

Sure, Nicks patch was generically usable, my IPI stuff was just a hack
made to go as fast as possible for a single use. The current
call-on-other cpu path is not exactly scalable...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-18 13:33     ` Andi Kleen
  2008-02-18 14:16       ` Jens Axboe
@ 2008-02-19  1:49       ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2008-02-19  1:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jens Axboe, Alan D. Brunelle, linux-kernel, dgc, arjan

On Mon, Feb 18, 2008 at 02:33:17PM +0100, Andi Kleen wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > and that scrapping the remote
> > softirq trigger stuff is sanest.
> 
> I actually liked Nick's queued smp_function_call_single() patch. So even
> if it was not used for block I would still like to see it being merged 
> in some form to speed up all the other IPI users.

Yeah, that hasn't been forgotten (nor have your comments about folding
my special function into smp_call_function_single).

The call function path is terribly unscalable at the moment on a lot
of architectures, and also it isn't allowed to be used with interrupts
off due to deadlock (which the queued version can allow, provided
that wait=0).

I will get around to sending that upstream soon.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-18 12:37   ` Jens Axboe
  2008-02-18 13:33     ` Andi Kleen
@ 2008-02-19 21:14     ` Paul Jackson
  2008-02-19 21:31       ` Mike Travis
  1 sibling, 1 reply; 13+ messages in thread
From: Paul Jackson @ 2008-02-19 21:14 UTC (permalink / raw)
  To: Jens Axboe, Mike Travis; +Cc: Alan.Brunelle, linux-kernel, npiggin, dgc, arjan

Jens wrote:
> My main worry with the current code is the ->lock in the per-cpu
> completion structure.

Drive-by-comment here:  Does the patch posted later this same day by Mike Travis:

  [PATCH 0/2] percpu: Optimize percpu accesses v3

help with this lock issue any?  (I have no real clue here -- just connecting
up the pretty colored dots ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-19 21:14     ` Paul Jackson
@ 2008-02-19 21:31       ` Mike Travis
  2008-02-20  8:08         ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Travis @ 2008-02-19 21:31 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Jens Axboe, Alan.Brunelle, linux-kernel, npiggin, dgc, arjan

Paul Jackson wrote:
> Jens wrote:
>> My main worry with the current code is the ->lock in the per-cpu
>> completion structure.
> 
> Drive-by-comment here:  Does the patch posted later this same day by Mike Travis:
> 
>   [PATCH 0/2] percpu: Optimize percpu accesses v3
> 
> help with this lock issue any?  (I have no real clue here -- just connecting
> up the pretty colored dots ;).
> 

I'm not sure of the context here but a big motivation for doing the
zero-based per_cpu variables was to optimize access to the local
per cpu variables to one instruction, reducing the need for locks.

-Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: IO queueing and complete affinity w/ threads: Some results
  2008-02-19 21:31       ` Mike Travis
@ 2008-02-20  8:08         ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2008-02-20  8:08 UTC (permalink / raw)
  To: Mike Travis
  Cc: Paul Jackson, Alan.Brunelle, linux-kernel, npiggin, dgc, arjan

On Tue, Feb 19 2008, Mike Travis wrote:
> Paul Jackson wrote:
> > Jens wrote:
> >> My main worry with the current code is the ->lock in the per-cpu
> >> completion structure.
> > 
> > Drive-by-comment here:  Does the patch posted later this same day by Mike Travis:
> > 
> >   [PATCH 0/2] percpu: Optimize percpu accesses v3
> > 
> > help with this lock issue any?  (I have no real clue here -- just connecting
> > up the pretty colored dots ;).
> > 
> 
> I'm not sure of the context here but a big motivation for doing the
> zero-based per_cpu variables was to optimize access to the local
> per cpu variables to one instruction, reducing the need for locks.

I'm afraid the two things aren't related, although faster access to
per-cpu is of course a benefit for this as well. My expressed concern
was the:

        spin_lock(&bc->lock);
        was_empty = list_empty(&bc->list);
        list_add_tail(&req->donelist, &bc->list);
        spin_unlock(&bc->lock);

where 'bc' may be per-cpu data of another CPU

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-02-20  8:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-11 20:56 IO queueing and complete affinity w/ threads: Some results Alan D. Brunelle
2008-02-12 20:56 ` Alan D. Brunelle
2008-02-12 22:08 ` Alan D. Brunelle
2008-02-12 22:26   ` Alan D. Brunelle
2008-02-13 15:35 ` Alan D. Brunelle
2008-02-14 15:36 ` Alan D. Brunelle
2008-02-18 12:37   ` Jens Axboe
2008-02-18 13:33     ` Andi Kleen
2008-02-18 14:16       ` Jens Axboe
2008-02-19  1:49       ` Nick Piggin
2008-02-19 21:14     ` Paul Jackson
2008-02-19 21:31       ` Mike Travis
2008-02-20  8:08         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).