linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NUMA schedulers tests
@ 2002-10-13 11:17 Erich Focht
  2002-10-14 15:57 ` Andrew Theurer
  2002-10-15  0:01 ` Michael Hohnbaum
  0 siblings, 2 replies; 3+ messages in thread
From: Erich Focht @ 2002-10-13 11:17 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Andrew Theurer, Luck, Tony, Ingo Molnar, linux-kernel

Hi,

here is a set of measurements for three benchmarks and five schedulers.
They were performed on an NEC Azusa, 16 CPU Itanium machine with NUMA
architecture, 4 CPUs per node, 4 nodes with 8GB memory each.

The benchmarks are:
- hackbench: a variant of the wellknown chatroom benchmark
- numa_test: runs a certain number of memory bandwidth hungry and latency
  dependent processes in parallel
- kernbench: parallel kernel compile (5 times, make -j24, compile an ia64
  2.4.18 kernel with gcc 2.96. As gcc's speed is VERY dependent on the
  number of registers of the target architecture, the results are not
  comparable to i386 compiles.)

Schedulers tested (all on top of discontigmem for ia64):

A: vanilla O(1) scheduler from 2.5.39
B: Michael Hohnbaums simple NUMA scheduler (latest published rev 2)
C: pooling NUMA scheduler with initial load balancing (patches 01+02
   from the 5 patches sent out)
D: node affine NUMA scheduler (patches 01+02+03)
E: node affine NUMA scheduler with dynamic homenode (patches 01+02+03+05)

All results are averages over several measurements, the numbers in braces
are the standard deviation or "error bars".

There's a lot to say about the results, e.g that E is the best because it
has most features. But the only comment I really want to make on this now
is: the numa_test is not good enough for testing node affinity as
"hackbench 20" doesn't seem to disturb the initialy balance too much.
I'll improve this to make it more realistic. Otherwise it's good enough
to see the advantage of not moving around tasks across the nodes.

Best regards,
Erich

Hackbench: (averages over 4 tests, execution time)
-------------------------------------------------------------------
N:	10		25		50		100
A	5.54(0.10)	15.28(0.03)	32.65(0.35)	67.67(0.69)
B	5.37(0.06)	15.13(0.44)	31.96(0.49)	67.14(0.43)
C	1.31(0.08)	 2.91(0.12)	 5.95(0.13)	13.37(0.11)
D	2.45(0.44)	 8.89(1.47)	21.98(1.08)	30.57(17.10)
E	1.39(0.08)	 2.92(0.09)	 6.20(0.34)	14.30(0.28)


numa_test: N=4
----------------------------------------------------------------------------
	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
A	42.95(3.23)	30.40(1.16)	121.65(4.62)			10
B	35.08(2.24)	28.51(0.77)	114.10(3.07)	0.19(0.03)	20
C	27.44(0.06)	27.14(0.04)	108.62(0.15)	0.16(0.01)	20
D	27.42(0.08)	27.10(0.04)	108.48(0.14)	0.15(0.01)	20
E	27.50(0.23)	27.10(0.03)	108.48(0.13)	0.15(0.01)	20


numa_test: N=8
----------------------------------------------------------------------------
	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
A	42.75(2.31)	31.34(0.93)	250.82(7.46)			10
B	34.82(2.74)	29.76(0.67)	238.19(5.34)	0.44(0.04)	20
C	29.29(0.47)	28.80(0.14)	230.55(1.13)	0.35(0.02)	20
D	29.23(0.40)	28.76(0.12)	230.20(0.92)	0.34(0.02)	20
E	29.08(0.06)	28.76(0.03)	230.22(0.27)	0.34(0.01)	20


numa_test: N=16
----------------------------------------------------------------------------
	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
A	47.02(0.88)	33.80(0.90)	541.15(14.47)			10
B	41.82(8.26)	32.04(0.47)	512.91(7.57)	1.08(0.05)	20
C	41.30(5.50)	31.80(0.18)	509.11(2.84)	1.00(0.05)	20
D	39.89(6.42)	31.80(0.10)	509.13(1.67)	0.95(0.03)	20
E	41.61(5.42)	31.76(0.27)	508.44(4.32)	0.96(0.04)	20


numa_test: N=32
----------------------------------------------------------------------------
	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
A	83.52(2.25)	37.44(1.16)	1198.57(37.18)			10
B	80.33(4.35)	33.86(0.60)	1083.94(19.07)	2.13(0.06)	10
C	77.84(4.10)	33.58(0.36)	1074.82(11.47)			4
D	73.59(6.31)	33.20(0.11)	1062.88(3.38)			4
E	69.86(0.41)	33.26(0.06)	1064.72(1.88)	2.01(0.03)	10


numa_test: N=64
----------------------------------------------------------------------------
	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
A	164.15(3.40)	38.73(0.77)	2479.61(49.41)			10
B	149.63(2.89)	34.61(0.40)	2215.80(25.42)	4.28(0.06)	10
C	147.22(1.82)	34.41(0.13)	2202.88(8.39)			4
D	139.32(5.26)	33.37(0.05)	2136.66(3.59)			4
E	137.10(2.15)	33.40(0.02)	2138.13(1.37)	4.08(0.05)	10


Kernbench: (averages over 5 compiles)
-----------------------------------------------
	Elapsed		UserTime	SysTime
A	93.78(0.43)	1321.05(1.20)	49.31(0.14)
B	93.47(1.57)	1303.11(0.53)	52.27(0.46)
C	92.67(1.08)	1304.09(0.74)	52.79(0.32)
D	93.19(1.45)	1300.30(0.89)	52.46(0.20)
E	93.63(1.00)	1299.44(0.67)	52.81(0.19)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: NUMA schedulers tests
  2002-10-13 11:17 NUMA schedulers tests Erich Focht
@ 2002-10-14 15:57 ` Andrew Theurer
  2002-10-15  0:01 ` Michael Hohnbaum
  1 sibling, 0 replies; 3+ messages in thread
From: Andrew Theurer @ 2002-10-14 15:57 UTC (permalink / raw)
  To: Erich Focht, Martin J. Bligh, Michael Hohnbaum
  Cc: Luck, Tony, Ingo Molnar, linux-kernel

On Sunday 13 October 2002 6:17 am, Erich Focht wrote:
> Hi,
>
<snip>
>
> Schedulers tested (all on top of discontigmem for ia64):
>
> A: vanilla O(1) scheduler from 2.5.39
> B: Michael Hohnbaums simple NUMA scheduler (latest published rev 2)
> C: pooling NUMA scheduler with initial load balancing (patches 01+02
>    from the 5 patches sent out)
> D: node affine NUMA scheduler (patches 01+02+03)
> E: node affine NUMA scheduler with dynamic homenode (patches 01+02+03+05)
>
> All results are averages over several measurements, the numbers in braces
> are the standard deviation or "error bars".
>
> There's a lot to say about the results, e.g that E is the best because it
> has most features. But the only comment I really want to make on this now
> is: the numa_test is not good enough for testing node affinity as
> "hackbench 20" doesn't seem to disturb the initially balance too much.
> I'll improve this to make it more realistic. Otherwise it's good enough
> to see the advantage of not moving around tasks across the nodes.

Thanks very much for doing this Erich.  I have run the SPEC SDET benchmark on 
2.5.41-mm3, with your "E" scheduler, and Michael's rev 1 & 2 scheduler.  At 
this time I don't think I can publish actual throughput results, but only % 
differences, to abide by SPEC rules.  These results are not considered 
compliant by SPEC, and are used only to compare performance difference of 
these scheduler implementations.

SDET simulates the multiuser development environment, and may be useful for 
environments like universities, etc.   More information can be found at: 
http://www.spec.org/osg/sdm91/

SDET was configured to simulate 1, 2, 4, 8, 16, 32, 64 and 128 users.  SDET 
appears to benefit from spreading out exec'd processes on initial placement, 
but there also exists scenarios where exec'd tasks should reside on the same 
node.  Ideally, each of the scripts run in parallel would benefit from being 
initially spread out across the nodes, but the processes exec'd _within_ each 
script would benefit from being placed within the node.  I don't expect any 
scheduler to to identify and optimize for that today, but maybe that's 
something to work on in the future, who knows?   

Anyway, on to the results.  Almost forgot, this is on a 16 way NUMA-Q box.  
These results are not actual results, but normalized to a baseline for 
2.5.41-mm3 for 1 user:

1 user
vanilla	100
erich-E	062
michael-r1	103
michael-r2  065

2 users
vanilla	191
erich-E	112
michael-r1	180
michael-r2	116

4 users
vanilla	288
erich-E	213
michael-r1	306
michael-r2	208

8 users
vanilla	426
erich-E	336
michael-r1	405
michael-r2	333

16 users
vanilla	462
erich-E	412
michael-r1	464
michael-r2	427

32 users
vanilla	438
erich-E	425
michael-r1	436
michael-r2	332

64 users
vanilla	420
erich-E	410
michael-r1	430
michael-r2	432

128 users
vanilla	413
erich-E	397
michael-r1	413
michael-r2	398


In Michael's rev 1, initial placement of an exec'd task would be on the same 
CPU if nr_running <=2, (sched_best_cpu) which I believe why the low loaded 
tests perform better.  This is where I believe the tasks within a script are 
getting scheduled on the same CPU/node and benefit from locality.  Erich's 
and Michael's rev2 place the newly exec'd task on the least loaded cpu, which 
tends to spread those tasks exec'd within a script across nodes and hurt 
thoughput in those cases.  In the higher load cases, most of the results 
appear similar.  

Anyway, I used SDET to give yet another view of how these schedulers can 
affect performance.  At this point I personally do not have a preference on 
which scheduler one to use.  I'd like to hear from people who have made BIG 
changes to the scheduler (hint, hint) what they think of these 
implementations, and where we should go from here.  We should probably decide 
which implementation to use, and what we need to do to make it ready for 
inclusion, right?

Andrew Theurer


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: NUMA schedulers tests
  2002-10-13 11:17 NUMA schedulers tests Erich Focht
  2002-10-14 15:57 ` Andrew Theurer
@ 2002-10-15  0:01 ` Michael Hohnbaum
  1 sibling, 0 replies; 3+ messages in thread
From: Michael Hohnbaum @ 2002-10-15  0:01 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Andrew Theurer, Luck, Tony, Ingo Molnar, linux-kernel

Erich,

Thanks for the numbers.  These are quite interesting.  Your 
scheduler which goes through much effort to spread the load
evenly across all of the nodes appears to do much better on 
small loads, but as the system becomes loaded, the difference
gets smaller.  Still, your best combination is the clear winner
on the numa_test.  But on kernbench, that changes and all the
runs look quite close.

While numa_test is interesting and might be quite relevant to
the HPC loads that you focus on, I'm not sure it should be the
defining test for NUMA scheduler performance.  The main problem
is that we have differing workloads that we care about.  

I would be happy to see you NUMA scheduler get accepted into
mainline, as I think there are enough knobs that it can be
tuned for varying workloads.  It will just require some time
and testing to establish these.

On the other hand, my NUMA scheduler is a definite improvement
for NUMA boxes, so if your scheduler cannot be included in
mainline, getting mine in would be of benefit.  

Ingo, do you have any thoughts on this you would be willing to
pass along?

                Michael



On Sun, 2002-10-13 at 04:17, Erich Focht wrote:
> Hi,
> 
> here is a set of measurements for three benchmarks and five schedulers.
> They were performed on an NEC Azusa, 16 CPU Itanium machine with NUMA
> architecture, 4 CPUs per node, 4 nodes with 8GB memory each.
> 
> The benchmarks are:
> - hackbench: a variant of the wellknown chatroom benchmark
> - numa_test: runs a certain number of memory bandwidth hungry and latency
>   dependent processes in parallel
> - kernbench: parallel kernel compile (5 times, make -j24, compile an ia64
>   2.4.18 kernel with gcc 2.96. As gcc's speed is VERY dependent on the
>   number of registers of the target architecture, the results are not
>   comparable to i386 compiles.)
> 
> Schedulers tested (all on top of discontigmem for ia64):
> 
> A: vanilla O(1) scheduler from 2.5.39
> B: Michael Hohnbaums simple NUMA scheduler (latest published rev 2)
> C: pooling NUMA scheduler with initial load balancing (patches 01+02
>    from the 5 patches sent out)
> D: node affine NUMA scheduler (patches 01+02+03)
> E: node affine NUMA scheduler with dynamic homenode (patches 01+02+03+05)
> 
> All results are averages over several measurements, the numbers in braces
> are the standard deviation or "error bars".
> 
> There's a lot to say about the results, e.g that E is the best because it
> has most features. But the only comment I really want to make on this now
> is: the numa_test is not good enough for testing node affinity as
> "hackbench 20" doesn't seem to disturb the initialy balance too much.
> I'll improve this to make it more realistic. Otherwise it's good enough
> to see the advantage of not moving around tasks across the nodes.
> 
> Best regards,
> Erich
> 
> Hackbench: (averages over 4 tests, execution time)
> -------------------------------------------------------------------
> N:	10		25		50		100
> A	5.54(0.10)	15.28(0.03)	32.65(0.35)	67.67(0.69)
> B	5.37(0.06)	15.13(0.44)	31.96(0.49)	67.14(0.43)
> C	1.31(0.08)	 2.91(0.12)	 5.95(0.13)	13.37(0.11)
> D	2.45(0.44)	 8.89(1.47)	21.98(1.08)	30.57(17.10)
> E	1.39(0.08)	 2.92(0.09)	 6.20(0.34)	14.30(0.28)
> 
> 
> numa_test: N=4
> ----------------------------------------------------------------------------
> 	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
> A	42.95(3.23)	30.40(1.16)	121.65(4.62)			10
> B	35.08(2.24)	28.51(0.77)	114.10(3.07)	0.19(0.03)	20
> C	27.44(0.06)	27.14(0.04)	108.62(0.15)	0.16(0.01)	20
> D	27.42(0.08)	27.10(0.04)	108.48(0.14)	0.15(0.01)	20
> E	27.50(0.23)	27.10(0.03)	108.48(0.13)	0.15(0.01)	20
> 
> 
> numa_test: N=8
> ----------------------------------------------------------------------------
> 	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
> A	42.75(2.31)	31.34(0.93)	250.82(7.46)			10
> B	34.82(2.74)	29.76(0.67)	238.19(5.34)	0.44(0.04)	20
> C	29.29(0.47)	28.80(0.14)	230.55(1.13)	0.35(0.02)	20
> D	29.23(0.40)	28.76(0.12)	230.20(0.92)	0.34(0.02)	20
> E	29.08(0.06)	28.76(0.03)	230.22(0.27)	0.34(0.01)	20
> 
> 
> numa_test: N=16
> ----------------------------------------------------------------------------
> 	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
> A	47.02(0.88)	33.80(0.90)	541.15(14.47)			10
> B	41.82(8.26)	32.04(0.47)	512.91(7.57)	1.08(0.05)	20
> C	41.30(5.50)	31.80(0.18)	509.11(2.84)	1.00(0.05)	20
> D	39.89(6.42)	31.80(0.10)	509.13(1.67)	0.95(0.03)	20
> E	41.61(5.42)	31.76(0.27)	508.44(4.32)	0.96(0.04)	20
> 
> 
> numa_test: N=32
> ----------------------------------------------------------------------------
> 	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
> A	83.52(2.25)	37.44(1.16)	1198.57(37.18)			10
> B	80.33(4.35)	33.86(0.60)	1083.94(19.07)	2.13(0.06)	10
> C	77.84(4.10)	33.58(0.36)	1074.82(11.47)			4
> D	73.59(6.31)	33.20(0.11)	1062.88(3.38)			4
> E	69.86(0.41)	33.26(0.06)	1064.72(1.88)	2.01(0.03)	10
> 
> 
> numa_test: N=64
> ----------------------------------------------------------------------------
> 	Elapsed		AvgUserTime	TotUserTime	TotSysTime	Runs
> A	164.15(3.40)	38.73(0.77)	2479.61(49.41)			10
> B	149.63(2.89)	34.61(0.40)	2215.80(25.42)	4.28(0.06)	10
> C	147.22(1.82)	34.41(0.13)	2202.88(8.39)			4
> D	139.32(5.26)	33.37(0.05)	2136.66(3.59)			4
> E	137.10(2.15)	33.40(0.02)	2138.13(1.37)	4.08(0.05)	10
> 
> 
> Kernbench: (averages over 5 compiles)
> -----------------------------------------------
> 	Elapsed		UserTime	SysTime
> A	93.78(0.43)	1321.05(1.20)	49.31(0.14)
> B	93.47(1.57)	1303.11(0.53)	52.27(0.46)
> C	92.67(1.08)	1304.09(0.74)	52.79(0.32)
> D	93.19(1.45)	1300.30(0.89)	52.46(0.20)
> E	93.63(1.00)	1299.44(0.67)	52.81(0.19)
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-10-15  0:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-13 11:17 NUMA schedulers tests Erich Focht
2002-10-14 15:57 ` Andrew Theurer
2002-10-15  0:01 ` Michael Hohnbaum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).