* HT schedulers' performance on single HT processor
@ 2003-12-12 14:57 Con Kolivas
2003-12-14 19:49 ` Nathan Fredrickson
2004-01-03 17:56 ` Bill Davidsen
0 siblings, 2 replies; 9+ messages in thread
From: Con Kolivas @ 2003-12-12 14:57 UTC (permalink / raw)
To: linux kernel mailing list; +Cc: Nick Piggin, Ingo Molnar
I set out to find how the hyper-thread schedulers would affect the all
important kernel compile benchmark on machines that most of us are likely to
encounter soon. The single processor HT machine.
Usual benchmark precautions taken; best of five runs (curiously the fastest
was almost always the second run). Although for confirmation I really did
this twice.
Tested a kernel compile with make vmlinux, make -j2 and make -j8.
make vmlinux - tests to ensure the sequential single threaded make doesn't
suffer as a result of these tweaks
make -j2 vmlinux - tests to see how well wasted idle time is avoided
make -j8 vmlinux - maximum throughput test (4x nr_cpus seems to be ceiling for
this).
Hardware: P4 HT 3.066
Legend:
UP - Uniprocessor 2.6.0-test11 kernel
SMP - SMP kernel
C1 - With Ingo's C1 hyperthread patch
w26 - With Nick's w26 sched-rollup (hyperthread included)
make vmlinux
kernel time
UP 65.96
SMP 65.80
C1 66.54
w26 66.25
I was concerned this might happen and indeed the sequential single threaded
compile is slightly worse on both HT schedulers. (1)
make -j2 vmlinux
kernel time
UP 65.17
SMP 57.77
C1 66.01
w26 57.94
Shows the smp kernel nicely utilises HT whereas the UP kernel doesn't. The C1
result was very repeatable and I was unable to get it lower than this.(2)
make -j8 vmlinux
kernel time
UP 65.00
SMP 57.85
C1 58.25
w26 57.94
Results are not obviously better(3) but C1 is still a little slower (2)
Ok so what happened as I see it?
(1) My concern with the HT patches and single compiles was that in an effort
to keep both logical cores busy, the next task would bounce to the other
logical core. While very cheap on HT it's still more expensive than staying
on the same core. I can't prove that happened.
(2) We know the C1 patch has trouble booting on some hardware so maybe there's
a bug in there affecting performance too.
(3) There is a very real performance advantage in this benchmark to enabling
SMP on a HT cpu. However, in the best case it only amounts to 11%. This means
that if a specialised HT scheduler patch gained say 10% it would only amount
to 1% overall - hardly an exciting amount. 1% should have been on the edge of
statistical significance, but I haven't even been able to show any difference
at all. This does _not_ mean there aren't performance benefits elsewhere, but
they obviously need evidence.
Conclusion?
If you run nothing but kernel compiles all day on a P4 HT, make sure you
compile it for SMP ;-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-12 14:57 HT schedulers' performance on single HT processor Con Kolivas
@ 2003-12-14 19:49 ` Nathan Fredrickson
2003-12-14 20:35 ` Adam Kropelin
2003-12-15 10:11 ` Con Kolivas
2004-01-03 17:56 ` Bill Davidsen
1 sibling, 2 replies; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-14 19:49 UTC (permalink / raw)
To: Con Kolivas; +Cc: Linux Kernel Mailing List, Nick Piggin, Ingo Molnar
On Fri, 2003-12-12 at 09:57, Con Kolivas wrote:
> I set out to find how the hyper-thread schedulers would affect the all
> important kernel compile benchmark on machines that most of us are likely to
> encounter soon. The single processor HT machine.
I ran some further tests since I have access to some SMP systems with HT
(1, 2 and 4 physical processors).
Tested a kernel compile with make -jX vmlinux, where X = 1...16.
Results are the best real time out of five runs.
Hardware: Xeon HT 2GHz
Test cases:
1phys (uniproc) - UP test11 kernel with HT disabled in the BIOS
1phys w/HT - SMP test11 kernel on 1 physical proc with HT enabled
1phys w/HT (w26) - same as above with Nick's w26 sched-rollup patch
1phys w/HT (C1) - same as above with Ingo's C1 patch
2phys - SMP test11 kernel on 2 physical proc with HT disabled
2phys w/HT - SMP test11 kernel on 2 physical proc with HT enabled
2phys w/HT (w26) - same as above with Nick's w26 sched-rollup patch
2phys w/HT (C1) - same as above with Ingo's C1 patch
I can also run the same on four physical processors if there is
interest.
Here are some of the results. The units are time in seconds so lower is
better. The complete results and some graphs are available at:
http://nrf.sortof.com/kbench/test11-kbench.html
j = 1 2 3 4 8
1phys (uniproc) 305.86 306.07 306.47 306.63 306.69
1phys w/HT 311.70 311.01 267.05 267.16 267.62
1phys w/HT (w26) 311.85 311.58 267.20 267.53 267.76
1phys w/HT (C1) 313.72 312.89 268.16 269.17 268.67
2phys 306.00 305.00 161.15 161.31 161.51
2phys w/HT 309.02 308.36 196.91 151.70 145.80
2phys w/HT (w26) 310.65 309.34 167.16 151.37 145.22
2phys w/HT (C1) 310.86 307.90 162.05 152.16 145.82
Same table as above normalized to the j=1 uniproc case to make
comparisons easier. Lower is still better.
j = 1 2 3 4 8
1phys (uniproc) 1.00 1.00 1.00 1.00 1.00
1phys w/HT 1.02 1.02 0.87 0.87 0.87
1phys w/HT (w26) 1.02 1.02 0.87 0.87 0.88
1phys w/HT (C1) 1.03 1.02 0.88 0.88 0.88
2phys 1.00 1.00 0.53 0.53 0.53
2phys w/HT 1.01 1.01 0.64 0.50 0.48
2phys w/HT (w26) 1.02 1.01 0.55 0.49 0.47
2phys w/HT (C1) 1.02 1.01 0.53 0.50 0.48
Con Kolivas wrote:
> I was concerned this might happen and indeed the sequential single threaded
> compile is slightly worse on both HT schedulers. (1)
My test showed the same (assuming -j1 is the same as omitting the
option). The slowdown of the -j1 case with HT is 1-3%.
There was not much benefit from either HT or SMP with j=2. Maximum
speedup was not realized until j=3 for one physical processor and j=5
for 2 physical processors. This suggests that j should be set to at
least the number of logical processors + 1.
> (3) There is a very real performance advantage in this benchmark to enabling
> SMP on a HT cpu. However, in the best case it only amounts to 11%. This means
> that if a specialised HT scheduler patch gained say 10% it would only amount
> to 1% overall - hardly an exciting amount.
Agree, there is certainly an advantage to using HT as long as there are
enough runnable processes (j>=3). Running additional processes in
parallel (j=16) does not increase performance any further nor does it
decease it. My best case speedup amounts to 15%, which is right in the
middle of the 10-20% range that Intel talks about.
> Conclusion?
> If you run nothing but kernel compiles all day on a P4 HT, make sure you
> compile it for SMP ;-)
And make sure you compile with the -jX option with X >= logical_procs+1
Nathan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-14 19:49 ` Nathan Fredrickson
@ 2003-12-14 20:35 ` Adam Kropelin
2003-12-14 21:15 ` Nathan Fredrickson
2003-12-15 10:11 ` Con Kolivas
1 sibling, 1 reply; 9+ messages in thread
From: Adam Kropelin @ 2003-12-14 20:35 UTC (permalink / raw)
To: Nathan Fredrickson
Cc: Con Kolivas, Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, sam
On Sun, Dec 14, 2003 at 02:49:24PM -0500, Nathan Fredrickson wrote:
> Same table as above normalized to the j=1 uniproc case to make
> comparisons easier. Lower is still better.
>
> j = 1 2 3 4 8
> 1phys (uniproc) 1.00 1.00 1.00 1.00 1.00
> 1phys w/HT 1.02 1.02 0.87 0.87 0.87
> 1phys w/HT (w26) 1.02 1.02 0.87 0.87 0.88
> 1phys w/HT (C1) 1.03 1.02 0.88 0.88 0.88
> 2phys 1.00 1.00 0.53 0.53 0.53
^^^^^ ^^^^
Ummm...
> 2phys w/HT 1.01 1.01 0.64 0.50 0.48
> 2phys w/HT (w26) 1.02 1.01 0.55 0.49 0.47
> 2phys w/HT (C1) 1.02 1.01 0.53 0.50 0.48
> There was not much benefit from either HT or SMP with j=2. Maximum
> speedup was not realized until j=3 for one physical processor and j=5
> for 2 physical processors.
This is mighty suspicious. With -j2 did you check to see that there
were indeed two parallel gcc's running? Since -test6 I've found that
-j2 only results in a single gcc instance. I've seen this on both an
old hacked-up RH 7.3 installation and a brand new RH 9 + updates
installation.
> This suggests that j should be set to at least the number of logical
> processors + 1.
Since -test6 I've found this to be the case for kernel builds, yes. But
I don't think it has anything to do with the scheduler or HT vs SMP
platforms.
--Adam
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-14 20:35 ` Adam Kropelin
@ 2003-12-14 21:15 ` Nathan Fredrickson
0 siblings, 0 replies; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-14 21:15 UTC (permalink / raw)
To: Adam Kropelin
Cc: Con Kolivas, Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, sam
On Sun, 2003-12-14 at 15:35, Adam Kropelin wrote:
> On Sun, Dec 14, 2003 at 02:49:24PM -0500, Nathan Fredrickson wrote:
> > Same table as above normalized to the j=1 uniproc case to make
> > comparisons easier. Lower is still better.
> >
> > j = 1 2 3 4 8
> > 1phys (uniproc) 1.00 1.00 1.00 1.00 1.00
> > 1phys w/HT 1.02 1.02 0.87 0.87 0.87
> > 1phys w/HT (w26) 1.02 1.02 0.87 0.87 0.88
> > 1phys w/HT (C1) 1.03 1.02 0.88 0.88 0.88
> > 2phys 1.00 1.00 0.53 0.53 0.53
> ^^^^^ ^^^^
>
> Ummm...
>
> This is mighty suspicious. With -j2 did you check to see that there
> were indeed two parallel gcc's running? Since -test6 I've found that
> -j2 only results in a single gcc instance. I've seen this on both an
> old hacked-up RH 7.3 installation and a brand new RH 9 + updates
> installation.
I just checked and you're right, the number of compilers that actually
run is j-1, for all j>1. I assume this is a problem with the parallel
build process, but it does not invalidate these results for comparing
the scheduler performance with different patches.
>
> > This suggests that j should be set to at least the number of logical
> > processors + 1.
>
> Since -test6 I've found this to be the case for kernel builds, yes. But
> I don't think it has anything to do with the scheduler or HT vs SMP
> platforms.
The 1-3% performance loss when HT is enabled for -j1 is still very real.
Nathan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-14 19:49 ` Nathan Fredrickson
2003-12-14 20:35 ` Adam Kropelin
@ 2003-12-15 10:11 ` Con Kolivas
2003-12-16 0:16 ` Nathan Fredrickson
1 sibling, 1 reply; 9+ messages in thread
From: Con Kolivas @ 2003-12-15 10:11 UTC (permalink / raw)
To: Nathan Fredrickson
Cc: Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, Adam Kropelin
On Mon, 15 Dec 2003 06:49, Nathan Fredrickson wrote:
> On Fri, 2003-12-12 at 09:57, Con Kolivas wrote:
> > I set out to find how the hyper-thread schedulers would affect the all
> > important kernel compile benchmark on machines that most of us are likely
> > to encounter soon. The single processor HT machine.
>
> I ran some further tests since I have access to some SMP systems with HT
> (1, 2 and 4 physical processors).
> I can also run the same on four physical processors if there is
> interest.
> j = 1 2 3 4 8
> 1phys (uniproc) 1.00 1.00 1.00 1.00 1.00
> 1phys w/HT 1.02 1.02 0.87 0.87 0.87
> 1phys w/HT (w26) 1.02 1.02 0.87 0.87 0.88
> 1phys w/HT (C1) 1.03 1.02 0.88 0.88 0.88
> 2phys 1.00 1.00 0.53 0.53 0.53
> 2phys w/HT 1.01 1.01 0.64 0.50 0.48
> 2phys w/HT (w26) 1.02 1.01 0.55 0.49 0.47
> 2phys w/HT (C1) 1.02 1.01 0.53 0.50 0.48
The specific HT scheduler benefits only start appearing with more physical
cpus which is to be expected. Just for demonstration the four processor run
would be nice (and obviously take you less time to do ;). I think it will
demonstrate it even more. It would be nice to help the most common case of
one HT cpu, though, instead of hindering it.
Adam already pointed out that you -j2 didn't really get you 2 jobs. I was
using a 2.4 kernel tree for the benchmarks and j2 was giving me two jobs
although perhaps something about the C1 patch was preventing the second job
from ever taking off which is why the result is the same as one job in my
benches. Curious.
> > Conclusion?
> > If you run nothing but kernel compiles all day on a P4 HT, make sure you
> > compile it for SMP ;-)
>
> And make sure you compile with the -jX option with X >= logical_procs+1
Of course. For now on the uniprocessor HT setup I'd recommend the unmodified
scheduler in SMP mode.
Con
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-15 10:11 ` Con Kolivas
@ 2003-12-16 0:16 ` Nathan Fredrickson
2003-12-16 0:55 ` Con Kolivas
0 siblings, 1 reply; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-16 0:16 UTC (permalink / raw)
To: Con Kolivas
Cc: Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, Adam Kropelin
On Mon, 2003-12-15 at 05:11, Con Kolivas wrote:
> On Mon, 15 Dec 2003 06:49, Nathan Fredrickson wrote:
> > I can also run the same on four physical processors if there is
> > interest.
>
> The specific HT scheduler benefits only start appearing with more physical
> cpus which is to be expected. Just for demonstration the four processor run
> would be nice (and obviously take you less time to do ;). I think it will
> demonstrate it even more. It would be nice to help the most common case of
> one HT cpu, though, instead of hindering it.
Here are some results on four physical processors. Unfortunately my
quad systems are a different speed than the dual systems used for the
previous tests so the results are not directly comparable.
Same test as before, a 2.6.0 kernel compile with make -jX vmlinux.
Results are the best real time out of five runs.
Hardware: Xeon HT 1.4GHz
Test cases:
1phys UP - UP test11 kernel with HT disabled in the BIOS
4phys SMP - SMP test11 kernel on 4 physical procs with HT disabled
4phys HT - SMP test11 kernel on 4 physical procs with HT enabled
4phys HT (w26)- same as above with Nick's w26 sched-rollup patch
4phys HT (C1) - same as above with Ingo's C1 patch
Here are the results normalized to the X=1 UP case to make comparisons
easier. Lower is better.
X = 1 2 3 4 5 6 7 8 9 16
1phys UP 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4phys SMP 1.00 0.99 0.51 0.35 0.27 0.27 0.27 0.27 0.27 0.27
4phys HT 1.01 1.00 0.55 0.40 0.33 0.29 0.27 0.26 0.25 0.26
4phys HT(w26) 1.01 1.01 0.54 0.37 0.31 0.27 0.26 0.26 0.26 0.26
4phys HT(C1) 1.01 1.00 0.52 0.36 0.29 0.28 0.27 0.26 0.25 0.26
Interesting that the overhead due to HT in the X=1 column is only 1%
with 4 physical processors. It was 1-3% before with 1 or 2 physical
processors.
In the partial load columns where there are less compiler processes than
logical CPUs (X=3,4,5,6,7), it appears that both patches are doing a
better job scheduling than the standard scheduler. At full load (X=>8)
all three HT test cases perform about equally and beat standard SMP by
1-2%.
Hope these results are helpful. I'd be happy to run more cases and/or
other patches.
Nathan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-16 0:16 ` Nathan Fredrickson
@ 2003-12-16 0:55 ` Con Kolivas
2003-12-16 3:57 ` Nathan Fredrickson
0 siblings, 1 reply; 9+ messages in thread
From: Con Kolivas @ 2003-12-16 0:55 UTC (permalink / raw)
To: Nathan Fredrickson; +Cc: Linux Kernel Mailing List
[-- Attachment #1: Type: text/plain, Size: 1362 bytes --]
Quoting Nathan Fredrickson <8nrf@qlink.queensu.ca>:
> X = 1 2 3 4 5 6 7 8 9 16
> 1phys UP 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
> 4phys SMP 1.00 0.99 0.51 0.35 0.27 0.27 0.27 0.27 0.27 0.27
> 4phys HT 1.01 1.00 0.55 0.40 0.33 0.29 0.27 0.26 0.25 0.26
> 4phys HT(w26) 1.01 1.01 0.54 0.37 0.31 0.27 0.26 0.26 0.26 0.26
> 4phys HT(C1) 1.01 1.00 0.52 0.36 0.29 0.28 0.27 0.26 0.25 0.26
>
> Interesting that the overhead due to HT in the X=1 column is only 1%
> with 4 physical processors. It was 1-3% before with 1 or 2 physical
> processors.
>
> In the partial load columns where there are less compiler processes than
> logical CPUs (X=3,4,5,6,7), it appears that both patches are doing a
> better job scheduling than the standard scheduler. At full load (X=>8)
> all three HT test cases perform about equally and beat standard SMP by
> 1-2%.
>
> Hope these results are helpful. I'd be happy to run more cases and/or
> other patches.
(cc list stripped)
Well since you asked... I've been looking for someone with more HT cpus to give
a much simpler approach a try. Here's a sample patch for vanilla test11 with
HT. This one actually helps UP HT performance ever so slightly and I'd be
curious to see if it does anything on more cpus.
Con
[-- Attachment #2: patch-test11-ht-3 --]
[-- Type: application/octet-stream, Size: 2612 bytes --]
--- linux-2.6.0-test11-base/kernel/sched.c 2003-11-24 22:18:56.000000000 +1100
+++ linux-2.6.0-test11-ht3/kernel/sched.c 2003-12-15 23:38:33.250059542 +1100
@@ -204,6 +204,7 @@ struct runqueue {
struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
int prev_cpu_load[NR_CPUS];
+ unsigned long cpu;
#ifdef CONFIG_NUMA
atomic_t *node_nr_running;
int prev_node_load[MAX_NUMNODES];
@@ -221,6 +222,10 @@ static DEFINE_PER_CPU(struct runqueue, r
#define task_rq(p) cpu_rq(task_cpu(p))
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
+#define ht_active (cpu_has_ht && smp_num_siblings > 1)
+#define ht_siblings(cpu1, cpu2) (ht_active && \
+ cpu_sibling_map[(cpu1)] == (cpu2))
+
/*
* Default context-switch locking:
*/
@@ -1157,8 +1162,9 @@ can_migrate_task(task_t *tsk, runqueue_t
{
unsigned long delta = sched_clock() - tsk->timestamp;
- if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
- return 0;
+ if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)) &&
+ !ht_siblings(this_cpu, task_cpu(tsk)))
+ return 0;
if (task_running(rq, tsk))
return 0;
if (!cpu_isset(this_cpu, tsk->cpus_allowed))
@@ -1193,15 +1199,23 @@ static void load_balance(runqueue_t *thi
imbalance /= 2;
/*
+ * For hyperthread siblings take tasks from the active array
+ * to get cache-warm tasks since they share caches.
+ */
+ if (ht_siblings(this_cpu, busiest->cpu))
+ array = busiest->active;
+ /*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
* be cache-cold, thus switching CPUs has the least effect
* on them.
*/
- if (busiest->expired->nr_active)
- array = busiest->expired;
- else
- array = busiest->active;
+ else {
+ if (busiest->expired->nr_active)
+ array = busiest->expired;
+ else
+ array = busiest->active;
+ }
new_array:
/* Start searching at priority 0: */
@@ -1212,9 +1226,16 @@ skip_bitmap:
else
idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
if (idx >= MAX_PRIO) {
- if (array == busiest->expired) {
- array = busiest->active;
- goto new_array;
+ if (ht_siblings(this_cpu, busiest->cpu)){
+ if (array == busiest->active) {
+ array = busiest->expired;
+ goto new_array;
+ }
+ } else {
+ if (array == busiest->expired) {
+ array = busiest->active;
+ goto new_array;
+ }
}
goto out_unlock;
}
@@ -2812,6 +2833,7 @@ void __init sched_init(void)
prio_array_t *array;
rq = cpu_rq(i);
+ rq->cpu = (unsigned long)(i);
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-16 0:55 ` Con Kolivas
@ 2003-12-16 3:57 ` Nathan Fredrickson
0 siblings, 0 replies; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-16 3:57 UTC (permalink / raw)
To: Con Kolivas; +Cc: Linux Kernel Mailing List
On Mon, 2003-12-15 at 19:55, Con Kolivas wrote:
> Well since you asked... I've been looking for someone with more HT cpus to give
> a much simpler approach a try. Here's a sample patch for vanilla test11 with
> HT. This one actually helps UP HT performance ever so slightly and I'd be
> curious to see if it does anything on more cpus.
Not much change with this patch. The new result is most similar to
vanilla test11 with HT. Both perform worse than no-HT under partial
load. Here are the results from earlier with the new test case
appended:
X = 1 2 3 4 5 6 7 8 9 16
1phys UP 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4phys SMP 1.00 0.99 0.51 0.35 0.27 0.27 0.27 0.27 0.27 0.27
4phys HT 1.01 1.00 0.55 0.40 0.33 0.29 0.27 0.26 0.25 0.26
4phys HT(w26) 1.01 1.01 0.54 0.37 0.31 0.27 0.26 0.26 0.26 0.26
4phys HT(C1) 1.01 1.00 0.52 0.36 0.29 0.28 0.27 0.26 0.25 0.26
4phys HT(ht3) 1.01 1.00 0.53 0.39 0.33 0.29 0.27 0.26 0.26 0.26
Nathan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: HT schedulers' performance on single HT processor
2003-12-12 14:57 HT schedulers' performance on single HT processor Con Kolivas
2003-12-14 19:49 ` Nathan Fredrickson
@ 2004-01-03 17:56 ` Bill Davidsen
1 sibling, 0 replies; 9+ messages in thread
From: Bill Davidsen @ 2004-01-03 17:56 UTC (permalink / raw)
To: Con Kolivas; +Cc: linux kernel mailing list, Nick Piggin, Ingo Molnar
Con Kolivas wrote:
> I set out to find how the hyper-thread schedulers would affect the all
> important kernel compile benchmark on machines that most of us are likely to
> encounter soon. The single processor HT machine.
>
> Usual benchmark precautions taken; best of five runs (curiously the fastest
> was almost always the second run). Although for confirmation I really did
> this twice.
>
> Tested a kernel compile with make vmlinux, make -j2 and make -j8.
>
> make vmlinux - tests to ensure the sequential single threaded make doesn't
> suffer as a result of these tweaks
>
> make -j2 vmlinux - tests to see how well wasted idle time is avoided
>
> make -j8 vmlinux - maximum throughput test (4x nr_cpus seems to be ceiling for
> this).
>
> Hardware: P4 HT 3.066
>
> Legend:
> UP - Uniprocessor 2.6.0-test11 kernel
> SMP - SMP kernel
> C1 - With Ingo's C1 hyperthread patch
> w26 - With Nick's w26 sched-rollup (hyperthread included)
>
> make vmlinux
> kernel time
> UP 65.96
> SMP 65.80
> C1 66.54
> w26 66.25
>
> I was concerned this might happen and indeed the sequential single threaded
> compile is slightly worse on both HT schedulers. (1)
>
> make -j2 vmlinux
> kernel time
> UP 65.17
> SMP 57.77
> C1 66.01
> w26 57.94
>
> Shows the smp kernel nicely utilises HT whereas the UP kernel doesn't. The C1
> result was very repeatable and I was unable to get it lower than this.(2)
>
> make -j8 vmlinux
> kernel time
> UP 65.00
> SMP 57.85
> C1 58.25
> w26 57.94
If you could make one more test, do the compile with -pipe set in the
top level Makefile. I don't have play access to a HT uni, the only
machines available to me at the moment are SMP and production at that.
I did try it just for grins on a non-HT uni and saw this:
opt real user sys idle
-j1 406.2 308.1 19.0 79.1
-j1 -pipe 398.6 308.2 19.0 71.4
-j3 391.6 308.3 19.0 64.3
-j3 -pipe 388.7 308.4 19.0 61.3
P4-2.4MHz, 256MB, compiling 2.5.47-ac6 with just "make." Using -pipe
*may* allow both siblings to cooperate better.
I assume that CPU affinity should apply to all siblings in a package?
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2004-01-03 17:56 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-12 14:57 HT schedulers' performance on single HT processor Con Kolivas
2003-12-14 19:49 ` Nathan Fredrickson
2003-12-14 20:35 ` Adam Kropelin
2003-12-14 21:15 ` Nathan Fredrickson
2003-12-15 10:11 ` Con Kolivas
2003-12-16 0:16 ` Nathan Fredrickson
2003-12-16 0:55 ` Con Kolivas
2003-12-16 3:57 ` Nathan Fredrickson
2004-01-03 17:56 ` Bill Davidsen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).