linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* HT schedulers' performance on single HT processor
@ 2003-12-12 14:57 Con Kolivas
  2003-12-14 19:49 ` Nathan Fredrickson
  2004-01-03 17:56 ` Bill Davidsen
  0 siblings, 2 replies; 9+ messages in thread
From: Con Kolivas @ 2003-12-12 14:57 UTC (permalink / raw)
  To: linux kernel mailing list; +Cc: Nick Piggin, Ingo Molnar

I set out to find how the hyper-thread schedulers would affect the all 
important kernel compile benchmark on machines that most of us are likely to 
encounter soon. The single processor HT machine.

Usual benchmark precautions taken; best of five runs (curiously the fastest 
was almost always the second run). Although for confirmation I really did 
this twice.

Tested a kernel compile with make vmlinux, make -j2 and make -j8. 

make vmlinux - tests to ensure the sequential single threaded make doesn't 
suffer as a result of these tweaks

make -j2 vmlinux - tests to see how well wasted idle time is avoided

make -j8 vmlinux - maximum throughput test (4x nr_cpus seems to be ceiling for 
this).

Hardware: P4 HT 3.066

Legend:
UP - Uniprocessor 2.6.0-test11 kernel
SMP - SMP kernel
C1 - With Ingo's C1 hyperthread patch
w26 - With Nick's w26 sched-rollup (hyperthread included)

make vmlinux
kernel	time
UP	65.96
SMP	65.80
C1	66.54
w26	66.25

I was concerned this might happen and indeed the sequential single threaded 
compile is slightly worse on both HT schedulers. (1)

make -j2 vmlinux
kernel	time
UP	65.17
SMP	57.77
C1	66.01
w26	57.94

Shows the smp kernel nicely utilises HT whereas the UP kernel doesn't. The C1 
result was very repeatable and I was unable to get it lower than this.(2)

make -j8 vmlinux
kernel	time
UP	65.00
SMP	57.85
C1	58.25
w26	57.94

Results are not obviously better(3) but C1 is still a little slower (2)

Ok so what happened as I see it?

(1) My concern with the HT patches and single compiles was that in an effort 
to keep both logical cores busy, the next task would bounce to the other 
logical core. While very cheap on HT it's still more expensive than staying 
on the same core. I can't prove that happened.

(2) We know the C1 patch has trouble booting on some hardware so maybe there's 
a bug in there affecting performance too.

(3) There is a very real performance advantage in this benchmark to enabling 
SMP on a HT cpu. However, in the best case it only amounts to 11%. This means 
that if a specialised HT scheduler patch gained say 10% it would only amount 
to 1% overall - hardly an exciting amount. 1% should have been on the edge of 
statistical significance, but I haven't even been able to show any difference 
at all. This does _not_ mean there aren't performance benefits elsewhere, but 
they obviously need evidence.

Conclusion?
If you run nothing but kernel compiles all day on a P4 HT, make sure you 
compile it for SMP ;-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-12 14:57 HT schedulers' performance on single HT processor Con Kolivas
@ 2003-12-14 19:49 ` Nathan Fredrickson
  2003-12-14 20:35   ` Adam Kropelin
  2003-12-15 10:11   ` Con Kolivas
  2004-01-03 17:56 ` Bill Davidsen
  1 sibling, 2 replies; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-14 19:49 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Linux Kernel Mailing List, Nick Piggin, Ingo Molnar

On Fri, 2003-12-12 at 09:57, Con Kolivas wrote:
> I set out to find how the hyper-thread schedulers would affect the all 
> important kernel compile benchmark on machines that most of us are likely to 
> encounter soon. The single processor HT machine.

I ran some further tests since I have access to some SMP systems with HT
(1, 2 and 4 physical processors).

Tested a kernel compile with make -jX vmlinux, where X = 1...16. 
Results are the best real time out of five runs.

Hardware: Xeon HT 2GHz

Test cases:
1phys (uniproc)  - UP test11 kernel with HT disabled in the BIOS
1phys w/HT       - SMP test11 kernel on 1 physical proc with HT enabled
1phys w/HT (w26) - same as above with Nick's w26 sched-rollup patch
1phys w/HT (C1)  - same as above with Ingo's C1 patch
2phys            - SMP test11 kernel on 2 physical proc with HT disabled
2phys w/HT       - SMP test11 kernel on 2 physical proc with HT enabled
2phys w/HT (w26) - same as above with Nick's w26 sched-rollup patch
2phys w/HT (C1)  - same as above with Ingo's C1 patch

I can also run the same on four physical processors if there is
interest.

Here are some of the results.  The units are time in seconds so lower is
better.  The complete results and some graphs are available at:
http://nrf.sortof.com/kbench/test11-kbench.html

             j =   1       2       3       4       8
1phys (uniproc)  305.86  306.07  306.47  306.63  306.69
1phys w/HT       311.70  311.01  267.05  267.16  267.62
1phys w/HT (w26) 311.85  311.58  267.20  267.53  267.76
1phys w/HT (C1)  313.72  312.89  268.16  269.17  268.67
2phys            306.00  305.00  161.15  161.31  161.51
2phys w/HT       309.02  308.36  196.91  151.70  145.80
2phys w/HT (w26) 310.65  309.34  167.16  151.37  145.22
2phys w/HT (C1)  310.86  307.90  162.05  152.16  145.82

Same table as above normalized to the j=1 uniproc case to make
comparisons easier.  Lower is still better.

             j =  1     2     3     4     8
1phys (uniproc)  1.00  1.00  1.00  1.00  1.00
1phys w/HT       1.02  1.02  0.87  0.87  0.87
1phys w/HT (w26) 1.02  1.02  0.87  0.87  0.88
1phys w/HT (C1)  1.03  1.02  0.88  0.88  0.88
2phys            1.00  1.00  0.53  0.53  0.53
2phys w/HT       1.01  1.01  0.64  0.50  0.48
2phys w/HT (w26) 1.02  1.01  0.55  0.49  0.47
2phys w/HT (C1)  1.02  1.01  0.53  0.50  0.48

Con Kolivas wrote:
> I was concerned this might happen and indeed the sequential single threaded 
> compile is slightly worse on both HT schedulers. (1)

My test showed the same (assuming -j1 is the same as omitting the 
option).  The slowdown of the -j1 case with HT is 1-3%.

There was not much benefit from either HT or SMP with j=2.  Maximum
speedup was not realized until j=3 for one physical processor and j=5
for 2 physical processors.  This suggests that j should be set to at
least the number of logical processors + 1.

> (3) There is a very real performance advantage in this benchmark to enabling 
> SMP on a HT cpu. However, in the best case it only amounts to 11%. This means 
> that if a specialised HT scheduler patch gained say 10% it would only amount 
> to 1% overall - hardly an exciting amount. 

Agree, there is certainly an advantage to using HT as long as there are
enough runnable processes (j>=3).  Running additional processes in
parallel (j=16) does not increase performance any further nor does it
decease it.  My best case speedup amounts to 15%, which is right in the
middle of the 10-20% range that Intel talks about.

> Conclusion?
> If you run nothing but kernel compiles all day on a P4 HT, make sure you 
> compile it for SMP ;-)

And make sure you compile with the -jX option with X >= logical_procs+1

Nathan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-14 19:49 ` Nathan Fredrickson
@ 2003-12-14 20:35   ` Adam Kropelin
  2003-12-14 21:15     ` Nathan Fredrickson
  2003-12-15 10:11   ` Con Kolivas
  1 sibling, 1 reply; 9+ messages in thread
From: Adam Kropelin @ 2003-12-14 20:35 UTC (permalink / raw)
  To: Nathan Fredrickson
  Cc: Con Kolivas, Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, sam

On Sun, Dec 14, 2003 at 02:49:24PM -0500, Nathan Fredrickson wrote:
> Same table as above normalized to the j=1 uniproc case to make
> comparisons easier.  Lower is still better.
> 
>              j =  1     2     3     4     8
> 1phys (uniproc)  1.00  1.00  1.00  1.00  1.00
> 1phys w/HT       1.02  1.02  0.87  0.87  0.87
> 1phys w/HT (w26) 1.02  1.02  0.87  0.87  0.88
> 1phys w/HT (C1)  1.03  1.02  0.88  0.88  0.88
> 2phys            1.00  1.00  0.53  0.53  0.53
  ^^^^^                  ^^^^

Ummm...

> 2phys w/HT       1.01  1.01  0.64  0.50  0.48
> 2phys w/HT (w26) 1.02  1.01  0.55  0.49  0.47
> 2phys w/HT (C1)  1.02  1.01  0.53  0.50  0.48

> There was not much benefit from either HT or SMP with j=2.  Maximum
> speedup was not realized until j=3 for one physical processor and j=5
> for 2 physical processors.

This is mighty suspicious. With -j2 did you check to see that there
were indeed two parallel gcc's running? Since -test6 I've found that 
-j2 only results in a single gcc instance. I've seen this on both an
old hacked-up RH 7.3 installation and a brand new RH 9 + updates
installation.

> This suggests that j should be set to at least the number of logical
> processors + 1.

Since -test6 I've found this to be the case for kernel builds, yes. But
I don't think it has anything to do with the scheduler or HT vs SMP
platforms.

--Adam


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-14 20:35   ` Adam Kropelin
@ 2003-12-14 21:15     ` Nathan Fredrickson
  0 siblings, 0 replies; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-14 21:15 UTC (permalink / raw)
  To: Adam Kropelin
  Cc: Con Kolivas, Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, sam

On Sun, 2003-12-14 at 15:35, Adam Kropelin wrote:
> On Sun, Dec 14, 2003 at 02:49:24PM -0500, Nathan Fredrickson wrote:
> > Same table as above normalized to the j=1 uniproc case to make
> > comparisons easier.  Lower is still better.
> > 
> >              j =  1     2     3     4     8
> > 1phys (uniproc)  1.00  1.00  1.00  1.00  1.00
> > 1phys w/HT       1.02  1.02  0.87  0.87  0.87
> > 1phys w/HT (w26) 1.02  1.02  0.87  0.87  0.88
> > 1phys w/HT (C1)  1.03  1.02  0.88  0.88  0.88
> > 2phys            1.00  1.00  0.53  0.53  0.53
>   ^^^^^                  ^^^^
> 
> Ummm...
> 
> This is mighty suspicious. With -j2 did you check to see that there
> were indeed two parallel gcc's running? Since -test6 I've found that 
> -j2 only results in a single gcc instance. I've seen this on both an
> old hacked-up RH 7.3 installation and a brand new RH 9 + updates
> installation.

I just checked and you're right, the number of compilers that actually
run is j-1, for all j>1.  I assume this is a problem with the parallel
build process, but it does not invalidate these results for comparing
the scheduler performance with different patches.
> 
> > This suggests that j should be set to at least the number of logical
> > processors + 1.
> 
> Since -test6 I've found this to be the case for kernel builds, yes. But
> I don't think it has anything to do with the scheduler or HT vs SMP
> platforms.

The 1-3% performance loss when HT is enabled for -j1 is still very real.

Nathan




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-14 19:49 ` Nathan Fredrickson
  2003-12-14 20:35   ` Adam Kropelin
@ 2003-12-15 10:11   ` Con Kolivas
  2003-12-16  0:16     ` Nathan Fredrickson
  1 sibling, 1 reply; 9+ messages in thread
From: Con Kolivas @ 2003-12-15 10:11 UTC (permalink / raw)
  To: Nathan Fredrickson
  Cc: Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, Adam Kropelin

On Mon, 15 Dec 2003 06:49, Nathan Fredrickson wrote:
> On Fri, 2003-12-12 at 09:57, Con Kolivas wrote:
> > I set out to find how the hyper-thread schedulers would affect the all
> > important kernel compile benchmark on machines that most of us are likely
> > to encounter soon. The single processor HT machine.
>
> I ran some further tests since I have access to some SMP systems with HT
> (1, 2 and 4 physical processors).

> I can also run the same on four physical processors if there is
> interest.

>              j =  1     2     3     4     8
> 1phys (uniproc)  1.00  1.00  1.00  1.00  1.00
> 1phys w/HT       1.02  1.02  0.87  0.87  0.87
> 1phys w/HT (w26) 1.02  1.02  0.87  0.87  0.88
> 1phys w/HT (C1)  1.03  1.02  0.88  0.88  0.88
> 2phys            1.00  1.00  0.53  0.53  0.53
> 2phys w/HT       1.01  1.01  0.64  0.50  0.48
> 2phys w/HT (w26) 1.02  1.01  0.55  0.49  0.47
> 2phys w/HT (C1)  1.02  1.01  0.53  0.50  0.48

The specific HT scheduler benefits only start appearing with more physical 
cpus which is to be expected. Just for demonstration the four processor run 
would be nice (and obviously take you less time to do ;). I think it will 
demonstrate it even more. It would be nice to help the most common case of 
one HT cpu, though, instead of hindering it.

Adam already pointed out that you -j2 didn't really get you 2 jobs. I was 
using a 2.4 kernel tree for the benchmarks and j2 was giving me two jobs 
although perhaps something about the C1 patch was preventing the second job 
from ever taking off which is why the result is the same as one job in my 
benches. Curious.

> > Conclusion?
> > If you run nothing but kernel compiles all day on a P4 HT, make sure you
> > compile it for SMP ;-)
>
> And make sure you compile with the -jX option with X >= logical_procs+1

Of course. For now on the uniprocessor HT setup I'd recommend the unmodified 
scheduler in SMP mode.

Con


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-15 10:11   ` Con Kolivas
@ 2003-12-16  0:16     ` Nathan Fredrickson
  2003-12-16  0:55       ` Con Kolivas
  0 siblings, 1 reply; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-16  0:16 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailing List, Nick Piggin, Ingo Molnar, Adam Kropelin

On Mon, 2003-12-15 at 05:11, Con Kolivas wrote:
> On Mon, 15 Dec 2003 06:49, Nathan Fredrickson wrote:
> > I can also run the same on four physical processors if there is
> > interest.
> 
> The specific HT scheduler benefits only start appearing with more physical 
> cpus which is to be expected. Just for demonstration the four processor run 
> would be nice (and obviously take you less time to do ;). I think it will 
> demonstrate it even more. It would be nice to help the most common case of 
> one HT cpu, though, instead of hindering it.

Here are some results on four physical processors.  Unfortunately my
quad systems are a different speed than the dual systems used for the
previous tests so the results are not directly comparable.

Same test as before, a 2.6.0 kernel compile with make -jX vmlinux. 
Results are the best real time out of five runs.
Hardware: Xeon HT 1.4GHz

Test cases:
1phys UP      - UP test11 kernel with HT disabled in the BIOS
4phys SMP     - SMP test11 kernel on 4 physical procs with HT disabled
4phys HT      - SMP test11 kernel on 4 physical procs with HT enabled
4phys HT (w26)- same as above with Nick's w26 sched-rollup patch
4phys HT (C1) - same as above with Ingo's C1 patch

Here are the results normalized to the X=1 UP case to make comparisons
easier.  Lower is better.

          X =  1     2     3     4     5     6     7     8     9    16
1phys UP      1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00
4phys SMP     1.00  0.99  0.51  0.35  0.27  0.27  0.27  0.27  0.27  0.27
4phys HT      1.01  1.00  0.55  0.40  0.33  0.29  0.27  0.26  0.25  0.26
4phys HT(w26) 1.01  1.01  0.54  0.37  0.31  0.27  0.26  0.26  0.26  0.26
4phys HT(C1)  1.01  1.00  0.52  0.36  0.29  0.28  0.27  0.26  0.25  0.26

Interesting that the overhead due to HT in the X=1 column is only 1%
with 4 physical processors.  It was 1-3% before with 1 or 2 physical
processors.

In the partial load columns where there are less compiler processes than
logical CPUs (X=3,4,5,6,7), it appears that both patches are doing a
better job scheduling than the standard scheduler.  At full load (X=>8)
all three HT test cases perform about equally and beat standard SMP by
1-2%.

Hope these results are helpful.  I'd be happy to run more cases and/or
other patches.

Nathan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-16  0:16     ` Nathan Fredrickson
@ 2003-12-16  0:55       ` Con Kolivas
  2003-12-16  3:57         ` Nathan Fredrickson
  0 siblings, 1 reply; 9+ messages in thread
From: Con Kolivas @ 2003-12-16  0:55 UTC (permalink / raw)
  To: Nathan Fredrickson; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1362 bytes --]

Quoting Nathan Fredrickson <8nrf@qlink.queensu.ca>:
>           X =  1     2     3     4     5     6     7     8     9    16
> 1phys UP      1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00
> 4phys SMP     1.00  0.99  0.51  0.35  0.27  0.27  0.27  0.27  0.27  0.27
> 4phys HT      1.01  1.00  0.55  0.40  0.33  0.29  0.27  0.26  0.25  0.26
> 4phys HT(w26) 1.01  1.01  0.54  0.37  0.31  0.27  0.26  0.26  0.26  0.26
> 4phys HT(C1)  1.01  1.00  0.52  0.36  0.29  0.28  0.27  0.26  0.25  0.26
> 
> Interesting that the overhead due to HT in the X=1 column is only 1%
> with 4 physical processors.  It was 1-3% before with 1 or 2 physical
> processors.
> 
> In the partial load columns where there are less compiler processes than
> logical CPUs (X=3,4,5,6,7), it appears that both patches are doing a
> better job scheduling than the standard scheduler.  At full load (X=>8)
> all three HT test cases perform about equally and beat standard SMP by
> 1-2%.
> 
> Hope these results are helpful.  I'd be happy to run more cases and/or
> other patches.

(cc list stripped)

Well since you asked... I've been looking for someone with more HT cpus to give
a much simpler approach a try. Here's a sample patch for vanilla test11 with
HT. This one actually helps UP HT performance ever so slightly and I'd be
curious to see if it does anything on more cpus.

Con

[-- Attachment #2: patch-test11-ht-3 --]
[-- Type: application/octet-stream, Size: 2612 bytes --]

--- linux-2.6.0-test11-base/kernel/sched.c	2003-11-24 22:18:56.000000000 +1100
+++ linux-2.6.0-test11-ht3/kernel/sched.c	2003-12-15 23:38:33.250059542 +1100
@@ -204,6 +204,7 @@ struct runqueue {
 	struct mm_struct *prev_mm;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_cpu_load[NR_CPUS];
+	unsigned long cpu;
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
 	int prev_node_load[MAX_NUMNODES];
@@ -221,6 +222,10 @@ static DEFINE_PER_CPU(struct runqueue, r
 #define task_rq(p)		cpu_rq(task_cpu(p))
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 
+#define ht_active		(cpu_has_ht && smp_num_siblings > 1)
+#define ht_siblings(cpu1, cpu2)	(ht_active && \
+	cpu_sibling_map[(cpu1)] == (cpu2))
+
 /*
  * Default context-switch locking:
  */
@@ -1157,8 +1162,9 @@ can_migrate_task(task_t *tsk, runqueue_t
 {
 	unsigned long delta = sched_clock() - tsk->timestamp;
 
-	if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
-		return 0;
+	if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)) &&
+		!ht_siblings(this_cpu, task_cpu(tsk)))
+			return 0;
 	if (task_running(rq, tsk))
 		return 0;
 	if (!cpu_isset(this_cpu, tsk->cpus_allowed))
@@ -1193,15 +1199,23 @@ static void load_balance(runqueue_t *thi
 	imbalance /= 2;
 
 	/*
+	 * For hyperthread siblings take tasks from the active array
+	 * to get cache-warm tasks since they share caches.
+	 */
+	if (ht_siblings(this_cpu, busiest->cpu))
+		array = busiest->active;
+	/*
 	 * We first consider expired tasks. Those will likely not be
 	 * executed in the near future, and they are most likely to
 	 * be cache-cold, thus switching CPUs has the least effect
 	 * on them.
 	 */
-	if (busiest->expired->nr_active)
-		array = busiest->expired;
-	else
-		array = busiest->active;
+	else {
+		if (busiest->expired->nr_active)
+			array = busiest->expired;
+		else
+			array = busiest->active;
+	}
 
 new_array:
 	/* Start searching at priority 0: */
@@ -1212,9 +1226,16 @@ skip_bitmap:
 	else
 		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
 	if (idx >= MAX_PRIO) {
-		if (array == busiest->expired) {
-			array = busiest->active;
-			goto new_array;
+		if (ht_siblings(this_cpu, busiest->cpu)){
+			if (array == busiest->active) {
+				array = busiest->expired;
+				goto new_array;
+			}
+		} else {
+			if (array == busiest->expired) {
+				array = busiest->active;
+				goto new_array;
+			}
 		}
 		goto out_unlock;
 	}
@@ -2812,6 +2833,7 @@ void __init sched_init(void)
 		prio_array_t *array;
 
 		rq = cpu_rq(i);
+		rq->cpu = (unsigned long)(i);
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-16  0:55       ` Con Kolivas
@ 2003-12-16  3:57         ` Nathan Fredrickson
  0 siblings, 0 replies; 9+ messages in thread
From: Nathan Fredrickson @ 2003-12-16  3:57 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Linux Kernel Mailing List

On Mon, 2003-12-15 at 19:55, Con Kolivas wrote:
> Well since you asked... I've been looking for someone with more HT cpus to give
> a much simpler approach a try. Here's a sample patch for vanilla test11 with
> HT. This one actually helps UP HT performance ever so slightly and I'd be
> curious to see if it does anything on more cpus.

Not much change with this patch.  The new result is most similar to
vanilla test11 with HT.  Both perform worse than no-HT under partial
load.  Here are the results from earlier with the new test case
appended:

          X =  1     2     3     4     5     6     7     8     9    16
1phys UP      1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00
4phys SMP     1.00  0.99  0.51  0.35  0.27  0.27  0.27  0.27  0.27  0.27
4phys HT      1.01  1.00  0.55  0.40  0.33  0.29  0.27  0.26  0.25  0.26
4phys HT(w26) 1.01  1.01  0.54  0.37  0.31  0.27  0.26  0.26  0.26  0.26
4phys HT(C1)  1.01  1.00  0.52  0.36  0.29  0.28  0.27  0.26  0.25  0.26
4phys HT(ht3) 1.01  1.00  0.53  0.39  0.33  0.29  0.27  0.26  0.26  0.26

Nathan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HT schedulers' performance on single HT processor
  2003-12-12 14:57 HT schedulers' performance on single HT processor Con Kolivas
  2003-12-14 19:49 ` Nathan Fredrickson
@ 2004-01-03 17:56 ` Bill Davidsen
  1 sibling, 0 replies; 9+ messages in thread
From: Bill Davidsen @ 2004-01-03 17:56 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list, Nick Piggin, Ingo Molnar

Con Kolivas wrote:
> I set out to find how the hyper-thread schedulers would affect the all 
> important kernel compile benchmark on machines that most of us are likely to 
> encounter soon. The single processor HT machine.
> 
> Usual benchmark precautions taken; best of five runs (curiously the fastest 
> was almost always the second run). Although for confirmation I really did 
> this twice.
> 
> Tested a kernel compile with make vmlinux, make -j2 and make -j8. 
> 
> make vmlinux - tests to ensure the sequential single threaded make doesn't 
> suffer as a result of these tweaks
> 
> make -j2 vmlinux - tests to see how well wasted idle time is avoided
> 
> make -j8 vmlinux - maximum throughput test (4x nr_cpus seems to be ceiling for 
> this).
> 
> Hardware: P4 HT 3.066
> 
> Legend:
> UP - Uniprocessor 2.6.0-test11 kernel
> SMP - SMP kernel
> C1 - With Ingo's C1 hyperthread patch
> w26 - With Nick's w26 sched-rollup (hyperthread included)
> 
> make vmlinux
> kernel	time
> UP	65.96
> SMP	65.80
> C1	66.54
> w26	66.25
> 
> I was concerned this might happen and indeed the sequential single threaded 
> compile is slightly worse on both HT schedulers. (1)
> 
> make -j2 vmlinux
> kernel	time
> UP	65.17
> SMP	57.77
> C1	66.01
> w26	57.94
> 
> Shows the smp kernel nicely utilises HT whereas the UP kernel doesn't. The C1 
> result was very repeatable and I was unable to get it lower than this.(2)
> 
> make -j8 vmlinux
> kernel	time
> UP	65.00
> SMP	57.85
> C1	58.25
> w26	57.94

If you could make one more test, do the compile with -pipe set in the 
top level Makefile. I don't have play access to a HT uni, the only 
machines available to me at the moment are SMP and production at that.

I did try it just for grins on a non-HT uni and saw this:

opt		real	user	sys	idle
-j1		406.2	308.1	19.0	79.1
-j1 -pipe	398.6	308.2	19.0	71.4
-j3		391.6	308.3	19.0	64.3
-j3 -pipe	388.7	308.4	19.0	61.3

P4-2.4MHz, 256MB, compiling 2.5.47-ac6 with just "make." Using -pipe 
*may* allow both siblings to cooperate better.

I assume that CPU affinity should apply to all siblings in a package?

-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-01-03 17:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-12 14:57 HT schedulers' performance on single HT processor Con Kolivas
2003-12-14 19:49 ` Nathan Fredrickson
2003-12-14 20:35   ` Adam Kropelin
2003-12-14 21:15     ` Nathan Fredrickson
2003-12-15 10:11   ` Con Kolivas
2003-12-16  0:16     ` Nathan Fredrickson
2003-12-16  0:55       ` Con Kolivas
2003-12-16  3:57         ` Nathan Fredrickson
2004-01-03 17:56 ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).