RE: [PATCH][RFC] make cpu_sibling_map a cpumask

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: [PATCH][RFC] make cpu_sibling_map a cpumask_t
@ 2003-12-09  4:54 Nakajima, Jun
  2003-12-09 19:42 ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Nakajima, Jun @ 2003-12-09  4:54 UTC (permalink / raw)
  To: Zwane Mwaikambo, Nick Piggin
  Cc: Ingo Molnar, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, linux-kernel

> > It turns the cpu_sibling_map array from an array of cpus to an array
> > of cpumasks. This allows handling of more than 2 siblings, not that
> > I know of any immediate need.
We need this sooner or later.

	Jun
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Zwane Mwaikambo
> Sent: Monday, December 08, 2003 12:51 PM
> To: Nick Piggin
> Cc: Ingo Molnar; Anton Blanchard; Linus Torvalds; Andrew Morton; Rusty
> Russell; linux-kernel
> Subject: Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
> 
> On Mon, 8 Dec 2003, Nick Piggin wrote:
> 
> > It turns the cpu_sibling_map array from an array of cpus to an array
> > of cpumasks. This allows handling of more than 2 siblings, not that
> > I know of any immediate need.
> >
> > I think it generalises cpu_sibling_map sufficiently that it can
become
> > generic code. This would allow architecture specific code to build
the
> > sibling map, and then Ingo's or my HT implementations to build their
> > scheduling descriptions in generic code.
> >
> > I'm not aware of any reason why the kernel should not become
generally
> > SMT aware. It is sufficiently different to SMP that it is worth
> > specialising it, although I am only aware of P4 and POWER5
> implementations.
> 
> Generally i agree, it's fairly unintrusive and appears to clean up
some
> things. Perhaps Andrew will take it a bit later after release.
> 
> > P.S.
> > I have an alternative to Ingo's HT scheduler which basically does
> > the same thing. It is showing a 20% elapsed time improvement with a
> > make -j3 on a 2xP4 Xeon (4 logical CPUs).
> 
> -j3 is an odd number, what does -j4, -j8, -j16 look like?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-09  4:54 [PATCH][RFC] make cpu_sibling_map a cpumask_t Nakajima, Jun
@ 2003-12-09 19:42 ` William Lee Irwin III
  0 siblings, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2003-12-09 19:42 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Zwane Mwaikambo, Nick Piggin, Ingo Molnar, Anton Blanchard,
	Linus Torvalds, Andrew Morton, Rusty Russell, linux-kernel

On Mon, Dec 08, 2003 at 08:54:31PM -0800, Nakajima, Jun wrote:
> We need this sooner or later.

I somehow knew you were going to say that. =)


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-09 13:36   ` Nick Piggin
@ 2003-12-11 21:41     ` bill davidsen
  0 siblings, 0 replies; 16+ messages in thread
From: bill davidsen @ 2003-12-11 21:41 UTC (permalink / raw)
  To: linux-kernel

In article <3FD5CFE1.8080800@cyberone.com.au>,
Nick Piggin  <piggin@cyberone.com.au> wrote:

| Well if (something like) cpu_sibling_map is to become architecture
| independant code, it should probably get something like cpu_to_package,
| package_to_cpumask sort of things like NUMA has.
| 
| I don't know if SMT should become just another level of NUMA because
| its not NUMA of course. For the scheduler it seems to make sense because
| SMT/SMP/NUMA basically want slight variations of the same thing.
| 
| Possibly as you say slab cache could be SMTed, but I'm not convinced
| that SMT and NUMA should become inseperable. Anyway I doubt 2.6 will
| see a general multi level NUMA framework.

It would seem that what concerns all of these levels is the process
migration cost, so perhaps it will be possible in 2.6 after all. Right
now a lot of clever people are actively working on scheduling, so now
is the time for a breakthrough which gives a big boost. Absent that I
predict all of the clever folks and most of us "build a bunch of
kernels and play" types will conclude the low hanging fruit have all
been picked and move on. Scheduling will be stable again.

Before that perhaps one of you will suddenly see some way to make this
all very simple and elegant, so everyone can look and say "of course"
and everything will work optimally.

What we have in all of these patches is a big gain over base, so the
efforts have produced visable gains.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08 23:46 ` Rusty Russell
@ 2003-12-09 13:36   ` Nick Piggin
  2003-12-11 21:41     ` bill davidsen
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2003-12-09 13:36 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Zwane Mwaikambo, linux-kernel

Rusty Russell wrote:

>In message <3FD3FD52.7020001@cyberone.com.au> you write:
>
>>I'm not aware of any reason why the kernel should not become generally
>>SMT aware. It is sufficiently different to SMP that it is worth
>>specialising it, although I am only aware of P4 and POWER5 implementations.
>>
>
>To do it properly, it should be done within the NUMA framework.  That
>would allow generic slab cache optimizations, etc.  We'd really need a
>multi-level NUMA framework for this though.
>
>But patch looks fine.
>

Well if (something like) cpu_sibling_map is to become architecture
independant code, it should probably get something like cpu_to_package,
package_to_cpumask sort of things like NUMA has.

I don't know if SMT should become just another level of NUMA because
its not NUMA of course. For the scheduler it seems to make sense because
SMT/SMP/NUMA basically want slight variations of the same thing.

Possibly as you say slab cache could be SMTed, but I'm not convinced
that SMT and NUMA should become inseperable. Anyway I doubt 2.6 will
see a general multi level NUMA framework.

>
>>I have an alternative to Ingo's HT scheduler which basically does
>>the same thing. It is showing a 20% elapsed time improvement with a
>>make -j3 on a 2xP4 Xeon (4 logical CPUs).
>>
>
>Me too.
>
>My main argument with Ingo's patch (last I looked) was technical: the
>code becomes clearer if the structures are explicitly split into the
>"per-runqueue stuff" and the "per-cpu stuff" (containing a my_runqueue
>pointer).
>

Ahh so yours isn't an earlier/later version of the same implementation..

>
>I'd be very interested in your patch though, Nick.
>

I've just had hard to trace bug report with it, but there is some
interest in it now so I'll work on sorting it out and getting
something out for testing and review. Sorry for the delay.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08 23:08   ` Nick Piggin
@ 2003-12-09  0:14     ` Anton Blanchard
  0 siblings, 0 replies; 16+ messages in thread
From: Anton Blanchard @ 2003-12-09  0:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Rusty Russell,
	Zwane Mwaikambo, linux-kernel


> I don't have an architecture independant way to build SMT scheduling
> descriptions, although with the cpu_sibling_map change, you can copy
> and paste the code for the P4 if you are able to build a cpu_sibling_map.

I can build a suitable table, I did that for Ingos HT scheduler.

> I'll just have to add a some bits so SMT and NUMA work together which
> I will be unable to test. I'll get back to you with some code.

Thanks.

Anton

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08 23:36       ` Ingo Molnar
@ 2003-12-08 23:58         ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2003-12-08 23:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zwane Mwaikambo, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, linux-kernel

Ingo Molnar wrote:

>the thing that makes balancing-only driven SMT possible with current 2.6.0
>is the overbalancing we do (to have cross-CPU fairness). Previous
>iterations of the O(1) scheduler (all the 2.4 backports) didnt do this so
>all the add-on SMT schedulers tended to have a problem achieving good SMT
>distribution. Now that we more agressively balance, this isnt such a big
>issue anymore.
>

I'm glad you like my idea. I do like your shared runqueue approach,
its conceptually very elegant IMO, but implementation looks difficult.

You'll have to have a look at my patch, it is the "scheduling domains"
thing. Basically there are no ifdefs for NUMA or SMT in the scheduler,
the balancing depends on SMP and behaves according to a structure
describing desired properties.

I have to document it a little more though.

>
>so i tend to lean towards your SMT patch, it's much easier to maintain
>than my runqueue-sharing approach. The performance is equivalent as far as
>i can see (around 20%, and a stabilization of the distribution of
>runtimes) - but please also boot my patch and repeat the exact same
>measurements you did.
>

I will. I might not get time today, but I'll test various old
"favourites" like kernbench, hackbench, [dt]bench with both versions.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08  4:25 Nick Piggin
                   ` (3 preceding siblings ...)
  2003-12-08 20:51 ` Zwane Mwaikambo
@ 2003-12-08 23:46 ` Rusty Russell
  2003-12-09 13:36   ` Nick Piggin
  4 siblings, 1 reply; 16+ messages in thread
From: Rusty Russell @ 2003-12-08 23:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Zwane Mwaikambo, linux-kernel

In message <3FD3FD52.7020001@cyberone.com.au> you write:
> I'm not aware of any reason why the kernel should not become generally
> SMT aware. It is sufficiently different to SMP that it is worth
> specialising it, although I am only aware of P4 and POWER5 implementations.

To do it properly, it should be done within the NUMA framework.  That
would allow generic slab cache optimizations, etc.  We'd really need a
multi-level NUMA framework for this though.

But patch looks fine.

> I have an alternative to Ingo's HT scheduler which basically does
> the same thing. It is showing a 20% elapsed time improvement with a
> make -j3 on a 2xP4 Xeon (4 logical CPUs).

Me too.

My main argument with Ingo's patch (last I looked) was technical: the
code becomes clearer if the structures are explicitly split into the
"per-runqueue stuff" and the "per-cpu stuff" (containing a my_runqueue
pointer).

I'd be very interested in your patch though, Nick.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08 23:17     ` Nick Piggin
@ 2003-12-08 23:36       ` Ingo Molnar
  2003-12-08 23:58         ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2003-12-08 23:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Zwane Mwaikambo, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, linux-kernel

the thing that makes balancing-only driven SMT possible with current 2.6.0
is the overbalancing we do (to have cross-CPU fairness). Previous
iterations of the O(1) scheduler (all the 2.4 backports) didnt do this so
all the add-on SMT schedulers tended to have a problem achieving good SMT
distribution. Now that we more agressively balance, this isnt such a big
issue anymore.

so i tend to lean towards your SMT patch, it's much easier to maintain
than my runqueue-sharing approach. The performance is equivalent as far as
i can see (around 20%, and a stabilization of the distribution of
runtimes) - but please also boot my patch and repeat the exact same
measurements you did.

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08 20:55   ` Ingo Molnar
@ 2003-12-08 23:17     ` Nick Piggin
  2003-12-08 23:36       ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2003-12-08 23:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zwane Mwaikambo, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, linux-kernel



Ingo Molnar wrote:

>On Mon, 8 Dec 2003, Zwane Mwaikambo wrote:
>
>
>>>P.S.
>>>I have an alternative to Ingo's HT scheduler which basically does
>>>the same thing. It is showing a 20% elapsed time improvement with a
>>>make -j3 on a 2xP4 Xeon (4 logical CPUs).
>>>
>>-j3 is an odd number, what does -j4, -j8, -j16 look like?
>>
>
>let me guess: -j3 is the one that gives the highest performance boost on a 
>2-way/4-logical P4 box?
>
Yep! Note: -j3 now gives you 2 compute "threads" in the build. Basically
(as you know Ingo), you'll see the best improvement when there is the
most opportunity for one physical CPU to be saturated and the other idle.

[piggin@ragnarok piggin]$ grep user compile-test11
319.56user 30.04system 2:53.61elapsed 201%CPU
320.50user 30.33system 2:54.04elapsed 201%CPU
318.44user 30.94system 2:53.44elapsed 201%CPU
[piggin@ragnarok piggin]$ grep user compile-w26p20
268.04user 27.48system 2:26.94elapsed 201%CPU
267.36user 27.69system 2:26.10elapsed 201%CPU
266.66user 27.63system 2:25.68elapsed 202%CPU

Note here that both sets of results saturate 2 CPUs. My changes (and
Ingo's too, of course) try to ensure that those 2 CPUs are different
physical ones which is why they are getting more work done.

>
>for the SMT/HT scheduler to give any benefits there has to be idle time -
>so that SMT decisions actually make a difference.
>
>the higher -j values are only useful to make sure that SMT scheduling does
>not introduce regressions - performance should be the same as the pure
>SMP/NUMA scheduler's performance.
>

I haven't run your HT patch Ingo, but of course I'll have to get more
comprehensive benchmarks before we go anywhere with this.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08 15:59 ` Anton Blanchard
@ 2003-12-08 23:08   ` Nick Piggin
  2003-12-09  0:14     ` Anton Blanchard
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2003-12-08 23:08 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Rusty Russell,
	Zwane Mwaikambo, linux-kernel



Anton Blanchard wrote:

> 
>
>>I'm not aware of any reason why the kernel should not become generally
>>SMT aware. It is sufficiently different to SMP that it is worth
>>specialising it, although I am only aware of P4 and POWER5 implementations.
>>
>
>I agree, SMT is likely to become more popular in the coming years.
>
>
>>I have an alternative to Ingo's HT scheduler which basically does
>>the same thing. It is showing a 20% elapsed time improvement with a
>>make -j3 on a 2xP4 Xeon (4 logical CPUs).
>>
>>Before Ingo's is merged, I would like to discuss the pros and cons of
>>both approaches with those interested. If Ingo's is accepted I should
>>still be able to port my other SMP/NUMA improvements on top of it.
>>
>
>Sounds good, have you got anything to test? I can throw it on a POWER5.
>

It would be great to get some testing on another architecture.

I don't have an architecture independant way to build SMT scheduling
descriptions, although with the cpu_sibling_map change, you can copy
and paste the code for the P4 if you are able to build a cpu_sibling_map.

I'll just have to add a some bits so SMT and NUMA work together which
I will be unable to test. I'll get back to you with some code.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08 20:51 ` Zwane Mwaikambo
@ 2003-12-08 20:55   ` Ingo Molnar
  2003-12-08 23:17     ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2003-12-08 20:55 UTC (permalink / raw)
  To: Zwane Mwaikambo
  Cc: Nick Piggin, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, linux-kernel

On Mon, 8 Dec 2003, Zwane Mwaikambo wrote:

> > P.S.
> > I have an alternative to Ingo's HT scheduler which basically does
> > the same thing. It is showing a 20% elapsed time improvement with a
> > make -j3 on a 2xP4 Xeon (4 logical CPUs).
> 
> -j3 is an odd number, what does -j4, -j8, -j16 look like?

let me guess: -j3 is the one that gives the highest performance boost on a 
2-way/4-logical P4 box?

for the SMT/HT scheduler to give any benefits there has to be idle time -
so that SMT decisions actually make a difference.

the higher -j values are only useful to make sure that SMT scheduling does
not introduce regressions - performance should be the same as the pure
SMP/NUMA scheduler's performance.

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08  4:25 Nick Piggin
                   ` (2 preceding siblings ...)
  2003-12-08 20:38 ` Ingo Molnar
@ 2003-12-08 20:51 ` Zwane Mwaikambo
  2003-12-08 20:55   ` Ingo Molnar
  2003-12-08 23:46 ` Rusty Russell
  4 siblings, 1 reply; 16+ messages in thread
From: Zwane Mwaikambo @ 2003-12-08 20:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1078 bytes --]

On Mon, 8 Dec 2003, Nick Piggin wrote:

> It turns the cpu_sibling_map array from an array of cpus to an array
> of cpumasks. This allows handling of more than 2 siblings, not that
> I know of any immediate need.
>
> I think it generalises cpu_sibling_map sufficiently that it can become
> generic code. This would allow architecture specific code to build the
> sibling map, and then Ingo's or my HT implementations to build their
> scheduling descriptions in generic code.
>
> I'm not aware of any reason why the kernel should not become generally
> SMT aware. It is sufficiently different to SMP that it is worth
> specialising it, although I am only aware of P4 and POWER5 implementations.

Generally i agree, it's fairly unintrusive and appears to clean up some
things. Perhaps Andrew will take it a bit later after release.

> P.S.
> I have an alternative to Ingo's HT scheduler which basically does
> the same thing. It is showing a 20% elapsed time improvement with a
> make -j3 on a 2xP4 Xeon (4 logical CPUs).

-j3 is an odd number, what does -j4, -j8, -j16 look like?

[-- Attachment #2: Type: TEXT/PLAIN, Size: 7229 bytes --]

 linux-2.6-npiggin/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c |    9 --
 linux-2.6-npiggin/arch/i386/kernel/io_apic.c                 |   19 ++--
 linux-2.6-npiggin/arch/i386/kernel/smpboot.c                 |   46 +++++------
 linux-2.6-npiggin/arch/i386/oprofile/op_model_p4.c           |    7 -
 linux-2.6-npiggin/include/asm-i386/smp.h                     |    2 
 5 files changed, 37 insertions(+), 46 deletions(-)

diff -puN include/asm-i386/smp.h~sibling-map-to-cpumask include/asm-i386/smp.h
--- linux-2.6/include/asm-i386/smp.h~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/smp.h	2003-12-08 14:42:28.000000000 +1100
@@ -34,7 +34,7 @@
 extern void smp_alloc_memory(void);
 extern int pic_mode;
 extern int smp_num_siblings;
-extern int cpu_sibling_map[];
+extern cpumask_t cpu_sibling_map[];
 
 extern void smp_flush_tlb(void);
 extern void smp_message_irq(int cpl, void *dev_id, struct pt_regs *regs);
diff -puN arch/i386/kernel/smpboot.c~sibling-map-to-cpumask arch/i386/kernel/smpboot.c
--- linux-2.6/arch/i386/kernel/smpboot.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/smpboot.c	2003-12-08 15:07:48.000000000 +1100
@@ -933,7 +933,7 @@ static int boot_cpu_logical_apicid;
 /* Where the IO area was mapped on multiquad, always 0 otherwise */
 void *xquad_portio;
 
-int cpu_sibling_map[NR_CPUS] __cacheline_aligned;
+cpumask_t cpu_sibling_map[NR_CPUS] __cacheline_aligned;
 
 static void __init smp_boot_cpus(unsigned int max_cpus)
 {
@@ -1079,32 +1079,28 @@ static void __init smp_boot_cpus(unsigne
 	Dprintk("Boot done.\n");
 
 	/*
-	 * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so
-	 * that we can tell the sibling CPU efficiently.
+	 * construct cpu_sibling_map[], so that we can tell sibling CPUs
+	 * efficiently.
 	 */
-	if (cpu_has_ht && smp_num_siblings > 1) {
-		for (cpu = 0; cpu < NR_CPUS; cpu++)
-			cpu_sibling_map[cpu] = NO_PROC_ID;
-
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			int 	i;
-			if (!cpu_isset(cpu, cpu_callout_map))
-				continue;
+	for (cpu = 0; cpu < NR_CPUS; cpu++)
+		cpu_sibling_map[cpu] = CPU_MASK_NONE;
 
-			for (i = 0; i < NR_CPUS; i++) {
-				if (i == cpu || !cpu_isset(i, cpu_callout_map))
-					continue;
-				if (phys_proc_id[cpu] == phys_proc_id[i]) {
-					cpu_sibling_map[cpu] = i;
-					printk("cpu_sibling_map[%d] = %d\n", cpu, cpu_sibling_map[cpu]);
-					break;
-				}
-			}
-			if (cpu_sibling_map[cpu] == NO_PROC_ID) {
-				smp_num_siblings = 1;
-				printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu);
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		int siblings = 0;
+		int 	i;
+		if (!cpu_isset(cpu, cpu_callout_map))
+			continue;
+
+		for (i = 0; i < NR_CPUS; i++) {
+			if (!cpu_isset(i, cpu_callout_map))
+				continue;
+			if (phys_proc_id[cpu] == phys_proc_id[i]) {
+				siblings++;
+				cpu_set(i, cpu_sibling_map[cpu]);
 			}
 		}
+		if (siblings != smp_num_siblings)
+			printk(KERN_WARNING "WARNING: %d siblings found for CPU%d, should be %d\n", siblings, cpu, smp_num_siblings);
 	}
 
 	smpboot_setup_io_apic();
@@ -1143,8 +1139,8 @@ __init void arch_init_sched_domains(void
 
 		*cpu_domain = SD_CPU_INIT;
 		cpu_domain->span = CPU_MASK_NONE;
-		cpu_set(i, cpu_domain->span);
-		cpu_set(cpu_sibling_map[i], cpu_domain->span);
+		cpu_domain->span = cpu_sibling_map[i];
+		WARN_ON(!cpu_isset(i, cpu_sibling_map[i]));
 		cpu_domain->flags |= SD_FLAG_WAKE | SD_FLAG_IDLE;
 		cpu_domain->cache_hot_time = 0;
 		cpu_domain->cache_nice_tries = 0;
diff -puN arch/i386/kernel/io_apic.c~sibling-map-to-cpumask arch/i386/kernel/io_apic.c
--- linux-2.6/arch/i386/kernel/io_apic.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/io_apic.c	2003-12-08 14:42:28.000000000 +1100
@@ -309,8 +309,7 @@ struct irq_cpu_info {
 
 #define IRQ_ALLOWED(cpu, allowed_mask)	cpu_isset(cpu, allowed_mask)
 
-#define CPU_TO_PACKAGEINDEX(i) \
-		((physical_balance && i > cpu_sibling_map[i]) ? cpu_sibling_map[i] : i)
+#define CPU_TO_PACKAGEINDEX(i) (first_cpu(cpu_sibling_map[i]))
 
 #define MAX_BALANCED_IRQ_INTERVAL	(5*HZ)
 #define MIN_BALANCED_IRQ_INTERVAL	(HZ/2)
@@ -393,6 +392,7 @@ static void do_irq_balance(void)
 	unsigned long max_cpu_irq = 0, min_cpu_irq = (~0);
 	unsigned long move_this_load = 0;
 	int max_loaded = 0, min_loaded = 0;
+	int load;
 	unsigned long useful_load_threshold = balanced_irq_interval + 10;
 	int selected_irq;
 	int tmp_loaded, first_attempt = 1;
@@ -444,7 +444,7 @@ static void do_irq_balance(void)
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
-		if (physical_balance && i > cpu_sibling_map[i])
+		if (i != CPU_TO_PACKAGEINDEX(i))
 			continue;
 		if (min_cpu_irq > CPU_IRQ(i)) {
 			min_cpu_irq = CPU_IRQ(i);
@@ -463,7 +463,7 @@ tryanothercpu:
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
-		if (physical_balance && i > cpu_sibling_map[i])
+		if (i != CPU_TO_PACKAGEINDEX(i))
 			continue;
 		if (max_cpu_irq <= CPU_IRQ(i)) 
 			continue;
@@ -543,9 +543,14 @@ tryanotherirq:
 	 * We seek the least loaded sibling by making the comparison
 	 * (A+B)/2 vs B
 	 */
-	if (physical_balance && (CPU_IRQ(min_loaded) >> 1) >
-					CPU_IRQ(cpu_sibling_map[min_loaded]))
-		min_loaded = cpu_sibling_map[min_loaded];
+	load = CPU_IRQ(min_loaded) >> 1;
+	for_each_cpu(j, cpu_sibling_map[min_loaded]) {
+		if (load > CPU_IRQ(j)) {
+			/* This won't change cpu_sibling_map[min_loaded] */
+			load = CPU_IRQ(j);
+			min_loaded = j;
+		}
+	}
 
 	cpus_and(allowed_mask, cpu_online_map, irq_affinity[selected_irq]);
 	target_cpu_mask = cpumask_of_cpu(min_loaded);
diff -puN arch/i386/kernel/cpu/cpufreq/p4-clockmod.c~sibling-map-to-cpumask arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
--- linux-2.6/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c	2003-12-08 14:42:28.000000000 +1100
@@ -67,14 +67,7 @@ static int cpufreq_p4_setdc(unsigned int
 	cpus_allowed = current->cpus_allowed;
 
 	/* only run on CPU to be set, or on its sibling */
-       affected_cpu_map = cpumask_of_cpu(cpu);
-#ifdef CONFIG_X86_HT
-	hyperthreading = ((cpu_has_ht) && (smp_num_siblings == 2));
-	if (hyperthreading) {
-		sibling = cpu_sibling_map[cpu];
-                cpu_set(sibling, affected_cpu_map);
-	}
-#endif
+	affected_cpu_map = cpu_sibling_map[cpu];
 	set_cpus_allowed(current, affected_cpu_map);
         BUG_ON(!cpu_isset(smp_processor_id(), affected_cpu_map));
 
diff -puN arch/i386/oprofile/op_model_p4.c~sibling-map-to-cpumask arch/i386/oprofile/op_model_p4.c
--- linux-2.6/arch/i386/oprofile/op_model_p4.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/oprofile/op_model_p4.c	2003-12-08 14:42:28.000000000 +1100
@@ -382,11 +382,8 @@ static struct p4_event_binding p4_events
 static unsigned int get_stagger(void)
 {
 #ifdef CONFIG_SMP
-	int cpu;
-	if (smp_num_siblings > 1) {
-		cpu = smp_processor_id();
-		return (cpu_sibling_map[cpu] > cpu) ? 0 : 1;
-	}
+	int cpu = smp_processor_id();
+	return (cpu != first_cpu(cpu_sibling_map[cpu]));
 #endif	
 	return 0;
 }

_

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08  4:25 Nick Piggin
  2003-12-08 15:59 ` Anton Blanchard
  2003-12-08 19:44 ` James Cleverdon
@ 2003-12-08 20:38 ` Ingo Molnar
  2003-12-08 20:51 ` Zwane Mwaikambo
  2003-12-08 23:46 ` Rusty Russell
  4 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2003-12-08 20:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, Zwane Mwaikambo, linux-kernel


On Mon, 8 Dec 2003, Nick Piggin wrote:

> P.S.
> I have an alternative to Ingo's HT scheduler which basically does the
> same thing. It is showing a 20% elapsed time improvement with a make -j3
> on a 2xP4 Xeon (4 logical CPUs).

if it gets close/equivalent (or better!) performance than the SMT-aware
scheduler i've posted then i'd prefer your patch because yours is
fundamentally simpler. (Btw., there exist a number of similar,
balancing-based HT patches - they are all quite simple.)

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08  4:25 Nick Piggin
  2003-12-08 15:59 ` Anton Blanchard
@ 2003-12-08 19:44 ` James Cleverdon
  2003-12-08 20:38 ` Ingo Molnar
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: James Cleverdon @ 2003-12-08 19:44 UTC (permalink / raw)
  To: Nick Piggin, Ingo Molnar, Anton Blanchard, Linus Torvalds,
	Andrew Morton, Rusty Russell, Zwane Mwaikambo
  Cc: linux-kernel

On Sunday 07 December 2003 8:25 pm, Nick Piggin wrote:
> Hi guys,
> This is rather a trivial change but is not for 2.6.0.
> The patch is actually on top of some of my HT stuff so one part is
> slightly different, but its purpose is to gather feedback.
>
> It turns the cpu_sibling_map array from an array of cpus to an array
> of cpumasks. This allows handling of more than 2 siblings, not that
> I know of any immediate need.

There will probably be a need, although I don't know how soon.  Some of the 
Intel folks have not been hinting that there will be more than two siblings 
in a future CPU.  No, they have not been hinting at all.   ;^)


> I think it generalises cpu_sibling_map sufficiently that it can become
> generic code. This would allow architecture specific code to build the
> sibling map, and then Ingo's or my HT implementations to build their
> scheduling descriptions in generic code.
>
> I'm not aware of any reason why the kernel should not become generally
> SMT aware. It is sufficiently different to SMP that it is worth
> specialising it, although I am only aware of P4 and POWER5 implementations.
>
> Best regards
> Nick
>
> P.S.
> I have an alternative to Ingo's HT scheduler which basically does
> the same thing. It is showing a 20% elapsed time improvement with a
> make -j3 on a 2xP4 Xeon (4 logical CPUs).
>
> Before Ingo's is merged, I would like to discuss the pros and cons of
> both approaches with those interested. If Ingo's is accepted I should
> still be able to port my other SMP/NUMA improvements on top of it.

-- 
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot comm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFC] make cpu_sibling_map a cpumask_t
  2003-12-08  4:25 Nick Piggin
@ 2003-12-08 15:59 ` Anton Blanchard
  2003-12-08 23:08   ` Nick Piggin
  2003-12-08 19:44 ` James Cleverdon
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Anton Blanchard @ 2003-12-08 15:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, Rusty Russell,
	Zwane Mwaikambo, linux-kernel

 
> I'm not aware of any reason why the kernel should not become generally
> SMT aware. It is sufficiently different to SMP that it is worth
> specialising it, although I am only aware of P4 and POWER5 implementations.

I agree, SMT is likely to become more popular in the coming years.

> I have an alternative to Ingo's HT scheduler which basically does
> the same thing. It is showing a 20% elapsed time improvement with a
> make -j3 on a 2xP4 Xeon (4 logical CPUs).
> 
> Before Ingo's is merged, I would like to discuss the pros and cons of
> both approaches with those interested. If Ingo's is accepted I should
> still be able to port my other SMP/NUMA improvements on top of it.

Sounds good, have you got anything to test? I can throw it on a POWER5.

Anton

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH][RFC] make cpu_sibling_map a cpumask_t
@ 2003-12-08  4:25 Nick Piggin
  2003-12-08 15:59 ` Anton Blanchard
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Nick Piggin @ 2003-12-08  4:25 UTC (permalink / raw)
  To: Ingo Molnar, Anton Blanchard, Linus Torvalds, Andrew Morton,
	Rusty Russell, Zwane Mwaikambo
  Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

Hi guys,
This is rather a trivial change but is not for 2.6.0.
The patch is actually on top of some of my HT stuff so one part is
slightly different, but its purpose is to gather feedback.

It turns the cpu_sibling_map array from an array of cpus to an array
of cpumasks. This allows handling of more than 2 siblings, not that
I know of any immediate need.

I think it generalises cpu_sibling_map sufficiently that it can become
generic code. This would allow architecture specific code to build the
sibling map, and then Ingo's or my HT implementations to build their
scheduling descriptions in generic code.

I'm not aware of any reason why the kernel should not become generally
SMT aware. It is sufficiently different to SMP that it is worth
specialising it, although I am only aware of P4 and POWER5 implementations.

Best regards
Nick

P.S.
I have an alternative to Ingo's HT scheduler which basically does
the same thing. It is showing a 20% elapsed time improvement with a
make -j3 on a 2xP4 Xeon (4 logical CPUs).

Before Ingo's is merged, I would like to discuss the pros and cons of
both approaches with those interested. If Ingo's is accepted I should
still be able to port my other SMP/NUMA improvements on top of it.


[-- Attachment #2: sibling-map-to-cpumask.patch --]
[-- Type: text/plain, Size: 7229 bytes --]

 linux-2.6-npiggin/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c |    9 --
 linux-2.6-npiggin/arch/i386/kernel/io_apic.c                 |   19 ++--
 linux-2.6-npiggin/arch/i386/kernel/smpboot.c                 |   46 +++++------
 linux-2.6-npiggin/arch/i386/oprofile/op_model_p4.c           |    7 -
 linux-2.6-npiggin/include/asm-i386/smp.h                     |    2 
 5 files changed, 37 insertions(+), 46 deletions(-)

diff -puN include/asm-i386/smp.h~sibling-map-to-cpumask include/asm-i386/smp.h
--- linux-2.6/include/asm-i386/smp.h~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/include/asm-i386/smp.h	2003-12-08 14:42:28.000000000 +1100
@@ -34,7 +34,7 @@
 extern void smp_alloc_memory(void);
 extern int pic_mode;
 extern int smp_num_siblings;
-extern int cpu_sibling_map[];
+extern cpumask_t cpu_sibling_map[];
 
 extern void smp_flush_tlb(void);
 extern void smp_message_irq(int cpl, void *dev_id, struct pt_regs *regs);
diff -puN arch/i386/kernel/smpboot.c~sibling-map-to-cpumask arch/i386/kernel/smpboot.c
--- linux-2.6/arch/i386/kernel/smpboot.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/smpboot.c	2003-12-08 15:07:48.000000000 +1100
@@ -933,7 +933,7 @@ static int boot_cpu_logical_apicid;
 /* Where the IO area was mapped on multiquad, always 0 otherwise */
 void *xquad_portio;
 
-int cpu_sibling_map[NR_CPUS] __cacheline_aligned;
+cpumask_t cpu_sibling_map[NR_CPUS] __cacheline_aligned;
 
 static void __init smp_boot_cpus(unsigned int max_cpus)
 {
@@ -1079,32 +1079,28 @@ static void __init smp_boot_cpus(unsigne
 	Dprintk("Boot done.\n");
 
 	/*
-	 * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so
-	 * that we can tell the sibling CPU efficiently.
+	 * construct cpu_sibling_map[], so that we can tell sibling CPUs
+	 * efficiently.
 	 */
-	if (cpu_has_ht && smp_num_siblings > 1) {
-		for (cpu = 0; cpu < NR_CPUS; cpu++)
-			cpu_sibling_map[cpu] = NO_PROC_ID;
-
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			int 	i;
-			if (!cpu_isset(cpu, cpu_callout_map))
-				continue;
+	for (cpu = 0; cpu < NR_CPUS; cpu++)
+		cpu_sibling_map[cpu] = CPU_MASK_NONE;
 
-			for (i = 0; i < NR_CPUS; i++) {
-				if (i == cpu || !cpu_isset(i, cpu_callout_map))
-					continue;
-				if (phys_proc_id[cpu] == phys_proc_id[i]) {
-					cpu_sibling_map[cpu] = i;
-					printk("cpu_sibling_map[%d] = %d\n", cpu, cpu_sibling_map[cpu]);
-					break;
-				}
-			}
-			if (cpu_sibling_map[cpu] == NO_PROC_ID) {
-				smp_num_siblings = 1;
-				printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu);
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		int siblings = 0;
+		int 	i;
+		if (!cpu_isset(cpu, cpu_callout_map))
+			continue;
+
+		for (i = 0; i < NR_CPUS; i++) {
+			if (!cpu_isset(i, cpu_callout_map))
+				continue;
+			if (phys_proc_id[cpu] == phys_proc_id[i]) {
+				siblings++;
+				cpu_set(i, cpu_sibling_map[cpu]);
 			}
 		}
+		if (siblings != smp_num_siblings)
+			printk(KERN_WARNING "WARNING: %d siblings found for CPU%d, should be %d\n", siblings, cpu, smp_num_siblings);
 	}
 
 	smpboot_setup_io_apic();
@@ -1143,8 +1139,8 @@ __init void arch_init_sched_domains(void
 
 		*cpu_domain = SD_CPU_INIT;
 		cpu_domain->span = CPU_MASK_NONE;
-		cpu_set(i, cpu_domain->span);
-		cpu_set(cpu_sibling_map[i], cpu_domain->span);
+		cpu_domain->span = cpu_sibling_map[i];
+		WARN_ON(!cpu_isset(i, cpu_sibling_map[i]));
 		cpu_domain->flags |= SD_FLAG_WAKE | SD_FLAG_IDLE;
 		cpu_domain->cache_hot_time = 0;
 		cpu_domain->cache_nice_tries = 0;
diff -puN arch/i386/kernel/io_apic.c~sibling-map-to-cpumask arch/i386/kernel/io_apic.c
--- linux-2.6/arch/i386/kernel/io_apic.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/io_apic.c	2003-12-08 14:42:28.000000000 +1100
@@ -309,8 +309,7 @@ struct irq_cpu_info {
 
 #define IRQ_ALLOWED(cpu, allowed_mask)	cpu_isset(cpu, allowed_mask)
 
-#define CPU_TO_PACKAGEINDEX(i) \
-		((physical_balance && i > cpu_sibling_map[i]) ? cpu_sibling_map[i] : i)
+#define CPU_TO_PACKAGEINDEX(i) (first_cpu(cpu_sibling_map[i]))
 
 #define MAX_BALANCED_IRQ_INTERVAL	(5*HZ)
 #define MIN_BALANCED_IRQ_INTERVAL	(HZ/2)
@@ -393,6 +392,7 @@ static void do_irq_balance(void)
 	unsigned long max_cpu_irq = 0, min_cpu_irq = (~0);
 	unsigned long move_this_load = 0;
 	int max_loaded = 0, min_loaded = 0;
+	int load;
 	unsigned long useful_load_threshold = balanced_irq_interval + 10;
 	int selected_irq;
 	int tmp_loaded, first_attempt = 1;
@@ -444,7 +444,7 @@ static void do_irq_balance(void)
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
-		if (physical_balance && i > cpu_sibling_map[i])
+		if (i != CPU_TO_PACKAGEINDEX(i))
 			continue;
 		if (min_cpu_irq > CPU_IRQ(i)) {
 			min_cpu_irq = CPU_IRQ(i);
@@ -463,7 +463,7 @@ tryanothercpu:
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
-		if (physical_balance && i > cpu_sibling_map[i])
+		if (i != CPU_TO_PACKAGEINDEX(i))
 			continue;
 		if (max_cpu_irq <= CPU_IRQ(i)) 
 			continue;
@@ -543,9 +543,14 @@ tryanotherirq:
 	 * We seek the least loaded sibling by making the comparison
 	 * (A+B)/2 vs B
 	 */
-	if (physical_balance && (CPU_IRQ(min_loaded) >> 1) >
-					CPU_IRQ(cpu_sibling_map[min_loaded]))
-		min_loaded = cpu_sibling_map[min_loaded];
+	load = CPU_IRQ(min_loaded) >> 1;
+	for_each_cpu(j, cpu_sibling_map[min_loaded]) {
+		if (load > CPU_IRQ(j)) {
+			/* This won't change cpu_sibling_map[min_loaded] */
+			load = CPU_IRQ(j);
+			min_loaded = j;
+		}
+	}
 
 	cpus_and(allowed_mask, cpu_online_map, irq_affinity[selected_irq]);
 	target_cpu_mask = cpumask_of_cpu(min_loaded);
diff -puN arch/i386/kernel/cpu/cpufreq/p4-clockmod.c~sibling-map-to-cpumask arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
--- linux-2.6/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c	2003-12-08 14:42:28.000000000 +1100
@@ -67,14 +67,7 @@ static int cpufreq_p4_setdc(unsigned int
 	cpus_allowed = current->cpus_allowed;
 
 	/* only run on CPU to be set, or on its sibling */
-       affected_cpu_map = cpumask_of_cpu(cpu);
-#ifdef CONFIG_X86_HT
-	hyperthreading = ((cpu_has_ht) && (smp_num_siblings == 2));
-	if (hyperthreading) {
-		sibling = cpu_sibling_map[cpu];
-                cpu_set(sibling, affected_cpu_map);
-	}
-#endif
+	affected_cpu_map = cpu_sibling_map[cpu];
 	set_cpus_allowed(current, affected_cpu_map);
         BUG_ON(!cpu_isset(smp_processor_id(), affected_cpu_map));
 
diff -puN arch/i386/oprofile/op_model_p4.c~sibling-map-to-cpumask arch/i386/oprofile/op_model_p4.c
--- linux-2.6/arch/i386/oprofile/op_model_p4.c~sibling-map-to-cpumask	2003-12-08 14:42:28.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/oprofile/op_model_p4.c	2003-12-08 14:42:28.000000000 +1100
@@ -382,11 +382,8 @@ static struct p4_event_binding p4_events
 static unsigned int get_stagger(void)
 {
 #ifdef CONFIG_SMP
-	int cpu;
-	if (smp_num_siblings > 1) {
-		cpu = smp_processor_id();
-		return (cpu_sibling_map[cpu] > cpu) ? 0 : 1;
-	}
+	int cpu = smp_processor_id();
+	return (cpu != first_cpu(cpu_sibling_map[cpu]));
 #endif	
 	return 0;
 }

_

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2003-12-11 21:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-09  4:54 [PATCH][RFC] make cpu_sibling_map a cpumask_t Nakajima, Jun
2003-12-09 19:42 ` William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2003-12-08  4:25 Nick Piggin
2003-12-08 15:59 ` Anton Blanchard
2003-12-08 23:08   ` Nick Piggin
2003-12-09  0:14     ` Anton Blanchard
2003-12-08 19:44 ` James Cleverdon
2003-12-08 20:38 ` Ingo Molnar
2003-12-08 20:51 ` Zwane Mwaikambo
2003-12-08 20:55   ` Ingo Molnar
2003-12-08 23:17     ` Nick Piggin
2003-12-08 23:36       ` Ingo Molnar
2003-12-08 23:58         ` Nick Piggin
2003-12-08 23:46 ` Rusty Russell
2003-12-09 13:36   ` Nick Piggin
2003-12-11 21:41     ` bill davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).