All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
       [not found] <200611032205.kA3M5wmJ003178@shell0.pdx.osdl.net>
@ 2006-11-07  7:32 ` Ingo Molnar
  2006-11-07 17:44   ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2006-11-07  7:32 UTC (permalink / raw)
  To: akpm; +Cc: mm-commits, clameter, nickpiggin, suresh.b.siddha, linux-kernel


* akpm@osdl.org <akpm@osdl.org> wrote:

> +DECLARE_TASKLET(rebalance, &rebalance_domains, 0L);

i'm not sure i get the point of this whole do-rebalance-in-tasklet idea. 
A tasklet is global to the system. The rebalance tick was per-CPU. This 
is not an equivalent change at all. What am i missing?

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 17:44   ` Christoph Lameter
@ 2006-11-07 17:31     ` Siddha, Suresh B
  2006-11-07 17:55       ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Siddha, Suresh B @ 2006-11-07 17:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, akpm, mm-commits, nickpiggin, suresh.b.siddha, linux-kernel

On Tue, Nov 07, 2006 at 09:44:11AM -0800, Christoph Lameter wrote:
> On Tue, 7 Nov 2006, Ingo Molnar wrote:
> 
> > i'm not sure i get the point of this whole do-rebalance-in-tasklet idea. 
> > A tasklet is global to the system. The rebalance tick was per-CPU. This 
> > is not an equivalent change at all. What am i missing?
> 
> A tasklet runs per cpu. In many ways it is equivalent to an interrupt 
> context just interrupts are enabled.

Christoph, DECLARE_TASKLET that you had atleast needs to be per cpu.. 
Not sure if there are any other concerns.

thanks,
suresh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07  7:32 ` + sched-use-tasklet-to-call-balancing.patch added to -mm tree Ingo Molnar
@ 2006-11-07 17:44   ` Christoph Lameter
  2006-11-07 17:31     ` Siddha, Suresh B
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2006-11-07 17:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, mm-commits, nickpiggin, suresh.b.siddha, linux-kernel

On Tue, 7 Nov 2006, Ingo Molnar wrote:

> i'm not sure i get the point of this whole do-rebalance-in-tasklet idea. 
> A tasklet is global to the system. The rebalance tick was per-CPU. This 
> is not an equivalent change at all. What am i missing?

A tasklet runs per cpu. In many ways it is equivalent to an interrupt 
context just interrupts are enabled.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 17:55       ` Christoph Lameter
@ 2006-11-07 17:50         ` Siddha, Suresh B
  2006-11-07 18:18           ` Christoph Lameter
                             ` (2 more replies)
  2006-11-07 20:29         ` Ingo Molnar
  1 sibling, 3 replies; 28+ messages in thread
From: Siddha, Suresh B @ 2006-11-07 17:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, Ingo Molnar, akpm, mm-commits, nickpiggin,
	linux-kernel

On Tue, Nov 07, 2006 at 09:55:54AM -0800, Christoph Lameter wrote:
> On Tue, 7 Nov 2006, Siddha, Suresh B wrote:
> 
> > Christoph, DECLARE_TASKLET that you had atleast needs to be per cpu.. 
> > Not sure if there are any other concerns.
> 
> Nope. Tasklets scheduled and executed per cpu. These are the former bottom 
> halves. See tasklet_schedule in kernel/softirq.c

You must be referring to __tasklet_schedule() ?

tasklet_schedule doesn't schedule if there is already one scheduled.

no?

thanks,
suresh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 17:31     ` Siddha, Suresh B
@ 2006-11-07 17:55       ` Christoph Lameter
  2006-11-07 17:50         ` Siddha, Suresh B
  2006-11-07 20:29         ` Ingo Molnar
  0 siblings, 2 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-11-07 17:55 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, akpm, mm-commits, nickpiggin, linux-kernel

On Tue, 7 Nov 2006, Siddha, Suresh B wrote:

> Christoph, DECLARE_TASKLET that you had atleast needs to be per cpu.. 
> Not sure if there are any other concerns.

Nope. Tasklets scheduled and executed per cpu. These are the former bottom 
halves. See tasklet_schedule in kernel/softirq.c



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 17:50         ` Siddha, Suresh B
@ 2006-11-07 18:18           ` Christoph Lameter
  2006-11-07 19:17           ` Christoph Lameter
  2006-11-07 19:22           ` Christoph Lameter
  2 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-11-07 18:18 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, akpm, mm-commits, nickpiggin, linux-kernel

On Tue, 7 Nov 2006, Siddha, Suresh B wrote:

> tasklet_schedule doesn't schedule if there is already one scheduled.

Right. 

/* Tasklets --- multithreaded analogue of BHs.

   Main feature differing them of generic softirqs: tasklet
   is running only on one CPU simultaneously.

   Main feature differing them of BHs: different tasklets
   may be run simultaneously on different CPUs.

   Properties:
   * If tasklet_schedule() is called, then tasklet is guaranteed
     to be executed on some cpu at least once after this.

^^^^ Crap. Not equivalent. Must be guaranteed to run on the same cpu.

   * If the tasklet is already scheduled, but its excecution is still not
     started, it will be executed only once.
   * If this tasklet is already running on another CPU (or schedule is called
     from tasklet itself), it is rescheduled for later.
   * Tasklet is strictly serialized wrt itself, but not
     wrt another tasklets. If client needs some intertask synchronization,
     he makes it with spinlocks.
 */


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 17:50         ` Siddha, Suresh B
  2006-11-07 18:18           ` Christoph Lameter
@ 2006-11-07 19:17           ` Christoph Lameter
  2006-11-07 20:31             ` Ingo Molnar
  2006-11-07 19:22           ` Christoph Lameter
  2 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2006-11-07 19:17 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, akpm, mm-commits, nickpiggin, linux-kernel

Tasklets are scheduled on the same cpu that triggered the tasklet. They 
are just moved to other processors if the processor goes down. So that 
aspect is fine. We just need a tasklet struct per cpu.



User a per cpu tasklet to schedule rebalancing

Turns out that tasklets have a flag that only allows one instance to run
on all processors. So we need a tasklet structure for each processor.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.19-rc4-mm2/kernel/sched.c
===================================================================
--- linux-2.6.19-rc4-mm2.orig/kernel/sched.c	2006-11-06 13:58:38.000000000 -0600
+++ linux-2.6.19-rc4-mm2/kernel/sched.c	2006-11-07 13:05:56.236343144 -0600
@@ -2918,7 +2918,8 @@ static void rebalance_domains(unsigned l
 	this_rq->next_balance = next_balance;
 }
 
-DECLARE_TASKLET(rebalance, &rebalance_domains, 0L);
+static DEFINE_PER_CPU(struct tasklet_struct, rebalance) =
+		{ NULL, 0, ATOMIC_INIT(0), rebalance_domains, 0L };
 #else
 /*
  * on UP we do not need to balance between CPUs:
@@ -3171,7 +3172,7 @@ void scheduler_tick(void)
 #ifdef CONFIG_SMP
 	update_load(rq);
 	if (time_after_eq(jiffies, rq->next_balance))
-		tasklet_schedule(&rebalance);
+		tasklet_schedule(&__get_cpu_var(rebalance));
 #endif
 }
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 17:50         ` Siddha, Suresh B
  2006-11-07 18:18           ` Christoph Lameter
  2006-11-07 19:17           ` Christoph Lameter
@ 2006-11-07 19:22           ` Christoph Lameter
  2 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-11-07 19:22 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, akpm, mm-commits, nickpiggin, linux-kernel

On Tue, 7 Nov 2006, Siddha, Suresh B wrote:

> tasklet_schedule doesn't schedule if there is already one scheduled.

Correct. The effect of this is to only allow load balancing on one cpu at a 
time. Subsequent attempts to schedule load balancing on other cpus are 
ignored until the cpu that is load balancing has finished. Then the 
others (hopefully) get a turn.

A pretty interesting unintended effect. It certainly solves the concurrent 
load balancing scalability issues and would avoid the need to stagger 
load balancing. Wonder how fair it would be?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 17:55       ` Christoph Lameter
  2006-11-07 17:50         ` Siddha, Suresh B
@ 2006-11-07 20:29         ` Ingo Molnar
  1 sibling, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2006-11-07 20:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, akpm, mm-commits, nickpiggin, linux-kernel


* Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 7 Nov 2006, Siddha, Suresh B wrote:
> 
> > Christoph, DECLARE_TASKLET that you had atleast needs to be per 
> > cpu.. Not sure if there are any other concerns.
> 
> Nope. Tasklets scheduled and executed per cpu. These are the former 
> bottom halves. See tasklet_schedule in kernel/softirq.c

no, they are "task"-lets. Softirqs are the equivalent of former 
bottom-halves.

a tasklet may run on any CPU but its execution is globally serialized - 
it's not what we want to do in the scheduler.

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 19:17           ` Christoph Lameter
@ 2006-11-07 20:31             ` Ingo Molnar
  2006-11-07 20:59               ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2006-11-07 20:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Siddha, Suresh B, akpm, mm-commits, nickpiggin, linux-kernel


* Christoph Lameter <clameter@sgi.com> wrote:

> Tasklets are scheduled on the same cpu that triggered the tasklet. 
> They are just moved to other processors if the processor goes down. So 
> that aspect is fine. We just need a tasklet struct per cpu.
> 
> User a per cpu tasklet to schedule rebalancing
> 
> Turns out that tasklets have a flag that only allows one instance to 
> run on all processors. So we need a tasklet structure for each 
> processor.

Per-CPU tasklets are equivalent to softirqs, with extra complexity and 
overhead ontop of it :-)

so please just introduce a rebalance softirq and attach the scheduling 
rebalance tick to it. But i'd suggest to re-test on the 4096-CPU box, 
maybe what 'fixed' your workload was the global serialization of the 
tasklet. With a per-CPU softirq approach we are i think back to the same 
situation that broke your system before.

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 20:31             ` Ingo Molnar
@ 2006-11-07 20:59               ` Christoph Lameter
  2006-11-07 21:35                 ` Chen, Kenneth W
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2006-11-07 20:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Siddha, Suresh B, akpm, mm-commits, nickpiggin, linux-kernel

On Tue, 7 Nov 2006, Ingo Molnar wrote:

> Per-CPU tasklets are equivalent to softirqs, with extra complexity and 
> overhead ontop of it :-)
> 
> so please just introduce a rebalance softirq and attach the scheduling 
> rebalance tick to it. But i'd suggest to re-test on the 4096-CPU box, 
> maybe what 'fixed' your workload was the global serialization of the 
> tasklet. With a per-CPU softirq approach we are i think back to the same 
> situation that broke your system before.

What broke the system was the disabling of interrupts over long time 
periods during load balancing.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 20:59               ` Christoph Lameter
@ 2006-11-07 21:35                 ` Chen, Kenneth W
  2006-11-07 21:49                   ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Chen, Kenneth W @ 2006-11-07 21:35 UTC (permalink / raw)
  To: 'Christoph Lameter', Ingo Molnar
  Cc: Siddha, Suresh B, akpm, mm-commits, nickpiggin, linux-kernel

Christoph Lameter wrote on Tuesday, November 07, 2006 12:59 PM
> On Tue, 7 Nov 2006, Ingo Molnar wrote:
> 
> > Per-CPU tasklets are equivalent to softirqs, with extra complexity and 
> > overhead ontop of it :-)
> > 
> > so please just introduce a rebalance softirq and attach the scheduling 
> > rebalance tick to it. But i'd suggest to re-test on the 4096-CPU box, 
> > maybe what 'fixed' your workload was the global serialization of the 
> > tasklet. With a per-CPU softirq approach we are i think back to the same 
> > situation that broke your system before.
> 
> What broke the system was the disabling of interrupts over long time 
> periods during load balancing.


The previous global load balancing tasket could be an interesting data point.
Do you see a lot of imbalance in the system with the global tasket?  Does it
take prolonged interval to reach balanced system from imbalance?

Conceptually, it doesn't make a lot of sense to serialize load balance in the
System.  But in practice, maybe it is good enough to cycle through each cpu
(I suppose it all depends on the user environment).  This exercise certainly
make me to think in regards to what is the best algorithm in terms of efficiency
and minimal latency to achieve maximum system throughput.  Does kernel really
have to do load balancing so aggressively in polling mode? Perhaps an event
driven mechanism is a better solution.

- Ken


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 21:35                 ` Chen, Kenneth W
@ 2006-11-07 21:49                   ` Christoph Lameter
  2006-11-10  6:18                     ` Chen, Kenneth W
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2006-11-07 21:49 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: Ingo Molnar, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

On Tue, 7 Nov 2006, Chen, Kenneth W wrote:

> > What broke the system was the disabling of interrupts over long time 
> > periods during load balancing.
> The previous global load balancing tasket could be an interesting data point.

Yup seems also very interesting to me. We could drop the staggering code 
f.e. if we would leave the patch as is. Maybe there are other ways to 
optimize the code because we know that there are no concurrent 
balance_tick() functions running.

> Do you see a lot of imbalance in the system with the global tasket?  Does it
> take prolonged interval to reach balanced system from imbalance?

I am rather surprised that I did not see any problems but I think we would 
need some more testing. It seems that having only one load balance 
running at one time speeds up load balacing in general since there is 
less lock contention.

Timer interrupts are staggered already. If the tasklet finishes soon then 
we will not have a problem and load balancing runs as before. If the 
tasklet performance gets throttled by something then it is good to 
serialize load balancing since we already have memory contention.  
Performing concurrent scans of runqueues (as is now) would increase the 
problem.

Before we saw multiple load balance actions ongoing at the same time which 
caused node overload because multiple cache line refetches from remote 
nodes occurred. I think it was mostly node 0 that got overloaded so that 
load balancing performance dropped significantly.

> Conceptually, it doesn't make a lot of sense to serialize load balance in the
> System.  But in practice, maybe it is good enough to cycle through each cpu
> (I suppose it all depends on the user environment).  This exercise certainly
> make me to think in regards to what is the best algorithm in terms of efficiency
> and minimal latency to achieve maximum system throughput.  Does kernel really
> have to do load balancing so aggressively in polling mode? Perhaps an event
> driven mechanism is a better solution.

There is no other way to do load balancing since we have to calculate the 
load over many processors and then figure out the one with the most load.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-07 21:49                   ` Christoph Lameter
@ 2006-11-10  6:18                     ` Chen, Kenneth W
  2006-11-10 18:50                       ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Chen, Kenneth W @ 2006-11-10  6:18 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: Ingo Molnar, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

Christoph Lameter wrote on Tuesday, November 07, 2006 1:50 PM
> On Tue, 7 Nov 2006, Chen, Kenneth W wrote:
> > > What broke the system was the disabling of interrupts over long time 
> > > periods during load balancing.
> > The previous global load balancing tasket could be an interesting data point.
> 
> Yup seems also very interesting to me. We could drop the staggering code 
> f.e. if we would leave the patch as is. Maybe there are other ways to 
> optimize the code because we know that there are no concurrent 
> balance_tick() functions running.
> 
> > Do you see a lot of imbalance in the system with the global tasket?  Does it
> > take prolonged interval to reach balanced system from imbalance?
> 
> I am rather surprised that I did not see any problems but I think we would 
> need some more testing. It seems that having only one load balance 
> running at one time speeds up load balacing in general since there is 
> less lock contention.


I ran majority of micro-benchmarks from LKP project with global load
balance tasklet. (http://kernel-perf.sourceforge.net)

Result is here:
http://kernel-perf.sourceforge.net/sched/global-load-bal.txt

All results are within noise range.  The global tasklet does a fairly good job especially on context switch intensive workload like
aim7, volanomark, tbench
etc.  Note all machines are non-numa platform.

Base on the data, I think we should make the load balance tasklet one per numa
node instead of one per CPU.

- Ken

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-10  6:18                     ` Chen, Kenneth W
@ 2006-11-10 18:50                       ` Christoph Lameter
  2006-11-10 21:38                         ` Ingo Molnar
                                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-11-10 18:50 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: Ingo Molnar, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

On Thu, 9 Nov 2006, Chen, Kenneth W wrote:

> All results are within noise range.  The global tasklet does a fairly 
> good job especially on context switch intensive workload like aim7, 
> volanomark, tbench etc.  Note all machines are non-numa platform.
 
> Base on the data, I think we should make the load balance tasklet one per numa
> node instead of one per CPU.

I have done some testing on NUMA with 8p / 4n and 256 p / 128n which seems 
to indicate that this is doing very well. Having a global tasklet avoids 
concurrent load balancing from multiple nodes which smoothes out the load 
balancing load over all nodes in a NUMA system. It fixes the issues that 
we saw with contention during concurrent load balancing.

On a 8p system:

I) Percent of ticks where load balancing was found to be required

II) Percent of ticks where we attempted load balancing
   but we found that we need to try again due to load balancing
   in progress elsewhere (This increases (I) since we found that
   load balancing was required but we decided to defer. Tasklet
   was not invoked).

    	I)	II) 
Boot:	70%	~1%
AIM7:	30%	2%
Idle:	50%	 <0.5%

256p:
	I)	II)
Boot:	80%	30%
AIM7:	90%	30%
Idle: 	95%	30%

So on larger systems we have more attempts to concurrently execute the 
tasklet which also inflates I). I would expect that rate to be raised 
further on even larger systems until we likely reach a point of continuous 
load balancing somehwere in the system at 1024p (will take some time to to 
such a system). It is likely prefereable at that scale to still not have 
concurrency. Concurrency may mean that we concurrently balance huge sched 
domains which causes long delays on multiple processor and may cause 
contention between those load balancing threads that attempt to move tasks 
from the same busy processor.

Suresh noted at the Intel OS forum yesterday that we may be able to avoid 
running load balancing of the larger sched domains from all processors. 
That would reduce the overhead of scheduling and reduce the percentage of 
time that load blancing is active even on very large systems.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-10 18:50                       ` Christoph Lameter
@ 2006-11-10 21:38                         ` Ingo Molnar
  2006-11-11  1:01                           ` Christoph Lameter
  2006-11-10 21:42                         ` + sched-use-tasklet-to-call-balancing.patch added to -mm tree Ingo Molnar
  2006-11-10 22:32                         ` Siddha, Suresh B
  2 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2006-11-10 21:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chen, Kenneth W, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel


* Christoph Lameter <clameter@sgi.com> wrote:

> I have done some testing on NUMA with 8p / 4n and 256 p / 128n which 
> seems to indicate that this is doing very well. Having a global 
> tasklet avoids concurrent load balancing from multiple nodes which 
> smoothes out the load balancing load over all nodes in a NUMA system. 
> It fixes the issues that we saw with contention during concurrent load 
> balancing.

ok, that's what i suspected - what made the difference wasnt the fact 
that it was moved out of irqs-off section, but that it was running 
globally, instead of in parallel on every cpu. I have no conceptual 
problem with single-threading the more invasive load-balancing bits. 
(since it has to touch every runqueue anyway there's probably little 
parallelism possible) But it's a scary change nevertheless, it 
materially affects every SMP system's balancing characteristics.

Also, tasklets are a pretty bad choice for scalable stuff - it's 
basically a shared resource between all CPUs and the tasklet-running 
cacheline will bounce between all the CPUs. Also, a tasklet can be 
'reactivated', which means that a CPU will do more rebalancing albeit 
it's /another/ CPU that wanted that.

So could you try another implementation perhaps, which would match the 
'single thread of rebalancing' model better? For example, try a global 
spinlock that is trylocked upon rebalance. If the trylock succeeds then 
do the rebalance and unlock. If the trylock fails, just skip the 
rebalance. (this roughly matches the tasklet logic but has lower 
overhead and doesnt cause false, repeat rebalancing)

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-10 18:50                       ` Christoph Lameter
  2006-11-10 21:38                         ` Ingo Molnar
@ 2006-11-10 21:42                         ` Ingo Molnar
  2006-11-10 22:32                         ` Siddha, Suresh B
  2 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2006-11-10 21:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chen, Kenneth W, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel


* Christoph Lameter <clameter@sgi.com> wrote:

> On a 8p system:
> 
> I) Percent of ticks where load balancing was found to be required
> 
> II) Percent of ticks where we attempted load balancing
>    but we found that we need to try again due to load balancing
>    in progress elsewhere (This increases (I) since we found that
>    load balancing was required but we decided to defer. Tasklet
>    was not invoked).
> 
>     	I)	II) 
> Boot:	70%	~1%
> AIM7:	30%	2%
> Idle:	50%	 <0.5%
> 
> 256p:
> 	I)	II)
> Boot:	80%	30%
> AIM7:	90%	30%
> Idle: 	95%	30%

nice measurements and interesting results.

note that with a tasklet a 'retry' will often be done on the /same/ CPU 
that was running the tasklet when we tried to schedule it. I.e. such a 
'collision' will result not only in the 'loss' of the local rebalance 
event, but also causes /another/ rebalance event on the remote CPU.

so a better model would be the trylock model i suggested in the previous 
mail: to just lose the rebalance events upon collision and not cause 
extra work on the remote CPU. I'd also suggest to keep the rebalancing 
code under the irqs-off section, like it is currently - only do it 
conditional on trylock success. Do you think that would work?

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-10 18:50                       ` Christoph Lameter
  2006-11-10 21:38                         ` Ingo Molnar
  2006-11-10 21:42                         ` + sched-use-tasklet-to-call-balancing.patch added to -mm tree Ingo Molnar
@ 2006-11-10 22:32                         ` Siddha, Suresh B
  2006-11-11  1:14                           ` Christoph Lameter
  2 siblings, 1 reply; 28+ messages in thread
From: Siddha, Suresh B @ 2006-11-10 22:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chen, Kenneth W, Ingo Molnar, Siddha, Suresh B, akpm, mm-commits,
	nickpiggin, linux-kernel

On Fri, Nov 10, 2006 at 10:50:13AM -0800, Christoph Lameter wrote:
> Suresh noted at the Intel OS forum yesterday that we may be able to avoid 
> running load balancing of the larger sched domains from all processors. 
> That would reduce the overhead of scheduling and reduce the percentage of 
> time that load blancing is active even on very large systems.

I brought this up at OLS with Nick too. What happens today is, at a particular
domain, each cpu in the sched group will do a load balance at the frequency
of balance_interval. More the cores and threads, more the cpus will be
in each sched group at SMP and NUMA domain. And we endup spending
quite a bit of time doing load balancing in those domains.

One idea(old patch appended) is to make only one cpu in the sched group do
the load balance at that particular sched domain and this load will slowly
percolate down to the other cpus with in that group(when they do load balancing
at lower domains).

Nick was suggesting another alternative, that we should increase the max
interval and probably make it dependent on number of cpus that
share the group. Today for NUMA domain, we already set max_interval
based on online cpus. Perhaps we need to do that for other domains (SMP, MC)
too and increase the max interval on NUMA for big systems?

---

When busy, only the first cpu in the sched group will do the load balance.

--- linux-2.6.9/kernel/sched.c.1	2006-07-11 16:17:37.000000000 -0700
+++ linux-2.6.9/kernel/sched.c	2006-07-12 10:48:48.000000000 -0700
@@ -2329,12 +2329,15 @@
 			interval = 1;
 
 		if (j - sd->last_balance >= interval) {
-			sd->last_balance += interval;
+			sd->last_balance = j;
 			if (load_balance(this_cpu, this_rq, sd, idle)) {
 				/* We've pulled tasks over so no longer idle */
 				idle = NOT_IDLE;
 			}
 		}
+
+		if (idle != IDLE && this_cpu != first_cpu(sd->span))
+			break;
 	}
 }
 #else

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-10 21:38                         ` Ingo Molnar
@ 2006-11-11  1:01                           ` Christoph Lameter
  2006-11-11  2:05                             ` Chen, Kenneth W
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2006-11-11  1:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chen, Kenneth W, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

On Fri, 10 Nov 2006, Ingo Molnar wrote:

> ok, that's what i suspected - what made the difference wasnt the fact 
> that it was moved out of irqs-off section, but that it was running 
> globally, instead of in parallel on every cpu. I have no conceptual 
> problem with single-threading the more invasive load-balancing bits. 
> (since it has to touch every runqueue anyway there's probably little 
> parallelism possible) But it's a scary change nevertheless, it 
> materially affects every SMP system's balancing characteristics.

We saw multiple issues. The first we saw was interrupt holdoff related 
since IPIs took a long time to complete. The other was that multiple 
load balance actions in multiple CPUs seem to serialize on the locks 
trying each to move tasks off the same busy processor. So both better be 
addressed.

Load balancing for small domains is running faster so there is less chance 
of parallelism. It seems that the staggering of the timer interrupt is 
sufficient on smaller systems to avoid concurrent load balancing 
operations.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-10 22:32                         ` Siddha, Suresh B
@ 2006-11-11  1:14                           ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-11-11  1:14 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Chen, Kenneth W, Ingo Molnar, akpm, mm-commits, nickpiggin, linux-kernel

On Fri, 10 Nov 2006, Siddha, Suresh B wrote:

> Nick was suggesting another alternative, that we should increase the max
> interval and probably make it dependent on number of cpus that
> share the group. Today for NUMA domain, we already set max_interval
> based on online cpus. Perhaps we need to do that for other domains (SMP, MC)
> too and increase the max interval on NUMA for big systems?

I have been trying to run with this patch. We increase the min interval so 
that we can at least put one tick between each of the load balance actions 
of each processor. The cpu_offset can be modified to:

/* Don't have all balancing operations going off at once: */
static inline unsigned long cpu_offset(int cpu)
{
	if (NR_CPUS < HZ)
    		return jiffies + cpu * HZ / NR_CPUS;
	else
		return jiffies + cpu;
}

But exclusion in load balancing would take care of the above.




Index: linux-2.6.19-rc4-mm2/include/linux/topology.h
===================================================================
--- linux-2.6.19-rc4-mm2.orig/include/linux/topology.h	2006-11-07 16:43:43.084665449 -0600
+++ linux-2.6.19-rc4-mm2/include/linux/topology.h	2006-11-07 16:55:23.027245226 -0600
@@ -182,7 +182,7 @@
 	.parent			= NULL,			\
 	.child			= NULL,			\
 	.groups			= NULL,			\
-	.min_interval		= 64,			\
+	.min_interval		= max(64, jiffies_to_msec(num_online_cpus())),\
 	.max_interval		= 64*num_online_cpus(),	\
 	.busy_factor		= 128,			\
 	.imbalance_pct		= 133,			\

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-11  1:01                           ` Christoph Lameter
@ 2006-11-11  2:05                             ` Chen, Kenneth W
  2006-11-11  2:51                               ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Chen, Kenneth W @ 2006-11-11  2:05 UTC (permalink / raw)
  To: 'Christoph Lameter', Ingo Molnar
  Cc: Siddha, Suresh B, akpm, mm-commits, nickpiggin, linux-kernel

Christoph Lameter wrote on Friday, November 10, 2006 5:01 PM
> On Fri, 10 Nov 2006, Ingo Molnar wrote:
> 
> > ok, that's what i suspected - what made the difference wasnt the fact 
> > that it was moved out of irqs-off section, but that it was running 
> > globally, instead of in parallel on every cpu. I have no conceptual 
> > problem with single-threading the more invasive load-balancing bits. 
> > (since it has to touch every runqueue anyway there's probably little 
> > parallelism possible) But it's a scary change nevertheless, it 
> > materially affects every SMP system's balancing characteristics.
> 
> We saw multiple issues. The first we saw was interrupt holdoff related 
> since IPIs took a long time to complete. The other was that multiple 
> load balance actions in multiple CPUs seem to serialize on the locks 
> trying each to move tasks off the same busy processor. So both better be 
> addressed.

So designate only one CPU within a domain to do load balance between groups
for that specific sched domain should in theory fix the 2nd problem you
identified.  Did you get a chance to look at the patch Suresh posted?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-11  2:05                             ` Chen, Kenneth W
@ 2006-11-11  2:51                               ` Christoph Lameter
  2006-11-13  4:03                                 ` Chen, Kenneth W
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2006-11-11  2:51 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: Ingo Molnar, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

On Fri, 10 Nov 2006, Chen, Kenneth W wrote:

> So designate only one CPU within a domain to do load balance between groups
> for that specific sched domain should in theory fix the 2nd problem you
> identified.  Did you get a chance to look at the patch Suresh posted?

Yes, I am still thinking about how this would work. This would mean that 
the first cpu on a system would move busy processes to itself from all 
1023 other processes? That does not sound appealing.

Processes in the allnodes domain would move to processor #0. The next 
lower layer are the nodes groups. Not all of the groups at that layer 
contain processor #0. So we would never move processes into those sched 
domains?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-11  2:51                               ` Christoph Lameter
@ 2006-11-13  4:03                                 ` Chen, Kenneth W
  2006-11-13  5:44                                   ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Chen, Kenneth W @ 2006-11-13  4:03 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: Ingo Molnar, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

Christoph Lameter wrote on Friday, November 10, 2006 6:51 PM
> On Fri, 10 Nov 2006, Chen, Kenneth W wrote:
> 
> > So designate only one CPU within a domain to do load balance between groups
> > for that specific sched domain should in theory fix the 2nd problem you
> > identified.  Did you get a chance to look at the patch Suresh posted?
> 
> Yes, I am still thinking about how this would work. This would mean that 
> the first cpu on a system would move busy processes to itself from all 
> 1023 other processes? That does not sound appealing.
> 
> Processes in the allnodes domain would move to processor #0. The next 
> lower layer are the nodes groups. Not all of the groups at that layer 
> contain processor #0. So we would never move processes into those sched 
> domains?

Not really, someone needs to put a bloody comments in that patch ;-)

The key is that load balance scans from lowest SMT domain and traverse
upwards to numa allnodes domain.  The CPU check and break is placed at
the end of the for loop so it is effectively shorting out i+1 iteration.
And given the fact that sd->span in iteration(i) is the same as sd->groups
in iteration(i+1).  The patch is effectively designate first CPU (hard-coded)
in a group within a domain to perform load balance in that domain.  All
other groups within a domain will still have at least one CPU assigned to
perform load balance.

Though the original patch is not perfect:

(1) we should extend the logic to all rebalance tick, not just busy tick.
(2) we should initiate load balance within a domain only from least
    loaded group.
(3) the load scanning should be done only once per interval per domain.
    Currently, it is scanning load for each CPU within a domain.  Even with
    the patch, the scanning is cut down to one per group. That is still
    too much.  Large system that has hundreds of groups in numa allnodes
    will end up scanning / calculate load over and over again. That should
    be cut down as well.

Part of all this problem probably stemmed from "load balance" is incapable
of performing l-d between arbitrary pair of CPUs, and tightly tied load scan
and actual l-d action.  And on top of that l-d is really a pull operation
to current running CPU. All these limitations dictate that every CPU somehow
has to scan and pull.  It is extremely inefficient on large system.

IMO, we should decouple load scan from actual l-d action and if possible,
make l-d capable of performing on arbitrary pair of CPUs.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-13  4:03                                 ` Chen, Kenneth W
@ 2006-11-13  5:44                                   ` Christoph Lameter
  2006-11-13  6:40                                     ` Chen, Kenneth W
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2006-11-13  5:44 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: Ingo Molnar, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

On Sun, 12 Nov 2006, Chen, Kenneth W wrote:

> The key is that load balance scans from lowest SMT domain and traverse
> upwards to numa allnodes domain.  The CPU check and break is placed at
> the end of the for loop so it is effectively shorting out i+1 iteration.

It shortens out if the current cpu is not the first cpu of the span. But 
at that point it has already done load balancing for the cpu that is not 
the first cpu of the span!

> (1) we should extend the logic to all rebalance tick, not just busy tick.

Ok.

> (2) we should initiate load balance within a domain only from least
>     loaded group.

This would mean we would have to determine the least loaded group first.

> (3) the load scanning should be done only once per interval per domain.
>     Currently, it is scanning load for each CPU within a domain.  Even with
>     the patch, the scanning is cut down to one per group. That is still
>     too much.  Large system that has hundreds of groups in numa allnodes
>     will end up scanning / calculate load over and over again. That should
>     be cut down as well.

Yes we want that. Maybe we could remember the load calculated in 
sched_group and use that for a certain time period? Load would only be 
calculated once and then all other processors make their decisions based 
on the cached load?

> Part of all this problem probably stemmed from "load balance" is incapable
> of performing l-d between arbitrary pair of CPUs, and tightly tied load scan
> and actual l-d action.  And on top of that l-d is really a pull operation
> to current running CPU. All these limitations dictate that every CPU somehow
> has to scan and pull.  It is extremely inefficient on large system.

Right. However, if we follow this line of thought then we will be 
redesigning the load balancing logic.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: + sched-use-tasklet-to-call-balancing.patch added to -mm tree
  2006-11-13  5:44                                   ` Christoph Lameter
@ 2006-11-13  6:40                                     ` Chen, Kenneth W
  2006-11-14  1:06                                       ` [patch] sched domain: move sched group allocations to percpu area Siddha, Suresh B
  0 siblings, 1 reply; 28+ messages in thread
From: Chen, Kenneth W @ 2006-11-13  6:40 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: Ingo Molnar, Siddha, Suresh B, akpm, mm-commits, nickpiggin,
	linux-kernel

Christoph Lameter wrote on Sunday, November 12, 2006 9:45 PM
> > (2) we should initiate load balance within a domain only from least
> >     loaded group.
> 
> This would mean we would have to determine the least loaded group first.

Well, find_busiest_group() scans every single bloody CPU in the system at
the highest sched_domain level.  In fact, this function is capable to find
busiest group within a domain, it should be capable to determine least
loaded group for free because it already scanned every groups within a domain.


> > Part of all this problem probably stemmed from "load balance" is incapable
> > of performing l-d between arbitrary pair of CPUs, and tightly tied load scan
> > and actual l-d action.  And on top of that l-d is really a pull operation
> > to current running CPU. All these limitations dictate that every CPU somehow
> > has to scan and pull.  It is extremely inefficient on large system.
> 
> Right. However, if we follow this line of thought then we will be 
> redesigning the load balancing logic.

It won't be a bad idea to redesign it ;-)

There are number of other oddity beside what was identified in it's design:

(1) several sched_groups are statically declared and they will reside in
    boot node. I would expect cross node memory access to be expansive.
    Every cpu will access these data structure repeatedly.

    static struct sched_group sched_group_cpus[NR_CPUS];
    static struct sched_group sched_group_core[NR_CPUS];
    static struct sched_group sched_group_phys[NR_CPUS];

(2) load balance staggering. Number of people pointed out that it is overly
    done.

(3) The for_each_domain() loop in rebalance_tick() looks different from
    idle_balance() where it will traverse entire sched domains even if lower
    level domain succeeded in moving some tasks.  I would expect we either
    break out of the for loop like idle_balance(), or somehow update load
    for current CPU so it gets accurate load value when doing l-d in the
    next level. Currently, It is doing neither.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch] sched domain: move sched group allocations to percpu area
  2006-11-13  6:40                                     ` Chen, Kenneth W
@ 2006-11-14  1:06                                       ` Siddha, Suresh B
  2006-11-14  8:23                                         ` Ingo Molnar
  2006-11-14 16:57                                         ` Christoph Lameter
  0 siblings, 2 replies; 28+ messages in thread
From: Siddha, Suresh B @ 2006-11-14  1:06 UTC (permalink / raw)
  To: clameter; +Cc: kenneth.w.chen, Ingo Molnar, akpm, nickpiggin, linux-kernel

Christoph, Can you please test this on your big systems and ACK it? Thanks.
---

Move the sched group allocations to percpu area. This will minimize cross
node memory references and also cleans up the sched groups allocation for
allnodes sched domain.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 3399701..088a63f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5511,28 +5511,27 @@ static int __init isolated_cpu_setup(cha
 __setup ("isolcpus=", isolated_cpu_setup);
 
 /*
- * init_sched_build_groups takes an array of groups, the cpumask we wish
- * to span, and a pointer to a function which identifies what group a CPU
- * belongs to. The return value of group_fn must be a valid index into the
- * groups[] array, and must be >= 0 and < NR_CPUS (due to the fact that we
- * keep track of groups covered with a cpumask_t).
+ * init_sched_build_groups takes the cpumask we wish to span, and a pointer
+ * to a function which identifies what group(along with sched group) a CPU
+ * belongs to. The return value of group_fn must be a >= 0 and < NR_CPUS
+ * (due to the fact that we keep track of groups covered with a cpumask_t).
  *
  * init_sched_build_groups will build a circular linked list of the groups
  * covered by the given span, and will set each group's ->cpumask correctly,
  * and ->cpu_power to 0.
  */
 static void
-init_sched_build_groups(struct sched_group groups[], cpumask_t span,
-			const cpumask_t *cpu_map,
-			int (*group_fn)(int cpu, const cpumask_t *cpu_map))
+init_sched_build_groups(cpumask_t span, const cpumask_t *cpu_map,
+			int (*group_fn)(int cpu, const cpumask_t *cpu_map,
+					struct sched_group **sg))
 {
 	struct sched_group *first = NULL, *last = NULL;
 	cpumask_t covered = CPU_MASK_NONE;
 	int i;
 
 	for_each_cpu_mask(i, span) {
-		int group = group_fn(i, cpu_map);
-		struct sched_group *sg = &groups[group];
+		struct sched_group *sg;
+		int group = group_fn(i, cpu_map, &sg);
 		int j;
 
 		if (cpu_isset(i, covered))
@@ -5542,7 +5541,7 @@ init_sched_build_groups(struct sched_gro
 		sg->cpu_power = 0;
 
 		for_each_cpu_mask(j, span) {
-			if (group_fn(j, cpu_map) != group)
+			if (group_fn(j, cpu_map, NULL) != group)
 				continue;
 
 			cpu_set(j, covered);
@@ -6118,10 +6117,13 @@ int sched_smt_power_savings = 0, sched_m
  */
 #ifdef CONFIG_SCHED_SMT
 static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
-static struct sched_group sched_group_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_cpus);
 
-static int cpu_to_cpu_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_cpu_group(int cpu, const cpumask_t *cpu_map,
+			    struct sched_group **sg)
 {
+	if (sg)
+		*sg = &per_cpu(sched_group_cpus, cpu);
 	return cpu;
 }
 #endif
@@ -6131,39 +6133,52 @@ #endif
  */
 #ifdef CONFIG_SCHED_MC
 static DEFINE_PER_CPU(struct sched_domain, core_domains);
-static struct sched_group sched_group_core[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_core);
 #endif
 
 #if defined(CONFIG_SCHED_MC) && defined(CONFIG_SCHED_SMT)
-static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map,
+			     struct sched_group **sg)
 {
+	int group;
 	cpumask_t mask = cpu_sibling_map[cpu];
 	cpus_and(mask, mask, *cpu_map);
-	return first_cpu(mask);
+	group = first_cpu(mask);
+	if (sg)
+		*sg = &per_cpu(sched_group_core, group);
+	return group;
 }
 #elif defined(CONFIG_SCHED_MC)
-static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_core_group(int cpu, const cpumask_t *cpu_map,
+			     struct sched_group **sg)
 {
+	if (sg)
+		*sg = &per_cpu(sched_group_core, cpu);
 	return cpu;
 }
 #endif
 
 static DEFINE_PER_CPU(struct sched_domain, phys_domains);
-static struct sched_group sched_group_phys[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_phys);
 
-static int cpu_to_phys_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_phys_group(int cpu, const cpumask_t *cpu_map,
+			     struct sched_group **sg)
 {
+	int group;
 #ifdef CONFIG_SCHED_MC
 	cpumask_t mask = cpu_coregroup_map(cpu);
 	cpus_and(mask, mask, *cpu_map);
-	return first_cpu(mask);
+	group = first_cpu(mask);
 #elif defined(CONFIG_SCHED_SMT)
 	cpumask_t mask = cpu_sibling_map[cpu];
 	cpus_and(mask, mask, *cpu_map);
-	return first_cpu(mask);
+	group = first_cpu(mask);
 #else
-	return cpu;
+	group = cpu;
 #endif
+	if (sg)
+		*sg = &per_cpu(sched_group_phys, group);
+	return group;
 }
 
 #ifdef CONFIG_NUMA
@@ -6176,12 +6191,22 @@ static DEFINE_PER_CPU(struct sched_domai
 static struct sched_group **sched_group_nodes_bycpu[NR_CPUS];
 
 static DEFINE_PER_CPU(struct sched_domain, allnodes_domains);
-static struct sched_group *sched_group_allnodes_bycpu[NR_CPUS];
+static DEFINE_PER_CPU(struct sched_group, sched_group_allnodes);
 
-static int cpu_to_allnodes_group(int cpu, const cpumask_t *cpu_map)
+static int cpu_to_allnodes_group(int cpu, const cpumask_t *cpu_map,
+				 struct sched_group **sg)
 {
-	return cpu_to_node(cpu);
+	cpumask_t nodemask = node_to_cpumask(cpu_to_node(cpu));
+	int group;
+
+	cpus_and(nodemask, nodemask, *cpu_map);
+	group = first_cpu(nodemask);
+
+	if (sg)
+		*sg = &per_cpu(sched_group_allnodes, group);
+	return group;
 }
+
 static void init_numa_sched_groups_power(struct sched_group *group_head)
 {
 	struct sched_group *sg = group_head;
@@ -6217,16 +6242,9 @@ static void free_sched_groups(const cpum
 	int cpu, i;
 
 	for_each_cpu_mask(cpu, *cpu_map) {
-		struct sched_group *sched_group_allnodes
-			= sched_group_allnodes_bycpu[cpu];
 		struct sched_group **sched_group_nodes
 			= sched_group_nodes_bycpu[cpu];
 
-		if (sched_group_allnodes) {
-			kfree(sched_group_allnodes);
-			sched_group_allnodes_bycpu[cpu] = NULL;
-		}
-
 		if (!sched_group_nodes)
 			continue;
 
@@ -6320,7 +6338,7 @@ static int build_sched_domains(const cpu
 	struct sched_domain *sd;
 #ifdef CONFIG_NUMA
 	struct sched_group **sched_group_nodes = NULL;
-	struct sched_group *sched_group_allnodes = NULL;
+	int sd_allnodes = 0;
 
 	/*
 	 * Allocate the per-node list of sched groups
@@ -6338,7 +6356,6 @@ #endif
 	 * Set up domains for cpus specified by the cpu_map.
 	 */
 	for_each_cpu_mask(i, *cpu_map) {
-		int group;
 		struct sched_domain *sd = NULL, *p;
 		cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
 
@@ -6347,26 +6364,12 @@ #endif
 #ifdef CONFIG_NUMA
 		if (cpus_weight(*cpu_map)
 				> SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
-			if (!sched_group_allnodes) {
-				sched_group_allnodes
-					= kmalloc_node(sizeof(struct sched_group)
-						  	* MAX_NUMNODES,
-						  GFP_KERNEL,
-						  cpu_to_node(i));
-				if (!sched_group_allnodes) {
-					printk(KERN_WARNING
-					"Can not alloc allnodes sched group\n");
-					goto error;
-				}
-				sched_group_allnodes_bycpu[i]
-						= sched_group_allnodes;
-			}
 			sd = &per_cpu(allnodes_domains, i);
 			*sd = SD_ALLNODES_INIT;
 			sd->span = *cpu_map;
-			group = cpu_to_allnodes_group(i, cpu_map);
-			sd->groups = &sched_group_allnodes[group];
+			cpu_to_allnodes_group(i, cpu_map, &sd->groups);
 			p = sd;
+			sd_allnodes = 1;
 		} else
 			p = NULL;
 
@@ -6381,36 +6384,33 @@ #endif
 
 		p = sd;
 		sd = &per_cpu(phys_domains, i);
-		group = cpu_to_phys_group(i, cpu_map);
 		*sd = SD_CPU_INIT;
 		sd->span = nodemask;
 		sd->parent = p;
 		if (p)
 			p->child = sd;
-		sd->groups = &sched_group_phys[group];
+		cpu_to_phys_group(i, cpu_map, &sd->groups);
 
 #ifdef CONFIG_SCHED_MC
 		p = sd;
 		sd = &per_cpu(core_domains, i);
-		group = cpu_to_core_group(i, cpu_map);
 		*sd = SD_MC_INIT;
 		sd->span = cpu_coregroup_map(i);
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
 		p->child = sd;
-		sd->groups = &sched_group_core[group];
+		cpu_to_core_group(i, cpu_map, &sd->groups);
 #endif
 
 #ifdef CONFIG_SCHED_SMT
 		p = sd;
 		sd = &per_cpu(cpu_domains, i);
-		group = cpu_to_cpu_group(i, cpu_map);
 		*sd = SD_SIBLING_INIT;
 		sd->span = cpu_sibling_map[i];
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
 		p->child = sd;
-		sd->groups = &sched_group_cpus[group];
+		cpu_to_cpu_group(i, cpu_map, &sd->groups);
 #endif
 	}
 
@@ -6422,8 +6422,7 @@ #ifdef CONFIG_SCHED_SMT
 		if (i != first_cpu(this_sibling_map))
 			continue;
 
-		init_sched_build_groups(sched_group_cpus, this_sibling_map,
-					cpu_map, &cpu_to_cpu_group);
+		init_sched_build_groups(this_sibling_map, cpu_map, &cpu_to_cpu_group);
 	}
 #endif
 
@@ -6434,8 +6433,7 @@ #ifdef CONFIG_SCHED_MC
 		cpus_and(this_core_map, this_core_map, *cpu_map);
 		if (i != first_cpu(this_core_map))
 			continue;
-		init_sched_build_groups(sched_group_core, this_core_map,
-					cpu_map, &cpu_to_core_group);
+		init_sched_build_groups(this_core_map, cpu_map, &cpu_to_core_group);
 	}
 #endif
 
@@ -6448,15 +6446,13 @@ #endif
 		if (cpus_empty(nodemask))
 			continue;
 
-		init_sched_build_groups(sched_group_phys, nodemask,
-					cpu_map, &cpu_to_phys_group);
+		init_sched_build_groups(nodemask, cpu_map, &cpu_to_phys_group);
 	}
 
 #ifdef CONFIG_NUMA
 	/* Set up node groups */
-	if (sched_group_allnodes)
-		init_sched_build_groups(sched_group_allnodes, *cpu_map,
-					cpu_map, &cpu_to_allnodes_group);
+	if (sd_allnodes)
+		init_sched_build_groups(*cpu_map, cpu_map, &cpu_to_allnodes_group);
 
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		/* Set up node groups */
@@ -6548,10 +6544,10 @@ #ifdef CONFIG_NUMA
 	for (i = 0; i < MAX_NUMNODES; i++)
 		init_numa_sched_groups_power(sched_group_nodes[i]);
 
-	if (sched_group_allnodes) {
-		int group = cpu_to_allnodes_group(first_cpu(*cpu_map), cpu_map);
-		struct sched_group *sg = &sched_group_allnodes[group];
+	if (sd_allnodes) {
+		struct sched_group *sg;
 
+		cpu_to_allnodes_group(first_cpu(*cpu_map), cpu_map, &sg);
 		init_numa_sched_groups_power(sg);
 	}
 #endif

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [patch] sched domain: move sched group allocations to percpu area
  2006-11-14  1:06                                       ` [patch] sched domain: move sched group allocations to percpu area Siddha, Suresh B
@ 2006-11-14  8:23                                         ` Ingo Molnar
  2006-11-14 16:57                                         ` Christoph Lameter
  1 sibling, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2006-11-14  8:23 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: clameter, kenneth.w.chen, akpm, nickpiggin, linux-kernel


* Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:

> Move the sched group allocations to percpu area. This will minimize 
> cross node memory references and also cleans up the sched groups 
> allocation for allnodes sched domain.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

yeah, makes sense.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch] sched domain: move sched group allocations to percpu area
  2006-11-14  1:06                                       ` [patch] sched domain: move sched group allocations to percpu area Siddha, Suresh B
  2006-11-14  8:23                                         ` Ingo Molnar
@ 2006-11-14 16:57                                         ` Christoph Lameter
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-11-14 16:57 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: kenneth.w.chen, Ingo Molnar, akpm, nickpiggin, linux-kernel

On Mon, 13 Nov 2006, Siddha, Suresh B wrote:

> Christoph, Can you please test this on your big systems and ACK it? Thanks.

We also thought about fixing it this way. Thanks!

Acked-by: Christoph Lameter <clameter@sgi.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-11-14 16:57 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200611032205.kA3M5wmJ003178@shell0.pdx.osdl.net>
2006-11-07  7:32 ` + sched-use-tasklet-to-call-balancing.patch added to -mm tree Ingo Molnar
2006-11-07 17:44   ` Christoph Lameter
2006-11-07 17:31     ` Siddha, Suresh B
2006-11-07 17:55       ` Christoph Lameter
2006-11-07 17:50         ` Siddha, Suresh B
2006-11-07 18:18           ` Christoph Lameter
2006-11-07 19:17           ` Christoph Lameter
2006-11-07 20:31             ` Ingo Molnar
2006-11-07 20:59               ` Christoph Lameter
2006-11-07 21:35                 ` Chen, Kenneth W
2006-11-07 21:49                   ` Christoph Lameter
2006-11-10  6:18                     ` Chen, Kenneth W
2006-11-10 18:50                       ` Christoph Lameter
2006-11-10 21:38                         ` Ingo Molnar
2006-11-11  1:01                           ` Christoph Lameter
2006-11-11  2:05                             ` Chen, Kenneth W
2006-11-11  2:51                               ` Christoph Lameter
2006-11-13  4:03                                 ` Chen, Kenneth W
2006-11-13  5:44                                   ` Christoph Lameter
2006-11-13  6:40                                     ` Chen, Kenneth W
2006-11-14  1:06                                       ` [patch] sched domain: move sched group allocations to percpu area Siddha, Suresh B
2006-11-14  8:23                                         ` Ingo Molnar
2006-11-14 16:57                                         ` Christoph Lameter
2006-11-10 21:42                         ` + sched-use-tasklet-to-call-balancing.patch added to -mm tree Ingo Molnar
2006-11-10 22:32                         ` Siddha, Suresh B
2006-11-11  1:14                           ` Christoph Lameter
2006-11-07 19:22           ` Christoph Lameter
2006-11-07 20:29         ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.