linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Minature NUMA scheduler
@ 2003-01-09 23:54 Martin J. Bligh
  2003-01-10  5:36 ` [Lse-tech] " Michael Hohnbaum
  0 siblings, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-09 23:54 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum, Robert Love, Ingo Molnar
  Cc: linux-kernel, lse-tech

I tried a small experiment today - did a simple restriction of
the O(1) scheduler to only balance inside a node. Coupled with
the small initial load balancing patch floating around, this
covers 95% of cases, is a trivial change (3 lines), performs 
just as well as Erich's patch on a kernel compile, and actually
better on schedbench.

This is NOT meant to be a replacement for the code Erich wrote,
it's meant to be a simple way to get integration and acceptance.
Code that just forks and never execs will stay on one node - but
we can take the code Erich wrote, and put it in seperate rebalancer
that fires much less often to do a cross-node rebalance. All that
would be under #ifdef CONFIG_NUMA, the only thing that would touch
mainline is these three lines of change, and it's trivial to see
they're completely equivalent to the current code on non-NUMA systems.

I also believe that this is the more correct approach in design, it
should result in much less cross-node migration of tasks, and less 
scanning of remote runqueues.

Opinions / comments?

M.

Kernbench:
                                   Elapsed        User      System         CPU
                   2.5.54-mjb3      19.41s     186.38s     39.624s     1191.4%
          2.5.54-mjb3-mjbsched     19.508s    186.356s     39.888s     1164.6%

Schedbench 4:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.54-mjb3        0.00       35.14       88.82        0.64
          2.5.54-mjb3-mjbsched        0.00       31.84       88.91        0.49

Schedbench 8:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.54-mjb3        0.00       47.55      269.36        1.48
          2.5.54-mjb3-mjbsched        0.00       41.01      252.34        1.07

Schedbench 16:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.54-mjb3        0.00       76.53      957.48        4.17
          2.5.54-mjb3-mjbsched        0.00       69.01      792.71        2.74

Schedbench 32:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.54-mjb3        0.00      145.20     1993.97       11.05
          2.5.54-mjb3-mjbsched        0.00      117.47     1798.93        5.95

Schedbench 64:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.54-mjb3        0.00      307.80     4643.55       20.36
          2.5.54-mjb3-mjbsched        0.00      241.04     3589.55       12.74

-----------------------------------------

diff -purN -X /home/mbligh/.diff.exclude virgin/kernel/sched.c mjbsched/kernel/sched.c
--- virgin/kernel/sched.c	Mon Dec  9 18:46:15 2002
+++ mjbsched/kernel/sched.c	Thu Jan  9 14:09:17 2003
@@ -654,7 +654,7 @@ static inline unsigned int double_lock_b
 /*
  * find_busiest_queue - find the busiest runqueue.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +689,7 @@ static inline runqueue_t *find_busiest_q
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1 << i) & cpumask) )
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -764,7 +764,8 @@ static void load_balance(runqueue_t *thi
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, 
+				__node_to_cpu_mask(__cpu_to_node(this_cpu)) );
 	if (!busiest)
 		goto out;
 

---------------------------------------------------

A tiny change in the current ilb patch is also needed to stop it
using a macro from the first patch:

diff -purN -X /home/mbligh/.diff.exclude ilbold/kernel/sched.c ilbnew/kernel/sched.c
--- ilbold/kernel/sched.c	Thu Jan  9 15:20:53 2003
+++ ilbnew/kernel/sched.c	Thu Jan  9 15:27:49 2003
@@ -2213,6 +2213,7 @@ static void sched_migrate_task(task_t *p
 static int sched_best_cpu(struct task_struct *p)
 {
 	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
 
 	best_cpu = task_cpu(p);
 	if (cpu_rq(best_cpu)->nr_running <= 2)
@@ -2226,9 +2227,11 @@ static int sched_best_cpu(struct task_st
 			node = i;
 		}
 	}
+
 	minload = 10000000;
-	loop_over_node(i,node) {
-		if (!cpu_online(i))
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1 << i)))
 			continue;
 		if (cpu_rq(i)->nr_running < minload) {
 			best_cpu = i;




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Minature NUMA scheduler
  2003-01-09 23:54 Minature NUMA scheduler Martin J. Bligh
@ 2003-01-10  5:36 ` Michael Hohnbaum
  2003-01-10 16:34   ` Erich Focht
  0 siblings, 1 reply; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-10  5:36 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Erich Focht, Robert Love, Ingo Molnar, linux-kernel, lse-tech

On Thu, 2003-01-09 at 15:54, Martin J. Bligh wrote:
> I tried a small experiment today - did a simple restriction of
> the O(1) scheduler to only balance inside a node. Coupled with
> the small initial load balancing patch floating around, this
> covers 95% of cases, is a trivial change (3 lines), performs 
> just as well as Erich's patch on a kernel compile, and actually
> better on schedbench.
> 
> This is NOT meant to be a replacement for the code Erich wrote,
> it's meant to be a simple way to get integration and acceptance.
> Code that just forks and never execs will stay on one node - but
> we can take the code Erich wrote, and put it in seperate rebalancer
> that fires much less often to do a cross-node rebalance. 

I tried this on my 4 node NUMAQ (16 procs, 16GB memory) and got
similar results.  Also, added in the cputime_stats patch and am 
attaching the matrix results from the 32 process run.  Basically,
all runs show that the initial load balancer is able to place the
tasks evenly across the nodes, and the better overall times show
that not migrating to keep the nodes balanced over time results
in better performance - at least on these boxes.

Obviously, there can be situations where load balancing across
nodes is necessary, but these results point to less load balancing
being better, at least on these machines.  It will be interesting
to repeat this on boxes with other interconnects.

$ reportbench hacksched54 sched54 stock54
Kernbench:
                        Elapsed       User     System        CPU
         hacksched54    29.406s    282.18s    81.716s    1236.8%
             sched54    29.112s   283.888s     82.84s    1259.4%
             stock54    31.348s   303.134s    87.824s    1247.2%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
         hacksched54      21.94      31.93      87.81       0.53
             sched54      22.03      34.90      88.15       0.75
             stock54      49.35      57.53     197.45       0.86

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
         hacksched54      28.23      31.62     225.87       1.11
             sched54      27.95      37.12     223.66       1.50
             stock54      43.14      62.97     345.18       2.12

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
         hacksched54      49.29      71.31     788.83       2.88
             sched54      55.37      69.58     886.10       3.79
             stock54      66.00      81.25    1056.25       7.12

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
         hacksched54      56.41     117.98    1805.35       5.90
             sched54      57.93     132.11    1854.01      10.74
             stock54      77.81     173.26    2490.31      12.37

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
         hacksched54      56.62     231.93    3624.42      13.32
             sched54      72.91     308.87    4667.03      21.06
             stock54      86.68     368.55    5548.57      25.73

hacksched54 = 2.5.54 + Martin's tiny NUMA patch +          
              03-cputimes_stat-2.5.53.patch +
              02-numa-sched-ilb-2.5.53-21.patch
sched54 = 2.5.54 + 01-numa-sched-core-2.5.53-24.patch + 
          02-ilb-2.5.53-21.patch02 +
          03-cputimes_stat-2.5.53.patch
stock54 - 2.5.54 + 03-cputimes_stat-2.5.53.patch

Detailed results from running numa_test 32:

Executing 32 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 4.557
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  57.12
  2     0.0  100.0    0.0    0.0 |    1     1    |  55.89
  3   100.0    0.0    0.0    0.0 |    0     0    |  55.39
  4     0.0    0.0  100.0    0.0 |    2     2    |  56.67
  5     0.0    0.0    0.0  100.0 |    3     3    |  57.08
  6     0.0  100.0    0.0    0.0 |    1     1    |  56.31
  7   100.0    0.0    0.0    0.0 |    0     0    |  57.11
  8     0.0    0.0    0.0  100.0 |    3     3    |  56.66
  9     0.0  100.0    0.0    0.0 |    1     1    |  55.87
 10     0.0    0.0  100.0    0.0 |    2     2    |  55.83
 11     0.0    0.0    0.0  100.0 |    3     3    |  56.01
 12     0.0  100.0    0.0    0.0 |    1     1    |  56.56
 13     0.0    0.0  100.0    0.0 |    2     2    |  56.31
 14     0.0    0.0    0.0  100.0 |    3     3    |  56.40
 15   100.0    0.0    0.0    0.0 |    0     0    |  55.82
 16     0.0  100.0    0.0    0.0 |    1     1    |  57.47
 17     0.0    0.0  100.0    0.0 |    2     2    |  56.76
 18     0.0    0.0    0.0  100.0 |    3     3    |  57.10
 19     0.0  100.0    0.0    0.0 |    1     1    |  57.26
 20     0.0    0.0  100.0    0.0 |    2     2    |  56.48
 21     0.0    0.0    0.0  100.0 |    3     3    |  56.65
 22     0.0  100.0    0.0    0.0 |    1     1    |  55.81
 23     0.0    0.0  100.0    0.0 |    2     2    |  55.77
 24     0.0    0.0    0.0  100.0 |    3     3    |  56.91
 25     0.0  100.0    0.0    0.0 |    1     1    |  56.86
 26     0.0    0.0  100.0    0.0 |    2     2    |  56.62
 27     0.0    0.0    0.0  100.0 |    3     3    |  57.16
 28     0.0    0.0  100.0    0.0 |    2     2    |  56.36
 29   100.0    0.0    0.0    0.0 |    0     0    |  55.72
 30   100.0    0.0    0.0    0.0 |    0     0    |  56.00
 31   100.0    0.0    0.0    0.0 |    0     0    |  55.48
 32   100.0    0.0    0.0    0.0 |    0     0    |  55.59
AverageUserTime 56.41 seconds
ElapsedTime     117.98
TotalUserTime   1805.35
TotalSysTime    5.90

Runs with 4, 8, 16, and 64 processes all showed the same even
distribution across all nodes and 100% time on node for each
process.
-- 
Michael Hohnbaum            503-578-5486
hohnbaum@us.ibm.com         T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Minature NUMA scheduler
  2003-01-10  5:36 ` [Lse-tech] " Michael Hohnbaum
@ 2003-01-10 16:34   ` Erich Focht
  2003-01-10 16:57     ` Martin J. Bligh
  2003-01-11 14:43     ` [Lse-tech] Minature NUMA scheduler Bill Davidsen
  0 siblings, 2 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-10 16:34 UTC (permalink / raw)
  To: Michael Hohnbaum, Martin J. Bligh
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

Hi Martin & Michael,

indeed, restricting a process to the node on which it was started
helps, as the memory will always be local. The NUMA scheduler allows a
process to move away from it's node, if the load conditions require
it, but in the current form the process will not come back to its
homenode. That's what the "node affine scheduler" tried to realise.

The miniature NUMA scheduler relies on the quality of the initial load
balancer, and that one seems to be good enough. As you mentioned,
multithreaded jobs are disadvantaged as their threads have to stick on
the originating node.

Having some sort of automatic node affinity of processes and equal
node loads in mind (as design targets), we could:
 - take the minimal NUMA scheduler
 - if the normal (node-restricted) find_busiest_queue() fails and
 certain conditions are fulfilled (tried to balance inside own node
 for a while and didn't succeed, own CPU idle, etc... ???) rebalance
 over node boundaries (eg. using my load balancer)
This actually resembles the original design of the node affine
scheduler, having the cross-node balancing separate is ok and might
make the ideas clearer.

I made some measurements, too, and found basically what I
expected. The numbers are from a machine with 4 nodes of 2 CPUs
each. It's on ia64, so 2.5.52 based.

As the minsched cannot make mistakes (by moving tasks away from their
homenode), it leads to the best results with numa_test. OTOH hackbench
suffers a lot from the limitation to one node. The hackbench tasks are
not latency/bandwidth limited, the faster they spread over the whole
machine, the quicker finishes the job. That's why NUMA-sched is
slightly worse than a stock kernel. But minsched looses >50%. Funilly,
on my machine kernbench is slightly faster with the normal NUMA
scheduler.

Regards,
Erich

Results on a 8 CPU machine with 4 nodes (2 CPUs per node).

kernbench:
                elapsed       user          system
      stock52   134.52(0.84)  951.64(0.97)  20.72(0.22)
      sched52   133.19(1.49)  944.24(0.50)  21.36(0.24)
   minsched52   135.47(0.47)  937.61(0.20)  21.30(0.14)

schedbench/hackbench: time(s)
               10         25         50         100
      stock52  0.81(0.04) 2.06(0.07) 4.09(0.13) 7.89(0.25)
      sched52  0.81(0.04) 2.03(0.07) 4.14(0.20) 8.61(0.35)
   minsched52  1.28(0.05) 3.19(0.06) 6.59(0.13) 13.56(0.27)

numabench/numa_test 4
               AvgUserTime ElapsedTime TotUserTime TotSysTime
      stock52  ---         27.23(0.52) 89.30(4.18) 0.09(0.01)
      sched52  22.32(1.00) 27.39(0.42) 89.29(4.02) 0.10(0.01)
   minsched52  20.01(0.01) 23.40(0.13) 80.05(0.02) 0.08(0.01)

numabench/numa_test 8
               AvgUserTime ElapsedTime TotUserTime  TotSysTime
      stock52  ---         27.50(2.58) 174.74(6.66) 0.18(0.01)
      sched52  21.73(1.00) 33.70(1.82) 173.87(7.96) 0.18(0.01)
   minsched52  20.31(0.00) 23.50(0.12) 162.47(0.04) 0.16(0.01)

numabench/numa_test 16
               AvgUserTime ElapsedTime TotUserTime   TotSysTime
      stock52  ---         52.68(1.51) 390.03(15.10) 0.34(0.01)
      sched52  21.51(0.80) 47.18(3.24) 344.29(12.78) 0.36(0.01)
   minsched52  20.50(0.03) 43.82(0.08) 328.05(0.45)  0.34(0.01)

numabench/numa_test 32
               AvgUserTime ElapsedTime  TotUserTime   TotSysTime
      stock52  ---         102.60(3.89) 794.57(31.72) 0.65(0.01)
      sched52  21.93(0.57) 92.46(1.10)  701.75(18.38) 0.67(0.02)
   minsched52  20.64(0.10) 89.95(3.16)  660.72(3.13)  0.68(0.07)



On Friday 10 January 2003 06:36, Michael Hohnbaum wrote:
> On Thu, 2003-01-09 at 15:54, Martin J. Bligh wrote:
> > I tried a small experiment today - did a simple restriction of
> > the O(1) scheduler to only balance inside a node. Coupled with
> > the small initial load balancing patch floating around, this
> > covers 95% of cases, is a trivial change (3 lines), performs
> > just as well as Erich's patch on a kernel compile, and actually
> > better on schedbench.
> >
> > This is NOT meant to be a replacement for the code Erich wrote,
> > it's meant to be a simple way to get integration and acceptance.
> > Code that just forks and never execs will stay on one node - but
> > we can take the code Erich wrote, and put it in seperate rebalancer
> > that fires much less often to do a cross-node rebalance.
>
> I tried this on my 4 node NUMAQ (16 procs, 16GB memory) and got
> similar results.  Also, added in the cputime_stats patch and am
> attaching the matrix results from the 32 process run.  Basically,
> all runs show that the initial load balancer is able to place the
> tasks evenly across the nodes, and the better overall times show
> that not migrating to keep the nodes balanced over time results
> in better performance - at least on these boxes.
>
> Obviously, there can be situations where load balancing across
> nodes is necessary, but these results point to less load balancing
> being better, at least on these machines.  It will be interesting
> to repeat this on boxes with other interconnects.
>
> $ reportbench hacksched54 sched54 stock54
> Kernbench:
>                         Elapsed       User     System        CPU
>          hacksched54    29.406s    282.18s    81.716s    1236.8%
>              sched54    29.112s   283.888s     82.84s    1259.4%
>              stock54    31.348s   303.134s    87.824s    1247.2%
>
> Schedbench 4:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>          hacksched54      21.94      31.93      87.81       0.53
>              sched54      22.03      34.90      88.15       0.75
>              stock54      49.35      57.53     197.45       0.86
>
> Schedbench 8:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>          hacksched54      28.23      31.62     225.87       1.11
>              sched54      27.95      37.12     223.66       1.50
>              stock54      43.14      62.97     345.18       2.12
>
> Schedbench 16:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>          hacksched54      49.29      71.31     788.83       2.88
>              sched54      55.37      69.58     886.10       3.79
>              stock54      66.00      81.25    1056.25       7.12
>
> Schedbench 32:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>          hacksched54      56.41     117.98    1805.35       5.90
>              sched54      57.93     132.11    1854.01      10.74
>              stock54      77.81     173.26    2490.31      12.37
>
> Schedbench 64:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>          hacksched54      56.62     231.93    3624.42      13.32
>              sched54      72.91     308.87    4667.03      21.06
>              stock54      86.68     368.55    5548.57      25.73
>
> hacksched54 = 2.5.54 + Martin's tiny NUMA patch +
>               03-cputimes_stat-2.5.53.patch +
>               02-numa-sched-ilb-2.5.53-21.patch
> sched54 = 2.5.54 + 01-numa-sched-core-2.5.53-24.patch +
>           02-ilb-2.5.53-21.patch02 +
>           03-cputimes_stat-2.5.53.patch
> stock54 - 2.5.54 + 03-cputimes_stat-2.5.53.patch


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Minature NUMA scheduler
  2003-01-10 16:34   ` Erich Focht
@ 2003-01-10 16:57     ` Martin J. Bligh
  2003-01-12 23:35       ` Erich Focht
  2003-01-12 23:55       ` NUMA scheduler 2nd approach Erich Focht
  2003-01-11 14:43     ` [Lse-tech] Minature NUMA scheduler Bill Davidsen
  1 sibling, 2 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-10 16:57 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

> Having some sort of automatic node affinity of processes and equal
> node loads in mind (as design targets), we could:
>  - take the minimal NUMA scheduler
>  - if the normal (node-restricted) find_busiest_queue() fails and
>  certain conditions are fulfilled (tried to balance inside own node
>  for a while and didn't succeed, own CPU idle, etc... ???) rebalance
>  over node boundaries (eg. using my load balancer)
> This actually resembles the original design of the node affine
> scheduler, having the cross-node balancing separate is ok and might
> make the ideas clearer.

This seems like the right approach to me, apart from the trigger to
do the cross-node rebalance. I don't believe that has anything to do
with when we're internally balanced within a node or not, it's
whether the nodes are balanced relative to each other. I think we should 
just check that every N ticks, looking at node load averages, and do 
a cross-node rebalance if they're "significantly out".

The definintion of "N ticks" and "significantly out" would be a tunable
number, defined by each platform; roughly speaking, the lower the NUMA
ratio, the lower these numbers would be. That also allows us to wedge
all sorts of smarts in the NUMA rebalance part of the scheduler, such
as moving the tasks with the smallest RSS off node. The NUMA rebalancer
is obviously completely missing from the current implementation, and 
I expect we'd use mainly Erich's current code to implement that. 
However, it's suprising how well we do with no rebalancer at all,
apart from the exec-time initial load balance code.

Another big advantage of this approach is that it *obviously* changes
nothing at all for standard SMP systems (whereas your current patch does),
so it should be much easier to get it accepted into mainline ....

M.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Minature NUMA scheduler
  2003-01-10 16:34   ` Erich Focht
  2003-01-10 16:57     ` Martin J. Bligh
@ 2003-01-11 14:43     ` Bill Davidsen
  2003-01-12 23:24       ` Erich Focht
  1 sibling, 1 reply; 96+ messages in thread
From: Bill Davidsen @ 2003-01-11 14:43 UTC (permalink / raw)
  To: Erich Focht; +Cc: linux-kernel, lse-tech

On Fri, 10 Jan 2003, Erich Focht wrote:

> Having some sort of automatic node affinity of processes and equal
> node loads in mind (as design targets), we could:
>  - take the minimal NUMA scheduler
>  - if the normal (node-restricted) find_busiest_queue() fails and
>  certain conditions are fulfilled (tried to balance inside own node
>  for a while and didn't succeed, own CPU idle, etc... ???) rebalance
>  over node boundaries (eg. using my load balancer)
> This actually resembles the original design of the node affine
> scheduler, having the cross-node balancing separate is ok and might
> make the ideas clearer.

Agreed, but honestly just this explanation would make it easier to
understand! I'm not sure you have the "balance of nodes" trigger defined
quite right, but I'm assuming if this gets implemented as described that
some long term umbalance detector mechanism will be run occasionally.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Minature NUMA scheduler
  2003-01-11 14:43     ` [Lse-tech] Minature NUMA scheduler Bill Davidsen
@ 2003-01-12 23:24       ` Erich Focht
  0 siblings, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-12 23:24 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel, lse-tech

On Saturday 11 January 2003 15:43, Bill Davidsen wrote:
> Agreed, but honestly just this explanation would make it easier to
> understand! I'm not sure you have the "balance of nodes" trigger defined
> quite right, but I'm assuming if this gets implemented as described that
> some long term umbalance detector mechanism will be run occasionally.

Yes, the current plan is to extend the miniature NUMA scheduler by a
inter-node balancer which is called less frequently.

Regards,
Erich



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Minature NUMA scheduler
  2003-01-10 16:57     ` Martin J. Bligh
@ 2003-01-12 23:35       ` Erich Focht
  2003-01-12 23:55       ` NUMA scheduler 2nd approach Erich Focht
  1 sibling, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-12 23:35 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

On Friday 10 January 2003 17:57, Martin J. Bligh wrote:
> This seems like the right approach to me, apart from the trigger to
> do the cross-node rebalance. I don't believe that has anything to do
> with when we're internally balanced within a node or not, it's
> whether the nodes are balanced relative to each other. I think we should
> just check that every N ticks, looking at node load averages, and do
> a cross-node rebalance if they're "significantly out".

OK, I changed my mind about the trigger and made some experiments with
the cross-node balancer called after every N calls of
load_balance. If we make N tunable, we can even have some dynamics in
case the nodes are unbalanced: make N small if the current node is
less loaded than the average node loads, make N large if the node is
averagely loaded. I'll send the patches in a separate email.

> The NUMA rebalancer
> is obviously completely missing from the current implementation, and
> I expect we'd use mainly Erich's current code to implement that.
> However, it's suprising how well we do with no rebalancer at all,
> apart from the exec-time initial load balance code.

The fact that we're doing fine on kernbench and numa_test/schedbench
is (I think) understandeable. In both benchmarks a process cannot
change its node during its lifetime, therefore has minimal memory
latency. In numa_test the "disturbing" hackbench just cannot move away
any of the tasks from their originating node. Therefore the results
are the best possible.

Regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* NUMA scheduler 2nd approach
  2003-01-10 16:57     ` Martin J. Bligh
  2003-01-12 23:35       ` Erich Focht
@ 2003-01-12 23:55       ` Erich Focht
  2003-01-13  8:02         ` Christoph Hellwig
  2003-01-14  1:23         ` Michael Hohnbaum
  1 sibling, 2 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-12 23:55 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

Hi Martin & Michael,

as discussed on the LSE call I played around with a cross-node
balancer approach put on top of the miniature NUMA scheduler. The
patches are attached and it seems to be clear that we can regain the
good performance for hackbench by adding a cross-node balancer.

The patches are:

01-minisched-2.5.55.patch : the miniature scheduler from
Martin. Balances strictly within a node. Added an inlined function to
make the integration of the cross-node balancer easier. The added code
is optimised away by the compiler.

02-initial-lb-2.5.55.patch : Michael's initial load balancer at
exec().

03-internode-lb-2.5.55.patch : internode load balancer core. Called
after INTERNODE_LB calls of the inter-node load balancer. The most
loaded node's cpu-mask is ORed to the own node's cpu-mask and
find_busiest_in_mask() finds the most loaded CPU in this set.

04-smooth-node-load-2.5.55.patch : The node load measure is smoothed
by adding half of the previous node load (and 1/4 of the one before,
etc..., as discussed in the LSE call). This should improve a bit the
behavior in case of short timed load peaks and avoid bouncing tasks
between nodes.

05-var-intnode-lb2-2.5.55.patch : Replaces the fixed INTERNODE_LB
interval (between cross-node balancer calls) by a variable
node-specific interval. Currently only two values used. Certainly
needs some tweaking and tuning.

01, 02, 03 are enough to produce a working NUMA scheduler, when
including all patches the behavior should be better. I made some
measurements which I'd like to post in a separate email tomorrow.

Comments? Ideas?

Regards,
Erich

[-- Attachment #2: 01-minisched-2.5.55.patch --]
[-- Type: text/x-diff, Size: 1449 bytes --]

diff -urNp linux-2.5.55/kernel/sched.c linux-2.5.55-ms/kernel/sched.c
--- linux-2.5.55/kernel/sched.c	2003-01-09 05:04:22.000000000 +0100
+++ linux-2.5.55-ms/kernel/sched.c	2003-01-11 15:46:10.000000000 +0100
@@ -652,9 +652,9 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_in_mask - find the busiest runqueue among the cpus in cpumask
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +689,7 @@ static inline runqueue_t *find_busiest_q
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1UL << i) & cpumask) )
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -730,6 +730,16 @@ out:
 }
 
 /*
+ * find_busiest_queue - find the busiest runqueue.
+ */
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+{
+	unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
+	return find_busiest_in_mask(this_rq, this_cpu, idle, imbalance,
+				    cpumask);
+}
+
+/*
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */

[-- Attachment #3: 02-initial-lb-2.5.55.patch --]
[-- Type: text/x-diff, Size: 5614 bytes --]

diff -urNp linux-2.5.55-ms/fs/exec.c linux-2.5.55-ms-ilb/fs/exec.c
--- linux-2.5.55-ms/fs/exec.c	2003-01-09 05:04:00.000000000 +0100
+++ linux-2.5.55-ms-ilb/fs/exec.c	2003-01-11 01:09:25.000000000 +0100
@@ -1027,6 +1027,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urNp linux-2.5.55-ms/include/linux/sched.h linux-2.5.55-ms-ilb/include/linux/sched.h
--- linux-2.5.55-ms/include/linux/sched.h	2003-01-09 05:03:53.000000000 +0100
+++ linux-2.5.55-ms-ilb/include/linux/sched.h	2003-01-11 01:10:31.000000000 +0100
@@ -160,7 +160,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -444,6 +443,20 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp linux-2.5.55-ms/init/main.c linux-2.5.55-ms-ilb/init/main.c
--- linux-2.5.55-ms/init/main.c	2003-01-09 05:03:55.000000000 +0100
+++ linux-2.5.55-ms-ilb/init/main.c	2003-01-11 01:09:25.000000000 +0100
@@ -495,6 +495,7 @@ static void do_pre_smp_initcalls(void)
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urNp linux-2.5.55-ms/kernel/sched.c linux-2.5.55-ms-ilb/kernel/sched.c
--- linux-2.5.55-ms/kernel/sched.c	2003-01-10 23:01:02.000000000 +0100
+++ linux-2.5.55-ms-ilb/kernel/sched.c	2003-01-11 01:12:43.000000000 +0100
@@ -153,6 +153,7 @@ struct runqueue {
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
+	atomic_t * node_ptr;
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -294,7 +295,7 @@ static inline void activate_task(task_t 
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -302,7 +303,7 @@ static inline void activate_task(task_t 
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -736,9 +737,9 @@ out:
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2171,6 +2172,86 @@ __init int migration_init(void)
 
 #endif
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+	}
+	return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+
+	minload = 10000000;
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1UL << i)))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP || CONFIG_PREEMPT
 /*
  * The 'big kernel lock'
@@ -2232,6 +2313,10 @@ void __init sched_init(void)
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

[-- Attachment #4: 03-internode-lb-2.5.55.patch --]
[-- Type: text/x-diff, Size: 2304 bytes --]

diff -urNp linux-2.5.55-ms-ilb/include/linux/sched.h linux-2.5.55-ms-ilb-nb/include/linux/sched.h
--- linux-2.5.55-ms-ilb/include/linux/sched.h	2003-01-12 18:03:07.000000000 +0100
+++ linux-2.5.55-ms-ilb-nb/include/linux/sched.h	2003-01-11 17:19:49.000000000 +0100
@@ -450,11 +450,13 @@ extern void node_nr_running_init(void);
 	rq->nr_running++
 #define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
 	rq->nr_running--
+#define decl_numa_int(ctr) int ctr
 #else
 #define sched_balance_exec() {}
 #define node_nr_running_init() {}
 #define nr_running_inc(rq) rq->nr_running++
 #define nr_running_dec(rq) rq->nr_running--
+#define decl_numa_int(ctr) {}
 #endif
 
 extern void set_user_nice(task_t *p, long nice);
diff -urNp linux-2.5.55-ms-ilb/kernel/sched.c linux-2.5.55-ms-ilb-nb/kernel/sched.c
--- linux-2.5.55-ms-ilb/kernel/sched.c	2003-01-12 18:03:07.000000000 +0100
+++ linux-2.5.55-ms-ilb-nb/kernel/sched.c	2003-01-12 18:12:27.000000000 +0100
@@ -154,6 +154,7 @@ struct runqueue {
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 	atomic_t * node_ptr;
+	decl_numa_int(lb_cntr);
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -703,6 +704,23 @@ void sched_balance_exec(void)
 			sched_migrate_task(current, new_cpu);
 	}
 }
+
+static int find_busiest_node(int this_node)
+{
+	int i, node = this_node, load, this_load, maxload;
+	
+	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	for (i = 0; i < numnodes; i++) {
+		if (i == this_node)
+			continue;
+		load = atomic_read(&node_nr_running[i]);
+		if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+			maxload = load;
+			node = i;
+		}
+	}
+	return node;
+}
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -816,6 +834,16 @@ out:
 static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
 {
 	unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
+#if CONFIG_NUMA
+	int node;
+#       define INTERNODE_LB 10
+
+	/* rebalance across nodes only every INTERNODE_LB call */
+	if (!(++(this_rq->lb_cntr) % INTERNODE_LB)) {
+		node = find_busiest_node(__cpu_to_node(this_cpu));
+		cpumask |= __node_to_cpu_mask(node);
+	}
+#endif
 	return find_busiest_in_mask(this_rq, this_cpu, idle, imbalance,
 				    cpumask);
 }

[-- Attachment #5: 04-smooth-node-load-2.5.55.patch --]
[-- Type: text/x-diff, Size: 2230 bytes --]

diff -urNp linux-2.5.55-ms-ilb-nb/include/linux/sched.h linux-2.5.55-ms-ilb-nba/include/linux/sched.h
--- linux-2.5.55-ms-ilb-nb/include/linux/sched.h	2003-01-11 17:19:49.000000000 +0100
+++ linux-2.5.55-ms-ilb-nba/include/linux/sched.h	2003-01-11 17:17:53.000000000 +0100
@@ -451,12 +451,14 @@ extern void node_nr_running_init(void);
 #define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
 	rq->nr_running--
 #define decl_numa_int(ctr) int ctr
+#define decl_numa_nodeint(v) int v[MAX_NUMNODES]
 #else
 #define sched_balance_exec() {}
 #define node_nr_running_init() {}
 #define nr_running_inc(rq) rq->nr_running++
 #define nr_running_dec(rq) rq->nr_running--
 #define decl_numa_int(ctr) {}
+#define decl_numa_nodeint(v) {}
 #endif
 
 extern void set_user_nice(task_t *p, long nice);
diff -urNp linux-2.5.55-ms-ilb-nb/kernel/sched.c linux-2.5.55-ms-ilb-nba/kernel/sched.c
--- linux-2.5.55-ms-ilb-nb/kernel/sched.c	2003-01-12 18:18:16.000000000 +0100
+++ linux-2.5.55-ms-ilb-nba/kernel/sched.c	2003-01-12 18:15:34.000000000 +0100
@@ -155,6 +155,7 @@ struct runqueue {
 	int prev_nr_running[NR_CPUS];
 	atomic_t * node_ptr;
 	decl_numa_int(lb_cntr);
+	decl_numa_nodeint(prev_node_load);
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -705,15 +706,25 @@ void sched_balance_exec(void)
 	}
 }
 
+/*
+ * Find the busiest node. All previous node loads contribute with a geometrically
+ * deccaying weight to the load measure:
+ *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
+ * This way sudden load peaks are flattened out a bit.
+ */
 static int find_busiest_node(int this_node)
 {
 	int i, node = this_node, load, this_load, maxload;
 	
-	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
+		+ atomic_read(&node_nr_running[this_node]);
+	this_rq()->prev_node_load[this_node] = this_load;
 	for (i = 0; i < numnodes; i++) {
 		if (i == this_node)
 			continue;
-		load = atomic_read(&node_nr_running[i]);
+		load = (this_rq()->prev_node_load[i] >> 1)
+			+ atomic_read(&node_nr_running[i]);
+		this_rq()->prev_node_load[i] = load;
 		if (load > maxload && (4*load > ((5*4*this_load)/4))) {
 			maxload = load;
 			node = i;

[-- Attachment #6: 05-var-intnode-lb2-2.5.55.patch --]
[-- Type: text/x-diff, Size: 2247 bytes --]

diff -urNp linux-2.5.55-ms-ilb-nba/kernel/sched.c linux-2.5.55-ms-ilb-nbav/kernel/sched.c
--- linux-2.5.55-ms-ilb-nba/kernel/sched.c	2003-01-12 18:15:34.000000000 +0100
+++ linux-2.5.55-ms-ilb-nbav/kernel/sched.c	2003-01-12 23:16:25.000000000 +0100
@@ -629,6 +629,9 @@ static inline void double_rq_unlock(runq
 
 #if CONFIG_NUMA
 static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+#define MAX_INTERNODE_LB 40
+#define MIN_INTERNODE_LB 4
+static int internode_lb[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = MAX_INTERNODE_LB};
 
 __init void node_nr_running_init(void)
 {
@@ -715,21 +718,31 @@ void sched_balance_exec(void)
 static int find_busiest_node(int this_node)
 {
 	int i, node = this_node, load, this_load, maxload;
+	int avg_load;
 	
 	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
 		+ atomic_read(&node_nr_running[this_node]);
 	this_rq()->prev_node_load[this_node] = this_load;
+	avg_load = this_load;
 	for (i = 0; i < numnodes; i++) {
 		if (i == this_node)
 			continue;
 		load = (this_rq()->prev_node_load[i] >> 1)
 			+ atomic_read(&node_nr_running[i]);
+		avg_load += load;
 		this_rq()->prev_node_load[i] = load;
 		if (load > maxload && (4*load > ((5*4*this_load)/4))) {
 			maxload = load;
 			node = i;
 		}
 	}
+	avg_load = avg_load / (numnodes ? numnodes : 1);
+	if (this_load < (avg_load / 2)) {
+		if (internode_lb[this_node] == MAX_INTERNODE_LB)
+			internode_lb[this_node] = MIN_INTERNODE_LB;
+	} else
+		if (internode_lb[this_node] == MIN_INTERNODE_LB)
+			internode_lb[this_node] = MAX_INTERNODE_LB;
 	return node;
 }
 #endif /* CONFIG_NUMA */
@@ -846,11 +859,10 @@ static inline runqueue_t *find_busiest_q
 {
 	unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
 #if CONFIG_NUMA
-	int node;
-#       define INTERNODE_LB 10
+	int node, this_node = __cpu_to_node(this_cpu);
 
-	/* rebalance across nodes only every INTERNODE_LB call */
-	if (!(++(this_rq->lb_cntr) % INTERNODE_LB)) {
+	/* rarely rebalance across nodes */
+	if (!(++(this_rq->lb_cntr) % internode_lb[this_node])) {
 		node = find_busiest_node(__cpu_to_node(this_cpu));
 		cpumask |= __node_to_cpu_mask(node);
 	}

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: NUMA scheduler 2nd approach
  2003-01-12 23:55       ` NUMA scheduler 2nd approach Erich Focht
@ 2003-01-13  8:02         ` Christoph Hellwig
  2003-01-13 11:32           ` Erich Focht
  2003-01-14  1:23         ` Michael Hohnbaum
  1 sibling, 1 reply; 96+ messages in thread
From: Christoph Hellwig @ 2003-01-13  8:02 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
	linux-kernel, lse-tech

On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote:
> Hi Martin & Michael,
> 
> as discussed on the LSE call I played around with a cross-node
> balancer approach put on top of the miniature NUMA scheduler. The
> patches are attached and it seems to be clear that we can regain the
> good performance for hackbench by adding a cross-node balancer.

The changes look fine to me.  But I think there's some conding style
issues that need cleaning up (see below).  Also is there a reason
patches 2/3 and 4/5 aren't merged into one patch each?

 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_in_mask - find the busiest runqueue among the cpus in cpumask
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)


	find_busiest_queue has just one caller in 2.5.56, I'd suggest just
	changing the prototype and updating that single caller to pass in
	the cpumask opencoded.


@@ -160,7 +160,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);

	I don't think you need this spurious whitespace change :)
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--

	static inline void nr_running_inc(runqueue_t *rq)
	{
		atomic_inc(rq->node_ptr);
		rq->nr_running++
	}

	etc.. would look a bit nicer.

diff -urNp linux-2.5.55-ms/kernel/sched.c linux-2.5.55-ms-ilb/kernel/sched.c
--- linux-2.5.55-ms/kernel/sched.c	2003-01-10 23:01:02.000000000 +0100
+++ linux-2.5.55-ms-ilb/kernel/sched.c	2003-01-11 01:12:43.000000000 +0100
@@ -153,6 +153,7 @@ struct runqueue {
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
+	atomic_t * node_ptr;

	atomic_t *node_ptr; would match the style above.

+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};

	Maybe wants some linewrapping after 80 chars?


+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+	}
+	return;
+}

	The braces and the return are superflous.  Also kernel/sched.c (or
	mingo codein general) seems to prefer array + i instead of &array[i]
	(not that I have a general preferences, but you should try to match
	the surrounding code)

+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);

	This double set_cpus_allowed doesn't look nice to me.  I don't
	have a better suggestion of-hand, though :(

+#define decl_numa_int(ctr) int ctr

	This is ugly as hell.  I'd prefer wasting one int in each runqueue
	or even an ifdef in the struct declaration over this obsfucation
	all the time.

@@ -816,6 +834,16 @@ out:
 static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
 {
 	unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
+#if CONFIG_NUMA
+	int node;
+#       define INTERNODE_LB 10

	This wants to be put to the other magic constants in the scheduler
	and needs an explanation there.


 #define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
 	rq->nr_running--
 #define decl_numa_int(ctr) int ctr
+#define decl_numa_nodeint(v) int v[MAX_NUMNODES]

	Another one of those..  You should reall just stick the CONFIG_NUMA
	ifdef into the actual struct declaration.

+/*
+ * Find the busiest node. All previous node loads contribute with a geometrically
+ * deccaying weight to the load measure:

	Linewrapping again..



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: NUMA scheduler 2nd approach
  2003-01-13  8:02         ` Christoph Hellwig
@ 2003-01-13 11:32           ` Erich Focht
  2003-01-13 15:26             ` [Lse-tech] " Christoph Hellwig
  2003-01-13 19:03             ` Michael Hohnbaum
  0 siblings, 2 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-13 11:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
	linux-kernel, lse-tech

Hi Christoph,

thanks for reviewing the code and the helpful comments!

On Monday 13 January 2003 09:02, Christoph Hellwig wrote:
> On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote:
> > as discussed on the LSE call I played around with a cross-node
> > balancer approach put on top of the miniature NUMA scheduler. The
> > patches are attached and it seems to be clear that we can regain the
> > good performance for hackbench by adding a cross-node balancer.
>
> The changes look fine to me.  But I think there's some conding style
> issues that need cleaning up (see below).  Also is there a reason
> patches 2/3 and 4/5 aren't merged into one patch each?

The patches are separated by their functionality. Patch 2 comes from
Michael Hohnbaum, so I kept that separate for that reason. Right now
we can exchange the single components but when we decide that they are
doing the job, I'd also prefer to have just one patch.

>  /*
> - * find_busiest_queue - find the busiest runqueue.
> + * find_busiest_in_mask - find the busiest runqueue among the cpus in
> cpumask */
> -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int
> this_cpu, int idle, int *imbalance) +static inline runqueue_t
> *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int
> *imbalance, unsigned long cpumask)
>
>
> 	find_busiest_queue has just one caller in 2.5.56, I'd suggest just
> 	changing the prototype and updating that single caller to pass in
> 	the cpumask opencoded.

Having find_busiest_queue() and find_busiest_in_mask() as separate
function makes it simpler to merge in the cross-node balancer (patch
3). Otherwise we'd have to add two #ifdef CONFIG_NUMA blocks into
load_balance() (one for new variable declarations, the other one for
selecting the target node mask). We might have several calls to
find_busiest_in_mask() later, if we decide to add multi-level node
hierarchy support...


> 	I don't think you need this spurious whitespace change :)

:-) slipped in somehow.


> +#ifdef CONFIG_NUMA
> +extern void sched_balance_exec(void);
> +extern void node_nr_running_init(void);
> +#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
> +	rq->nr_running++
> +#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
> +	rq->nr_running--
>
> 	static inline void nr_running_inc(runqueue_t *rq)
> 	{
> 		atomic_inc(rq->node_ptr);
> 		rq->nr_running++
> 	}
>
> 	etc.. would look a bit nicer.

We can change this. Michael, ok with you?


> +#if CONFIG_NUMA
> +static atomic_t node_nr_running[MAX_NUMNODES]
> ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
>
> 	Maybe wants some linewrapping after 80 chars?

Yes.


> +	for (i = 0; i < NR_CPUS; i++) {
> +		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
> +	}
> +	return;
> +}
>
> 	The braces and the return are superflous.  Also kernel/sched.c (or
> 	mingo codein general) seems to prefer array + i instead of &array[i]
> 	(not that I have a general preferences, but you should try to match
> 	the surrounding code)

Will change the braces and remove the return. I personally find
&array[i] more readable.

> +static void sched_migrate_task(task_t *p, int dest_cpu)
> +{
> +	unsigned long old_mask;
> +
> +	old_mask = p->cpus_allowed;
> +	if (!(old_mask & (1UL << dest_cpu)))
> +		return;
> +	/* force the process onto the specified CPU */
> +	set_cpus_allowed(p, 1UL << dest_cpu);
> +
> +	/* restore the cpus allowed mask */
> +	set_cpus_allowed(p, old_mask);
>
> 	This double set_cpus_allowed doesn't look nice to me.  I don't
> 	have a better suggestion of-hand, though :(

This is not that bad. It involves only one single wakeup of the
migration thread, and that's more important. Doing it another way
would mean to replicate the set_cpus_allowed() code.

> +#define decl_numa_int(ctr) int ctr
>
> 	This is ugly as hell.  I'd prefer wasting one int in each runqueue
> 	or even an ifdef in the struct declaration over this obsfucation
> 	all the time.

Agreed :-) Just trying to avoid #ifdefs in sched.c as much as
possible. Somehow had the feeling Linus doesn't like that. On the
other hand: CONFIG_NUMA is a special case of CONFIG_SMP and nobody has
anything against CONFIG_SMP in sched.c...


> @@ -816,6 +834,16 @@ out:
>  static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int
> this_cpu, int idle, int *imbalance) {
>  	unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
> +#if CONFIG_NUMA
> +	int node;
> +#       define INTERNODE_LB 10
>
> 	This wants to be put to the other magic constants in the scheduler
> 	and needs an explanation there.

We actually get rid of this in patch #5 (variable internode_lb,
depending on the load of the current node). But ok, I'll move the
MIN_INTERNODE_LB and MAX_INTERNODE_LB variables to the magic
constants. They'll be outside the #ifdef CONFIG_NUMA block...

Thanks,

best regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-13 11:32           ` Erich Focht
@ 2003-01-13 15:26             ` Christoph Hellwig
  2003-01-13 15:46               ` Erich Focht
  2003-01-13 19:03             ` Michael Hohnbaum
  1 sibling, 1 reply; 96+ messages in thread
From: Christoph Hellwig @ 2003-01-13 15:26 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
	linux-kernel, lse-tech

Anyone interested in this cleaned up minimal numa scheduler?  This
is basically Erich's patches 1-3 with my suggestions applied.

This does not mean I don't like 4 & 5, but I'd rather get a small,
non-intrusive patch into Linus' tree now and do the fine-tuning later.


--- 1.62/fs/exec.c	Fri Jan 10 08:21:00 2003
+++ edited/fs/exec.c	Mon Jan 13 15:33:32 2003
@@ -1031,6 +1031,8 @@
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
--- 1.119/include/linux/sched.h	Sat Jan 11 07:44:15 2003
+++ edited/include/linux/sched.h	Mon Jan 13 15:58:11 2003
@@ -444,6 +444,14 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+# define sched_balance_exec()	do { } while (0)
+# define node_nr_running_init()	do { } while (0)
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
--- 1.91/init/main.c	Mon Jan  6 04:08:49 2003
+++ edited/init/main.c	Mon Jan 13 15:33:33 2003
@@ -495,6 +495,7 @@
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
--- 1.148/kernel/sched.c	Sat Jan 11 07:44:22 2003
+++ edited/kernel/sched.c	Mon Jan 13 16:17:34 2003
@@ -67,6 +67,7 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
+#define NODE_BALANCE_RATIO	10
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -154,6 +155,11 @@
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 
+#ifdef CONFIG_NUMA
+	atomic_t *node_nr_running;
+	int nr_balanced;
+#endif
+
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
@@ -178,6 +184,38 @@
 #endif
 
 /*
+ * Keep track of running tasks.
+ */
+#if CONFIG_NUMA
+
+/* XXX(hch): this should go into a struct sched_node_data */
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_init(struct runqueue *rq)
+{
+	rq->node_nr_running = &node_nr_running[0];
+}
+
+static inline void nr_running_inc(struct runqueue *rq)
+{
+	atomic_inc(rq->node_nr_running);
+	rq->nr_running++;
+}
+
+static inline void nr_running_dec(struct runqueue *rq)
+{
+	atomic_dec(rq->node_nr_running);
+	rq->nr_running--;
+}
+
+#else
+# define nr_running_init(rq)	do { } while (0)
+# define nr_running_inc(rq)	do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)	do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
+/*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
  * explicitly disabling preemption.
@@ -294,7 +332,7 @@
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -302,7 +340,7 @@
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -624,9 +662,108 @@
 		spin_unlock(&rq2->lock);
 }
 
-#if CONFIG_SMP
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ *
+ * Note:  This isn't actually numa-specific, but just not used otherwise.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask = p->cpus_allowed;
+
+	if (old_mask & (1UL << dest_cpu)) {
+		unsigned long flags;
+		struct runqueue *rq;
+
+		/* force the process onto the specified CPU */
+		set_cpus_allowed(p, 1UL << dest_cpu);
+
+		/* restore the cpus allowed mask */
+		rq = task_rq_lock(p, &flags);
+		p->cpus_allowed = old_mask;
+		task_rq_unlock(rq, &flags);
+	}
+}
 
 /*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+
+	minload = 10000000;
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1UL << i)))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+
+static int find_busiest_node(int this_node)
+{
+	int i, node = this_node, load, this_load, maxload;       
+
+	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	for (i = 0; i < numnodes; i++) {
+		if (i == this_node)
+			continue;
+		load = atomic_read(&node_nr_running[i]);
+		if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+			maxload = load;
+			node = i;
+		}
+	}
+
+	return node;
+}
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_rq(i)->node_nr_running = node_nr_running + __cpu_to_node(i);
+}
+#endif /* CONFIG_NUMA */
+
+#if CONFIG_SMP
+/*
  * double_lock_balance - lock the busiest runqueue
  *
  * this_rq is locked already. Recalculate nr_running if we have to
@@ -652,9 +789,10 @@
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+		int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +827,7 @@
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1UL << i) & cpumask))
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -736,9 +874,9 @@
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -758,13 +896,27 @@
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, idx, this_cpu, this_node;
+	unsigned long cpumask;
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	this_cpu = smp_processor_id();
+	this_node = __cpu_to_node(this_cpu);
+	cpumask = __node_to_cpu_mask(this_node);
+
+#if CONFIG_NUMA
+	/*
+	 * Avoid rebalancing between nodes too often.
+	 */
+	if (!(++this_rq->nr_balanced % NODE_BALANCE_RATIO))
+		cpumask |= __node_to_cpu_mask(find_busiest_node(this_node));
+#endif
+
+	busiest = find_busiest_queue(this_rq, this_cpu, idle,
+			&imbalance, cpumask);
 	if (!busiest)
 		goto out;
 
@@ -2231,6 +2383,7 @@
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+		nr_running_init(rq);
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-13 15:26             ` [Lse-tech] " Christoph Hellwig
@ 2003-01-13 15:46               ` Erich Focht
  0 siblings, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-13 15:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
	linux-kernel, lse-tech

Hi Christoph,

I just finished some experiments which show that the finetuning can
really be left for later. So this approach is ok for me. I hope we can
get enough support for integrating this tiny numa scheduler. 

I didn't do all possible measurements, the interesting ones are with
patches 1-4 (nb-smooth) and 1-5 (nb-sm-var1, nb-sm-var2) applied. They
show pretty consistent results (within error bars). The fine-tuning in
patch #5 doesn't buy us much right now (on my platform), so we can
leave it out.

Here's the data:

Results on a 8 CPU ia64 machine with 4 nodes (2 CPUs per node).

kernbench:
                elapsed       user          system
      stock52   134.52(0.84)  951.64(0.97)  20.72(0.22)
      sched52   133.19(1.49)  944.24(0.50)  21.36(0.24)
   minsched52   135.47(0.47)  937.61(0.20)  21.30(0.14)
    nb-smooth   133.61(0.71)  944.71(0.35)  21.22(0.22)
   nb-sm-var1   135.23(2.07)  943.78(0.54)  21.54(0.17)
   nb-sm-var2   133.87(0.61)  944.18(0.62)  21.32(0.13)

schedbench/hackbench: time(s)
               10         25         50         100
      stock52  0.81(0.04) 2.06(0.07) 4.09(0.13) 7.89(0.25)
      sched52  0.81(0.04) 2.03(0.07) 4.14(0.20) 8.61(0.35)
   minsched52  1.28(0.05) 3.19(0.06) 6.59(0.13) 13.56(0.27)
    nb-smooth  0.77(0.03) 1.94(0.04) 3.80(0.06) 7.97(0.35)
   nb-sm-var1  0.81(0.05) 2.01(0.09) 3.89(0.21) 8.20(0.34)
   nb-sm-var2  0.82(0.04) 2.10(0.09) 4.19(0.14) 8.15(0.24)

numabench/numa_test 4
               AvgUserTime ElapsedTime TotUserTime TotSysTime
      stock52  ---         27.23(0.52) 89.30(4.18) 0.09(0.01)
      sched52  22.32(1.00) 27.39(0.42) 89.29(4.02) 0.10(0.01)
   minsched52  20.01(0.01) 23.40(0.13) 80.05(0.02) 0.08(0.01)
    nb-smooth  21.01(0.79) 24.70(2.75) 84.04(3.15) 0.09(0.01)
   nb-sm-var1  21.39(0.83) 26.03(2.15) 85.56(3.31) 0.09(0.01)
   nb-sm-var2  22.18(0.74) 27.36(0.42) 88.72(2.94) 0.09(0.01)

numabench/numa_test 8
               AvgUserTime ElapsedTime TotUserTime  TotSysTime
      stock52  ---         27.50(2.58) 174.74(6.66) 0.18(0.01)
      sched52  21.73(1.00) 33.70(1.82) 173.87(7.96) 0.18(0.01)
   minsched52  20.31(0.00) 23.50(0.12) 162.47(0.04) 0.16(0.01)
    nb-smooth  20.46(0.44) 24.21(1.95) 163.68(3.56) 0.16(0.01)
   nb-sm-var1  20.45(0.44) 23.95(1.73) 163.62(3.49) 0.17(0.01)
   nb-sm-var2  20.71(0.82) 23.78(2.42) 165.74(6.58) 0.17(0.01)

numabench/numa_test 16
               AvgUserTime ElapsedTime TotUserTime   TotSysTime
      stock52  ---         52.68(1.51) 390.03(15.10) 0.34(0.01)
      sched52  21.51(0.80) 47.18(3.24) 344.29(12.78) 0.36(0.01)
   minsched52  20.50(0.03) 43.82(0.08) 328.05(0.45)  0.34(0.01)
    nb-smooth  21.12(0.69) 47.42(4.02) 337.99(10.99) 0.34(0.01)
   nb-sm-var1  21.18(0.77) 48.19(5.05) 338.94(12.38) 0.34(0.01)
   nb-sm-var2  21.69(0.91) 49.05(4.36) 347.03(14.49) 0.34(0.01)

numabench/numa_test 32
               AvgUserTime ElapsedTime  TotUserTime   TotSysTime
      stock52  ---         102.60(3.89) 794.57(31.72) 0.65(0.01)
      sched52  21.93(0.57) 92.46(1.10)  701.75(18.38) 0.67(0.02)
   minsched52  20.64(0.10) 89.95(3.16)  660.72(3.13)  0.68(0.07)
    nb-smooth  20.95(0.19) 86.63(1.74)  670.56(6.02)  0.66(0.02)
   nb-sm-var1  21.47(0.54) 90.95(3.28)  687.12(17.42) 0.67(0.02)
   nb-sm-var2  21.45(0.67) 89.91(3.80)  686.47(21.37) 0.68(0.02)


The kernels used:
  - stock52 : 2.5.52 + ia64 patch
  - sched52 : stock52 + old numa scheduler
  - minisched52 : stock52 + miniature NUMA scheduler (cannot load
  balance across nodes)
  - nb-smooth : minisched52 + node balancer + smooth node load patch
  - nb-sm-var1 : nb-smooth + variable internode_lb, (MIN,MAX) = (4,40)
  - nb-sm-var2 : nb-smooth + variable internode_lb, (MIN,MAX) = (1,16)

Best regards,
Erich


On Monday 13 January 2003 16:26, Christoph Hellwig wrote:
> Anyone interested in this cleaned up minimal numa scheduler?  This
> is basically Erich's patches 1-3 with my suggestions applied.
>
> This does not mean I don't like 4 & 5, but I'd rather get a small,
> non-intrusive patch into Linus' tree now and do the fine-tuning later.
>
[...]


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: NUMA scheduler 2nd approach
  2003-01-13 11:32           ` Erich Focht
  2003-01-13 15:26             ` [Lse-tech] " Christoph Hellwig
@ 2003-01-13 19:03             ` Michael Hohnbaum
  1 sibling, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-13 19:03 UTC (permalink / raw)
  To: Erich Focht
  Cc: Christoph Hellwig, Martin J. Bligh, Robert Love, Ingo Molnar,
	linux-kernel, lse-tech

On Mon, 2003-01-13 at 03:32, Erich Focht wrote:
> Hi Christoph,
> 
> thanks for reviewing the code and the helpful comments!
> 
> On Monday 13 January 2003 09:02, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote:
> > > as discussed on the LSE call I played around with a cross-node
> > > balancer approach put on top of the miniature NUMA scheduler. The
> > > patches are attached and it seems to be clear that we can regain the
> > > good performance for hackbench by adding a cross-node balancer.
> >
> > The changes look fine to me.  But I think there's some conding style
> > issues that need cleaning up (see below).  Also is there a reason
> > patches 2/3 and 4/5 aren't merged into one patch each?
> 
> The patches are separated by their functionality. Patch 2 comes from
> Michael Hohnbaum, so I kept that separate for that reason. Right now
> we can exchange the single components but when we decide that they are
> doing the job, I'd also prefer to have just one patch.

I'm not opposed to combining the patches, but isn't it typically
preferred to have patches as small as possible?  The initial load
balance patch (patch 2) applies and functions fairly independent
of the the other patches, so has been easy to maintain as a separate
patch.
> 
> 
> > +#ifdef CONFIG_NUMA
> > +extern void sched_balance_exec(void);
> > +extern void node_nr_running_init(void);
> > +#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
> > +	rq->nr_running++
> > +#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
> > +	rq->nr_running--
> >
> > 	static inline void nr_running_inc(runqueue_t *rq)
> > 	{
> > 		atomic_inc(rq->node_ptr);
> > 		rq->nr_running++
> > 	}
> >
> > 	etc.. would look a bit nicer.
> 
> We can change this. Michael, ok with you?

Fine with me.
> 
> 
> > +#if CONFIG_NUMA
> > +static atomic_t node_nr_running[MAX_NUMNODES]
> > ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
> >
> > 	Maybe wants some linewrapping after 80 chars?
> 
> Yes.
> 
> 
> > +	for (i = 0; i < NR_CPUS; i++) {
> > +		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
> > +	}
> > +	return;
> > +}
> >
> > 	The braces and the return are superflous.  Also kernel/sched.c (or
> > 	mingo codein general) seems to prefer array + i instead of &array[i]
> > 	(not that I have a general preferences, but you should try to match
> > 	the surrounding code)
> 
> Will change the braces and remove the return. I personally find
> &array[i] more readable.

I agree with Erich here - &array[i] is clearer.
> 
> > +static void sched_migrate_task(task_t *p, int dest_cpu)
> > +{
> > +	unsigned long old_mask;
> > +
> > +	old_mask = p->cpus_allowed;
> > +	if (!(old_mask & (1UL << dest_cpu)))
> > +		return;
> > +	/* force the process onto the specified CPU */
> > +	set_cpus_allowed(p, 1UL << dest_cpu);
> > +
> > +	/* restore the cpus allowed mask */
> > +	set_cpus_allowed(p, old_mask);
> >
> > 	This double set_cpus_allowed doesn't look nice to me.  I don't
> > 	have a better suggestion of-hand, though :(
> 
> This is not that bad. It involves only one single wakeup of the
> migration thread, and that's more important. Doing it another way
> would mean to replicate the set_cpus_allowed() code.

While this shows up in the patch credited to me, it was actually
Erich's idea, and I think a quite good one.  My initial impression
of this was that it was ugly, but upon closer examination (and trying
to implement this some other way) realized that Erich's implementation,
which I borrowed from, was quite good.

Good work Erich in getting the internode balancer written and
working.  I'm currently building a kernel with these patches and
will post results later in the day.  Thanks.
 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: NUMA scheduler 2nd approach
  2003-01-12 23:55       ` NUMA scheduler 2nd approach Erich Focht
  2003-01-13  8:02         ` Christoph Hellwig
@ 2003-01-14  1:23         ` Michael Hohnbaum
  2003-01-14  4:45           ` [Lse-tech] " Andrew Theurer
  2003-01-14 10:56           ` Erich Focht
  1 sibling, 2 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-14  1:23 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Robert Love, Ingo Molnar, linux-kernel, lse-tech

On Sun, 2003-01-12 at 15:55, Erich Focht wrote:
> Hi Martin & Michael,
> 
> as discussed on the LSE call I played around with a cross-node
> balancer approach put on top of the miniature NUMA scheduler. The
> patches are attached and it seems to be clear that we can regain the
> good performance for hackbench by adding a cross-node balancer.

Erich,

I played with this today on my 4 node (16 CPU) NUMAQ.  Spent most
of the time working with the first three patches.  What I found was
that rebalancing was happening too much between nodes.  I tried a
few things to change this, but have not yet settled on the best
approach.  A key item to work with is the check in find_busiest_node
to determine if the found node is busier enough to warrant stealing
from it.  Currently the check is that the node has 125% of the load
of the current node.  I think that, for my system at least, we need
to add in a constant to this equation.  I tried using 4 and that
helped a little.  Finally I added in the 04 patch, and that helped
a lot.  Still, there is too much process movement between nodes.

Tomorrow, I will continue experiments, but base them on the first
4 patches.  Two suggestions for minor changes:

* Make the check in find_busiest_node into a macro that is defined
  in the arch specific topology header file.  Then different NUMA
  architectures can tune this appropriately.

* In find_busiest_queue change: 

	cpumask |= __node_to_cpu_mask(node); 
  to:
	cpumask = __node_to_cpu_mask(node) | (1UL << (this_cpu));
	

  There is no reason to iterate over the runqueues on the current 
  node, which is what the code currently does.

Some numbers for anyone interested:Kernbench:

All numbers are based on a 2.5.55 kernel with the cputime stats patch:
  * stock55 = no additional patches
  * mini+rebal-44 = patches 01, 02, and 03
  * rebal+4+fix = patches 01, 02, 03, and the cpumask change described
    above, and a +4 constant added to the check in find_busiest_node
  * rebal+4+fix+04 = above with the 04 patch added

                        Elapsed       User     System        CPU
   rebal+4+fix+04-55    29.302s   285.136s    82.106s      1253%
      rebal+4+fix-55    30.498s   286.586s    88.176s    1228.6%
       mini+rebal-55    30.756s   287.646s    85.512s    1212.8%
             stock55    31.018s   303.084s    86.626s    1256.2%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
   rebal+4+fix+04-55      27.34      40.49     109.39       0.88
      rebal+4+fix-55      24.73      38.50      98.94       0.84
       mini+rebal-55      25.18      43.23     100.76       0.68
             stock55      31.38      41.55     125.54       1.24

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
   rebal+4+fix+04-55      30.05      44.15     240.48       2.50
      rebal+4+fix-55      34.33      46.40     274.73       2.31
       mini+rebal-55      32.99      52.42     264.00       2.08
             stock55      44.63      61.28     357.11       2.22

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
   rebal+4+fix+04-55      52.13      57.68     834.23       3.55
      rebal+4+fix-55      52.72      65.16     843.70       4.55
       mini+rebal-55      57.29      71.51     916.84       5.10
             stock55      66.91      85.08    1070.72       6.05

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
   rebal+4+fix+04-55      56.38     124.09    1804.67       7.71
      rebal+4+fix-55      55.13     115.18    1764.46       8.86
       mini+rebal-55      57.83     125.80    1850.84      10.19
             stock55      80.38     181.80    2572.70      13.22

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
   rebal+4+fix+04-55      57.42     238.32    3675.77      17.68
      rebal+4+fix-55      60.06     252.96    3844.62      18.88
       mini+rebal-55      58.15     245.30    3722.38      19.64
             stock55     123.96     513.66    7934.07      26.39


And here is the results from running numa_test 32 on rebal+4+fix+04:

Executing 32 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 8.383
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  56.19
  2   100.0    0.0    0.0    0.0 |    0     0    |  53.80
  3     0.0    0.0  100.0    0.0 |    2     2    |  55.61
  4   100.0    0.0    0.0    0.0 |    0     0    |  54.13
  5     3.7    0.0    0.0   96.3 |    3     3    |  56.48
  6     0.0    0.0  100.0    0.0 |    2     2    |  55.11
  7     0.0    0.0  100.0    0.0 |    2     2    |  55.94
  8     0.0    0.0  100.0    0.0 |    2     2    |  55.69
  9    80.6   19.4    0.0    0.0 |    1     0   *|  56.53
 10     0.0    0.0    0.0  100.0 |    3     3    |  53.00
 11     0.0   99.2    0.0    0.8 |    1     1    |  56.72
 12     0.0    0.0    0.0  100.0 |    3     3    |  54.58
 13     0.0  100.0    0.0    0.0 |    1     1    |  59.38
 14     0.0   55.6    0.0   44.4 |    3     1   *|  63.06
 15     0.0  100.0    0.0    0.0 |    1     1    |  56.02
 16     0.0   19.2    0.0   80.8 |    1     3   *|  58.07
 17     0.0  100.0    0.0    0.0 |    1     1    |  53.78
 18     0.0    0.0  100.0    0.0 |    2     2    |  55.28
 19     0.0   78.6    0.0   21.4 |    3     1   *|  63.20
 20     0.0  100.0    0.0    0.0 |    1     1    |  53.27
 21     0.0    0.0  100.0    0.0 |    2     2    |  55.79
 22     0.0    0.0    0.0  100.0 |    3     3    |  57.23
 23    12.4   19.1    0.0   68.5 |    1     3   *|  61.05
 24     0.0    0.0  100.0    0.0 |    2     2    |  54.50
 25     0.0    0.0    0.0  100.0 |    3     3    |  56.82
 26     0.0    0.0  100.0    0.0 |    2     2    |  56.28
 27    15.3    0.0    0.0   84.7 |    3     3    |  57.12
 28   100.0    0.0    0.0    0.0 |    0     0    |  53.85
 29    32.7   67.2    0.0    0.0 |    0     1   *|  62.66
 30   100.0    0.0    0.0    0.0 |    0     0    |  53.86
 31   100.0    0.0    0.0    0.0 |    0     0    |  53.94
 32   100.0    0.0    0.0    0.0 |    0     0    |  55.36
AverageUserTime 56.38 seconds
ElapsedTime     124.09
TotalUserTime   1804.67
TotalSysTime    7.71

Ideally, there would be nothing but 100.0 in all non-zero entries.
I'll try adding in the 05 patch, and if that does not help, will
try a few other adjustments.


Thanks for the quick effort on putting together the node rebalance
code.  I'll also get some hackbench numbers soon.
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-14  1:23         ` Michael Hohnbaum
@ 2003-01-14  4:45           ` Andrew Theurer
  2003-01-14  4:56             ` Martin J. Bligh
  2003-01-14  5:50             ` [Lse-tech] Re: NUMA scheduler 2nd approach Michael Hohnbaum
  2003-01-14 10:56           ` Erich Focht
  1 sibling, 2 replies; 96+ messages in thread
From: Andrew Theurer @ 2003-01-14  4:45 UTC (permalink / raw)
  To: Michael Hohnbaum, Erich Focht
  Cc: Martin J. Bligh, Robert Love, Ingo Molnar, linux-kernel, lse-tech

> Erich,
> 
> I played with this today on my 4 node (16 CPU) NUMAQ.  Spent most
> of the time working with the first three patches.  What I found was
> that rebalancing was happening too much between nodes.  I tried a
> few things to change this, but have not yet settled on the best
> approach.  A key item to work with is the check in find_busiest_node
> to determine if the found node is busier enough to warrant stealing
> from it.  Currently the check is that the node has 125% of the load
> of the current node.  I think that, for my system at least, we need
> to add in a constant to this equation.  I tried using 4 and that
> helped a little.  

Michael,

in:

+static int find_busiest_node(int this_node)
+{
+ int i, node = this_node, load, this_load, maxload;
+ 
+ this_load = maxload = atomic_read(&node_nr_running[this_node]);
+ for (i = 0; i < numnodes; i++) {
+  if (i == this_node)
+   continue;
+  load = atomic_read(&node_nr_running[i]);
+  if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+   maxload = load;
+   node = i;
+  }
+ }
+ return node;
+}

You changed ((5*4*this_load)/4) to:
  (5*4*(this_load+4)/4)
or
  (4+(5*4*(this_load)/4))  ?

We def need some constant to avoid low load ping pong, right?

Finally I added in the 04 patch, and that helped
> a lot.  Still, there is too much process movement between nodes.

perhaps increase INTERNODE_LB?

-Andrew Theurer



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-14  4:45           ` [Lse-tech] " Andrew Theurer
@ 2003-01-14  4:56             ` Martin J. Bligh
  2003-01-14 11:14               ` Erich Focht
  2003-01-14  5:50             ` [Lse-tech] Re: NUMA scheduler 2nd approach Michael Hohnbaum
  1 sibling, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-14  4:56 UTC (permalink / raw)
  To: Andrew Theurer, Michael Hohnbaum, Erich Focht
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

>> I played with this today on my 4 node (16 CPU) NUMAQ.  Spent most
>> of the time working with the first three patches.  What I found was
>> that rebalancing was happening too much between nodes.  I tried a
>> few things to change this, but have not yet settled on the best
>> approach.  A key item to work with is the check in find_busiest_node
>> to determine if the found node is busier enough to warrant stealing
>> from it.  Currently the check is that the node has 125% of the load
>> of the current node.  I think that, for my system at least, we need
>> to add in a constant to this equation.  I tried using 4 and that
>> helped a little.  
> 
> Michael,
> 
> in:
> 
> +static int find_busiest_node(int this_node)
> +{
> + int i, node = this_node, load, this_load, maxload;
> + 
> + this_load = maxload = atomic_read(&node_nr_running[this_node]);
> + for (i = 0; i < numnodes; i++) {
> +  if (i == this_node)
> +   continue;
> +  load = atomic_read(&node_nr_running[i]);
> +  if (load > maxload && (4*load > ((5*4*this_load)/4))) {
> +   maxload = load;
> +   node = i;
> +  }
> + }
> + return node;
> +}
> 
> You changed ((5*4*this_load)/4) to:
>   (5*4*(this_load+4)/4)
> or
>   (4+(5*4*(this_load)/4))  ?
> 
> We def need some constant to avoid low load ping pong, right?
> 
> Finally I added in the 04 patch, and that helped
>> a lot.  Still, there is too much process movement between nodes.
> 
> perhaps increase INTERNODE_LB?

Before we tweak this too much, how about using the global load 
average for this? I can envisage a situation where we have two
nodes with 8 tasks per node, one with 12 tasks, and one with four.
You really don't want the ones with 8 tasks pulling stuff from
the 12 ... only for the least loaded node to start pulling stuff
later.

What about if we take the global load average, and multiply by
num_cpus_on_this_node / num_cpus_globally ... that'll give us
roughly what we should have on this node. If we're significantly
out underloaded compared to that, we start pulling stuff from
the busiest node? And you get the damping over time for free.

I think it'd be best if we stay fairly conservative for this, all
we're trying to catch is the corner case where a bunch of stuff 
has forked, but not execed, and we have a long term imbalance.
Agressive rebalance might win a few benchmarks, but you'll just
cause inter-node task bounce on others. 

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-14  4:45           ` [Lse-tech] " Andrew Theurer
  2003-01-14  4:56             ` Martin J. Bligh
@ 2003-01-14  5:50             ` Michael Hohnbaum
  2003-01-14 16:52               ` Andrew Theurer
  1 sibling, 1 reply; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-14  5:50 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Erich Focht, Martin J. Bligh, Robert Love, Ingo Molnar,
	linux-kernel, lse-tech

On Mon, 2003-01-13 at 20:45, Andrew Theurer wrote:
> > Erich,
> > 
> > I played with this today on my 4 node (16 CPU) NUMAQ.  Spent most
> > of the time working with the first three patches.  What I found was
> > that rebalancing was happening too much between nodes.  I tried a
> > few things to change this, but have not yet settled on the best
> > approach.  A key item to work with is the check in find_busiest_node
> > to determine if the found node is busier enough to warrant stealing
> > from it.  Currently the check is that the node has 125% of the load
> > of the current node.  I think that, for my system at least, we need
> > to add in a constant to this equation.  I tried using 4 and that
> > helped a little.  
> 
> Michael,
> 
> in:
> 
> +static int find_busiest_node(int this_node)
> +{
> + int i, node = this_node, load, this_load, maxload;
> + 
> + this_load = maxload = atomic_read(&node_nr_running[this_node]);
> + for (i = 0; i < numnodes; i++) {
> +  if (i == this_node)
> +   continue;
> +  load = atomic_read(&node_nr_running[i]);
> +  if (load > maxload && (4*load > ((5*4*this_load)/4))) {
> +   maxload = load;
> +   node = i;
> +  }
> + }
> + return node;
> +}
> 
> You changed ((5*4*this_load)/4) to:
>   (5*4*(this_load+4)/4)
> or
>   (4+(5*4*(this_load)/4))  ?

I suppose I should not have been so dang lazy and cut-n-pasted
the line I changed.  The change was (((5*4*this_load)/4) + 4)
which should be the same as your second choice.
> 
> We def need some constant to avoid low load ping pong, right?

Yep.  Without the constant, one could have 6 processes on node
A and 4 on node B, and node B would end up stealing.  While making
a perfect balance, the expense of the off-node traffic does not
justify it.  At least on the NUMAQ box.  It might be justified
for a different NUMA architecture, which is why I propose putting
this check in a macro that can be defined in topology.h for each
architecture.
> 
> Finally I added in the 04 patch, and that helped
> > a lot.  Still, there is too much process movement between nodes.
> 
> perhaps increase INTERNODE_LB?

That is on the list to try.  Martin was mumbling something about
use the system wide load average to help make the inter-node
balance decision.  I'd like to give that a try before tweaking
ITERNODE_LB.
> 
> -Andrew Theurer
> 
            Michael Hohnbaum
            hohnbaum@us.ibm.com


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: NUMA scheduler 2nd approach
  2003-01-14  1:23         ` Michael Hohnbaum
  2003-01-14  4:45           ` [Lse-tech] " Andrew Theurer
@ 2003-01-14 10:56           ` Erich Focht
  1 sibling, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-14 10:56 UTC (permalink / raw)
  To: Michael Hohnbaum
  Cc: Martin J. Bligh, Robert Love, Ingo Molnar, linux-kernel, lse-tech

Hi Michael,

thanks for the many experiments, they help giving some more
insight. It's quite clear that we'll have to tune the knobs a bit to
make this fit well.

At first a comment on the "target" of the tuning:

>  26     0.0    0.0  100.0    0.0 |    2     2    |  56.28
>  27    15.3    0.0    0.0   84.7 |    3     3    |  57.12
>  28   100.0    0.0    0.0    0.0 |    0     0    |  53.85
>  29    32.7   67.2    0.0    0.0 |    0     1   *|  62.66
>  30   100.0    0.0    0.0    0.0 |    0     0    |  53.86
>  31   100.0    0.0    0.0    0.0 |    0     0    |  53.94
>  32   100.0    0.0    0.0    0.0 |    0     0    |  55.36
> AverageUserTime 56.38 seconds
> ElapsedTime     124.09
> TotalUserTime   1804.67
> TotalSysTime    7.71
>
> Ideally, there would be nothing but 100.0 in all non-zero entries.
> I'll try adding in the 05 patch, and if that does not help, will
> try a few other adjustments.

The numa_test is written such that you MUST get numbers below 100%
with this kind of scheduler! There is a "disturbing" process started 3
seconds after the parallel tasks are up and running which has the aim
to "shake" the system a bit and the scheduler ends up with  some of
the parallel tasks moved away from their homenode. This is a
reallistic situation for long running parallel background jobs. In the
min-sched case this cannot happen, therefore you see the 100% all
over!

Increasing more and more the barrier for moving a task apart from it's
node won't necessarilly help the scheduler as we end up with the
min-sched limitations. The aim will be to introduce node affinity for
processes such that they can return to their homenode when the
exceptional load situation is gone. We should not try to reach the
min-sched numbers without having node-affine tasks.


> * Make the check in find_busiest_node into a macro that is defined
>   in the arch specific topology header file.  Then different NUMA
>   architectures can tune this appropriately.

I still think that the load balancing condition should be clear
and the same for every architecture. But tunable. I understand that
the constant added helps but I don't have a clear motivation for it.
We could keep the imbalance ratio as a condition for searching the
maximum:
> +  if (load > maxload && (4*load > ((5*4*this_load)/4))) {
(with tunable barrier height: say replace 5/4 by 11/8 or similar, i.e.
  (8*load > ((11*8*this_load)/8))
)
and rule out any special unwanted cases after the loop. These would be
cases where it is obvious that we don't have to steal anything, like:
load of remote node is <= number of cpus in remote node.

For making this tunable I suggest something like:
#define CROSS_NODE_BAL 125
   (100*load > ((CROSS_NODE_BAL*100*this_load)/100))
and let platforms define their own value. Using 100 makes it easy to
understand.

> * In find_busiest_queue change:
>
> 	cpumask |= __node_to_cpu_mask(node);
>   to:
> 	cpumask = __node_to_cpu_mask(node) | (1UL << (this_cpu));
>
>
>   There is no reason to iterate over the runqueues on the current
>   node, which is what the code currently does.

Here I thought I'd give the load balancer the chance to balance
internally just in case the own node's load jumped up in the mean
time. But this should ideally be caught by find_busiest_node(),
therefore I agree with your change.


> Some numbers for anyone interested:Kernbench:
>
> All numbers are based on a 2.5.55 kernel with the cputime stats patch:
>   * stock55 = no additional patches
>   * mini+rebal-44 = patches 01, 02, and 03
>   * rebal+4+fix = patches 01, 02, 03, and the cpumask change described
>     above, and a +4 constant added to the check in find_busiest_node
>   * rebal+4+fix+04 = above with the 04 patch added
>
>                         Elapsed       User     System        CPU
>    rebal+4+fix+04-55    29.302s   285.136s    82.106s      1253%
>       rebal+4+fix-55    30.498s   286.586s    88.176s    1228.6%
>        mini+rebal-55    30.756s   287.646s    85.512s    1212.8%
>              stock55    31.018s   303.084s    86.626s    1256.2%

The first line is VERY good!

The results below (for Schedbench) are not really meaningful without
statistics... But the combination rebal+4+fix+04-55 is the clear
winner for me. Still, I'd be curious to know the rebal+fix+04-55
numbers ;-)

> Schedbench 4:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>    rebal+4+fix+04-55      27.34      40.49     109.39       0.88
>       rebal+4+fix-55      24.73      38.50      98.94       0.84
>        mini+rebal-55      25.18      43.23     100.76       0.68
>              stock55      31.38      41.55     125.54       1.24
>
> Schedbench 8:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>    rebal+4+fix+04-55      30.05      44.15     240.48       2.50
>       rebal+4+fix-55      34.33      46.40     274.73       2.31
>        mini+rebal-55      32.99      52.42     264.00       2.08
>              stock55      44.63      61.28     357.11       2.22
>
> Schedbench 16:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>    rebal+4+fix+04-55      52.13      57.68     834.23       3.55
>       rebal+4+fix-55      52.72      65.16     843.70       4.55
>        mini+rebal-55      57.29      71.51     916.84       5.10
>              stock55      66.91      85.08    1070.72       6.05
>
> Schedbench 32:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>    rebal+4+fix+04-55      56.38     124.09    1804.67       7.71
>       rebal+4+fix-55      55.13     115.18    1764.46       8.86
>        mini+rebal-55      57.83     125.80    1850.84      10.19
>              stock55      80.38     181.80    2572.70      13.22
>
> Schedbench 64:
>                         AvgUser    Elapsed  TotalUser   TotalSys
>    rebal+4+fix+04-55      57.42     238.32    3675.77      17.68
>       rebal+4+fix-55      60.06     252.96    3844.62      18.88
>        mini+rebal-55      58.15     245.30    3722.38      19.64
>              stock55     123.96     513.66    7934.07      26.39
>
>
> And here is the results from running numa_test 32 on rebal+4+fix+04:
>
> Executing 32 times: ./randupdt 1000000
> Running 'hackbench 20' in the background: Time: 8.383
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1   100.0    0.0    0.0    0.0 |    0     0    |  56.19
>   2   100.0    0.0    0.0    0.0 |    0     0    |  53.80
>   3     0.0    0.0  100.0    0.0 |    2     2    |  55.61
>   4   100.0    0.0    0.0    0.0 |    0     0    |  54.13
>   5     3.7    0.0    0.0   96.3 |    3     3    |  56.48
>   6     0.0    0.0  100.0    0.0 |    2     2    |  55.11
>   7     0.0    0.0  100.0    0.0 |    2     2    |  55.94
>   8     0.0    0.0  100.0    0.0 |    2     2    |  55.69
>   9    80.6   19.4    0.0    0.0 |    1     0   *|  56.53
>  10     0.0    0.0    0.0  100.0 |    3     3    |  53.00
>  11     0.0   99.2    0.0    0.8 |    1     1    |  56.72
>  12     0.0    0.0    0.0  100.0 |    3     3    |  54.58
>  13     0.0  100.0    0.0    0.0 |    1     1    |  59.38
>  14     0.0   55.6    0.0   44.4 |    3     1   *|  63.06
>  15     0.0  100.0    0.0    0.0 |    1     1    |  56.02
>  16     0.0   19.2    0.0   80.8 |    1     3   *|  58.07
>  17     0.0  100.0    0.0    0.0 |    1     1    |  53.78
>  18     0.0    0.0  100.0    0.0 |    2     2    |  55.28
>  19     0.0   78.6    0.0   21.4 |    3     1   *|  63.20
>  20     0.0  100.0    0.0    0.0 |    1     1    |  53.27
>  21     0.0    0.0  100.0    0.0 |    2     2    |  55.79
>  22     0.0    0.0    0.0  100.0 |    3     3    |  57.23
>  23    12.4   19.1    0.0   68.5 |    1     3   *|  61.05
>  24     0.0    0.0  100.0    0.0 |    2     2    |  54.50
>  25     0.0    0.0    0.0  100.0 |    3     3    |  56.82
>  26     0.0    0.0  100.0    0.0 |    2     2    |  56.28
>  27    15.3    0.0    0.0   84.7 |    3     3    |  57.12
>  28   100.0    0.0    0.0    0.0 |    0     0    |  53.85
>  29    32.7   67.2    0.0    0.0 |    0     1   *|  62.66
>  30   100.0    0.0    0.0    0.0 |    0     0    |  53.86
>  31   100.0    0.0    0.0    0.0 |    0     0    |  53.94
>  32   100.0    0.0    0.0    0.0 |    0     0    |  55.36
> AverageUserTime 56.38 seconds
> ElapsedTime     124.09
> TotalUserTime   1804.67
> TotalSysTime    7.71
>
> Ideally, there would be nothing but 100.0 in all non-zero entries.
> I'll try adding in the 05 patch, and if that does not help, will
> try a few other adjustments.
>
> Thanks for the quick effort on putting together the node rebalance
> code.  I'll also get some hackbench numbers soon.

Best regards,

Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-14  4:56             ` Martin J. Bligh
@ 2003-01-14 11:14               ` Erich Focht
  2003-01-14 15:55                 ` [PATCH 2.5.58] new NUMA scheduler Erich Focht
  0 siblings, 1 reply; 96+ messages in thread
From: Erich Focht @ 2003-01-14 11:14 UTC (permalink / raw)
  To: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

Hi Martin,

> Before we tweak this too much, how about using the global load
> average for this? I can envisage a situation where we have two
> nodes with 8 tasks per node, one with 12 tasks, and one with four.
> You really don't want the ones with 8 tasks pulling stuff from
> the 12 ... only for the least loaded node to start pulling stuff
> later.

Hmmm, yet another idea from the old NUMA scheduler coming back,
therefore it has my full support ;-). Though we can't do it as I did
it there: in the old NUMA approach every cross-node steal was delayed,
only 1-2ms if the own node was underloaded, a lot more if the own
node's load was average or above average. We might need to finally
steal something even if we're having "above average" load, because the
cpu mask of the tasks on the overloaded node might only allow them to
migrate to us... But this is also a special case which we should solve
later.


> What about if we take the global load average, and multiply by
> num_cpus_on_this_node / num_cpus_globally ... that'll give us
> roughly what we should have on this node. If we're significantly
> out underloaded compared to that, we start pulling stuff from
> the busiest node? And you get the damping over time for free.

Patch 05 is going towards this direction but the constants I chose
were very aggressive. I'll update the whole set of patches and try to
send out something today.

Best regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-14 16:52               ` Andrew Theurer
@ 2003-01-14 15:13                 ` Erich Focht
  0 siblings, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-14 15:13 UTC (permalink / raw)
  To: Andrew Theurer, Michael Hohnbaum
  Cc: Martin J. Bligh, Robert Love, Ingo Molnar, linux-kernel, lse-tech

On Tuesday 14 January 2003 17:52, Andrew Theurer wrote:
> > I suppose I should not have been so dang lazy and cut-n-pasted
> > the line I changed.  The change was (((5*4*this_load)/4) + 4)
> > which should be the same as your second choice.
> >
> > > We def need some constant to avoid low load ping pong, right?
> >
> > Yep.  Without the constant, one could have 6 processes on node
> > A and 4 on node B, and node B would end up stealing.  While making
> > a perfect balance, the expense of the off-node traffic does not
> > justify it.  At least on the NUMAQ box.  It might be justified
> > for a different NUMA architecture, which is why I propose putting
> > this check in a macro that can be defined in topology.h for each
> > architecture.
>
> Yes, I was also concerned about one task in one node and none in the
> others. Without some sort of constant we will ping pong the task on every
> node endlessly, since there is no % threshold that could make any
> difference when the original load value is 0..  Your +4 gets rid of the 1
> task case.

That won't happen because:
 - find_busiest_queue() won't detect a sufficient imbalance
 (imbalance == 0),
 - load_balance() won't be able to steal the task, as it will be
 running (the rq->curr task)

So the worst thing that happens is that load_balance() finds out that
it cannot steal the only task running on the runqueue and
returns. But we won't get there because find_busiest_queue() returns
NULL. It happens even in the bad case:
   node#0 : 0 tasks
   node#1 : 4 tasks (each on its own CPU)
where we would desire to distribute the tasks equally among the
nodes. At least on our IA64 platform...

Regards,
Erich




^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 2.5.58] new NUMA scheduler
  2003-01-14 11:14               ` Erich Focht
@ 2003-01-14 15:55                 ` Erich Focht
  2003-01-14 16:07                   ` [Lse-tech] " Christoph Hellwig
  2003-01-14 16:23                   ` [PATCH 2.5.58] new NUMA scheduler: fix Erich Focht
  0 siblings, 2 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-14 15:55 UTC (permalink / raw)
  To: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1993 bytes --]

Here's the new version of the NUMA scheduler built on top of the
miniature scheduler of Martin. I incorporated Michael's ideas and
Christoph's suggestions and rediffed for 2.5.58. 

The whole patch is really tiny: 9.5k. This time I attached the numa
scheduler in form of two patches:

numa-sched-2.5.58.patch     (7k) : components 01, 02, 03
numa-sched-add-2.5.58.patch (3k) : components 04, 05

The single components are also attached in a small tgz archive:

01-minisched-2.5.58.patch : the miniature scheduler from
Martin. Balances strictly within a node. Removed the
find_busiest_in_mask() function.

02-initial-lb-2.5.58.patch : Michael's initial load balancer at
exec(). Cosmetic corrections sugegsted by Christoph.

03-internode-lb-2.5.58.patch : internode load balancer core. Called
after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable
parameters:
  NODE_BALANCE_RATE  (default: 10)
  NODE_THRESHOLD     (default: 125) : consider only nodes with load
                     above NODE_THRESHOLD/100 * own_node_load
  I added the constant factor of 4 suggested by Michael, but I'm not
  really happy with it. This should be nr_cpus_in_node, but we don't
  have that info in topology.h

04-smooth-node-load-2.5.58.patch : The node load measure is smoothed
by adding half of the previous node load (and 1/4 of the one before,
etc..., as discussed in the LSE call). This should improve a bit the
behavior in case of short timed load peaks and avoid bouncing tasks
between nodes.

05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE
interval (between cross-node balancer calls) by a variable
node-specific interval. Currently only two values used:
   NODE_BALANCE_MIN : 10
   NODE_BALANCE_MAX : 40
If the node load is less than avg_node_load/2, we switch to
NODE_BALANCE_MIN, otherwise we use the large interval.
I also added a function to reduce the number of tasks stolen from
remote nodes.

Regards,
Erich

[-- Attachment #2: numa-sched-2.5.58.patch --]
[-- Type: text/x-diff, Size: 7770 bytes --]

diff -urNp 2.5.58/fs/exec.c 2.5.58-ms-ilb-nb/fs/exec.c
--- 2.5.58/fs/exec.c	2003-01-14 06:58:33.000000000 +0100
+++ 2.5.58-ms-ilb-nb/fs/exec.c	2003-01-14 16:31:08.000000000 +0100
@@ -1031,6 +1031,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urNp 2.5.58/include/linux/sched.h 2.5.58-ms-ilb-nb/include/linux/sched.h
--- 2.5.58/include/linux/sched.h	2003-01-14 06:58:06.000000000 +0100
+++ 2.5.58-ms-ilb-nb/include/linux/sched.h	2003-01-14 16:31:08.000000000 +0100
@@ -160,7 +160,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -444,6 +443,28 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+
+static inline void nr_running_inc(runqueue_t *rq)
+{
+	atomic_inc(rq->node_ptr);
+	rq->nr_running++;
+}
+
+static inline void nr_running_dec(runqueue_t *rq)
+{
+	atomic_dec(rq->node_ptr);
+	rq->nr_running--;
+}
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp 2.5.58/init/main.c 2.5.58-ms-ilb-nb/init/main.c
--- 2.5.58/init/main.c	2003-01-14 06:58:20.000000000 +0100
+++ 2.5.58-ms-ilb-nb/init/main.c	2003-01-14 16:31:08.000000000 +0100
@@ -495,6 +495,7 @@ static void do_pre_smp_initcalls(void)
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urNp 2.5.58/kernel/sched.c 2.5.58-ms-ilb-nb/kernel/sched.c
--- 2.5.58/kernel/sched.c	2003-01-14 06:59:11.000000000 +0100
+++ 2.5.58-ms-ilb-nb/kernel/sched.c	2003-01-14 16:31:57.000000000 +0100
@@ -67,6 +67,8 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
+#define NODE_BALANCE_RATIO	10
+#define NODE_THRESHOLD          125
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -153,7 +155,10 @@ struct runqueue {
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
-
+#ifdef CONFIG_NUMA
+	atomic_t * node_ptr;
+	unsigned int nr_balanced;
+#endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
@@ -294,7 +299,7 @@ static inline void activate_task(task_t 
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -302,7 +307,7 @@ static inline void activate_task(task_t 
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -624,6 +629,105 @@ static inline void double_rq_unlock(runq
 		spin_unlock(&rq2->lock);
 }
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+
+	minload = 10000000;
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1UL << i)))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+
+static int find_busiest_node(int this_node)
+{
+	int i, node = -1, load, this_load, maxload;
+	
+	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	for (i = 0; i < numnodes; i++) {
+		if (i == this_node)
+			continue;
+		load = atomic_read(&node_nr_running[i]);
+		if (load > maxload &&
+		    (100*load > ((NODE_THRESHOLD*100*this_load)/100))) {
+			maxload = load;
+			node = i;
+		}
+	}
+	if (maxload <= 4)
+		node = -1;
+	return node;
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP
 
 /*
@@ -652,9 +756,9 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +793,7 @@ static inline runqueue_t *find_busiest_q
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1UL << i) & cpumask) )
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -736,9 +840,9 @@ out:
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -763,8 +867,21 @@ static void load_balance(runqueue_t *thi
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
+	int this_node = __cpu_to_node(this_cpu);
+	unsigned long cpumask = __node_to_cpu_mask(this_node);
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+#if CONFIG_NUMA
+	/*
+	 * Avoid rebalancing between nodes too often.
+	 */
+	if (!(++(this_rq->nr_balanced) % NODE_BALANCE_RATIO)) {
+		int node = find_busiest_node(this_node);
+		if (node >= 0)
+			cpumask = __node_to_cpu_mask(node) | (1UL << this_cpu);
+	}
+#endif
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
+				    cpumask);
 	if (!busiest)
 		goto out;
 
@@ -2231,6 +2348,10 @@ void __init sched_init(void)
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

[-- Attachment #3: numa-sched-add-2.5.58.patch --]
[-- Type: text/x-diff, Size: 3104 bytes --]

diff -urNp 2.5.58-ms-ilb-nb/kernel/sched.c 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c
--- 2.5.58-ms-ilb-nb/kernel/sched.c	2003-01-14 16:31:57.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c	2003-01-14 16:33:53.000000000 +0100
@@ -67,8 +67,9 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
-#define NODE_BALANCE_RATIO	10
 #define NODE_THRESHOLD          125
+#define NODE_BALANCE_MIN	10
+#define NODE_BALANCE_MAX	40
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -158,6 +159,7 @@ struct runqueue {
 #ifdef CONFIG_NUMA
 	atomic_t * node_ptr;
 	unsigned int nr_balanced;
+	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -632,6 +634,8 @@ static inline void double_rq_unlock(runq
 #if CONFIG_NUMA
 static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
 	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+static int internode_lb[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = NODE_BALANCE_MAX};
 
 __init void node_nr_running_init(void)
 {
@@ -707,25 +711,54 @@ void sched_balance_exec(void)
 	}
 }
 
+/*
+ * Find the busiest node. All previous node loads contribute with a 
+ * geometrically deccaying weight to the load measure:
+ *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
+ * This way sudden load peaks are flattened out a bit.
+ */
 static int find_busiest_node(int this_node)
 {
 	int i, node = -1, load, this_load, maxload;
+	int avg_load;
 	
-	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
+		+ atomic_read(&node_nr_running[this_node]);
+	this_rq()->prev_node_load[this_node] = this_load;
+	avg_load = this_load;
 	for (i = 0; i < numnodes; i++) {
 		if (i == this_node)
 			continue;
-		load = atomic_read(&node_nr_running[i]);
+		load = (this_rq()->prev_node_load[i] >> 1)
+			+ atomic_read(&node_nr_running[i]);
+		avg_load += load;
+		this_rq()->prev_node_load[i] = load;
 		if (load > maxload &&
 		    (100*load > ((NODE_THRESHOLD*100*this_load)/100))) {
 			maxload = load;
 			node = i;
 		}
 	}
-	if (maxload <= 4)
+	avg_load = avg_load / (numnodes ? numnodes : 1);
+	if (this_load < (avg_load / 2)) {
+		if (internode_lb[this_node] == MAX_INTERNODE_LB)
+			internode_lb[this_node] = MIN_INTERNODE_LB;
+	} else
+		if (internode_lb[this_node] == MIN_INTERNODE_LB)
+			internode_lb[this_node] = MAX_INTERNODE_LB;
+	if (maxload <= 4+2+1 || this_load >= avg_load)
 		node = -1;
 	return node;
 }
+
+static inline int remote_steal_factor(runqueue_t *rq)
+{
+	int cpu = __cpu_to_node(task_cpu(rq->curr));
+
+	return (cpu == __cpu_to_node(smp_processor_id())) ? 1 : 2;
+}
+#else
+#define remote_steal_factor(rq) 1
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -938,7 +971,7 @@ skip_queue:
 		goto skip_bitmap;
 	}
 	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
+	if (!idle && ((--imbalance)/remote_steal_factor(busiest))) {
 		if (curr != head)
 			goto skip_queue;
 		idx++;

[-- Attachment #4: numa-sched-patches.tgz --]
[-- Type: application/x-tgz, Size: 3870 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] [PATCH 2.5.58] new NUMA scheduler
  2003-01-14 15:55                 ` [PATCH 2.5.58] new NUMA scheduler Erich Focht
@ 2003-01-14 16:07                   ` Christoph Hellwig
  2003-01-14 16:23                   ` [PATCH 2.5.58] new NUMA scheduler: fix Erich Focht
  1 sibling, 0 replies; 96+ messages in thread
From: Christoph Hellwig @ 2003-01-14 16:07 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum, Robert Love,
	Ingo Molnar, linux-kernel, lse-tech

On Tue, Jan 14, 2003 at 04:55:06PM +0100, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58. 

This one looks a lot nicer.  You might also want to take the different
nr_running stuff from the patch I posted, I think it looks a lot nicer.

The patch (not updated yet) is below again for reference.



--- 1.62/fs/exec.c	Fri Jan 10 08:21:00 2003
+++ edited/fs/exec.c	Mon Jan 13 15:33:32 2003
@@ -1031,6 +1031,8 @@
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
--- 1.119/include/linux/sched.h	Sat Jan 11 07:44:15 2003
+++ edited/include/linux/sched.h	Mon Jan 13 15:58:11 2003
@@ -444,6 +444,14 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+# define sched_balance_exec()	do { } while (0)
+# define node_nr_running_init()	do { } while (0)
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
--- 1.91/init/main.c	Mon Jan  6 04:08:49 2003
+++ edited/init/main.c	Mon Jan 13 15:33:33 2003
@@ -495,6 +495,7 @@
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
--- 1.148/kernel/sched.c	Sat Jan 11 07:44:22 2003
+++ edited/kernel/sched.c	Mon Jan 13 16:17:34 2003
@@ -67,6 +67,7 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
+#define NODE_BALANCE_RATIO	10
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -154,6 +155,11 @@
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 
+#ifdef CONFIG_NUMA
+	atomic_t *node_nr_running;
+	int nr_balanced;
+#endif
+
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
@@ -178,6 +184,38 @@
 #endif
 
 /*
+ * Keep track of running tasks.
+ */
+#if CONFIG_NUMA
+
+/* XXX(hch): this should go into a struct sched_node_data */
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_init(struct runqueue *rq)
+{
+	rq->node_nr_running = &node_nr_running[0];
+}
+
+static inline void nr_running_inc(struct runqueue *rq)
+{
+	atomic_inc(rq->node_nr_running);
+	rq->nr_running++;
+}
+
+static inline void nr_running_dec(struct runqueue *rq)
+{
+	atomic_dec(rq->node_nr_running);
+	rq->nr_running--;
+}
+
+#else
+# define nr_running_init(rq)	do { } while (0)
+# define nr_running_inc(rq)	do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)	do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
+/*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
  * explicitly disabling preemption.
@@ -294,7 +332,7 @@
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -302,7 +340,7 @@
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -624,9 +662,108 @@
 		spin_unlock(&rq2->lock);
 }
 
-#if CONFIG_SMP
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ *
+ * Note:  This isn't actually numa-specific, but just not used otherwise.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask = p->cpus_allowed;
+
+	if (old_mask & (1UL << dest_cpu)) {
+		unsigned long flags;
+		struct runqueue *rq;
+
+		/* force the process onto the specified CPU */
+		set_cpus_allowed(p, 1UL << dest_cpu);
+
+		/* restore the cpus allowed mask */
+		rq = task_rq_lock(p, &flags);
+		p->cpus_allowed = old_mask;
+		task_rq_unlock(rq, &flags);
+	}
+}
 
 /*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+
+	minload = 10000000;
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1UL << i)))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+
+static int find_busiest_node(int this_node)
+{
+	int i, node = this_node, load, this_load, maxload;       
+
+	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	for (i = 0; i < numnodes; i++) {
+		if (i == this_node)
+			continue;
+		load = atomic_read(&node_nr_running[i]);
+		if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+			maxload = load;
+			node = i;
+		}
+	}
+
+	return node;
+}
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_rq(i)->node_nr_running = node_nr_running + __cpu_to_node(i);
+}
+#endif /* CONFIG_NUMA */
+
+#if CONFIG_SMP
+/*
  * double_lock_balance - lock the busiest runqueue
  *
  * this_rq is locked already. Recalculate nr_running if we have to
@@ -652,9 +789,10 @@
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+		int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +827,7 @@
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1UL << i) & cpumask))
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -736,9 +874,9 @@
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -758,13 +896,27 @@
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, idx, this_cpu, this_node;
+	unsigned long cpumask;
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	this_cpu = smp_processor_id();
+	this_node = __cpu_to_node(this_cpu);
+	cpumask = __node_to_cpu_mask(this_node);
+
+#if CONFIG_NUMA
+	/*
+	 * Avoid rebalancing between nodes too often.
+	 */
+	if (!(++this_rq->nr_balanced % NODE_BALANCE_RATIO))
+		cpumask |= __node_to_cpu_mask(find_busiest_node(this_node));
+#endif
+
+	busiest = find_busiest_queue(this_rq, this_cpu, idle,
+			&imbalance, cpumask);
 	if (!busiest)
 		goto out;
 
@@ -2231,6 +2383,7 @@
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+		nr_running_init(rq);
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 15:55                 ` [PATCH 2.5.58] new NUMA scheduler Erich Focht
  2003-01-14 16:07                   ` [Lse-tech] " Christoph Hellwig
@ 2003-01-14 16:23                   ` Erich Focht
  2003-01-14 16:43                     ` Erich Focht
                                       ` (3 more replies)
  1 sibling, 4 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-14 16:23 UTC (permalink / raw)
  To: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum, Christoph Hellwig
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
and this was present in the numa-sched-2.5.58.patch and
numa-sched-add-2.5.58.patch, too. Please use the patches attached to
this email! Sorry for the silly mistake...

Christoph, I used your way of coding nr_running_inc/dec now.

Regards,
Erich


On Tuesday 14 January 2003 16:55, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.
>
> The whole patch is really tiny: 9.5k. This time I attached the numa
> scheduler in form of two patches:
>
> numa-sched-2.5.58.patch     (7k) : components 01, 02, 03
> numa-sched-add-2.5.58.patch (3k) : components 04, 05
>
> The single components are also attached in a small tgz archive:
>
> 01-minisched-2.5.58.patch : the miniature scheduler from
> Martin. Balances strictly within a node. Removed the
> find_busiest_in_mask() function.
>
> 02-initial-lb-2.5.58.patch : Michael's initial load balancer at
> exec(). Cosmetic corrections sugegsted by Christoph.
>
> 03-internode-lb-2.5.58.patch : internode load balancer core. Called
> after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable
> parameters:
>   NODE_BALANCE_RATE  (default: 10)
>   NODE_THRESHOLD     (default: 125) : consider only nodes with load
>                      above NODE_THRESHOLD/100 * own_node_load
>   I added the constant factor of 4 suggested by Michael, but I'm not
>   really happy with it. This should be nr_cpus_in_node, but we don't
>   have that info in topology.h
>
> 04-smooth-node-load-2.5.58.patch : The node load measure is smoothed
> by adding half of the previous node load (and 1/4 of the one before,
> etc..., as discussed in the LSE call). This should improve a bit the
> behavior in case of short timed load peaks and avoid bouncing tasks
> between nodes.
>
> 05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE
> interval (between cross-node balancer calls) by a variable
> node-specific interval. Currently only two values used:
>    NODE_BALANCE_MIN : 10
>    NODE_BALANCE_MAX : 40
> If the node load is less than avg_node_load/2, we switch to
> NODE_BALANCE_MIN, otherwise we use the large interval.
> I also added a function to reduce the number of tasks stolen from
> remote nodes.
>
> Regards,
> Erich

[-- Attachment #2: numa-sched-2.5.58.patch --]
[-- Type: text/x-diff, Size: 8158 bytes --]

diff -urNp 2.5.58/fs/exec.c 2.5.58-ms-ilb-nb/fs/exec.c
--- 2.5.58/fs/exec.c	2003-01-14 06:58:33.000000000 +0100
+++ 2.5.58-ms-ilb-nb/fs/exec.c	2003-01-14 17:02:08.000000000 +0100
@@ -1031,6 +1031,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urNp 2.5.58/include/linux/sched.h 2.5.58-ms-ilb-nb/include/linux/sched.h
--- 2.5.58/include/linux/sched.h	2003-01-14 06:58:06.000000000 +0100
+++ 2.5.58-ms-ilb-nb/include/linux/sched.h	2003-01-14 17:06:44.000000000 +0100
@@ -160,7 +160,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -444,6 +443,13 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+#define sched_balance_exec() {}
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp 2.5.58/init/main.c 2.5.58-ms-ilb-nb/init/main.c
--- 2.5.58/init/main.c	2003-01-14 06:58:20.000000000 +0100
+++ 2.5.58-ms-ilb-nb/init/main.c	2003-01-14 17:02:08.000000000 +0100
@@ -495,6 +495,7 @@ static void do_pre_smp_initcalls(void)
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urNp 2.5.58/kernel/sched.c 2.5.58-ms-ilb-nb/kernel/sched.c
--- 2.5.58/kernel/sched.c	2003-01-14 06:59:11.000000000 +0100
+++ 2.5.58-ms-ilb-nb/kernel/sched.c	2003-01-14 17:10:37.000000000 +0100
@@ -67,6 +67,8 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
+#define NODE_BALANCE_RATIO	10
+#define NODE_THRESHOLD          125
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -153,7 +155,10 @@ struct runqueue {
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
-
+#ifdef CONFIG_NUMA
+	atomic_t * node_ptr;
+	unsigned int nr_balanced;
+#endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
@@ -177,6 +182,35 @@ static struct runqueue runqueues[NR_CPUS
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_inc(runqueue_t *rq)
+{
+	atomic_inc(rq->node_ptr);
+	rq->nr_running++;
+}
+
+static inline void nr_running_dec(runqueue_t *rq)
+{
+	atomic_dec(rq->node_ptr);
+	rq->nr_running--;
+}
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+}
+#else
+# define nr_running_init(rq)   do { } while (0)
+# define nr_running_inc(rq)    do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)    do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -294,7 +328,7 @@ static inline void activate_task(task_t 
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -302,7 +336,7 @@ static inline void activate_task(task_t 
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -624,6 +658,94 @@ static inline void double_rq_unlock(runq
 		spin_unlock(&rq2->lock);
 }
 
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+
+	minload = 10000000;
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1UL << i)))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+
+static int find_busiest_node(int this_node)
+{
+	int i, node = -1, load, this_load, maxload;
+	
+	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	for (i = 0; i < numnodes; i++) {
+		if (i == this_node)
+			continue;
+		load = atomic_read(&node_nr_running[i]);
+		if (load > maxload &&
+		    (100*load > ((NODE_THRESHOLD*100*this_load)/100))) {
+			maxload = load;
+			node = i;
+		}
+	}
+	if (maxload <= 4)
+		node = -1;
+	return node;
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP
 
 /*
@@ -652,9 +774,9 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +811,7 @@ static inline runqueue_t *find_busiest_q
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1UL << i) & cpumask) )
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -736,9 +858,9 @@ out:
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -763,8 +885,21 @@ static void load_balance(runqueue_t *thi
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
+	int this_node = __cpu_to_node(this_cpu);
+	unsigned long cpumask = __node_to_cpu_mask(this_node);
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+#if CONFIG_NUMA
+	/*
+	 * Avoid rebalancing between nodes too often.
+	 */
+	if (!(++(this_rq->nr_balanced) % NODE_BALANCE_RATIO)) {
+		int node = find_busiest_node(this_node);
+		if (node >= 0)
+			cpumask = __node_to_cpu_mask(node) | (1UL << this_cpu);
+	}
+#endif
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
+				    cpumask);
 	if (!busiest)
 		goto out;
 
@@ -2231,6 +2366,10 @@ void __init sched_init(void)
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

[-- Attachment #3: numa-sched-add-2.5.58.patch --]
[-- Type: text/x-diff, Size: 3115 bytes --]

diff -urNp 2.5.58-ms-ilb-nb/kernel/sched.c 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c
--- 2.5.58-ms-ilb-nb/kernel/sched.c	2003-01-14 17:10:37.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c	2003-01-14 17:12:35.000000000 +0100
@@ -67,8 +67,9 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
-#define NODE_BALANCE_RATIO	10
 #define NODE_THRESHOLD          125
+#define NODE_BALANCE_MIN	10
+#define NODE_BALANCE_MAX	40
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -158,6 +159,7 @@ struct runqueue {
 #ifdef CONFIG_NUMA
 	atomic_t * node_ptr;
 	unsigned int nr_balanced;
+	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -185,6 +187,8 @@ static struct runqueue runqueues[NR_CPUS
 #if CONFIG_NUMA
 static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
 	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+static int internode_lb[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = NODE_BALANCE_MAX};
 
 static inline void nr_running_inc(runqueue_t *rq)
 {
@@ -725,25 +729,54 @@ void sched_balance_exec(void)
 	}
 }
 
+/*
+ * Find the busiest node. All previous node loads contribute with a 
+ * geometrically deccaying weight to the load measure:
+ *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
+ * This way sudden load peaks are flattened out a bit.
+ */
 static int find_busiest_node(int this_node)
 {
 	int i, node = -1, load, this_load, maxload;
+	int avg_load;
 	
-	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
+		+ atomic_read(&node_nr_running[this_node]);
+	this_rq()->prev_node_load[this_node] = this_load;
+	avg_load = this_load;
 	for (i = 0; i < numnodes; i++) {
 		if (i == this_node)
 			continue;
-		load = atomic_read(&node_nr_running[i]);
+		load = (this_rq()->prev_node_load[i] >> 1)
+			+ atomic_read(&node_nr_running[i]);
+		avg_load += load;
+		this_rq()->prev_node_load[i] = load;
 		if (load > maxload &&
 		    (100*load > ((NODE_THRESHOLD*100*this_load)/100))) {
 			maxload = load;
 			node = i;
 		}
 	}
-	if (maxload <= 4)
+	avg_load = avg_load / (numnodes ? numnodes : 1);
+	if (this_load < (avg_load / 2)) {
+		if (internode_lb[this_node] == MAX_INTERNODE_LB)
+			internode_lb[this_node] = MIN_INTERNODE_LB;
+	} else
+		if (internode_lb[this_node] == MIN_INTERNODE_LB)
+			internode_lb[this_node] = MAX_INTERNODE_LB;
+	if (maxload <= 4+2+1 || this_load >= avg_load)
 		node = -1;
 	return node;
 }
+
+static inline int remote_steal_factor(runqueue_t *rq)
+{
+	int cpu = __cpu_to_node(task_cpu(rq->curr));
+
+	return (cpu == __cpu_to_node(smp_processor_id())) ? 1 : 2;
+}
+#else
+#define remote_steal_factor(rq) 1
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -956,7 +989,7 @@ skip_queue:
 		goto skip_bitmap;
 	}
 	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
+	if (!idle && ((--imbalance)/remote_steal_factor(busiest))) {
 		if (curr != head)
 			goto skip_queue;
 		idx++;

[-- Attachment #4: numa-sched-patches.tgz --]
[-- Type: application/x-tgz, Size: 4020 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 16:23                   ` [PATCH 2.5.58] new NUMA scheduler: fix Erich Focht
@ 2003-01-14 16:43                     ` Erich Focht
  2003-01-14 19:02                       ` Michael Hohnbaum
  2003-01-14 16:51                     ` Christoph Hellwig
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 96+ messages in thread
From: Erich Focht @ 2003-01-14 16:43 UTC (permalink / raw)
  To: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum, Christoph Hellwig
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

Aargh, I should have gone home earlier...
For those who really care about patch 05, it's attached. It's all
untested as I don't have a ia32 NUMA machine running 2.5.58...

Erich


On Tuesday 14 January 2003 17:23, Erich Focht wrote:
> In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
> and this was present in the numa-sched-2.5.58.patch and
> numa-sched-add-2.5.58.patch, too. Please use the patches attached to
> this email! Sorry for the silly mistake...
>
> Christoph, I used your way of coding nr_running_inc/dec now.
>
> Regards,
> Erich

[-- Attachment #2: 05-var-intnode-lb-2.5.58.patch --]
[-- Type: text/x-diff, Size: 2949 bytes --]

diff -urNp 2.5.58-ms-ilb-nb-sm/kernel/sched.c 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c
--- 2.5.58-ms-ilb-nb-sm/kernel/sched.c	2003-01-14 17:11:48.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c	2003-01-14 17:36:26.000000000 +0100
@@ -67,8 +67,9 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
-#define NODE_BALANCE_RATIO	10
 #define NODE_THRESHOLD          125
+#define NODE_BALANCE_MIN	10
+#define NODE_BALANCE_MAX	40
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -186,6 +187,8 @@ static struct runqueue runqueues[NR_CPUS
 #if CONFIG_NUMA
 static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
 	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+static int internode_lb[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = NODE_BALANCE_MAX};
 
 static inline void nr_running_inc(runqueue_t *rq)
 {
@@ -735,15 +738,18 @@ void sched_balance_exec(void)
 static int find_busiest_node(int this_node)
 {
 	int i, node = -1, load, this_load, maxload;
+	int avg_load;
 	
 	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
 		+ atomic_read(&node_nr_running[this_node]);
 	this_rq()->prev_node_load[this_node] = this_load;
+	avg_load = this_load;
 	for (i = 0; i < numnodes; i++) {
 		if (i == this_node)
 			continue;
 		load = (this_rq()->prev_node_load[i] >> 1)
 			+ atomic_read(&node_nr_running[i]);
+		avg_load += load;
 		this_rq()->prev_node_load[i] = load;
 		if (load > maxload &&
 		    (100*load > ((NODE_THRESHOLD*100*this_load)/100))) {
@@ -751,10 +757,26 @@ static int find_busiest_node(int this_no
 			node = i;
 		}
 	}
-	if (maxload <= 4+2+1)
+	avg_load = avg_load / (numnodes ? numnodes : 1);
+	if (this_load < (avg_load / 2)) {
+		if (internode_lb[this_node] == NODE_BALANCE_MAX)
+			internode_lb[this_node] = NODE_BALANCE_MIN;
+	} else
+		if (internode_lb[this_node] == NODE_BALANCE_MIN)
+			internode_lb[this_node] = NODE_BALANCE_MAX;
+	if (maxload <= 4+2+1 || this_load >= avg_load)
 		node = -1;
 	return node;
 }
+
+static inline int remote_steal_factor(runqueue_t *rq)
+{
+	int cpu = __cpu_to_node(task_cpu(rq->curr));
+
+	return (cpu == __cpu_to_node(smp_processor_id())) ? 1 : 2;
+}
+#else
+#define remote_steal_factor(rq) 1
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -903,7 +925,7 @@ static void load_balance(runqueue_t *thi
 	/*
 	 * Avoid rebalancing between nodes too often.
 	 */
-	if (!(++(this_rq->nr_balanced) % NODE_BALANCE_RATIO)) {
+	if (!(++(this_rq->nr_balanced) % internode_lb[this_node])) {
 		int node = find_busiest_node(this_node);
 		if (node >= 0)
 			cpumask = __node_to_cpu_mask(node) | (1UL << this_cpu);
@@ -967,7 +989,7 @@ skip_queue:
 		goto skip_bitmap;
 	}
 	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
+	if (!idle && ((--imbalance)/remote_steal_factor(busiest))) {
 		if (curr != head)
 			goto skip_queue;
 		idx++;

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 16:23                   ` [PATCH 2.5.58] new NUMA scheduler: fix Erich Focht
  2003-01-14 16:43                     ` Erich Focht
@ 2003-01-14 16:51                     ` Christoph Hellwig
  2003-01-15  0:05                     ` Michael Hohnbaum
  2003-01-15  7:47                     ` Martin J. Bligh
  3 siblings, 0 replies; 96+ messages in thread
From: Christoph Hellwig @ 2003-01-14 16:51 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum, Robert Love,
	Ingo Molnar, linux-kernel, lse-tech


+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+#define sched_balance_exec() {}
+#endif

You accidentally (?) removed the stub for node_nr_running_init.
Also sched.h used # define inside ifdefs.

+#ifdef CONFIG_NUMA
+	atomic_t * node_ptr;

The name is still a bit non-descriptive and the * placed wrong :)
What about atomic_t *nr_running_at_node?


 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+

I think my two comments here were pretty usefull :)

+static inline void nr_running_dec(runqueue_t *rq)
+{
+	atomic_dec(rq->node_ptr);
+	rq->nr_running--;
+}
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+}
+#else
+# define nr_running_init(rq)   do { } while (0)
+# define nr_running_inc(rq)    do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)    do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
 /*
@@ -689,7 +811,7 @@ static inline runqueue_t *find_busiest_q
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1UL << i) & cpumask) )

spurious whitespace before the closing brace.

 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
+	int this_node = __cpu_to_node(this_cpu);
+	unsigned long cpumask = __node_to_cpu_mask(this_node);

If that's not to much style nitpicking: put this_node on one line with all the
other local ints and initialize all three vars after the declarations (like
in my patch *duck*)

 
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */

I had a nr_running_init() abstraction for this, but you only took it
partially. It would be nice to merge the last bit to get rid of this ifdef.

Else the patch looks really, really good and I'm looking forward to see
it in mainline real soon!

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: NUMA scheduler 2nd approach
  2003-01-14  5:50             ` [Lse-tech] Re: NUMA scheduler 2nd approach Michael Hohnbaum
@ 2003-01-14 16:52               ` Andrew Theurer
  2003-01-14 15:13                 ` Erich Focht
  0 siblings, 1 reply; 96+ messages in thread
From: Andrew Theurer @ 2003-01-14 16:52 UTC (permalink / raw)
  To: Michael Hohnbaum
  Cc: Erich Focht, Martin J. Bligh, Robert Love, Ingo Molnar,
	linux-kernel, lse-tech

> I suppose I should not have been so dang lazy and cut-n-pasted
> the line I changed.  The change was (((5*4*this_load)/4) + 4)
> which should be the same as your second choice.
> >
> > We def need some constant to avoid low load ping pong, right?
>
> Yep.  Without the constant, one could have 6 processes on node
> A and 4 on node B, and node B would end up stealing.  While making
> a perfect balance, the expense of the off-node traffic does not
> justify it.  At least on the NUMAQ box.  It might be justified
> for a different NUMA architecture, which is why I propose putting
> this check in a macro that can be defined in topology.h for each
> architecture.

Yes, I was also concerned about one task in one node and none in the others.
Without some sort of constant we will ping pong the task on every node
endlessly, since there is no % threshold that could make any difference when
the original load value is 0..  Your +4 gets rid of the 1 task case.

-Andrew


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 16:43                     ` Erich Focht
@ 2003-01-14 19:02                       ` Michael Hohnbaum
  2003-01-14 21:56                         ` [Lse-tech] " Michael Hohnbaum
  2003-01-15 15:10                         ` Erich Focht
  0 siblings, 2 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-14 19:02 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Andrew Theurer, Christoph Hellwig, Robert Love,
	Ingo Molnar, linux-kernel, lse-tech

On Tue, 2003-01-14 at 08:43, Erich Focht wrote:
> Aargh, I should have gone home earlier...
> For those who really care about patch 05, it's attached. It's all
> untested as I don't have a ia32 NUMA machine running 2.5.58...

One more minor problem - the first two patches are missing the
following defines, and result in compile issues:

#define MAX_INTERNODE_LB 40
#define MIN_INTERNODE_LB 4
#define NODE_BALANCE_RATIO 10

Looking through previous patches, and the 05 patch, I found
these defines and put them under the #if CONFIG_NUMA in sched.c
that defines node_nr_running and friends.

With these three lines added, I have a kernel built and booted
using the first numa-sched and numa-sched-add patches.

Test results will follow later in the day.

             Michael


> 
> Erich
> 
> 
> On Tuesday 14 January 2003 17:23, Erich Focht wrote:
> > In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
> > and this was present in the numa-sched-2.5.58.patch and
> > numa-sched-add-2.5.58.patch, too. Please use the patches attached to
> > this email! Sorry for the silly mistake...
> >
> > Christoph, I used your way of coding nr_running_inc/dec now.
> >
> > Regards,
> > Erich

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 19:02                       ` Michael Hohnbaum
@ 2003-01-14 21:56                         ` Michael Hohnbaum
  2003-01-15 15:10                         ` Erich Focht
  1 sibling, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-14 21:56 UTC (permalink / raw)
  To: Michael Hohnbaum
  Cc: Erich Focht, Martin J. Bligh, Andrew Theurer, Christoph Hellwig,
	Robert Love, Ingo Molnar, linux-kernel, lse-tech

On Tue, 2003-01-14 at 11:02, Michael Hohnbaum wrote:
> On Tue, 2003-01-14 at 08:43, Erich Focht wrote:
> > Aargh, I should have gone home earlier...
> > For those who really care about patch 05, it's attached. It's all
> > untested as I don't have a ia32 NUMA machine running 2.5.58...
> 
> One more minor problem - the first two patches are missing the
> following defines, and result in compile issues:
> 
> #define MAX_INTERNODE_LB 40
> #define MIN_INTERNODE_LB 4
> #define NODE_BALANCE_RATIO 10
> 
> Looking through previous patches, and the 05 patch, I found
> these defines and put them under the #if CONFIG_NUMA in sched.c
> that defines node_nr_running and friends.
> 
> With these three lines added, I have a kernel built and booted
> using the first numa-sched and numa-sched-add patches.
> 
> Test results will follow later in the day.

Trying to apply the 05 patch, I discovered that it was already
in there.  Something is messed up with the combined patches, so
I went back to the tgz file you provided and started over.  I'm
not sure what the kernel is that I built and tested earlier today,
but I suspect it was, for the most part, the complete patchset
(i.e., patches 1-5).  Building a kernel with patches 1-4 from
the tgz file does not need the additional defines mentioned in
my previous email.

Testing is starting from scratch with a known patch base.  The
plan is to test with patches 1-4, then add in 05.  I should have
some numbers for you before the end of my day.  btw, the numbers
looked real good for the runs on whatever kernel it was that I
built this morning.
> 
>              Michael
> 
> > Erich
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 16:23                   ` [PATCH 2.5.58] new NUMA scheduler: fix Erich Focht
  2003-01-14 16:43                     ` Erich Focht
  2003-01-14 16:51                     ` Christoph Hellwig
@ 2003-01-15  0:05                     ` Michael Hohnbaum
  2003-01-15  7:47                     ` Martin J. Bligh
  3 siblings, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-15  0:05 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Andrew Theurer, Christoph Hellwig, Robert Love,
	Ingo Molnar, linux-kernel, lse-tech

On Tuesday 14 January 2003 16:55, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.
>
> The whole patch is really tiny: 9.5k. This time I attached the numa
> scheduler in form of two patches:

Ran tests on three different kernels:

stock58 - linux 2.5.58 with the cputime_stats patch
sched1-4-58 - stock58 with the first 4 NUMA scheduler patches
sched1-5-58 - stock58 with all 5 NUMA scheduler patches

Kernbench:
                        Elapsed       User     System        CPU
             stock58    31.656s    305.85s    89.232s    1248.2%
         sched1-4-58    29.886s   287.506s     82.94s      1239%
         sched1-5-58    29.994s   288.796s    84.706s      1245%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock58      27.73      42.80     110.96       0.85
         sched1-4-58      32.86      46.41     131.47       0.85
         sched1-5-58      32.37      45.34     129.52       0.89

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock58      45.97      61.87     367.81       2.11
         sched1-4-58      31.39      49.18     251.22       2.15
         sched1-5-58      37.52      61.32     300.22       2.06

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock58      60.91      83.63     974.71       6.18
         sched1-4-58      54.31      62.11     869.11       3.84
         sched1-5-58      51.60      59.05     825.72       4.74

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock58      84.26     195.16    2696.65      16.53
         sched1-4-58      61.49     140.51    1968.06       9.57
         sched1-5-58      55.23     117.32    1767.71       7.78

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock58     123.27     511.77    7889.77      27.78
         sched1-4-58      63.39     266.40    4057.92      20.55
         sched1-5-58      59.57     250.25    3813.39      17.05

One anomaly noted was that the kernbench system time went up
about 5% with the 2.5.58 kernel from what it was on the last
version I tested with (2.5.55).  This increase is in both stock
and with the NUMA scheduler, so is not caused by the NUMA scheduler.

Now that I've got baselines for these, I plan to next look at
tweaking various parameters within the scheduler and see how what
happens.  Also, I owe Erich numbers running hackbench.  Overall, I am
pleased with these results.

And just for grins, here are the detailed results for running 
numa_test 32

sched1-4-58:

Executing 32 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 9.044
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0  100.0    0.0    0.0 |    1     1    |  55.04
  2     0.0    0.0    4.8   95.2 |    2     3   *|  70.38
  3     0.0    0.0    3.2   96.8 |    2     3   *|  72.72
  4     0.0   26.4    2.8   70.7 |    2     3   *|  72.99
  5   100.0    0.0    0.0    0.0 |    0     0    |  57.03
  6   100.0    0.0    0.0    0.0 |    0     0    |  55.06
  7   100.0    0.0    0.0    0.0 |    0     0    |  57.18
  8     0.0  100.0    0.0    0.0 |    1     1    |  55.38
  9   100.0    0.0    0.0    0.0 |    0     0    |  54.37
 10     0.0  100.0    0.0    0.0 |    1     1    |  56.06
 11     0.0   13.2    0.0   86.8 |    3     3    |  64.33
 12     0.0    0.0    0.0  100.0 |    3     3    |  62.35
 13     1.7    0.0   98.3    0.0 |    2     2    |  67.47
 14   100.0    0.0    0.0    0.0 |    0     0    |  55.94
 15     0.0   29.4   61.9    8.6 |    3     2   *|  78.76
 16     0.0  100.0    0.0    0.0 |    1     1    |  56.42
 17    18.9    0.0   74.9    6.2 |    3     2   *|  70.57
 18     0.0    0.0  100.0    0.0 |    2     2    |  63.01
 19     0.0  100.0    0.0    0.0 |    1     1    |  55.97
 20     0.0    0.0   92.7    7.3 |    3     2   *|  65.62
 21     0.0    0.0  100.0    0.0 |    2     2    |  62.70
 22     0.0  100.0    0.0    0.0 |    1     1    |  55.53
 23     0.0    1.5    0.0   98.5 |    3     3    |  56.95
 24     0.0  100.0    0.0    0.0 |    1     1    |  55.75
 25     0.0   30.0    2.3   67.7 |    2     3   *|  77.78
 26     0.0    0.0    0.0  100.0 |    3     3    |  57.71
 27    13.6    0.0   86.4    0.0 |    0     2   *|  66.55
 28   100.0    0.0    0.0    0.0 |    0     0    |  55.43
 29     0.0  100.0    0.0    0.0 |    1     1    |  56.12
 30    19.8    0.0   62.5   17.6 |    3     2   *|  66.92
 31   100.0    0.0    0.0    0.0 |    0     0    |  54.90
 32   100.0    0.0    0.0    0.0 |    0     0    |  54.70
AverageUserTime 61.49 seconds
ElapsedTime     140.51
TotalUserTime   1968.06
TotalSysTime    9.57

sched1-5-58:

Executing 32 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 9.145
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  54.88
  2     0.0  100.0    0.0    0.0 |    1     1    |  54.08
  3     0.0    0.0    0.0  100.0 |    3     3    |  55.48
  4     0.0    0.0    0.0  100.0 |    3     3    |  55.47
  5   100.0    0.0    0.0    0.0 |    0     0    |  53.84
  6   100.0    0.0    0.0    0.0 |    0     0    |  53.37
  7     0.0    0.0    0.0  100.0 |    3     3    |  55.41
  8    90.9    9.1    0.0    0.0 |    1     0   *|  55.58
  9     0.0  100.0    0.0    0.0 |    1     1    |  55.61
 10     0.0  100.0    0.0    0.0 |    1     1    |  54.56
 11     0.0    0.0   98.1    1.9 |    2     2    |  56.25
 12     0.0    0.0    0.0  100.0 |    3     3    |  55.07
 13     0.0    0.0    0.0  100.0 |    3     3    |  54.92
 14     0.0  100.0    0.0    0.0 |    1     1    |  54.59
 15   100.0    0.0    0.0    0.0 |    0     0    |  55.10
 16     5.0    0.0   95.0    0.0 |    2     2    |  56.97
 17     0.0    0.0  100.0    0.0 |    2     2    |  55.51
 18   100.0    0.0    0.0    0.0 |    0     0    |  53.97
 19     0.0    4.7   95.3    0.0 |    2     2    |  57.21
 20     0.0    0.0  100.0    0.0 |    2     2    |  55.53
 21     0.0    0.0  100.0    0.0 |    2     2    |  56.46
 22     0.0    0.0  100.0    0.0 |    2     2    |  55.48
 23     0.0    0.0    0.0  100.0 |    3     3    |  55.99
 24     0.0  100.0    0.0    0.0 |    1     1    |  55.32
 25     0.0    6.2   93.8    0.0 |    2     2    |  57.66
 26     0.0    0.0    0.0  100.0 |    3     3    |  55.60
 27     0.0  100.0    0.0    0.0 |    1     1    |  54.65
 28     0.0    0.0    0.0  100.0 |    3     3    |  56.39
 29     0.0  100.0    0.0    0.0 |    1     1    |  54.91
 30   100.0    0.0    0.0    0.0 |    0     0    |  53.58
 31   100.0    0.0    0.0    0.0 |    0     0    |  53.53
 32    31.5   68.4    0.0    0.0 |    0     1   *|  54.41
AverageUserTime 55.23 seconds
ElapsedTime     117.32
TotalUserTime   1767.71
TotalSysTime    7.78

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 16:23                   ` [PATCH 2.5.58] new NUMA scheduler: fix Erich Focht
                                       ` (2 preceding siblings ...)
  2003-01-15  0:05                     ` Michael Hohnbaum
@ 2003-01-15  7:47                     ` Martin J. Bligh
  3 siblings, 0 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-15  7:47 UTC (permalink / raw)
  To: Erich Focht, Andrew Theurer, Michael Hohnbaum, Christoph Hellwig
  Cc: Robert Love, Ingo Molnar, linux-kernel, lse-tech

OK, ran some tests on incremental version of the stack.
newsched3 = patches 1+2+3 ... etc.
oldsched = Erich's old code + Michael's ilb.
newsched4-tune = patches 1,2,3,4 + tuning patch below:

Tuning seems to help quite a bit ... need to stick this into arch topo code.
Sleep time now ;-)

Kernbench:
                                   Elapsed        User      System         CPU
                   2.5.58-mjb1     19.522s    186.566s     41.516s     1167.8%
          2.5.58-mjb1-oldsched     19.488s     186.73s     42.382s     1175.6%
         2.5.58-mjb1-newsched2     19.286s    186.418s     40.998s     1178.8%
         2.5.58-mjb1-newsched3      19.58s    187.658s     43.694s     1181.2%
         2.5.58-mjb1-newsched4     19.266s    187.772s     42.984s     1197.4%
    2.5.58-mjb1-newsched4-tune     19.424s    186.664s     41.422s     1173.6%
         2.5.58-mjb1-newsched5     19.462s    187.692s      43.02s       1185%

NUMA schedbench 4:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.58-mjb1                                                
          2.5.58-mjb1-oldsched        0.00       35.16       88.55        0.68
         2.5.58-mjb1-newsched2        0.00       19.12       63.71        0.48
         2.5.58-mjb1-newsched3        0.00       35.73       88.26        0.58
         2.5.58-mjb1-newsched4        0.00       35.64       88.46        0.60
    2.5.58-mjb1-newsched4-tune        0.00       37.10       91.99        0.58
         2.5.58-mjb1-newsched5        0.00       35.34       88.60        0.64

NUMA schedbench 8:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.58-mjb1        0.00       35.34       88.60        0.64
          2.5.58-mjb1-oldsched        0.00       64.01      338.77        1.50
         2.5.58-mjb1-newsched2        0.00       31.56      227.72        1.03
         2.5.58-mjb1-newsched3        0.00       35.44      220.63        1.36
         2.5.58-mjb1-newsched4        0.00       35.47      223.86        1.33
    2.5.58-mjb1-newsched4-tune        0.00       37.04      232.92        1.14
         2.5.58-mjb1-newsched5        0.00       36.11      223.14        1.39

NUMA schedbench 16:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.58-mjb1        0.00       36.11      223.14        1.39
          2.5.58-mjb1-oldsched        0.00       62.60      834.67        4.85
         2.5.58-mjb1-newsched2        0.00       57.24      850.12        2.64
         2.5.58-mjb1-newsched3        0.00       64.15      870.25        3.18
         2.5.58-mjb1-newsched4        0.00       64.01      875.17        3.10
    2.5.58-mjb1-newsched4-tune        0.00       57.84      841.48        2.96
         2.5.58-mjb1-newsched5        0.00       61.87      828.37        3.47

NUMA schedbench 32:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.58-mjb1        0.00       61.87      828.37        3.47
          2.5.58-mjb1-oldsched        0.00      154.30     2031.93        9.35
         2.5.58-mjb1-newsched2        0.00      117.75     1798.53        5.52
         2.5.58-mjb1-newsched3        0.00      122.87     1771.71        8.33
         2.5.58-mjb1-newsched4        0.00      134.86     1863.51        8.27
    2.5.58-mjb1-newsched4-tune        0.00      118.18     1809.38        6.58
         2.5.58-mjb1-newsched5        0.00      134.36     1853.94        8.33

NUMA schedbench 64:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                   2.5.58-mjb1        0.00      134.36     1853.94        8.33
          2.5.58-mjb1-oldsched        0.00      318.68     4852.81       21.47
         2.5.58-mjb1-newsched2        0.00      241.11     3603.29       12.70
         2.5.58-mjb1-newsched3        0.00      258.72     3977.50       16.88
         2.5.58-mjb1-newsched4        0.00      252.87     3850.55       18.51
    2.5.58-mjb1-newsched4-tune        0.00      235.43     3627.28       15.90
         2.5.58-mjb1-newsched5        0.00      265.09     3967.70       18.81

--- sched.c.premjb4     2003-01-14 22:12:36.000000000 -0800
+++ sched.c     2003-01-14 22:20:19.000000000 -0800
@@ -85,7 +85,7 @@
 #define NODE_THRESHOLD          125
 #define MAX_INTERNODE_LB 40
 #define MIN_INTERNODE_LB 4
-#define NODE_BALANCE_RATIO 10
+#define NODE_BALANCE_RATIO 250
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -763,7 +763,8 @@
                        + atomic_read(&node_nr_running[i]);
                this_rq()->prev_node_load[i] = load;
                if (load > maxload &&
-                   (100*load > ((NODE_THRESHOLD*100*this_load)/100))) {
+                   (100*load > (NODE_THRESHOLD*this_load))
+                   && load > this_load + 4) {
                        maxload = load;
                        node = i;
                }



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-14 19:02                       ` Michael Hohnbaum
  2003-01-14 21:56                         ` [Lse-tech] " Michael Hohnbaum
@ 2003-01-15 15:10                         ` Erich Focht
  2003-01-16  0:14                           ` Michael Hohnbaum
  2003-01-16  6:05                           ` Martin J. Bligh
  1 sibling, 2 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-15 15:10 UTC (permalink / raw)
  To: Michael Hohnbaum, Martin J. Bligh, Christoph Hellwig
  Cc: Andrew Theurer, Robert Love, Ingo Molnar, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

Thanks for the patience with the problems in the yesterday patches. I
resend the patches in the same form. Made following changes:

- moved NODE_BALANCE_{RATE,MIN,MAX} to topology.h
- removed the divide in the find_busiest_node() loop (thanks, Martin)
- removed the modulo (%) in in the cross-node balancing trigger
- re-added node_nr_running_init() stub, nr_running_init() and comments
  from Christoph
- removed the constant factor 4 in find_busiest_node. The
  find_busiest_queue routine will take care of the case where the
  busiest_node is running only few processes (at most one per CPU) and
  return busiest=NULL .

I hope we can start tuning the parameters now. In the basic NUMA
scheduler part these are:
  NODE_THRESHOLD     : minimum percentage of node overload to
                       trigger cross-node balancing
  NODE_BALANCE_RATE  : arch specific, cross-node balancing is called
                       after this many intra-node load balance calls

In the extended NUMA scheduler the fixed value of NODE_BALANCE_RATE is
replaced by a variable rate set to :
  NODE_BALANCE_MIN : if own node load is less then avg_load/2
  NODE_BALANCE_MAX : if load is larger than avg_load/2
Together with the reduced number of steals across nodes this might
help us in achieving equal load among nodes. I'm not aware of any
simple benchmark which can demonstrate this...

Regards,
Erich

[-- Attachment #2: numa-sched-2.5.58.patch --]
[-- Type: text/x-diff, Size: 10483 bytes --]

diff -urNp 2.5.58/fs/exec.c 2.5.58-ms-ilb-nb/fs/exec.c
--- 2.5.58/fs/exec.c	2003-01-14 06:58:33.000000000 +0100
+++ 2.5.58-ms-ilb-nb/fs/exec.c	2003-01-15 11:47:32.000000000 +0100
@@ -1031,6 +1031,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urNp 2.5.58/include/asm-generic/topology.h 2.5.58-ms-ilb-nb/include/asm-generic/topology.h
--- 2.5.58/include/asm-generic/topology.h	2003-01-14 06:58:06.000000000 +0100
+++ 2.5.58-ms-ilb-nb/include/asm-generic/topology.h	2003-01-15 15:24:45.000000000 +0100
@@ -48,4 +48,9 @@
 #define __node_to_memblk(node)		(0)
 #endif
 
+/* Cross-node load balancing interval. */
+#ifndef NODE_BALANCE_RATE
+#define NODE_BALANCE_RATE 10
+#endif
+
 #endif /* _ASM_GENERIC_TOPOLOGY_H */
diff -urNp 2.5.58/include/asm-i386/topology.h 2.5.58-ms-ilb-nb/include/asm-i386/topology.h
--- 2.5.58/include/asm-i386/topology.h	2003-01-14 06:58:20.000000000 +0100
+++ 2.5.58-ms-ilb-nb/include/asm-i386/topology.h	2003-01-15 15:25:11.000000000 +0100
@@ -61,6 +61,9 @@ static inline int __node_to_first_cpu(in
 /* Returns the number of the first MemBlk on Node 'node' */
 #define __node_to_memblk(node) (node)
 
+/* Cross-node load balancing interval. */
+#define NODE_BALANCE_RATE 100
+
 #else /* !CONFIG_NUMA */
 /*
  * Other i386 platforms should define their own version of the 
diff -urNp 2.5.58/include/asm-ia64/topology.h 2.5.58-ms-ilb-nb/include/asm-ia64/topology.h
--- 2.5.58/include/asm-ia64/topology.h	2003-01-14 06:58:03.000000000 +0100
+++ 2.5.58-ms-ilb-nb/include/asm-ia64/topology.h	2003-01-15 15:25:28.000000000 +0100
@@ -60,4 +60,7 @@
  */
 #define __node_to_memblk(node) (node)
 
+/* Cross-node load balancing interval. */
+#define NODE_BALANCE_RATE 10
+
 #endif /* _ASM_IA64_TOPOLOGY_H */
diff -urNp 2.5.58/include/asm-ppc64/topology.h 2.5.58-ms-ilb-nb/include/asm-ppc64/topology.h
--- 2.5.58/include/asm-ppc64/topology.h	2003-01-14 06:59:29.000000000 +0100
+++ 2.5.58-ms-ilb-nb/include/asm-ppc64/topology.h	2003-01-15 15:24:15.000000000 +0100
@@ -46,6 +46,9 @@ static inline unsigned long __node_to_cp
 	return mask;
 }
 
+/* Cross-node load balancing interval. */
+#define NODE_BALANCE_RATE 10
+
 #else /* !CONFIG_NUMA */
 
 #define __cpu_to_node(cpu)		(0)
diff -urNp 2.5.58/include/linux/sched.h 2.5.58-ms-ilb-nb/include/linux/sched.h
--- 2.5.58/include/linux/sched.h	2003-01-14 06:58:06.000000000 +0100
+++ 2.5.58-ms-ilb-nb/include/linux/sched.h	2003-01-15 11:47:32.000000000 +0100
@@ -160,7 +160,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -444,6 +443,14 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+#define sched_balance_exec()   {}
+#define node_nr_running_init() {}
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp 2.5.58/init/main.c 2.5.58-ms-ilb-nb/init/main.c
--- 2.5.58/init/main.c	2003-01-14 06:58:20.000000000 +0100
+++ 2.5.58-ms-ilb-nb/init/main.c	2003-01-15 11:47:32.000000000 +0100
@@ -495,6 +495,7 @@ static void do_pre_smp_initcalls(void)
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urNp 2.5.58/kernel/sched.c 2.5.58-ms-ilb-nb/kernel/sched.c
--- 2.5.58/kernel/sched.c	2003-01-14 06:59:11.000000000 +0100
+++ 2.5.58-ms-ilb-nb/kernel/sched.c	2003-01-15 15:26:29.000000000 +0100
@@ -67,6 +67,7 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
+#define NODE_THRESHOLD          125
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -153,7 +154,10 @@ struct runqueue {
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
-
+#ifdef CONFIG_NUMA
+	atomic_t *node_nr_running;
+	unsigned int nr_balanced;
+#endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
@@ -178,6 +182,44 @@ static struct runqueue runqueues[NR_CPUS
 #endif
 
 /*
+ * Keep track of running tasks.
+ */
+#if CONFIG_NUMA
+/* XXX(hch): this should go into a struct sched_node_data */
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_init(struct runqueue *rq)
+{
+	rq->node_nr_running = &node_nr_running[0];
+}
+
+static inline void nr_running_inc(runqueue_t *rq)
+{
+	atomic_inc(rq->node_nr_running);
+	rq->nr_running++;
+}
+
+static inline void nr_running_dec(runqueue_t *rq)
+{
+	atomic_dec(rq->node_nr_running);
+	rq->nr_running--;
+}
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_rq(i)->node_nr_running = &node_nr_running[__cpu_to_node(i)];
+}
+#else
+# define nr_running_init(rq)   do { } while (0)
+# define nr_running_inc(rq)    do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)    do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
+/*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
  * explicitly disabling preemption.
@@ -294,7 +336,7 @@ static inline void activate_task(task_t 
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -302,7 +344,7 @@ static inline void activate_task(task_t 
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -624,6 +666,91 @@ static inline void double_rq_unlock(runq
 		spin_unlock(&rq2->lock);
 }
 
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+
+	minload = 10000000;
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1UL << i)))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+
+static int find_busiest_node(int this_node)
+{
+	int i, node = -1, load, this_load, maxload;
+	
+	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	for (i = 0; i < numnodes; i++) {
+		if (i == this_node)
+			continue;
+		load = atomic_read(&node_nr_running[i]);
+		if (load > maxload && (100*load > NODE_THRESHOLD*this_load)) {
+			maxload = load;
+			node = i;
+		}
+	}
+	return node;
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP
 
 /*
@@ -652,9 +779,9 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +816,7 @@ static inline runqueue_t *find_busiest_q
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!cpu_online(i) || !((1UL << i) & cpumask))
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -736,9 +863,9 @@ out:
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -758,13 +885,30 @@ static inline void pull_task(runqueue_t 
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, idx, this_cpu, this_node;
+	unsigned long cpumask;
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	this_cpu = smp_processor_id();
+	this_node = __cpu_to_node(this_cpu);
+	cpumask = __node_to_cpu_mask(this_node);
+
+#if CONFIG_NUMA
+	/*
+	 * Avoid rebalancing between nodes too often.
+	 */
+	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
+		int node = find_busiest_node(this_node);
+		this_rq->nr_balanced = 0;
+		if (node >= 0)
+			cpumask = __node_to_cpu_mask(node) | (1UL << this_cpu);
+	}
+#endif
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
+				    cpumask);
 	if (!busiest)
 		goto out;
 
@@ -2231,6 +2375,7 @@ void __init sched_init(void)
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+		nr_running_init(rq);
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

[-- Attachment #3: numa-sched-add-2.5.58.patch --]
[-- Type: text/x-diff, Size: 5303 bytes --]

diff -urNp 2.5.58-ms-ilb-nb/include/asm-generic/topology.h 2.5.58-ms-ilb-nb-sm-var/include/asm-generic/topology.h
--- 2.5.58-ms-ilb-nb/include/asm-generic/topology.h	2003-01-15 15:24:45.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/include/asm-generic/topology.h	2003-01-15 15:32:39.000000000 +0100
@@ -49,8 +49,11 @@
 #endif
 
 /* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
+#ifndef NODE_BALANCE_MIN
+#define NODE_BALANCE_MIN 10
+#endif
+#ifndef NODE_BALANCE_MAX
+#define NODE_BALANCE_MAX 50
 #endif
 
 #endif /* _ASM_GENERIC_TOPOLOGY_H */
diff -urNp 2.5.58-ms-ilb-nb/include/asm-i386/topology.h 2.5.58-ms-ilb-nb-sm-var/include/asm-i386/topology.h
--- 2.5.58-ms-ilb-nb/include/asm-i386/topology.h	2003-01-15 15:25:11.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/include/asm-i386/topology.h	2003-01-15 15:33:50.000000000 +0100
@@ -62,7 +62,8 @@ static inline int __node_to_first_cpu(in
 #define __node_to_memblk(node) (node)
 
 /* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
+#define NODE_BALANCE_MIN 10
+#define NODE_BALANCE_MAX 250
 
 #else /* !CONFIG_NUMA */
 /*
diff -urNp 2.5.58-ms-ilb-nb/include/asm-ia64/topology.h 2.5.58-ms-ilb-nb-sm-var/include/asm-ia64/topology.h
--- 2.5.58-ms-ilb-nb/include/asm-ia64/topology.h	2003-01-15 15:25:28.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/include/asm-ia64/topology.h	2003-01-15 15:34:11.000000000 +0100
@@ -61,6 +61,7 @@
 #define __node_to_memblk(node) (node)
 
 /* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
+#define NODE_BALANCE_MIN 10
+#define NODE_BALANCE_MAX 50
 
 #endif /* _ASM_IA64_TOPOLOGY_H */
diff -urNp 2.5.58-ms-ilb-nb/include/asm-ppc64/topology.h 2.5.58-ms-ilb-nb-sm-var/include/asm-ppc64/topology.h
--- 2.5.58-ms-ilb-nb/include/asm-ppc64/topology.h	2003-01-15 15:24:15.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/include/asm-ppc64/topology.h	2003-01-15 15:34:36.000000000 +0100
@@ -47,7 +47,8 @@ static inline unsigned long __node_to_cp
 }
 
 /* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
+#define NODE_BALANCE_MIN 10
+#define NODE_BALANCE_MAX 50
 
 #else /* !CONFIG_NUMA */
 
diff -urNp 2.5.58-ms-ilb-nb/kernel/sched.c 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c
--- 2.5.58-ms-ilb-nb/kernel/sched.c	2003-01-15 15:26:29.000000000 +0100
+++ 2.5.58-ms-ilb-nb-sm-var/kernel/sched.c	2003-01-15 15:31:35.000000000 +0100
@@ -157,6 +157,7 @@ struct runqueue {
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
 	unsigned int nr_balanced;
+	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -188,6 +189,8 @@ static struct runqueue runqueues[NR_CPUS
 /* XXX(hch): this should go into a struct sched_node_data */
 static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
 	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+static int internode_lb[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = NODE_BALANCE_MAX};
 
 static inline void nr_running_init(struct runqueue *rq)
 {
@@ -733,22 +736,53 @@ void sched_balance_exec(void)
 	}
 }
 
+/*
+ * Find the busiest node. All previous node loads contribute with a 
+ * geometrically deccaying weight to the load measure:
+ *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
+ * This way sudden load peaks are flattened out a bit.
+ */
 static int find_busiest_node(int this_node)
 {
 	int i, node = -1, load, this_load, maxload;
+	int avg_load;
 	
-	this_load = maxload = atomic_read(&node_nr_running[this_node]);
+	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
+		+ atomic_read(&node_nr_running[this_node]);
+	this_rq()->prev_node_load[this_node] = this_load;
+	avg_load = this_load;
 	for (i = 0; i < numnodes; i++) {
 		if (i == this_node)
 			continue;
-		load = atomic_read(&node_nr_running[i]);
+		load = (this_rq()->prev_node_load[i] >> 1)
+			+ atomic_read(&node_nr_running[i]);
+		avg_load += load;
+		this_rq()->prev_node_load[i] = load;
 		if (load > maxload && (100*load > NODE_THRESHOLD*this_load)) {
 			maxload = load;
 			node = i;
 		}
 	}
+	avg_load = avg_load / (numnodes ? numnodes : 1);
+	if (this_load < (avg_load / 2)) {
+		if (internode_lb[this_node] == NODE_BALANCE_MAX)
+			internode_lb[this_node] = NODE_BALANCE_MIN;
+	} else
+		if (internode_lb[this_node] == NODE_BALANCE_MIN)
+			internode_lb[this_node] = NODE_BALANCE_MAX;
+	if (this_load >= avg_load)
+  		node = -1;
 	return node;
 }
+
+static inline int remote_steal_factor(runqueue_t *rq)
+{
+	int cpu = __cpu_to_node(task_cpu(rq->curr));
+
+	return (cpu == __cpu_to_node(smp_processor_id())) ? 1 : 2;
+}
+#else
+# define remote_steal_factor(rq) 1
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -900,7 +934,7 @@ static void load_balance(runqueue_t *thi
 	/*
 	 * Avoid rebalancing between nodes too often.
 	 */
-	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
+	if (++(this_rq->nr_balanced) == internode_lb[this_node]) {
 		int node = find_busiest_node(this_node);
 		this_rq->nr_balanced = 0;
 		if (node >= 0)
@@ -965,7 +999,7 @@ skip_queue:
 		goto skip_bitmap;
 	}
 	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
+	if (!idle && ((--imbalance)/remote_steal_factor(busiest))) {
 		if (curr != head)
 			goto skip_queue;
 		idx++;

[-- Attachment #4: numa-sched-patches.tgz --]
[-- Type: application/x-tgz, Size: 4730 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-15 15:10                         ` Erich Focht
@ 2003-01-16  0:14                           ` Michael Hohnbaum
  2003-01-16  6:05                           ` Martin J. Bligh
  1 sibling, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-16  0:14 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Christoph Hellwig, Andrew Theurer, Robert Love,
	Ingo Molnar, linux-kernel, lse-tech

On Wed, 2003-01-15 at 07:10, Erich Focht wrote:
> Thanks for the patience with the problems in the yesterday patches. I
> resend the patches in the same form. Made following changes:
> 
> - moved NODE_BALANCE_{RATE,MIN,MAX} to topology.h
> - removed the divide in the find_busiest_node() loop (thanks, Martin)
> - removed the modulo (%) in in the cross-node balancing trigger
> - re-added node_nr_running_init() stub, nr_running_init() and comments
>   from Christoph
> - removed the constant factor 4 in find_busiest_node. The
>   find_busiest_queue routine will take care of the case where the
>   busiest_node is running only few processes (at most one per CPU) and
>   return busiest=NULL .

These applied clean and looked good.  I ran numbers for kernels
with patches 1-4 and patches 1-5.  Results below.
> 
> I hope we can start tuning the parameters now. In the basic NUMA
> scheduler part these are:
>   NODE_THRESHOLD     : minimum percentage of node overload to
>                        trigger cross-node balancing
>   NODE_BALANCE_RATE  : arch specific, cross-node balancing is called
>                        after this many intra-node load balance calls

I need to spend a some time staring at this code and putting together
a series of tests to try.  I think the basic mechanism looks good, so
hopefully we are down to finding the best numbers for the architecture.
> 
> In the extended NUMA scheduler the fixed value of NODE_BALANCE_RATE is
> replaced by a variable rate set to :
>   NODE_BALANCE_MIN : if own node load is less then avg_load/2
>   NODE_BALANCE_MAX : if load is larger than avg_load/2
> Together with the reduced number of steals across nodes this might
> help us in achieving equal load among nodes. I'm not aware of any
> simple benchmark which can demonstrate this...
> 
> Regards,
> Erich
> ----
>
results:

stock58h: linux 2.5.58 with cputime stats patch
sched14r2-58: stock58h with latest NUMA sched patches 1 through 4
sched15r2-58: stock58h with latest NUMA sched patches 1 through 5
sched1-4-58: stock58h with previous NUMA sched patches 1 through 4

Kernbench:
                        Elapsed       User     System        CPU
        sched14r2-58    29.488s    284.02s    82.132s    1241.8%
        sched15r2-58    29.778s   282.792s    83.478s    1229.8%
            stock58h    31.656s    305.85s    89.232s    1248.2%
         sched1-4-58    29.886s   287.506s     82.94s      1239%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
        sched14r2-58      22.50      37.20      90.03       0.94
        sched15r2-58      16.63      24.23      66.56       0.69
            stock58h      27.73      42.80     110.96       0.85
         sched1-4-58      32.86      46.41     131.47       0.85

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
        sched14r2-58      30.27      43.22     242.23       1.75
        sched15r2-58      30.90      42.46     247.28       1.48
            stock58h      45.97      61.87     367.81       2.11
         sched1-4-58      31.39      49.18     251.22       2.15

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
        sched14r2-58      52.78      57.28     844.61       3.70
        sched15r2-58      48.44      65.31     775.25       3.30
            stock58h      60.91      83.63     974.71       6.18
         sched1-4-58      54.31      62.11     869.11       3.84

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
        sched14r2-58      56.60     116.99    1811.56       5.94
        sched15r2-58      56.42     116.75    1805.82       6.45
            stock58h      84.26     195.16    2696.65      16.53
         sched1-4-58      61.49     140.51    1968.06       9.57

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
        sched14r2-58      56.48     232.63    3615.55      16.02
        sched15r2-58      56.38     236.25    3609.03      15.41
            stock58h     123.27     511.77    7889.77      27.78
         sched1-4-58      63.39     266.40    4057.92      20.55

 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-15 15:10                         ` Erich Focht
  2003-01-16  0:14                           ` Michael Hohnbaum
@ 2003-01-16  6:05                           ` Martin J. Bligh
  2003-01-16 16:47                             ` Erich Focht
  1 sibling, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-16  6:05 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum, Christoph Hellwig
  Cc: Andrew Theurer, Robert Love, Ingo Molnar, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 579 bytes --]

I'm not keen on the way the minisched patch got reformatted. I changed
it into a seperate function, which I think is much cleaner by the time
you've added the third patch - no #ifdef CONFIG_NUMA in load_balance.

Rejigged patches attatched, no functional changes.

Anyway, I perf tested this, and it comes out more or less the same as
the tuned version I was poking at last night (ie best of the bunch).
Looks pretty good to me.

M.

PS. The fourth patch was so small, and touching the same stuff as 3
that I rolled it into the third one here. Seems like a universal 
benefit ;-)

[-- Attachment #2: numasched1 --]
[-- Type: application/octet-stream, Size: 1653 bytes --]

diff -urpN -X /home/fletch/.diff.exclude virgin/kernel/sched.c numasched1/kernel/sched.c
--- virgin/kernel/sched.c	Mon Jan 13 21:09:28 2003
+++ numasched1/kernel/sched.c	Wed Jan 15 19:52:07 2003
@@ -624,6 +624,22 @@ static inline void double_rq_unlock(runq
 		spin_unlock(&rq2->lock);
 }
 
+#ifdef CONFIG_NUMA
+
+static inline unsigned long cpus_to_balance(int this_cpu)
+{
+	return __node_to_cpu_mask(__cpu_to_node(this_cpu));
+}
+
+#else /* !CONFIG_NUMA */
+
+static inline unsigned long cpus_to_balance(int this_cpu)
+{
+	return cpu_online_map;
+}
+
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP
 
 /*
@@ -652,9 +668,9 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -689,7 +705,7 @@ static inline runqueue_t *find_busiest_q
 	busiest = NULL;
 	max_load = 1;
 	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
+		if (!((1UL << i) & cpumask))
 			continue;
 
 		rq_src = cpu_rq(i);
@@ -764,7 +780,8 @@ static void load_balance(runqueue_t *thi
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
+					cpus_to_balance(this_cpu));
 	if (!busiest)
 		goto out;
 

[-- Attachment #3: numasched2 --]
[-- Type: application/octet-stream, Size: 5907 bytes --]

diff -urpN -X /home/fletch/.diff.exclude numasched1/fs/exec.c numasched2/fs/exec.c
--- numasched1/fs/exec.c	Mon Jan 13 21:09:13 2003
+++ numasched2/fs/exec.c	Wed Jan 15 19:10:04 2003
@@ -1031,6 +1031,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urpN -X /home/fletch/.diff.exclude numasched1/include/linux/sched.h numasched2/include/linux/sched.h
--- numasched1/include/linux/sched.h	Mon Jan 13 21:09:28 2003
+++ numasched2/include/linux/sched.h	Wed Jan 15 19:10:04 2003
@@ -444,6 +444,14 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+#define sched_balance_exec()   {}
+#define node_nr_running_init() {}
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urpN -X /home/fletch/.diff.exclude numasched1/init/main.c numasched2/init/main.c
--- numasched1/init/main.c	Thu Jan  9 19:16:15 2003
+++ numasched2/init/main.c	Wed Jan 15 19:10:04 2003
@@ -495,6 +495,7 @@ static void do_pre_smp_initcalls(void)
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urpN -X /home/fletch/.diff.exclude numasched1/kernel/sched.c numasched2/kernel/sched.c
--- numasched1/kernel/sched.c	Wed Jan 15 19:52:07 2003
+++ numasched2/kernel/sched.c	Wed Jan 15 19:56:42 2003
@@ -153,7 +153,9 @@ struct runqueue {
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
-
+#ifdef CONFIG_NUMA
+	atomic_t *node_nr_running;
+#endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
@@ -177,6 +179,48 @@ static struct runqueue runqueues[NR_CPUS
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
+#ifdef CONFIG_NUMA
+
+/*
+ * Keep track of running tasks.
+ */
+
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_init(struct runqueue *rq)
+{
+	rq->node_nr_running = &node_nr_running[0];
+}
+
+static inline void nr_running_inc(runqueue_t *rq)
+{
+	atomic_inc(rq->node_nr_running);
+	rq->nr_running++;
+}
+
+static inline void nr_running_dec(runqueue_t *rq)
+{
+	atomic_dec(rq->node_nr_running);
+	rq->nr_running--;
+}
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_rq(i)->node_nr_running = &node_nr_running[__cpu_to_node(i)];
+}
+
+#else /* !CONFIG_NUMA */
+
+# define nr_running_init(rq)   do { } while (0)
+# define nr_running_inc(rq)    do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)    do { (rq)->nr_running--; } while (0)
+
+#endif /* CONFIG_NUMA */
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -294,7 +338,7 @@ static inline void activate_task(task_t 
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -302,7 +346,7 @@ static inline void activate_task(task_t 
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -624,7 +668,72 @@ static inline void double_rq_unlock(runq
 		spin_unlock(&rq2->lock);
 }
 
-#ifdef CONFIG_NUMA
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+	unsigned long cpumask;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+
+	minload = 10000000;
+	cpumask = __node_to_cpu_mask(node);
+	for (i = 0; i < NR_CPUS; ++i) {
+		if (!(cpumask & (1UL << i)))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
 
 static inline unsigned long cpus_to_balance(int this_cpu)
 {
@@ -752,9 +861,9 @@ out:
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2248,6 +2357,7 @@ void __init sched_init(void)
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+		nr_running_init(rq);
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

[-- Attachment #4: numasched3 --]
[-- Type: application/octet-stream, Size: 4620 bytes --]

diff -urpN -X /home/fletch/.diff.exclude numasched2/include/asm-generic/topology.h numasched3/include/asm-generic/topology.h
--- numasched2/include/asm-generic/topology.h	Sun Nov 17 20:29:22 2002
+++ numasched3/include/asm-generic/topology.h	Wed Jan 15 19:26:01 2003
@@ -48,4 +48,9 @@
 #define __node_to_memblk(node)		(0)
 #endif
 
+/* Cross-node load balancing interval. */
+#ifndef NODE_BALANCE_RATE
+#define NODE_BALANCE_RATE 10
+#endif
+
 #endif /* _ASM_GENERIC_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude numasched2/include/asm-i386/topology.h numasched3/include/asm-i386/topology.h
--- numasched2/include/asm-i386/topology.h	Thu Jan  9 19:16:11 2003
+++ numasched3/include/asm-i386/topology.h	Wed Jan 15 19:26:01 2003
@@ -61,6 +61,9 @@ static inline int __node_to_first_cpu(in
 /* Returns the number of the first MemBlk on Node 'node' */
 #define __node_to_memblk(node) (node)
 
+/* Cross-node load balancing interval. */
+#define NODE_BALANCE_RATE 100
+
 #else /* !CONFIG_NUMA */
 /*
  * Other i386 platforms should define their own version of the 
diff -urpN -X /home/fletch/.diff.exclude numasched2/include/asm-ia64/topology.h numasched3/include/asm-ia64/topology.h
--- numasched2/include/asm-ia64/topology.h	Sun Nov 17 20:29:20 2002
+++ numasched3/include/asm-ia64/topology.h	Wed Jan 15 19:26:01 2003
@@ -60,4 +60,7 @@
  */
 #define __node_to_memblk(node) (node)
 
+/* Cross-node load balancing interval. */
+#define NODE_BALANCE_RATE 10
+
 #endif /* _ASM_IA64_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude numasched2/include/asm-ppc64/topology.h numasched3/include/asm-ppc64/topology.h
--- numasched2/include/asm-ppc64/topology.h	Thu Jan  9 19:16:12 2003
+++ numasched3/include/asm-ppc64/topology.h	Wed Jan 15 19:26:01 2003
@@ -46,6 +46,9 @@ static inline unsigned long __node_to_cp
 	return mask;
 }
 
+/* Cross-node load balancing interval. */
+#define NODE_BALANCE_RATE 10
+
 #else /* !CONFIG_NUMA */
 
 #define __cpu_to_node(cpu)		(0)
diff -urpN -X /home/fletch/.diff.exclude numasched2/kernel/sched.c numasched3/kernel/sched.c
--- numasched2/kernel/sched.c	Wed Jan 15 19:56:42 2003
+++ numasched3/kernel/sched.c	Wed Jan 15 20:01:12 2003
@@ -67,6 +67,7 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
+#define NODE_THRESHOLD          125
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -155,6 +156,8 @@ struct runqueue {
 	int prev_nr_running[NR_CPUS];
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
+	unsigned int nr_balanced;
+	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -735,14 +738,52 @@ void sched_balance_exec(void)
 	}
 }
 
-static inline unsigned long cpus_to_balance(int this_cpu)
+/*
+ * Find the busiest node. All previous node loads contribute with a 
+ * geometrically deccaying weight to the load measure:
+ *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
+ * This way sudden load peaks are flattened out a bit.
+ */
+static int find_busiest_node(int this_node)
+{
+	int i, node = -1, load, this_load, maxload;
+	
+	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
+		+ atomic_read(&node_nr_running[this_node]);
+	this_rq()->prev_node_load[this_node] = this_load;
+	for (i = 0; i < numnodes; i++) {
+		if (i == this_node)
+			continue;
+		load = (this_rq()->prev_node_load[i] >> 1)
+			+ atomic_read(&node_nr_running[i]);
+		this_rq()->prev_node_load[i] = load;
+		if (load > maxload && (100*load > NODE_THRESHOLD*this_load)) {
+			maxload = load;
+			node = i;
+		}
+	}
+	return node;
+}
+
+static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
 {
-	return __node_to_cpu_mask(__cpu_to_node(this_cpu));
+	int this_node = __cpu_to_node(this_cpu);
+	/*
+	 * Avoid rebalancing between nodes too often.
+	 * We rebalance globally once every NODE_BALANCE_RATE load balances.
+	 */
+	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
+		int node = find_busiest_node(this_node);
+		this_rq->nr_balanced = 0;
+		if (node >= 0)
+			return (__node_to_cpu_mask(node) | (1UL << this_cpu));
+	}
+	return __node_to_cpu_mask(this_node);
 }
 
 #else /* !CONFIG_NUMA */
 
-static inline unsigned long cpus_to_balance(int this_cpu)
+static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
 {
 	return cpu_online_map;
 }
@@ -890,7 +931,7 @@ static void load_balance(runqueue_t *thi
 	task_t *tmp;
 
 	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
-					cpus_to_balance(this_cpu));
+					cpus_to_balance(this_cpu, this_rq));
 	if (!busiest)
 		goto out;
 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16  6:05                           ` Martin J. Bligh
@ 2003-01-16 16:47                             ` Erich Focht
  2003-01-16 18:07                               ` Robert Love
  0 siblings, 1 reply; 96+ messages in thread
From: Erich Focht @ 2003-01-16 16:47 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Andrew Theurer, Robert Love, Ingo Molnar, linux-kernel, lse-tech,
	Christoph Hellwig

Hi Martin and Michael,

thanks for testing again!

On Thursday 16 January 2003 07:05, Martin J. Bligh wrote:
> I'm not keen on the way the minisched patch got reformatted. I changed
> it into a seperate function, which I think is much cleaner by the time
> you've added the third patch - no #ifdef CONFIG_NUMA in load_balance.

Fine. This form is also nearer to the codingstyle rule: "functions
should do only one thing" (I'm reading those more carefully now ;-)

> Anyway, I perf tested this, and it comes out more or less the same as
> the tuned version I was poking at last night (ie best of the bunch).
> Looks pretty good to me.

Great!

> PS. The fourth patch was so small, and touching the same stuff as 3
> that I rolled it into the third one here. Seems like a universal
> benefit ;-)

Yes, it's a much smaller step than patch #5. It would make sense to
have this included right from the start.

Regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 16:47                             ` Erich Focht
@ 2003-01-16 18:07                               ` Robert Love
  2003-01-16 18:48                                 ` Martin J. Bligh
  2003-01-16 19:07                                 ` Ingo Molnar
  0 siblings, 2 replies; 96+ messages in thread
From: Robert Love @ 2003-01-16 18:07 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Michael Hohnbaum, Andrew Theurer, Ingo Molnar,
	linux-kernel, lse-tech, Christoph Hellwig

On Thu, 2003-01-16 at 11:47, Erich Focht wrote:

> Fine. This form is also nearer to the codingstyle rule: "functions
> should do only one thing" (I'm reading those more carefully now ;-)

Good ;)

This is looking good.  Thanks hch for going over it with your fine tooth
comb.

Erich and Martin, what more needs to be done prior to inclusion?  Do you
still want an exec balancer in place?

	Robert Love


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 18:07                               ` Robert Love
@ 2003-01-16 18:48                                 ` Martin J. Bligh
  2003-01-16 19:07                                 ` Ingo Molnar
  1 sibling, 0 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-16 18:48 UTC (permalink / raw)
  To: Robert Love, Erich Focht
  Cc: Michael Hohnbaum, Andrew Theurer, Ingo Molnar, linux-kernel,
	lse-tech, Christoph Hellwig

> On Thu, 2003-01-16 at 11:47, Erich Focht wrote:
> 
>> Fine. This form is also nearer to the codingstyle rule: "functions
>> should do only one thing" (I'm reading those more carefully now ;-)
> 
> Good ;)
> 
> This is looking good.  Thanks hch for going over it with your fine tooth
> comb.
> 
> Erich and Martin, what more needs to be done prior to inclusion?  Do you
> still want an exec balancer in place?

Yup, that's in patch 2 already. I just pushed it ... will see what
happens.

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 19:07                                 ` Ingo Molnar
@ 2003-01-16 18:59                                   ` Martin J. Bligh
  2003-01-16 19:10                                   ` Christoph Hellwig
  1 sibling, 0 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-16 18:59 UTC (permalink / raw)
  To: Ingo Molnar, Robert Love
  Cc: Erich Focht, Michael Hohnbaum, Andrew Theurer, linux-kernel,
	lse-tech, Christoph Hellwig

> well, it needs to settle down a bit more, we are technically in a
> codefreeze :-)

The current codeset is *completely* non-invasive to non-NUMA systems.
It can't break anything.

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 18:07                               ` Robert Love
  2003-01-16 18:48                                 ` Martin J. Bligh
@ 2003-01-16 19:07                                 ` Ingo Molnar
  2003-01-16 18:59                                   ` Martin J. Bligh
  2003-01-16 19:10                                   ` Christoph Hellwig
  1 sibling, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-16 19:07 UTC (permalink / raw)
  To: Robert Love
  Cc: Erich Focht, Martin J. Bligh, Michael Hohnbaum, Andrew Theurer,
	linux-kernel, lse-tech, Christoph Hellwig


On 16 Jan 2003, Robert Love wrote:

> > Fine. This form is also nearer to the codingstyle rule: "functions
> > should do only one thing" (I'm reading those more carefully now ;-)
> 
> Good ;)
> 
> This is looking good.  Thanks hch for going over it with your fine tooth
> comb.
> 
> Erich and Martin, what more needs to be done prior to inclusion?  Do you
> still want an exec balancer in place?

well, it needs to settle down a bit more, we are technically in a
codefreeze :-)

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 19:07                                 ` Ingo Molnar
  2003-01-16 18:59                                   ` Martin J. Bligh
@ 2003-01-16 19:10                                   ` Christoph Hellwig
  2003-01-16 19:44                                     ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: Christoph Hellwig @ 2003-01-16 19:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Love, Erich Focht, Martin J. Bligh, Michael Hohnbaum,
	Andrew Theurer, linux-kernel, lse-tech

On Thu, Jan 16, 2003 at 08:07:09PM +0100, Ingo Molnar wrote:
> well, it needs to settle down a bit more, we are technically in a
> codefreeze :-)

We're in feature freeze.  Not sure whether fixing the scheduler for
one type of hardware supported by Linux is a feature 8)

Anyway, patch 1 should certainly merged ASAP, for the other we can wait
a bit more to settle, but I don't think it's really worth the wait.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 19:44                                     ` Ingo Molnar
@ 2003-01-16 19:43                                       ` Martin J. Bligh
  2003-01-16 20:19                                         ` Ingo Molnar
  2003-01-16 19:44                                       ` John Bradford
  1 sibling, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-16 19:43 UTC (permalink / raw)
  To: Ingo Molnar, Christoph Hellwig
  Cc: Robert Love, Erich Focht, Michael Hohnbaum, Andrew Theurer,
	linux-kernel, lse-tech

> complex. It's the one that is aware of the global scheduling picture. For
> NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
> and an inter-node frequency - configured by the architecture and roughly
> in the same proportion to each other as cachemiss latencies.

That's exactly what's in the latest set of patches - admittedly it's a
multiplier of when we run load_balance, not the tick multiplier, but 
that's very easy to fix. Can you check out the stuff I posted last night?
I think it's somewhat cleaner ...

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 19:10                                   ` Christoph Hellwig
@ 2003-01-16 19:44                                     ` Ingo Molnar
  2003-01-16 19:43                                       ` Martin J. Bligh
  2003-01-16 19:44                                       ` John Bradford
  0 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-16 19:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robert Love, Erich Focht, Martin J. Bligh, Michael Hohnbaum,
	Andrew Theurer, linux-kernel, lse-tech


On Thu, 16 Jan 2003, Christoph Hellwig wrote:

> > well, it needs to settle down a bit more, we are technically in a
> > codefreeze :-)
> 
> We're in feature freeze.  Not sure whether fixing the scheduler for one
> type of hardware supported by Linux is a feature 8)
> 
> Anyway, patch 1 should certainly merged ASAP, for the other we can wait
> a bit more to settle, but I don't think it's really worth the wait.

agreed, the patch is unintrusive, but by settling down i mean things like
this:

+/* XXX(hch): this should go into a struct sched_node_data */

should be decided one way or another.

i'm also not quite happy about the conceptual background of
rq->nr_balanced. This load-balancing rate-limit is arbitrary and not
neutral at all. The way this should be done is to move the inter-node
balancing conditional out of load_balance(), and only trigger it from the
timer interrupt, with a given rate. On basically all NUMA hardware i
suspect it's much better to do inter-node balancing only on a very slow
scale. Making it dependnet on an arbitrary portion of the idle-CPU
rebalancing act makes the frequency of inter-node rebalancing almost
arbitrarily high.

ie. there are two basic types of rebalancing acts in multiprocessor
environments: 'synchronous balancing' and 'asynchronous balancing'.  
Synchronous balancing is done whenever a CPU runs idle - this can happen
at a very high rate, so it needs to be low overhead and unintrusive. This
was already so when i did the SMP balancer. The asynchronous blancing
component (currently directly triggered from every CPU's own timer
interrupt), has a fixed frequency, and thus can be almost arbitrarily
complex. It's the one that is aware of the global scheduling picture. For
NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
and an inter-node frequency - configured by the architecture and roughly
in the same proportion to each other as cachemiss latencies.

(this all means that unless there's empirical data showing the opposite,
->nr_balanced can be removed completely, and fixed frequency balancing can
be done from the timer tick. This should further simplify the patch.)

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 19:44                                     ` Ingo Molnar
  2003-01-16 19:43                                       ` Martin J. Bligh
@ 2003-01-16 19:44                                       ` John Bradford
  1 sibling, 0 replies; 96+ messages in thread
From: John Bradford @ 2003-01-16 19:44 UTC (permalink / raw)
  To: mingo
  Cc: hch, rml, efocht, mbligh, hohnbaum, habanero, linux-kernel, lse-tech

> > > well, it needs to settle down a bit more, we are technically in a
> > > codefreeze :-)
> > 
> > We're in feature freeze.  Not sure whether fixing the scheduler for one
> > type of hardware supported by Linux is a feature 8)

Yes, we are definitely _not_ in a code freeze yet, and I doubt that we
will be for at least a few months.

John.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 19:43                                       ` Martin J. Bligh
@ 2003-01-16 20:19                                         ` Ingo Molnar
  2003-01-16 20:29                                           ` [Lse-tech] " Rick Lindsley
                                                             ` (3 more replies)
  0 siblings, 4 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-16 20:19 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Christoph Hellwig, Robert Love, Erich Focht, Michael Hohnbaum,
	Andrew Theurer, linux-kernel, lse-tech


On Thu, 16 Jan 2003, Martin J. Bligh wrote:

> > complex. It's the one that is aware of the global scheduling picture. For
> > NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
> > and an inter-node frequency - configured by the architecture and roughly
> > in the same proportion to each other as cachemiss latencies.
> 
> That's exactly what's in the latest set of patches - admittedly it's a
> multiplier of when we run load_balance, not the tick multiplier, but
> that's very easy to fix. Can you check out the stuff I posted last
> night? I think it's somewhat cleaner ...

yes, i saw it, it has the same tying between idle-CPU-rebalance and
inter-node rebalance, as Erich's patch. You've put it into
cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
coupled balancing act. There are two synchronous balancing acts currently:
the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
must involve any 'heavy' balancing, only local balancing. The inter-node
balancing (which is heavier than even the global SMP balancer), should
never be triggered from the high-frequency path. [whether it's high
frequency or not depends on the actual workload, but it can be potentially
_very_ high frequency, easily on the order of 1 million times a second -
then you'll call the inter-node balancer 100K times a second.]

I'd strongly suggest to decouple the heavy NUMA load-balancing code from
the fastpath and re-check the benchmark numbers.

	Ingo

(*) whether sched_balance_exec() is a high-frequency path or not is up to
debate. Right now it's not possible to get much more than a couple of
thousand exec()'s per second on fast CPUs. Hopefully that will change in
the future though, so exec() events could become really fast. So i'd
suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
only do NUMA cross-node balancing with a fixed frequency, from the timer
tick. But exec()-time is really special, since the user task usually has
zero cached state at this point, so we _can_ do cheap cross-node balancing
as well. So it's a boundary thing - probably doing the full-blown
balancing is the right thing.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 20:19                                         ` Ingo Molnar
@ 2003-01-16 20:29                                           ` Rick Lindsley
  2003-01-16 23:31                                           ` Martin J. Bligh
                                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 96+ messages in thread
From: Rick Lindsley @ 2003-01-16 20:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Christoph Hellwig, Robert Love, Erich Focht,
	Michael Hohnbaum, Andrew Theurer, linux-kernel, lse-tech

    [whether it's high
    frequency or not depends on the actual workload, but it can be potentially
    _very_ high frequency, easily on the order of 1 million times a second -
    then you'll call the inter-node balancer 100K times a second.]

If this is due to thread creation/death, though, you might want this level
of inter-node balancing (or at least checking).  It could represent a lot
of fork/execs that are now overloading one or more nodes. Is it reasonable
to expect this sort of load on a relatively proc/thread-stable machine?

Rick

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 20:19                                         ` Ingo Molnar
  2003-01-16 20:29                                           ` [Lse-tech] " Rick Lindsley
@ 2003-01-16 23:31                                           ` Martin J. Bligh
  2003-01-17  7:23                                             ` Ingo Molnar
  2003-01-17  8:47                                             ` [patch] sched-2.5.59-A2 Ingo Molnar
  2003-01-16 23:45                                           ` [PATCH 2.5.58] new NUMA scheduler: fix Michael Hohnbaum
  2003-01-17 11:10                                           ` Erich Focht
  3 siblings, 2 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-16 23:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Robert Love, Erich Focht, Michael Hohnbaum,
	Andrew Theurer, linux-kernel, lse-tech

> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing. The inter-node

If I understand that correctly (and I'm not sure I do), you're saying
you don't think the exec time balance should go global? That would 
break most of the concept ... *something* has to distribute stuff around
nodes, and the exec point is the cheapest time to do that (least "weight"
to move. I'd like to base it off cached load-averages, rather than going
sniffing every runqueue, but if you're meaning we should only exec time
balance inside a node, I disagree. Am I just misunderstanding you?

At the moment, the high-freq balancer is only inside a node. Exec balancing
is global, and the "low-frequency" balancer is global. WRT the idle-time
balancing, I agree with what I *think* you're saying ... this shouldn't
clock up the rq->nr_balanced counter ... this encourages too much cross-node
stealing. I'll hack that change out and see what it does to the numbers.

Would appreciate more feedback on the first paragraph. Thanks,

M.




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 20:19                                         ` Ingo Molnar
  2003-01-16 20:29                                           ` [Lse-tech] " Rick Lindsley
  2003-01-16 23:31                                           ` Martin J. Bligh
@ 2003-01-16 23:45                                           ` Michael Hohnbaum
  2003-01-17 11:10                                           ` Erich Focht
  3 siblings, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-16 23:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Christoph Hellwig, Robert Love, Erich Focht,
	Andrew Theurer, linux-kernel, lse-tech

On Thu, 2003-01-16 at 12:19, Ingo Molnar wrote:
> 
> 	Ingo
> 
> (*) whether sched_balance_exec() is a high-frequency path or not is up to
> debate. Right now it's not possible to get much more than a couple of
> thousand exec()'s per second on fast CPUs. Hopefully that will change in
> the future though, so exec() events could become really fast. So i'd
> suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
> only do NUMA cross-node balancing with a fixed frequency, from the timer
> tick. But exec()-time is really special, since the user task usually has
> zero cached state at this point, so we _can_ do cheap cross-node balancing
> as well. So it's a boundary thing - probably doing the full-blown
> balancing is the right thing.
> 
The reason for doing load balancing on every exec is that, as you say,
it is cheap to do the balancing at this point - no cache state, minimal
memory allocations.  If we did not balance at this point and relied on
balancing from the timer tick, there would be much more movement of 
established processes between nodes, which is expensive.  Ideally, the
exec balancing is good enough so that on a well functioning system there
is no inter-node balancing taking place at the timer tick.  Our testing
has shown that the exec load balance code does a very good job of
spreading processes across nodes, and thus, very little timer tick
balancing.  The timer tick internode balancing is there as a safety
valve for those cases where exec balancing is not adequate.  Workloads
with long running processes, and workloads with processes that do lots
of forks but not execs, are examples.

An earlier version of Erich's initial load balancer provided for
the option of balancing at either fork or exec, and the capability
of selecting on a per-process basis which to use.  That could be
used if workloads that do forks with no execs become a problem on
NUMA boxes.

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 23:31                                           ` Martin J. Bligh
@ 2003-01-17  7:23                                             ` Ingo Molnar
  2003-01-17  8:47                                             ` [patch] sched-2.5.59-A2 Ingo Molnar
  1 sibling, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-17  7:23 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Christoph Hellwig, Robert Love, Erich Focht, Michael Hohnbaum,
	Andrew Theurer, linux-kernel, lse-tech


On Thu, 16 Jan 2003, Martin J. Bligh wrote:

> If I understand that correctly (and I'm not sure I do), you're saying
> you don't think the exec time balance should go global? That would break
> most of the concept ... *something* has to distribute stuff around
> nodes, and the exec point is the cheapest time to do that (least
> "weight" to move. [...]

the exec()-time balancing is special, since it only moves the task in
question - so the 'push' should indeed be a global decision. _But_, exec()  
is also a natural balancing point for the local node (we potentially just
got rid of a task, which might create imbalance within the node), so it
might make sense to do a 'local' balancing run as well, if the exec()-ing
task was indeed pushed to another node.

> At the moment, the high-freq balancer is only inside a node. Exec
> balancing is global, and the "low-frequency" balancer is global. WRT the
> idle-time balancing, I agree with what I *think* you're saying ... this
> shouldn't clock up the rq->nr_balanced counter ... this encourages too
> much cross-node stealing. I'll hack that change out and see what it does
> to the numbers.

yes, this should also further unify the SMP and NUMA balancing code.

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [patch] sched-2.5.59-A2
  2003-01-16 23:31                                           ` Martin J. Bligh
  2003-01-17  7:23                                             ` Ingo Molnar
@ 2003-01-17  8:47                                             ` Ingo Molnar
  2003-01-17 14:35                                               ` Erich Focht
  1 sibling, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2003-01-17  8:47 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Christoph Hellwig, Robert Love, Erich Focht, Michael Hohnbaum,
	Andrew Theurer, Linus Torvalds, linux-kernel, lse-tech


Martin, Erich,

could you give the attached patch a go, it implements my
cleanup/reorganization suggestions ontop of 2.5.59:

 - decouple the 'slow' rebalancer from the 'fast' rebalancer and attach it 
   to the scheduler tick. Remove rq->nr_balanced.

 - do idle-rebalancing every 1 msec, intra-node rebalancing every 200 
   msecs and inter-node rebalancing every 400 msecs.

 - move the tick-based rebalancing logic into rebalance_tick(), it looks
   more organized this way and we have all related code in one spot.

 - clean up the topology.h include file structure. Since generic kernel
   code uses all the defines already, there's no reason to keep them in
   asm-generic.h. I've created a linux/topology.h file that includes
   asm/topology.h and takes care of the default and generic definitions.
   Moved over a generic topology definition from mmzone.h.

 - renamed rq->prev_nr_running[] to rq->prev_cpu_load[] - this further
   unifies the SMP and NUMA balancers and is more in line with the
   prev_node_load name.

If performance drops due to this patch then the benchmarking goal would be
to tune the following frequencies:

 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 #define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)

In theory NODE_REBALANCE_TICK could be defined by the NUMA platform,
although in the past such per-platform scheduler tunings used to end
'orphaned' after some time. 400 msecs is pretty conservative at the
moment, it could be made more frequent if benchmark results support it.

the patch compiles and boots on UP and SMP, it compiles on x86-NUMA.

	Ingo

--- linux/drivers/base/cpu.c.orig	2003-01-17 10:02:19.000000000 +0100
+++ linux/drivers/base/cpu.c	2003-01-17 10:02:25.000000000 +0100
@@ -6,8 +6,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/cpu.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int cpu_add_device(struct device * dev)
--- linux/drivers/base/node.c.orig	2003-01-17 10:02:50.000000000 +0100
+++ linux/drivers/base/node.c	2003-01-17 10:03:03.000000000 +0100
@@ -7,8 +7,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int node_add_device(struct device * dev)
--- linux/drivers/base/memblk.c.orig	2003-01-17 10:02:33.000000000 +0100
+++ linux/drivers/base/memblk.c	2003-01-17 10:02:38.000000000 +0100
@@ -7,8 +7,7 @@
 #include <linux/init.h>
 #include <linux/memblk.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int memblk_add_device(struct device * dev)
--- linux/include/asm-generic/topology.h.orig	2003-01-17 09:49:38.000000000 +0100
+++ linux/include/asm-generic/topology.h	2003-01-17 10:02:08.000000000 +0100
@@ -1,56 +0,0 @@
-/*
- * linux/include/asm-generic/topology.h
- *
- * Written by: Matthew Dobson, IBM Corporation
- *
- * Copyright (C) 2002, IBM Corp.
- *
- * All rights reserved.          
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
- * NON INFRINGEMENT.  See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- * Send feedback to <colpatch@us.ibm.com>
- */
-#ifndef _ASM_GENERIC_TOPOLOGY_H
-#define _ASM_GENERIC_TOPOLOGY_H
-
-/* Other architectures wishing to use this simple topology API should fill
-   in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef __cpu_to_node
-#define __cpu_to_node(cpu)		(0)
-#endif
-#ifndef __memblk_to_node
-#define __memblk_to_node(memblk)	(0)
-#endif
-#ifndef __parent_node
-#define __parent_node(node)		(0)
-#endif
-#ifndef __node_to_first_cpu
-#define __node_to_first_cpu(node)	(0)
-#endif
-#ifndef __node_to_cpu_mask
-#define __node_to_cpu_mask(node)	(cpu_online_map)
-#endif
-#ifndef __node_to_memblk
-#define __node_to_memblk(node)		(0)
-#endif
-
-/* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
-#endif
-
-#endif /* _ASM_GENERIC_TOPOLOGY_H */
--- linux/include/asm-ppc64/topology.h.orig	2003-01-17 09:54:46.000000000 +0100
+++ linux/include/asm-ppc64/topology.h	2003-01-17 09:55:18.000000000 +0100
@@ -46,18 +46,6 @@
 	return mask;
 }
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
-#else /* !CONFIG_NUMA */
-
-#define __cpu_to_node(cpu)		(0)
-#define __memblk_to_node(memblk)	(0)
-#define __parent_node(nid)		(0)
-#define __node_to_first_cpu(node)	(0)
-#define __node_to_cpu_mask(node)	(cpu_online_map)
-#define __node_to_memblk(node)		(0)
-
 #endif /* CONFIG_NUMA */
 
 #endif /* _ASM_PPC64_TOPOLOGY_H */
--- linux/include/linux/topology.h.orig	2003-01-17 09:57:20.000000000 +0100
+++ linux/include/linux/topology.h	2003-01-17 10:09:38.000000000 +0100
@@ -0,0 +1,43 @@
+/*
+ * linux/include/linux/topology.h
+ */
+#ifndef _LINUX_TOPOLOGY_H
+#define _LINUX_TOPOLOGY_H
+
+#include <asm/topology.h>
+
+/*
+ * The default (non-NUMA) topology definitions:
+ */
+#ifndef __cpu_to_node
+#define __cpu_to_node(cpu)		(0)
+#endif
+#ifndef __memblk_to_node
+#define __memblk_to_node(memblk)	(0)
+#endif
+#ifndef __parent_node
+#define __parent_node(node)		(0)
+#endif
+#ifndef __node_to_first_cpu
+#define __node_to_first_cpu(node)	(0)
+#endif
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu)		(cpu_online_map)
+#endif
+#ifndef __node_to_cpu_mask
+#define __node_to_cpu_mask(node)		(cpu_online_map)
+#endif
+#ifndef __node_to_memblk
+#define __node_to_memblk(node)		(0)
+#endif
+
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu)		__node_to_cpu_mask(__cpu_to_node(cpu))
+#endif
+
+/*
+ * Generic defintions:
+ */
+#define numa_node_id()			(__cpu_to_node(smp_processor_id()))
+
+#endif /* _LINUX_TOPOLOGY_H */
--- linux/include/linux/mmzone.h.orig	2003-01-17 09:58:20.000000000 +0100
+++ linux/include/linux/mmzone.h	2003-01-17 10:01:17.000000000 +0100
@@ -255,9 +255,7 @@
 #define MAX_NR_MEMBLKS	1
 #endif /* CONFIG_NUMA */
 
-#include <asm/topology.h>
-/* Returns the number of the current Node. */
-#define numa_node_id()		(__cpu_to_node(smp_processor_id()))
+#include <linux/topology.h>
 
 #ifndef CONFIG_DISCONTIGMEM
 extern struct pglist_data contig_page_data;
--- linux/include/asm-ia64/topology.h.orig	2003-01-17 09:54:33.000000000 +0100
+++ linux/include/asm-ia64/topology.h	2003-01-17 09:54:38.000000000 +0100
@@ -60,7 +60,4 @@
  */
 #define __node_to_memblk(node) (node)
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
 #endif /* _ASM_IA64_TOPOLOGY_H */
--- linux/include/asm-i386/topology.h.orig	2003-01-17 09:55:28.000000000 +0100
+++ linux/include/asm-i386/topology.h	2003-01-17 09:56:27.000000000 +0100
@@ -61,17 +61,6 @@
 /* Returns the number of the first MemBlk on Node 'node' */
 #define __node_to_memblk(node) (node)
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
-
-#else /* !CONFIG_NUMA */
-/*
- * Other i386 platforms should define their own version of the 
- * above macros here.
- */
-
-#include <asm-generic/topology.h>
-
 #endif /* CONFIG_NUMA */
 
 #endif /* _ASM_I386_TOPOLOGY_H */
--- linux/include/asm-i386/cpu.h.orig	2003-01-17 10:03:22.000000000 +0100
+++ linux/include/asm-i386/cpu.h	2003-01-17 10:03:31.000000000 +0100
@@ -3,8 +3,8 @@
 
 #include <linux/device.h>
 #include <linux/cpu.h>
+#include <linux/topology.h>
 
-#include <asm/topology.h>
 #include <asm/node.h>
 
 struct i386_cpu {
--- linux/include/asm-i386/node.h.orig	2003-01-17 10:04:02.000000000 +0100
+++ linux/include/asm-i386/node.h	2003-01-17 10:04:08.000000000 +0100
@@ -4,8 +4,7 @@
 #include <linux/device.h>
 #include <linux/mmzone.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 struct i386_node {
 	struct node node;
--- linux/include/asm-i386/memblk.h.orig	2003-01-17 10:03:51.000000000 +0100
+++ linux/include/asm-i386/memblk.h	2003-01-17 10:03:56.000000000 +0100
@@ -4,8 +4,8 @@
 #include <linux/device.h>
 #include <linux/mmzone.h>
 #include <linux/memblk.h>
+#include <linux/topology.h>
 
-#include <asm/topology.h>
 #include <asm/node.h>
 
 struct i386_memblk {
--- linux/kernel/sched.c.orig	2003-01-17 09:22:24.000000000 +0100
+++ linux/kernel/sched.c	2003-01-17 10:29:53.000000000 +0100
@@ -153,10 +153,9 @@
 			nr_uninterruptible;
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
-	int prev_nr_running[NR_CPUS];
+	int prev_cpu_load[NR_CPUS];
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
-	unsigned int nr_balanced;
 	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
@@ -765,29 +764,6 @@
 	return node;
 }
 
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	int this_node = __cpu_to_node(this_cpu);
-	/*
-	 * Avoid rebalancing between nodes too often.
-	 * We rebalance globally once every NODE_BALANCE_RATE load balances.
-	 */
-	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
-		int node = find_busiest_node(this_node);
-		this_rq->nr_balanced = 0;
-		if (node >= 0)
-			return (__node_to_cpu_mask(node) | (1UL << this_cpu));
-	}
-	return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	return cpu_online_map;
-}
-
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -807,10 +783,10 @@
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
 			/* Need to recalculate nr_running */
-			if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+			if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 				nr_running = this_rq->nr_running;
 			else
-				nr_running = this_rq->prev_nr_running[this_cpu];
+				nr_running = this_rq->prev_cpu_load[this_cpu];
 		} else
 			spin_lock(&busiest->lock);
 	}
@@ -847,10 +823,10 @@
 	 * that case we are less picky about moving a task across CPUs and
 	 * take what can be taken.
 	 */
-	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+	if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 		nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		nr_running = this_rq->prev_cpu_load[this_cpu];
 
 	busiest = NULL;
 	max_load = 1;
@@ -859,11 +835,11 @@
 			continue;
 
 		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+		if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
 			load = rq_src->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
+			load = this_rq->prev_cpu_load[i];
+		this_rq->prev_cpu_load[i] = rq_src->nr_running;
 
 		if ((load > max_load) && (rq_src != this_rq)) {
 			busiest = rq_src;
@@ -922,7 +898,7 @@
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
 {
 	int imbalance, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
@@ -930,8 +906,7 @@
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
-					cpus_to_balance(this_cpu, this_rq));
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
 	if (!busiest)
 		goto out;
 
@@ -1006,21 +981,61 @@
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
 
-static inline void idle_tick(runqueue_t *rq)
+static void rebalance_tick(runqueue_t *this_rq, int idle)
 {
-	if (jiffies % IDLE_REBALANCE_TICK)
+	int this_cpu = smp_processor_id();
+	unsigned long j = jiffies, cpumask, this_cpumask = 1UL << this_cpu;
+
+	if (idle) {
+		if (!(j % IDLE_REBALANCE_TICK)) {
+			spin_lock(&this_rq->lock);
+			load_balance(this_rq, 1, this_cpumask);
+			spin_unlock(&this_rq->lock);
+		}
 		return;
-	spin_lock(&rq->lock);
-	load_balance(rq, 1);
-	spin_unlock(&rq->lock);
+	}
+	/*
+	 * First do inter-node rebalancing, then intra-node rebalancing,
+	 * if both events happen in the same tick. The inter-node
+	 * rebalancing does not necessarily have to create a perfect
+	 * balance within the node, since we load-balance the most loaded
+	 * node with the current CPU. (ie. other CPUs in the local node
+	 * are not balanced.)
+	 */
+#if CONFIG_NUMA
+	if (!(j % NODE_REBALANCE_TICK)) {
+		int node = find_busiest_node(__cpu_to_node(this_cpu));
+		if (node >= 0) {
+			cpumask = __node_to_cpu_mask(node) | this_cpumask;
+			spin_lock(&this_rq->lock);
+			load_balance(this_rq, 1, cpumask);
+			spin_unlock(&this_rq->lock);
+		}
+	}
+#endif
+	if (!(j % BUSY_REBALANCE_TICK)) {
+		cpumask = __cpu_to_node_mask(this_cpu);
+		spin_lock(&this_rq->lock);
+		load_balance(this_rq, 1, cpumask);
+		spin_unlock(&this_rq->lock);
+	}
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
 }
-
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1078,7 @@
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
-		idle_tick(rq);
-#endif
+		rebalance_tick(rq, 1);
 		return;
 	}
 	if (TASK_NICE(p) > 0)
@@ -1121,11 +1134,8 @@
 			enqueue_task(p, rq->active);
 	}
 out:
-#if CONFIG_SMP
-	if (!(jiffies % BUSY_REBALANCE_TICK))
-		load_balance(rq, 0);
-#endif
 	spin_unlock(&rq->lock);
+	rebalance_tick(rq, 0);
 }
 
 void scheduling_functions_start_here(void) { }
@@ -1184,7 +1194,7 @@
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
 #if CONFIG_SMP
-		load_balance(rq, 1);
+		load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
 		if (rq->nr_running)
 			goto pick_next_task;
 #endif
--- linux/mm/page_alloc.c.orig	2003-01-17 10:01:29.000000000 +0100
+++ linux/mm/page_alloc.c	2003-01-17 10:01:35.000000000 +0100
@@ -28,8 +28,7 @@
 #include <linux/blkdev.h>
 #include <linux/slab.h>
 #include <linux/notifier.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
 DECLARE_BITMAP(memblk_online_map, MAX_NR_MEMBLKS);
--- linux/mm/vmscan.c.orig	2003-01-17 10:01:44.000000000 +0100
+++ linux/mm/vmscan.c	2003-01-17 10:01:52.000000000 +0100
@@ -27,10 +27,10 @@
 #include <linux/pagevec.h>
 #include <linux/backing-dev.h>
 #include <linux/rmap-locking.h>
+#include <linux/topology.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
-#include <asm/topology.h>
 #include <asm/div64.h>
 
 #include <linux/swapops.h>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-16 20:19                                         ` Ingo Molnar
                                                             ` (2 preceding siblings ...)
  2003-01-16 23:45                                           ` [PATCH 2.5.58] new NUMA scheduler: fix Michael Hohnbaum
@ 2003-01-17 11:10                                           ` Erich Focht
  2003-01-17 14:07                                             ` Ingo Molnar
  3 siblings, 1 reply; 96+ messages in thread
From: Erich Focht @ 2003-01-17 11:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Christoph Hellwig, Robert Love,
	Michael Hohnbaum, Andrew Theurer, linux-kernel, lse-tech,
	Linus Torvalds

Hi Ingo,

On Thursday 16 January 2003 21:19, Ingo Molnar wrote:
> On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> > > complex. It's the one that is aware of the global scheduling picture.
> > > For NUMA i'd suggest two asynchronous frequencies: one intra-node
> > > frequency, and an inter-node frequency - configured by the architecture
> > > and roughly in the same proportion to each other as cachemiss
> > > latencies.
> >
> > That's exactly what's in the latest set of patches - admittedly it's a
> > multiplier of when we run load_balance, not the tick multiplier, but
> > that's very easy to fix. Can you check out the stuff I posted last
> > night? I think it's somewhat cleaner ...
>
> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing.


I prefer a single point of entry called load_balance() to multiple
functionally different balancers. The reason is that the later choice
might lead to balancers competing or working against each other. Not
now but the design could mislead to such developments. 

But the main other reasons for calling the cross-node balancer after
NODE_BALANCE_RATE calls to the intra-node balancer (you call it
synchronous balancing) is performance:

Davide Libenzi showed quite a while ago that one benefits a lot if the
idle CPUs stay idle for rather short time. IIRC, his conclusion for
the multi-queue scheduler was that an order of magnitude of 10ms is
long enough, below you start feeling the balancing overhead, above you
waste useful cycles.

On a NUMA system this is even more important: the longer you leave
fresh tasks on an overloaded node, the more probable it is that they
allocate their memory there. And then they will run with poor
performance on the node which stayed idle for 200-400ms before
stealing them. So one wastes 200-400ms on each CPU of the idle node
and at the end gets tasks which perform poorly, anyway. If the tasks
are "old", at least we didn't waste too much time beeing idle. The
long-term target should be that the tasks should remember where their
memory is and return to that node.

> The inter-node
> balancing (which is heavier than even the global SMP balancer), should
> never be triggered from the high-frequency path.

Hmmm, we made it really slim. Actually the cross-node balancing might
even be cheaper than the global SMP balancer:
- it first loops over the nodes (loop length 4 on a 16 CPU NUMA-Q & Azusa)
- then it loops over the cpumask of the most loaded node + the current
CPU (loop length 5 on a NUMA-Q & Azusa).
This has to be compared with the loop length of 16 when doing the
global SMP rebalance. The additional work done for averaging is
minimal. The more nodes, the cheaper the NUMA cross-node balancing
compared to the global SMP balancing.

Besides: the CPU is idle anyway! So who cares whether it just
unsuccessfully scans its own empty node or looks at the other nodes
from time to time? It does this lockless and doesn't modify any
variables in other runqueues, so doesn't create cache misses for other
CPUs.

> [whether it's high
> frequency or not depends on the actual workload, but it can be potentially
> _very_ high frequency, easily on the order of 1 million times a second -
> then you'll call the inter-node balancer 100K times a second.]

You mean because cpu_idle() loops over schedule()? The code is:
        while (1) {
                void (*idle)(void) = pm_idle;
                if (!idle)
                        idle = default_idle;
                irq_stat[smp_processor_id()].idle_timestamp = jiffies;
                while (!need_resched())
                        idle();
                schedule();
        }
So if the CPU is idle, it won't go through schedule(), except we get
an interrupt from time to time... And then, it doesn't really
matter. Or do you want to keep idle CPUs free for serving interrupts?
That could be legitimate, but is not the typical load I had in
mind and is an issue not related to the NUMA scheduler. But maybe you
have something else in mind, that I didn't consider, yet.

Under normal conditions the rebalancing I though about would work the
following way:

Busy CPU:
 - intra-node rebalance every 200ms (interval timer controlled)
 - cross-node rebalance every NODE_BALANCE_RATE*200ms (2s)
 - when about to go idle, rebalance internally or across nodes, 10
 times more often within the node

Idle CPU:
 - intra-node rebalance every 1ms
 - cross-node rebalance every NODE_REBALANCE_RATE * 1ms (10ms)
   This doesn't appear to be too frequent for me... after all the cpu
   is idle and couldn't steal anything from it's own node.

I don't insist too much on this design, but I can't see any serious
reasons against it. Of course, the performance should decide.

I'm about to test the two versions in discussion on an NEC Asama
(small configuration with 4 nodes, good memory latency ratio between
nodes (1.6), no node-level cache). 

Best regards,
Erich


On Thursday 16 January 2003 21:19, Ingo Molnar wrote:
> On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> > > complex. It's the one that is aware of the global scheduling picture.
> > > For NUMA i'd suggest two asynchronous frequencies: one intra-node
> > > frequency, and an inter-node frequency - configured by the architecture
> > > and roughly in the same proportion to each other as cachemiss
> > > latencies.
> >
> > That's exactly what's in the latest set of patches - admittedly it's a
> > multiplier of when we run load_balance, not the tick multiplier, but
> > that's very easy to fix. Can you check out the stuff I posted last
> > night? I think it's somewhat cleaner ...
>
> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing. The inter-node
> balancing (which is heavier than even the global SMP balancer), should
> never be triggered from the high-frequency path. [whether it's high
> frequency or not depends on the actual workload, but it can be potentially
> _very_ high frequency, easily on the order of 1 million times a second -
> then you'll call the inter-node balancer 100K times a second.]
>
> I'd strongly suggest to decouple the heavy NUMA load-balancing code from
> the fastpath and re-check the benchmark numbers.
>
> 	Ingo
>
> (*) whether sched_balance_exec() is a high-frequency path or not is up to
> debate. Right now it's not possible to get much more than a couple of
> thousand exec()'s per second on fast CPUs. Hopefully that will change in
> the future though, so exec() events could become really fast. So i'd
> suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
> only do NUMA cross-node balancing with a fixed frequency, from the timer
> tick. But exec()-time is really special, since the user task usually has
> zero cached state at this point, so we _can_ do cheap cross-node balancing
> as well. So it's a boundary thing - probably doing the full-blown
> balancing is the right thing.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 2.5.58] new NUMA scheduler: fix
  2003-01-17 11:10                                           ` Erich Focht
@ 2003-01-17 14:07                                             ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-17 14:07 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Christoph Hellwig, Robert Love,
	Michael Hohnbaum, Andrew Theurer, linux-kernel, lse-tech,
	Linus Torvalds


On Fri, 17 Jan 2003, Erich Focht wrote:

> I prefer a single point of entry called load_balance() to multiple
> functionally different balancers. [...]

agreed - my cleanup patch keeps that property.

> [...] IIRC, his conclusion for the multi-queue scheduler was that an
> order of magnitude of 10ms is long enough, below you start feeling the
> balancing overhead, above you waste useful cycles.

this is one reason why we do the idle rebalancing at 1 msec granularity
right now.

> On a NUMA system this is even more important: the longer you leave fresh
> tasks on an overloaded node, the more probable it is that they allocate
> their memory there. And then they will run with poor performance on the
> node which stayed idle for 200-400ms before stealing them. So one wastes
> 200-400ms on each CPU of the idle node and at the end gets tasks which
> perform poorly, anyway. If the tasks are "old", at least we didn't waste
> too much time beeing idle. The long-term target should be that the tasks
> should remember where their memory is and return to that node.

i'd much rather vote for fork() and exec() time 'pre-balancing' and then
making it quite hard to move a task across nodes.

> > The inter-node balancing (which is heavier than even the global SMP
> > balancer), should never be triggered from the high-frequency path.
> 
> Hmmm, we made it really slim. [...]

this is a misunderstanding. I'm not worried about the algorithmic overhead
_at all_, i'm worried about the effect of too frequent balancing - tasks
being moved between runqueues too often. That has shown to be a problem on
SMP. The prev_load type of statistic measurement relies on a constant
frequency - it can lead to over-balancing if it's called too often.

> So if the CPU is idle, it won't go through schedule(), except we get an
> interrupt from time to time... [...]

(no, it's even better than that, we never leave the idle loop except when
we _know_ that there is scheduling work to be done. Hence the
need_resched() test. But i'm not worried about balancing overhead at all.)

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17  8:47                                             ` [patch] sched-2.5.59-A2 Ingo Molnar
@ 2003-01-17 14:35                                               ` Erich Focht
  2003-01-17 15:11                                                 ` Ingo Molnar
                                                                   ` (3 more replies)
  0 siblings, 4 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-17 14:35 UTC (permalink / raw)
  To: Ingo Molnar, Martin J. Bligh
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

Hi Ingo,

I tried your patch on the small NEC TX7 I can access easilly (8
Itanium2 CPUs on 4 nodes). I actually used your patch on top of the
2.5.52-ia64 kernel, but that shouldn't matter.

I like the cleanup of the topology.h. Also the renaming to
prev_cpu_load. There was a mistake (I think) in the call to
load_balance() in the idle path, guess you wanted to have:
+           load_balance(this_rq, 1, __node_to_cpu_mask(this_node));
instead of
+           load_balance(this_rq, 1, this_cpumask);
otherwise you won't load balance at all for idle cpus.

Here are the results:

kernbench (average of 5 kernel compiles) (standard error in brackets)
---------
           Elapsed       UserTime      SysTime
orig       134.43(1.79)  944.79(0.43)  21.41(0.28)
ingo       136.74(1.58)  951.55(0.73)  21.16(0.32)
ingofix    135.22(0.59)  952.17(0.78)  21.16(0.19)


hackbench (chat benchmark alike) (elapsed time for N groups of 20
---------           senders & receivers, stats from 10 measurements)

          N=10        N=25        N=50        N=100
orig      0.77(0.03)  1.91(0.06)  3.77(0.06)  7.78(0.21)
ingo      1.70(0.35)  3.11(0.47)  4.85(0.55)  8.80(0.98)
ingofix   1.16(0.14)  2.67(0.53)  5.05(0.26)  9.99(0.13)


numabench (N memory intensive tasks running in parallel, disturbed for
---------  a short time by a "hackbench 10" call)


numa_test N=4   ElapsedTime  TotalUserTime  TotalSysTime
orig:           26.13(2.54)  86.10(4.47)    0.09(0.01)
ingo:           27.60(2.16)  88.06(4.58)    0.11(0.01)
ingofix:        25.51(3.05)  83.55(2.78)    0.10(0.01)

numa_test N=8   ElapsedTime  TotalUserTime  TotalSysTime
orig:           24.81(2.71)  164.94(4.82)   0.17(0.01)
ingo:           27.38(3.01)  170.06(5.60)   0.30(0.03)
ingofix:        29.08(2.79)  172.10(4.48)   0.32(0.03)

numa_test N=16  ElapsedTime  TotalUserTime  TotalSysTime
orig:           45.19(3.42)  332.07(5.89)   0.32(0.01)
ingo:           50.18(0.38)  359.46(9.31)   0.46(0.04)
ingofix:        50.30(0.42)  357.38(9.12)   0.46(0.01)

numa_test N=32  ElapsedTime  TotalUserTime  TotalSysTime
orig:           86.84(1.83)  671.99(9.98)   0.65(0.02)
ingo:           93.44(2.13)  704.90(16.91)  0.82(0.06)
ingofix:        93.92(1.28)  727.58(9.26)   0.77(0.03)


>From these results I would prefer to either leave the numa scheduler
as it is or to introduce an IDLE_NODEBALANCE_TICK and a
BUSY_NODEBALANCE_TICK instead of just having one NODE_REBALANCE_TICK
which balances very rarely. In that case it would make sense to keep
the cpus_to_balance() function.

Best regards,
Erich


On Friday 17 January 2003 09:47, Ingo Molnar wrote:
> Martin, Erich,
>
> could you give the attached patch a go, it implements my
> cleanup/reorganization suggestions ontop of 2.5.59:
>
>  - decouple the 'slow' rebalancer from the 'fast' rebalancer and attach it
>    to the scheduler tick. Remove rq->nr_balanced.
>
>  - do idle-rebalancing every 1 msec, intra-node rebalancing every 200
>    msecs and inter-node rebalancing every 400 msecs.
>
>  - move the tick-based rebalancing logic into rebalance_tick(), it looks
>    more organized this way and we have all related code in one spot.
>
>  - clean up the topology.h include file structure. Since generic kernel
>    code uses all the defines already, there's no reason to keep them in
>    asm-generic.h. I've created a linux/topology.h file that includes
>    asm/topology.h and takes care of the default and generic definitions.
>    Moved over a generic topology definition from mmzone.h.
>
>  - renamed rq->prev_nr_running[] to rq->prev_cpu_load[] - this further
>    unifies the SMP and NUMA balancers and is more in line with the
>    prev_node_load name.
>
> If performance drops due to this patch then the benchmarking goal would be
> to tune the following frequencies:
>
>  #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
>  #define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
>  #define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
>
> In theory NODE_REBALANCE_TICK could be defined by the NUMA platform,
> although in the past such per-platform scheduler tunings used to end
> 'orphaned' after some time. 400 msecs is pretty conservative at the
> moment, it could be made more frequent if benchmark results support it.
>
> the patch compiles and boots on UP and SMP, it compiles on x86-NUMA.
>
> 	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 14:35                                               ` Erich Focht
@ 2003-01-17 15:11                                                 ` Ingo Molnar
  2003-01-17 15:30                                                   ` Erich Focht
                                                                     ` (3 more replies)
  2003-01-17 17:21                                                 ` Martin J. Bligh
                                                                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-17 15:11 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Christoph Hellwig, Robert Love,
	Michael Hohnbaum, Andrew Theurer, Linus Torvalds, linux-kernel,
	lse-tech


On Fri, 17 Jan 2003, Erich Focht wrote:

> I like the cleanup of the topology.h. Also the renaming to
> prev_cpu_load. There was a mistake (I think) in the call to
> load_balance() in the idle path, guess you wanted to have:
> +           load_balance(this_rq, 1, __node_to_cpu_mask(this_node));
> instead of
> +           load_balance(this_rq, 1, this_cpumask);
> otherwise you won't load balance at all for idle cpus.

indeed - there was another bug as well, the 'idle' parameter to
load_balance() was 1 even in the busy branch, causing too slow balancing.

> From these results I would prefer to either leave the numa scheduler as
> it is or to introduce an IDLE_NODEBALANCE_TICK and a
> BUSY_NODEBALANCE_TICK instead of just having one NODE_REBALANCE_TICK
> which balances very rarely.

agreed, i've attached the -B0 patch that does this. The balancing rates
are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
busy-global).

	Ingo

--- linux/drivers/base/cpu.c.orig	2003-01-17 10:02:19.000000000 +0100
+++ linux/drivers/base/cpu.c	2003-01-17 10:02:25.000000000 +0100
@@ -6,8 +6,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/cpu.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int cpu_add_device(struct device * dev)
--- linux/drivers/base/node.c.orig	2003-01-17 10:02:50.000000000 +0100
+++ linux/drivers/base/node.c	2003-01-17 10:03:03.000000000 +0100
@@ -7,8 +7,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int node_add_device(struct device * dev)
--- linux/drivers/base/memblk.c.orig	2003-01-17 10:02:33.000000000 +0100
+++ linux/drivers/base/memblk.c	2003-01-17 10:02:38.000000000 +0100
@@ -7,8 +7,7 @@
 #include <linux/init.h>
 #include <linux/memblk.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int memblk_add_device(struct device * dev)
--- linux/include/asm-generic/topology.h.orig	2003-01-17 09:49:38.000000000 +0100
+++ linux/include/asm-generic/topology.h	2003-01-17 10:02:08.000000000 +0100
@@ -1,56 +0,0 @@
-/*
- * linux/include/asm-generic/topology.h
- *
- * Written by: Matthew Dobson, IBM Corporation
- *
- * Copyright (C) 2002, IBM Corp.
- *
- * All rights reserved.          
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
- * NON INFRINGEMENT.  See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- * Send feedback to <colpatch@us.ibm.com>
- */
-#ifndef _ASM_GENERIC_TOPOLOGY_H
-#define _ASM_GENERIC_TOPOLOGY_H
-
-/* Other architectures wishing to use this simple topology API should fill
-   in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef __cpu_to_node
-#define __cpu_to_node(cpu)		(0)
-#endif
-#ifndef __memblk_to_node
-#define __memblk_to_node(memblk)	(0)
-#endif
-#ifndef __parent_node
-#define __parent_node(node)		(0)
-#endif
-#ifndef __node_to_first_cpu
-#define __node_to_first_cpu(node)	(0)
-#endif
-#ifndef __node_to_cpu_mask
-#define __node_to_cpu_mask(node)	(cpu_online_map)
-#endif
-#ifndef __node_to_memblk
-#define __node_to_memblk(node)		(0)
-#endif
-
-/* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
-#endif
-
-#endif /* _ASM_GENERIC_TOPOLOGY_H */
--- linux/include/asm-ppc64/topology.h.orig	2003-01-17 09:54:46.000000000 +0100
+++ linux/include/asm-ppc64/topology.h	2003-01-17 09:55:18.000000000 +0100
@@ -46,18 +46,6 @@
 	return mask;
 }
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
-#else /* !CONFIG_NUMA */
-
-#define __cpu_to_node(cpu)		(0)
-#define __memblk_to_node(memblk)	(0)
-#define __parent_node(nid)		(0)
-#define __node_to_first_cpu(node)	(0)
-#define __node_to_cpu_mask(node)	(cpu_online_map)
-#define __node_to_memblk(node)		(0)
-
 #endif /* CONFIG_NUMA */
 
 #endif /* _ASM_PPC64_TOPOLOGY_H */
--- linux/include/linux/topology.h.orig	2003-01-17 09:57:20.000000000 +0100
+++ linux/include/linux/topology.h	2003-01-17 10:09:38.000000000 +0100
@@ -0,0 +1,43 @@
+/*
+ * linux/include/linux/topology.h
+ */
+#ifndef _LINUX_TOPOLOGY_H
+#define _LINUX_TOPOLOGY_H
+
+#include <asm/topology.h>
+
+/*
+ * The default (non-NUMA) topology definitions:
+ */
+#ifndef __cpu_to_node
+#define __cpu_to_node(cpu)		(0)
+#endif
+#ifndef __memblk_to_node
+#define __memblk_to_node(memblk)	(0)
+#endif
+#ifndef __parent_node
+#define __parent_node(node)		(0)
+#endif
+#ifndef __node_to_first_cpu
+#define __node_to_first_cpu(node)	(0)
+#endif
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu)		(cpu_online_map)
+#endif
+#ifndef __node_to_cpu_mask
+#define __node_to_cpu_mask(node)		(cpu_online_map)
+#endif
+#ifndef __node_to_memblk
+#define __node_to_memblk(node)		(0)
+#endif
+
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu)		__node_to_cpu_mask(__cpu_to_node(cpu))
+#endif
+
+/*
+ * Generic defintions:
+ */
+#define numa_node_id()			(__cpu_to_node(smp_processor_id()))
+
+#endif /* _LINUX_TOPOLOGY_H */
--- linux/include/linux/mmzone.h.orig	2003-01-17 09:58:20.000000000 +0100
+++ linux/include/linux/mmzone.h	2003-01-17 10:01:17.000000000 +0100
@@ -255,9 +255,7 @@
 #define MAX_NR_MEMBLKS	1
 #endif /* CONFIG_NUMA */
 
-#include <asm/topology.h>
-/* Returns the number of the current Node. */
-#define numa_node_id()		(__cpu_to_node(smp_processor_id()))
+#include <linux/topology.h>
 
 #ifndef CONFIG_DISCONTIGMEM
 extern struct pglist_data contig_page_data;
--- linux/include/asm-ia64/topology.h.orig	2003-01-17 09:54:33.000000000 +0100
+++ linux/include/asm-ia64/topology.h	2003-01-17 09:54:38.000000000 +0100
@@ -60,7 +60,4 @@
  */
 #define __node_to_memblk(node) (node)
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
 #endif /* _ASM_IA64_TOPOLOGY_H */
--- linux/include/asm-i386/topology.h.orig	2003-01-17 09:55:28.000000000 +0100
+++ linux/include/asm-i386/topology.h	2003-01-17 09:56:27.000000000 +0100
@@ -61,17 +61,6 @@
 /* Returns the number of the first MemBlk on Node 'node' */
 #define __node_to_memblk(node) (node)
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
-
-#else /* !CONFIG_NUMA */
-/*
- * Other i386 platforms should define their own version of the 
- * above macros here.
- */
-
-#include <asm-generic/topology.h>
-
 #endif /* CONFIG_NUMA */
 
 #endif /* _ASM_I386_TOPOLOGY_H */
--- linux/include/asm-i386/cpu.h.orig	2003-01-17 10:03:22.000000000 +0100
+++ linux/include/asm-i386/cpu.h	2003-01-17 10:03:31.000000000 +0100
@@ -3,8 +3,8 @@
 
 #include <linux/device.h>
 #include <linux/cpu.h>
+#include <linux/topology.h>
 
-#include <asm/topology.h>
 #include <asm/node.h>
 
 struct i386_cpu {
--- linux/include/asm-i386/node.h.orig	2003-01-17 10:04:02.000000000 +0100
+++ linux/include/asm-i386/node.h	2003-01-17 10:04:08.000000000 +0100
@@ -4,8 +4,7 @@
 #include <linux/device.h>
 #include <linux/mmzone.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 struct i386_node {
 	struct node node;
--- linux/include/asm-i386/memblk.h.orig	2003-01-17 10:03:51.000000000 +0100
+++ linux/include/asm-i386/memblk.h	2003-01-17 10:03:56.000000000 +0100
@@ -4,8 +4,8 @@
 #include <linux/device.h>
 #include <linux/mmzone.h>
 #include <linux/memblk.h>
+#include <linux/topology.h>
 
-#include <asm/topology.h>
 #include <asm/node.h>
 
 struct i386_memblk {
--- linux/kernel/sched.c.orig	2003-01-17 09:22:24.000000000 +0100
+++ linux/kernel/sched.c	2003-01-17 17:01:47.000000000 +0100
@@ -153,10 +153,9 @@
 			nr_uninterruptible;
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
-	int prev_nr_running[NR_CPUS];
+	int prev_cpu_load[NR_CPUS];
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
-	unsigned int nr_balanced;
 	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
@@ -765,29 +764,6 @@
 	return node;
 }
 
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	int this_node = __cpu_to_node(this_cpu);
-	/*
-	 * Avoid rebalancing between nodes too often.
-	 * We rebalance globally once every NODE_BALANCE_RATE load balances.
-	 */
-	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
-		int node = find_busiest_node(this_node);
-		this_rq->nr_balanced = 0;
-		if (node >= 0)
-			return (__node_to_cpu_mask(node) | (1UL << this_cpu));
-	}
-	return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	return cpu_online_map;
-}
-
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -807,10 +783,10 @@
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
 			/* Need to recalculate nr_running */
-			if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+			if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 				nr_running = this_rq->nr_running;
 			else
-				nr_running = this_rq->prev_nr_running[this_cpu];
+				nr_running = this_rq->prev_cpu_load[this_cpu];
 		} else
 			spin_lock(&busiest->lock);
 	}
@@ -847,10 +823,10 @@
 	 * that case we are less picky about moving a task across CPUs and
 	 * take what can be taken.
 	 */
-	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+	if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 		nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		nr_running = this_rq->prev_cpu_load[this_cpu];
 
 	busiest = NULL;
 	max_load = 1;
@@ -859,11 +835,11 @@
 			continue;
 
 		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+		if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
 			load = rq_src->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
+			load = this_rq->prev_cpu_load[i];
+		this_rq->prev_cpu_load[i] = rq_src->nr_running;
 
 		if ((load > max_load) && (rq_src != this_rq)) {
 			busiest = rq_src;
@@ -922,7 +898,7 @@
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
 {
 	int imbalance, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
@@ -930,8 +906,7 @@
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
-					cpus_to_balance(this_cpu, this_rq));
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
 	if (!busiest)
 		goto out;
 
@@ -1006,21 +981,75 @@
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
 
-static inline void idle_tick(runqueue_t *rq)
+#if CONFIG_NUMA
+static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
 {
-	if (jiffies % IDLE_REBALANCE_TICK)
-		return;
-	spin_lock(&rq->lock);
-	load_balance(rq, 1);
-	spin_unlock(&rq->lock);
+	int node = find_busiest_node(__cpu_to_node(this_cpu));
+	unsigned long cpumask, this_cpumask = 1UL << this_cpu;
+
+	if (node >= 0) {
+		cpumask = __node_to_cpu_mask(node) | this_cpumask;
+		spin_lock(&this_rq->lock);
+		load_balance(this_rq, idle, cpumask);
+		spin_unlock(&this_rq->lock);
+	}
 }
+#endif
 
+static void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+#if CONFIG_NUMA
+	int this_cpu = smp_processor_id();
+#endif
+	unsigned long j = jiffies;
+
+	/*
+	 * First do inter-node rebalancing, then intra-node rebalancing,
+	 * if both events happen in the same tick. The inter-node
+	 * rebalancing does not necessarily have to create a perfect
+	 * balance within the node, since we load-balance the most loaded
+	 * node with the current CPU. (ie. other CPUs in the local node
+	 * are not balanced.)
+	 */
+	if (idle) {
+#if CONFIG_NUMA
+		if (!(j % IDLE_NODE_REBALANCE_TICK))
+			balance_node(this_rq, idle, this_cpu);
+#endif
+		if (!(j % IDLE_REBALANCE_TICK)) {
+			spin_lock(&this_rq->lock);
+			load_balance(this_rq, 0, __cpu_to_node_mask(this_cpu));
+			spin_unlock(&this_rq->lock);
+		}
+		return;
+	}
+#if CONFIG_NUMA
+	if (!(j % BUSY_NODE_REBALANCE_TICK))
+		balance_node(this_rq, idle, this_cpu);
+#endif
+	if (!(j % BUSY_REBALANCE_TICK)) {
+		spin_lock(&this_rq->lock);
+		load_balance(this_rq, idle, __cpu_to_node_mask(this_cpu));
+		spin_unlock(&this_rq->lock);
+	}
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+}
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1092,7 @@
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
-		idle_tick(rq);
-#endif
+		rebalance_tick(rq, 1);
 		return;
 	}
 	if (TASK_NICE(p) > 0)
@@ -1121,11 +1148,8 @@
 			enqueue_task(p, rq->active);
 	}
 out:
-#if CONFIG_SMP
-	if (!(jiffies % BUSY_REBALANCE_TICK))
-		load_balance(rq, 0);
-#endif
 	spin_unlock(&rq->lock);
+	rebalance_tick(rq, 0);
 }
 
 void scheduling_functions_start_here(void) { }
@@ -1184,7 +1208,7 @@
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
 #if CONFIG_SMP
-		load_balance(rq, 1);
+		load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
 		if (rq->nr_running)
 			goto pick_next_task;
 #endif
--- linux/mm/page_alloc.c.orig	2003-01-17 10:01:29.000000000 +0100
+++ linux/mm/page_alloc.c	2003-01-17 10:01:35.000000000 +0100
@@ -28,8 +28,7 @@
 #include <linux/blkdev.h>
 #include <linux/slab.h>
 #include <linux/notifier.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
 DECLARE_BITMAP(memblk_online_map, MAX_NR_MEMBLKS);
--- linux/mm/vmscan.c.orig	2003-01-17 10:01:44.000000000 +0100
+++ linux/mm/vmscan.c	2003-01-17 10:01:52.000000000 +0100
@@ -27,10 +27,10 @@
 #include <linux/pagevec.h>
 #include <linux/backing-dev.h>
 #include <linux/rmap-locking.h>
+#include <linux/topology.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
-#include <asm/topology.h>
 #include <asm/div64.h>
 
 #include <linux/swapops.h>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 15:11                                                 ` Ingo Molnar
@ 2003-01-17 15:30                                                   ` Erich Focht
  2003-01-17 16:58                                                   ` Martin J. Bligh
                                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-17 15:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Christoph Hellwig, Robert Love,
	Michael Hohnbaum, Andrew Theurer, Linus Torvalds, linux-kernel,
	lse-tech

On Friday 17 January 2003 16:11, Ingo Molnar wrote:
> On Fri, 17 Jan 2003, Erich Focht wrote:
> > I like the cleanup of the topology.h. Also the renaming to
> > prev_cpu_load. There was a mistake (I think) in the call to
> > load_balance() in the idle path, guess you wanted to have:
> > +           load_balance(this_rq, 1, __node_to_cpu_mask(this_node));
> > instead of
> > +           load_balance(this_rq, 1, this_cpumask);
> > otherwise you won't load balance at all for idle cpus.
>
> indeed - there was another bug as well, the 'idle' parameter to
> load_balance() was 1 even in the busy branch, causing too slow balancing.

I didn't see that, but it's impact is only that a busy cpu is stealing
at most one task from another node, otherwise the idle=1 leads to more
aggressive balancing.

> > From these results I would prefer to either leave the numa scheduler as
> > it is or to introduce an IDLE_NODEBALANCE_TICK and a
> > BUSY_NODEBALANCE_TICK instead of just having one NODE_REBALANCE_TICK
> > which balances very rarely.
>
> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).

This looks good! I'll see if I can rerun the tests today, anyway I'm
more optimistic about this version.

Regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 15:11                                                 ` Ingo Molnar
  2003-01-17 15:30                                                   ` Erich Focht
@ 2003-01-17 16:58                                                   ` Martin J. Bligh
  2003-01-18 20:54                                                     ` NUMA sched -> pooling scheduler (inc HT) Martin J. Bligh
  2003-01-17 18:19                                                   ` [patch] sched-2.5.59-A2 Michael Hohnbaum
  2003-01-18  7:08                                                   ` William Lee Irwin III
  3 siblings, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-17 16:58 UTC (permalink / raw)
  To: Ingo Molnar, Erich Focht
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).

Hmmm... something is drastically wrong here, looks like we're thrashing
tasks between nodes?

Kernbench:
                                   Elapsed        User      System         CPU
                        2.5.59     20.032s     186.66s      47.73s       1170%
                  2.5.59-ingo2     23.578s    198.874s     90.648s     1227.4%

NUMA schedbench 4:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00       36.38       90.70        0.62
                  2.5.59-ingo2        0.00       47.62      127.13        1.89

NUMA schedbench 8:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00       42.78      249.77        1.85
                  2.5.59-ingo2        0.00       59.45      358.31        5.23

NUMA schedbench 16:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00       56.84      848.00        2.78
                  2.5.59-ingo2        0.00      114.70     1430.95       21.21

diffprofile:

44770 total
27840 do_anonymous_page
3180 buffered_rmqueue
1814 default_idle
1534 __copy_from_user_ll
1465 free_hot_cold_page
1172 __copy_to_user_ll
919 page_remove_rmap
881 zap_pte_range
730 do_wp_page
687 do_no_page
601 __alloc_pages
527 vm_enough_memory
432 __set_page_dirty_buffers
426 page_add_rmap
322 release_pages
311 __pagevec_lru_add_active
233 prep_new_page
202 clear_page_tables
181 schedule
132 current_kernel_time
127 __block_prepare_write
103 kmap_atomic
100 __wake_up
97 pte_alloc_one
95 may_open
87 find_get_page
76 dget_locked
72 bad_range
69 pgd_alloc
62 __fput
50 copy_strings
...
-66 open_namei
-68 path_lookup
-115 .text.lock.file_table
-255 .text.lock.dec_and_lock
-469 .text.lock.namei


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 14:35                                               ` Erich Focht
  2003-01-17 15:11                                                 ` Ingo Molnar
@ 2003-01-17 17:21                                                 ` Martin J. Bligh
  2003-01-17 17:23                                                 ` Martin J. Bligh
  2003-01-17 18:11                                                 ` Erich Focht
  3 siblings, 0 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-17 17:21 UTC (permalink / raw)
  To: Erich Focht, Ingo Molnar
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

> I like the cleanup of the topology.h. 

Any chance we could keep that broken out as a seperate patch?
Topo cleanups below:

diff -urpN -X /home/fletch/.diff.exclude virgin/drivers/base/cpu.c ingo-A/drivers/base/cpu.c
--- virgin/drivers/base/cpu.c	Sun Dec  1 09:59:47 2002
+++ ingo-A/drivers/base/cpu.c	Fri Jan 17 09:19:23 2003
@@ -6,8 +6,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/cpu.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int cpu_add_device(struct device * dev)
diff -urpN -X /home/fletch/.diff.exclude virgin/drivers/base/memblk.c ingo-A/drivers/base/memblk.c
--- virgin/drivers/base/memblk.c	Mon Dec 16 21:50:42 2002
+++ ingo-A/drivers/base/memblk.c	Fri Jan 17 09:19:23 2003
@@ -7,8 +7,7 @@
 #include <linux/init.h>
 #include <linux/memblk.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int memblk_add_device(struct device * dev)
diff -urpN -X /home/fletch/.diff.exclude virgin/drivers/base/node.c ingo-A/drivers/base/node.c
--- virgin/drivers/base/node.c	Fri Jan 17 09:18:26 2003
+++ ingo-A/drivers/base/node.c	Fri Jan 17 09:19:23 2003
@@ -7,8 +7,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 
 static int node_add_device(struct device * dev)
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-generic/topology.h ingo-A/include/asm-generic/topology.h
--- virgin/include/asm-generic/topology.h	Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-generic/topology.h	Fri Jan 17 09:19:23 2003
@@ -1,56 +0,0 @@
-/*
- * linux/include/asm-generic/topology.h
- *
- * Written by: Matthew Dobson, IBM Corporation
- *
- * Copyright (C) 2002, IBM Corp.
- *
- * All rights reserved.          
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
- * NON INFRINGEMENT.  See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- * Send feedback to <colpatch@us.ibm.com>
- */
-#ifndef _ASM_GENERIC_TOPOLOGY_H
-#define _ASM_GENERIC_TOPOLOGY_H
-
-/* Other architectures wishing to use this simple topology API should fill
-   in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef __cpu_to_node
-#define __cpu_to_node(cpu)		(0)
-#endif
-#ifndef __memblk_to_node
-#define __memblk_to_node(memblk)	(0)
-#endif
-#ifndef __parent_node
-#define __parent_node(node)		(0)
-#endif
-#ifndef __node_to_first_cpu
-#define __node_to_first_cpu(node)	(0)
-#endif
-#ifndef __node_to_cpu_mask
-#define __node_to_cpu_mask(node)	(cpu_online_map)
-#endif
-#ifndef __node_to_memblk
-#define __node_to_memblk(node)		(0)
-#endif
-
-/* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
-#endif
-
-#endif /* _ASM_GENERIC_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/cpu.h ingo-A/include/asm-i386/cpu.h
--- virgin/include/asm-i386/cpu.h	Sun Nov 17 20:29:50 2002
+++ ingo-A/include/asm-i386/cpu.h	Fri Jan 17 09:19:23 2003
@@ -3,8 +3,8 @@
 
 #include <linux/device.h>
 #include <linux/cpu.h>
+#include <linux/topology.h>
 
-#include <asm/topology.h>
 #include <asm/node.h>
 
 struct i386_cpu {
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/memblk.h ingo-A/include/asm-i386/memblk.h
--- virgin/include/asm-i386/memblk.h	Sun Nov 17 20:29:47 2002
+++ ingo-A/include/asm-i386/memblk.h	Fri Jan 17 09:19:25 2003
@@ -4,8 +4,8 @@
 #include <linux/device.h>
 #include <linux/mmzone.h>
 #include <linux/memblk.h>
+#include <linux/topology.h>
 
-#include <asm/topology.h>
 #include <asm/node.h>
 
 struct i386_memblk {
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/node.h ingo-A/include/asm-i386/node.h
--- virgin/include/asm-i386/node.h	Sun Nov 17 20:29:29 2002
+++ ingo-A/include/asm-i386/node.h	Fri Jan 17 09:19:24 2003
@@ -4,8 +4,7 @@
 #include <linux/device.h>
 #include <linux/mmzone.h>
 #include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 struct i386_node {
 	struct node node;
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/topology.h ingo-A/include/asm-i386/topology.h
--- virgin/include/asm-i386/topology.h	Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-i386/topology.h	Fri Jan 17 09:19:23 2003
@@ -61,17 +61,6 @@ static inline int __node_to_first_cpu(in
 /* Returns the number of the first MemBlk on Node 'node' */
 #define __node_to_memblk(node) (node)
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
-
-#else /* !CONFIG_NUMA */
-/*
- * Other i386 platforms should define their own version of the 
- * above macros here.
- */
-
-#include <asm-generic/topology.h>
-
 #endif /* CONFIG_NUMA */
 
 #endif /* _ASM_I386_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-ia64/topology.h ingo-A/include/asm-ia64/topology.h
--- virgin/include/asm-ia64/topology.h	Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-ia64/topology.h	Fri Jan 17 09:19:23 2003
@@ -60,7 +60,4 @@
  */
 #define __node_to_memblk(node) (node)
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
 #endif /* _ASM_IA64_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-ppc64/topology.h ingo-A/include/asm-ppc64/topology.h
--- virgin/include/asm-ppc64/topology.h	Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-ppc64/topology.h	Fri Jan 17 09:19:23 2003
@@ -46,18 +46,6 @@ static inline unsigned long __node_to_cp
 	return mask;
 }
 
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
-#else /* !CONFIG_NUMA */
-
-#define __cpu_to_node(cpu)		(0)
-#define __memblk_to_node(memblk)	(0)
-#define __parent_node(nid)		(0)
-#define __node_to_first_cpu(node)	(0)
-#define __node_to_cpu_mask(node)	(cpu_online_map)
-#define __node_to_memblk(node)		(0)
-
 #endif /* CONFIG_NUMA */
 
 #endif /* _ASM_PPC64_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/linux/mmzone.h ingo-A/include/linux/mmzone.h
--- virgin/include/linux/mmzone.h	Fri Jan 17 09:18:31 2003
+++ ingo-A/include/linux/mmzone.h	Fri Jan 17 09:19:23 2003
@@ -255,9 +255,7 @@ static inline struct zone *next_zone(str
 #define MAX_NR_MEMBLKS	1
 #endif /* CONFIG_NUMA */
 
-#include <asm/topology.h>
-/* Returns the number of the current Node. */
-#define numa_node_id()		(__cpu_to_node(smp_processor_id()))
+#include <linux/topology.h>
 
 #ifndef CONFIG_DISCONTIGMEM
 extern struct pglist_data contig_page_data;
diff -urpN -X /home/fletch/.diff.exclude virgin/include/linux/topology.h ingo-A/include/linux/topology.h
--- virgin/include/linux/topology.h	Wed Dec 31 16:00:00 1969
+++ ingo-A/include/linux/topology.h	Fri Jan 17 09:19:23 2003
@@ -0,0 +1,43 @@
+/*
+ * linux/include/linux/topology.h
+ */
+#ifndef _LINUX_TOPOLOGY_H
+#define _LINUX_TOPOLOGY_H
+
+#include <asm/topology.h>
+
+/*
+ * The default (non-NUMA) topology definitions:
+ */
+#ifndef __cpu_to_node
+#define __cpu_to_node(cpu)		(0)
+#endif
+#ifndef __memblk_to_node
+#define __memblk_to_node(memblk)	(0)
+#endif
+#ifndef __parent_node
+#define __parent_node(node)		(0)
+#endif
+#ifndef __node_to_first_cpu
+#define __node_to_first_cpu(node)	(0)
+#endif
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu)		(cpu_online_map)
+#endif
+#ifndef __node_to_cpu_mask
+#define __node_to_cpu_mask(node)		(cpu_online_map)
+#endif
+#ifndef __node_to_memblk
+#define __node_to_memblk(node)		(0)
+#endif
+
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu)		__node_to_cpu_mask(__cpu_to_node(cpu))
+#endif
+
+/*
+ * Generic defintions:
+ */
+#define numa_node_id()			(__cpu_to_node(smp_processor_id()))
+
+#endif /* _LINUX_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/mm/page_alloc.c ingo-A/mm/page_alloc.c
--- virgin/mm/page_alloc.c	Fri Jan 17 09:18:32 2003
+++ ingo-A/mm/page_alloc.c	Fri Jan 17 09:19:25 2003
@@ -28,8 +28,7 @@
 #include <linux/blkdev.h>
 #include <linux/slab.h>
 #include <linux/notifier.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
 
 DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
 DECLARE_BITMAP(memblk_online_map, MAX_NR_MEMBLKS);
diff -urpN -X /home/fletch/.diff.exclude virgin/mm/vmscan.c ingo-A/mm/vmscan.c
--- virgin/mm/vmscan.c	Mon Dec 23 23:01:58 2002
+++ ingo-A/mm/vmscan.c	Fri Jan 17 09:19:25 2003
@@ -27,10 +27,10 @@
 #include <linux/pagevec.h>
 #include <linux/backing-dev.h>
 #include <linux/rmap-locking.h>
+#include <linux/topology.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
-#include <asm/topology.h>
 #include <asm/div64.h>
 
 #include <linux/swapops.h>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 14:35                                               ` Erich Focht
  2003-01-17 15:11                                                 ` Ingo Molnar
  2003-01-17 17:21                                                 ` Martin J. Bligh
@ 2003-01-17 17:23                                                 ` Martin J. Bligh
  2003-01-17 18:11                                                 ` Erich Focht
  3 siblings, 0 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-17 17:23 UTC (permalink / raw)
  To: Erich Focht, Ingo Molnar
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

> I like the cleanup of the topology.h. 

And the rest of Ingo's second version:

diff -urpN -X /home/fletch/.diff.exclude ingo-A/kernel/sched.c ingo-B/kernel/sched.c
--- ingo-A/kernel/sched.c	Fri Jan 17 09:18:32 2003
+++ ingo-B/kernel/sched.c	Fri Jan 17 09:19:42 2003
@@ -153,10 +153,9 @@ struct runqueue {
 			nr_uninterruptible;
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
-	int prev_nr_running[NR_CPUS];
+	int prev_cpu_load[NR_CPUS];
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
-	unsigned int nr_balanced;
 	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
@@ -765,29 +764,6 @@ static int find_busiest_node(int this_no
 	return node;
 }
 
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	int this_node = __cpu_to_node(this_cpu);
-	/*
-	 * Avoid rebalancing between nodes too often.
-	 * We rebalance globally once every NODE_BALANCE_RATE load balances.
-	 */
-	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
-		int node = find_busiest_node(this_node);
-		this_rq->nr_balanced = 0;
-		if (node >= 0)
-			return (__node_to_cpu_mask(node) | (1UL << this_cpu));
-	}
-	return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	return cpu_online_map;
-}
-
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -807,10 +783,10 @@ static inline unsigned int double_lock_b
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
 			/* Need to recalculate nr_running */
-			if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+			if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 				nr_running = this_rq->nr_running;
 			else
-				nr_running = this_rq->prev_nr_running[this_cpu];
+				nr_running = this_rq->prev_cpu_load[this_cpu];
 		} else
 			spin_lock(&busiest->lock);
 	}
@@ -847,10 +823,10 @@ static inline runqueue_t *find_busiest_q
 	 * that case we are less picky about moving a task across CPUs and
 	 * take what can be taken.
 	 */
-	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+	if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 		nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		nr_running = this_rq->prev_cpu_load[this_cpu];
 
 	busiest = NULL;
 	max_load = 1;
@@ -859,11 +835,11 @@ static inline runqueue_t *find_busiest_q
 			continue;
 
 		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+		if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
 			load = rq_src->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
+			load = this_rq->prev_cpu_load[i];
+		this_rq->prev_cpu_load[i] = rq_src->nr_running;
 
 		if ((load > max_load) && (rq_src != this_rq)) {
 			busiest = rq_src;
@@ -922,7 +898,7 @@ static inline void pull_task(runqueue_t 
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
 {
 	int imbalance, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
@@ -930,8 +906,7 @@ static void load_balance(runqueue_t *thi
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
-					cpus_to_balance(this_cpu, this_rq));
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
 	if (!busiest)
 		goto out;
 
@@ -1006,21 +981,75 @@ out:
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
 
-static inline void idle_tick(runqueue_t *rq)
+#if CONFIG_NUMA
+static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
 {
-	if (jiffies % IDLE_REBALANCE_TICK)
-		return;
-	spin_lock(&rq->lock);
-	load_balance(rq, 1);
-	spin_unlock(&rq->lock);
+	int node = find_busiest_node(__cpu_to_node(this_cpu));
+	unsigned long cpumask, this_cpumask = 1UL << this_cpu;
+
+	if (node >= 0) {
+		cpumask = __node_to_cpu_mask(node) | this_cpumask;
+		spin_lock(&this_rq->lock);
+		load_balance(this_rq, idle, cpumask);
+		spin_unlock(&this_rq->lock);
+	}
 }
+#endif
 
+static void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+#if CONFIG_NUMA
+	int this_cpu = smp_processor_id();
+#endif
+	unsigned long j = jiffies;
+
+	/*
+	 * First do inter-node rebalancing, then intra-node rebalancing,
+	 * if both events happen in the same tick. The inter-node
+	 * rebalancing does not necessarily have to create a perfect
+	 * balance within the node, since we load-balance the most loaded
+	 * node with the current CPU. (ie. other CPUs in the local node
+	 * are not balanced.)
+	 */
+	if (idle) {
+#if CONFIG_NUMA
+		if (!(j % IDLE_NODE_REBALANCE_TICK))
+			balance_node(this_rq, idle, this_cpu);
+#endif
+		if (!(j % IDLE_REBALANCE_TICK)) {
+			spin_lock(&this_rq->lock);
+			load_balance(this_rq, 0, __cpu_to_node_mask(this_cpu));
+			spin_unlock(&this_rq->lock);
+		}
+		return;
+	}
+#if CONFIG_NUMA
+	if (!(j % BUSY_NODE_REBALANCE_TICK))
+		balance_node(this_rq, idle, this_cpu);
+#endif
+	if (!(j % BUSY_REBALANCE_TICK)) {
+		spin_lock(&this_rq->lock);
+		load_balance(this_rq, idle, __cpu_to_node_mask(this_cpu));
+		spin_unlock(&this_rq->lock);
+	}
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+}
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1092,7 @@ void scheduler_tick(int user_ticks, int 
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
-		idle_tick(rq);
-#endif
+		rebalance_tick(rq, 1);
 		return;
 	}
 	if (TASK_NICE(p) > 0)
@@ -1121,11 +1148,8 @@ void scheduler_tick(int user_ticks, int 
 			enqueue_task(p, rq->active);
 	}
 out:
-#if CONFIG_SMP
-	if (!(jiffies % BUSY_REBALANCE_TICK))
-		load_balance(rq, 0);
-#endif
 	spin_unlock(&rq->lock);
+	rebalance_tick(rq, 0);
 }
 
 void scheduling_functions_start_here(void) { }
@@ -1184,7 +1208,7 @@ need_resched:
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
 #if CONFIG_SMP
-		load_balance(rq, 1);
+		load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
 		if (rq->nr_running)
 			goto pick_next_task;
 #endif


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 14:35                                               ` Erich Focht
                                                                   ` (2 preceding siblings ...)
  2003-01-17 17:23                                                 ` Martin J. Bligh
@ 2003-01-17 18:11                                                 ` Erich Focht
  2003-01-17 19:04                                                   ` Martin J. Bligh
  3 siblings, 1 reply; 96+ messages in thread
From: Erich Focht @ 2003-01-17 18:11 UTC (permalink / raw)
  To: Ingo Molnar, Martin J. Bligh
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

Ingo,

I repeated the tests with your B0 version and it's still not
satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
difference is that the other calls of load_balance() never have the
chance to balance across nodes.

Here are the results:

kernbench (average of 5 kernel compiles) (standard error in brackets)
---------
           Elapsed       UserTime      SysTime
orig       134.43(1.79)  944.79(0.43)  21.41(0.28)
ingo       136.74(1.58)  951.55(0.73)  21.16(0.32)
ingofix    135.22(0.59)  952.17(0.78)  21.16(0.19)
ingoB0     134.69(0.51)  951.63(0.81)  21.12(0.15)


hackbench (chat benchmark alike) (elapsed time for N groups of 20
---------           senders & receivers, stats from 10 measurements)

          N=10        N=25        N=50        N=100
orig      0.77(0.03)  1.91(0.06)  3.77(0.06)  7.78(0.21)
ingo      1.70(0.35)  3.11(0.47)  4.85(0.55)  8.80(0.98)
ingofix   1.16(0.14)  2.67(0.53)  5.05(0.26)  9.99(0.13)
ingoB0    0.84(0.03)  2.12(0.12)  4.20(0.22)  8.04(0.16)


numabench (N memory intensive tasks running in parallel, disturbed for
---------  a short time by a "hackbench 10" call)


numa_test N=4   ElapsedTime  TotalUserTime  TotalSysTime
orig:           26.13(2.54)  86.10(4.47)    0.09(0.01)
ingo:           27.60(2.16)  88.06(4.58)    0.11(0.01)
ingofix:        25.51(3.05)  83.55(2.78)    0.10(0.01)
ingoB0:         27.58(0.08)  90.86(4.42)    0.09(0.01)	

numa_test N=8   ElapsedTime  TotalUserTime  TotalSysTime
orig:           24.81(2.71)  164.94(4.82)   0.17(0.01)
ingo:           27.38(3.01)  170.06(5.60)   0.30(0.03)
ingofix:        29.08(2.79)  172.10(4.48)   0.32(0.03)
ingoB0:         26.05(3.28)  171.61(7.76)   0.18(0.01)

numa_test N=16  ElapsedTime  TotalUserTime  TotalSysTime
orig:           45.19(3.42)  332.07(5.89)   0.32(0.01)
ingo:           50.18(0.38)  359.46(9.31)   0.46(0.04)
ingofix:        50.30(0.42)  357.38(9.12)   0.46(0.01)
ingoB0:         50.96(1.33)  371.72(18.58)  0.34(0.01)

numa_test N=32  ElapsedTime  TotalUserTime  TotalSysTime
orig:           86.84(1.83)  671.99(9.98)   0.65(0.02)
ingo:           93.44(2.13)  704.90(16.91)  0.82(0.06)
ingofix:        93.92(1.28)  727.58(9.26)   0.77(0.03)
ingoB0:         99.72(4.13)  759.03(29.41)  0.69(0.01)


The kernbench user time is still too large.
Hackbench improved a lot (understandeable, as idle CPUs steal earlier
from remote nodes).
Numa_test didn't improve, in average we have the same results.

Hmmm, now I really tend towards letting it the way it is in
2.5.59. Except the topology cleanup and renaming, of course. I have no
more time to test a more conservative setting of
IDLE_NODE_REBALANCE_TICK today, but that could help...

Regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 15:11                                                 ` Ingo Molnar
  2003-01-17 15:30                                                   ` Erich Focht
  2003-01-17 16:58                                                   ` Martin J. Bligh
@ 2003-01-17 18:19                                                   ` Michael Hohnbaum
  2003-01-18  7:08                                                   ` William Lee Irwin III
  3 siblings, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-17 18:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Erich Focht, Martin J. Bligh, Christoph Hellwig, Robert Love,
	Andrew Theurer, Linus Torvalds, linux-kernel, lse-tech

On Fri, 2003-01-17 at 07:11, Ingo Molnar wrote:

> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).
> 
> 	Ingo

Ran this patch on a 4 node (16 CPU, 16 GB memory) NUMAQ.  Results don't
look encouraging.  I would suggest not applying this patch until the
degradation is worked out.

stock59 = linux 2.5.59
ingo59 = linux 2.5.59 with Ingo's B0 patch

Kernbench:
                        Elapsed       User     System        CPU
             stock59    29.668s   283.762s    82.286s      1233%
              ingo59    37.736s   338.162s   153.486s    1302.6%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock59       0.00      24.44      68.07       0.78
              ingo59       0.00      62.14     163.32       1.93

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock59       0.00      48.26     246.75       1.64
              ingo59       0.00      68.17     376.85       6.42

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock59       0.00      56.51     845.26       2.98
              ingo59       0.00     114.38    1337.65      18.89

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock59       0.00     116.95    1806.33       6.23
              ingo59       0.00     243.46    3515.09      43.92

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
             stock59       0.00     237.29    3634.59      15.71
              ingo59       0.00     688.31   10605.40     102.71

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 18:11                                                 ` Erich Focht
@ 2003-01-17 19:04                                                   ` Martin J. Bligh
  2003-01-17 19:26                                                     ` [Lse-tech] " Martin J. Bligh
  2003-01-17 23:09                                                     ` Matthew Dobson
  0 siblings, 2 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-17 19:04 UTC (permalink / raw)
  To: Erich Focht, Ingo Molnar, colpatch
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

> I repeated the tests with your B0 version and it's still not
> satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
> difference is that the other calls of load_balance() never have the
> chance to balance across nodes.

Nope, I found the problem. The topo cleanups are broken - we end up 
taking all mem accesses, etc to node 0.

Use the second half of the patch (the splitup I already posted), 
and fix the obvious compile error. Works fine now ;-)

Matt, you know the topo stuff better than anyone. Can you take a look
at the cleanup Ingo did, and see if it's easily fixable?

M.

PS. Ingo - I love the restructuring of the scheduler bits. 
I think we need > 2 multipler though ... I set it to 10 for now.
Tuning will tell ...

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: [patch] sched-2.5.59-A2
  2003-01-17 19:04                                                   ` Martin J. Bligh
@ 2003-01-17 19:26                                                     ` Martin J. Bligh
  2003-01-18  0:13                                                       ` Michael Hohnbaum
  2003-01-17 23:09                                                     ` Matthew Dobson
  1 sibling, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-17 19:26 UTC (permalink / raw)
  To: Erich Focht, Ingo Molnar, colpatch
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

>> I repeated the tests with your B0 version and it's still not
>> satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
>> difference is that the other calls of load_balance() never have the
>> chance to balance across nodes.
> 
> Nope, I found the problem. The topo cleanups are broken - we end up 
> taking all mem accesses, etc to node 0.

Kernbench:
                                   Elapsed        User      System         CPU
                        2.5.59     20.032s     186.66s      47.73s       1170%
               2.5.59-ingo-mjb     19.986s    187.044s     48.592s     1178.8%

NUMA schedbench 4:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00       36.38       90.70        0.62
               2.5.59-ingo-mjb        0.00       34.70       88.58        0.69

NUMA schedbench 8:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00       42.78      249.77        1.85
               2.5.59-ingo-mjb        0.00       49.33      256.59        1.69

NUMA schedbench 16:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00       56.84      848.00        2.78
               2.5.59-ingo-mjb        0.00       65.67      875.05        3.58

NUMA schedbench 32:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00      116.36     1807.29        5.75
               2.5.59-ingo-mjb        0.00      142.77     2039.47        8.42

NUMA schedbench 64:
                                   AvgUser     Elapsed   TotalUser    TotalSys
                        2.5.59        0.00      240.01     3634.20       14.57
               2.5.59-ingo-mjb        0.00      293.48     4534.99       20.62

System times are little higher (multipliers are set at busy = 10,
idle = 10) .... I'll try setting the idle multipler to 100, but
the other thing to do would be into increase the cross-node migrate
resistance by setting some minimum imbalance offsets. That'll 
probably have to be node-specific ... something like the number
of cpus per node ... but probably 0 for the simple HT systems.

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 19:04                                                   ` Martin J. Bligh
  2003-01-17 19:26                                                     ` [Lse-tech] " Martin J. Bligh
@ 2003-01-17 23:09                                                     ` Matthew Dobson
  1 sibling, 0 replies; 96+ messages in thread
From: Matthew Dobson @ 2003-01-17 23:09 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Erich Focht, Ingo Molnar, Christoph Hellwig, Robert Love,
	Michael Hohnbaum, Andrew Theurer, Linus Torvalds, linux-kernel,
	lse-tech

Martin J. Bligh wrote:
>>I repeated the tests with your B0 version and it's still not
>>satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
>>difference is that the other calls of load_balance() never have the
>>chance to balance across nodes.
> 
> 
> Nope, I found the problem. The topo cleanups are broken - we end up 
> taking all mem accesses, etc to node 0.
> 
> Use the second half of the patch (the splitup I already posted), 
> and fix the obvious compile error. Works fine now ;-)
> 
> Matt, you know the topo stuff better than anyone. Can you take a look
> at the cleanup Ingo did, and see if it's easily fixable?

Umm..  most of it looks clean.  I'm not really sure what the 
__cpu_to_node_mask(cpu) macro is supposed to do?  it looks to be just an 
alias for the __node_to_cpu_mask() macro, which makes little sense to 
me.  That's the only thing that immediately sticks out.  I'm doubly 
confused as to why it's defined twice in include/linux/topology.h?

-Matt

> 
> M.
> 
> PS. Ingo - I love the restructuring of the scheduler bits. 
> I think we need > 2 multipler though ... I set it to 10 for now.
> Tuning will tell ...
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] Re: [patch] sched-2.5.59-A2
  2003-01-17 19:26                                                     ` [Lse-tech] " Martin J. Bligh
@ 2003-01-18  0:13                                                       ` Michael Hohnbaum
  2003-01-18 13:31                                                         ` [patch] tunable rebalance rates for sched-2.5.59-B0 Erich Focht
  2003-01-18 23:09                                                         ` [patch] sched-2.5.59-A2 Erich Focht
  0 siblings, 2 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-18  0:13 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Erich Focht, Ingo Molnar, Matthew Dobson, Christoph Hellwig,
	Robert Love, Andrew Theurer, Linus Torvalds, linux-kernel,
	lse-tech

On Fri, 2003-01-17 at 11:26, Martin J. Bligh wrote:
> 
> System times are little higher (multipliers are set at busy = 10,
> idle = 10) .... I'll try setting the idle multipler to 100, but
> the other thing to do would be into increase the cross-node migrate
> resistance by setting some minimum imbalance offsets. That'll 
> probably have to be node-specific ... something like the number
> of cpus per node ... but probably 0 for the simple HT systems.
> 
> M.

I tried several values for the multipliers (IDLE_NODE_REBALANCE_TICK
and BUSY_NODE_REBALANCE_TICK) on a 4 node NUMAQ (16 700MHZ Pentium III
CPUs, 16 GB memory).  The interesting results follow:

Kernels:

stock59 - linux 2.5.59 with cputime_stats patch
ingoI50B10 - stock59 with Ingo's B0 patch as modified by Martin
             with a IDLE_NODE_REBALANCE_TICK multiplier of 50
             and a BUSY_NODE_REBALANCE_TICK multiplier of 10
ingoI2B2 - stock59 with Ingos B0 patch with the original multipliers (2)

Kernbench:
                        Elapsed       User     System        CPU
          ingoI50B10    29.574s   284.922s    84.542s      1249%
             stock59    29.498s   283.976s     83.05s    1243.8%
            ingoI2B2    30.212s   289.718s    84.926s      1240%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
          ingoI50B10      22.49      37.04      90.00       0.86
             stock59      22.25      35.94      89.06       0.81
            ingoI2B2      24.98      40.71      99.98       1.03

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
          ingoI50B10      31.16      49.63     249.32       1.86
             stock59      28.40      42.25     227.26       1.67
            ingoI2B2      36.81      59.38     294.57       2.02

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
          ingoI50B10      52.73      56.61     843.91       3.56
             stock59      52.97      57.19     847.70       3.29
            ingoI2B2      57.88      71.32     926.29       5.51

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
          ingoI50B10      58.85     139.76    1883.59       8.07
             stock59      56.57     118.05    1810.53       5.97
            ingoI2B2      61.48     137.38    1967.52      10.92

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
          ingoI50B10      70.74     297.25    4528.25      19.09
             stock59      56.75     234.12    3632.72      15.70
            ingoI2B2      70.56     298.45    4516.67      21.41

Martin already posted numbers with 10/10 for the multipliers.

It looks like on the NUMAQ the 50/10 values give the best results,
at least on these tests.  I suspect that on other architectures
the optimum numbers will vary.  Most likely on machines with lower
latency to off node memory, lower multipliers will help.  It will
be interesting to see numbers from Erich from the NEC IA64 boxes.
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-17 15:11                                                 ` Ingo Molnar
                                                                     ` (2 preceding siblings ...)
  2003-01-17 18:19                                                   ` [patch] sched-2.5.59-A2 Michael Hohnbaum
@ 2003-01-18  7:08                                                   ` William Lee Irwin III
  2003-01-18  8:12                                                     ` Martin J. Bligh
  2003-01-19  4:22                                                     ` William Lee Irwin III
  3 siblings, 2 replies; 96+ messages in thread
From: William Lee Irwin III @ 2003-01-18  7:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, davej, ak, Erich Focht, Martin J. Bligh, Christoph Hellwig,
	Robert Love, Michael Hohnbaum, Andrew Theurer, Linus Torvalds,
	linux-kernel, lse-tech

On Fri, Jan 17, 2003 at 04:11:59PM +0100, Ingo Molnar wrote:
> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).

I suspect some of these results may be off on NUMA-Q (or any PAE box)
if CONFIG_MTRR was enabled. Michael, Martin, please doublecheck
/proc/mtrr and whether CONFIG_MTRR=y. If you didn't enable it, or if
you compile times aren't on the order of 5-10 minutes, you're unaffected.

The severity of the MTRR regression in 2.5.59 is apparent from:
$ cat /proc/mtrr
reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
reg01: base=0x00000000 (   0MB), size=4096MB: write-back, count=1
$ time make -j bzImage > /dev/null
make -j bzImage > /dev/null  8338.52s user 245.73s system 1268% cpu 11:16.56 total

Fixing it up by hand (after dealing with various bits of pain) to:
$ cat /proc/mtrr
reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
reg01: base=0x00000000 (   0MB), size=4096MB: write-back, count=1
reg02: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
reg03: base=0x200000000 (8192MB), size=4096MB: write-back, count=1
reg04: base=0x300000000 (12288MB), size=4096MB: write-back, count=1
reg05: base=0x400000000 (16384MB), size=16384MB: write-back, count=1
reg06: base=0x800000000 (32768MB), size=16384MB: write-back, count=1
reg07: base=0xc00000000 (49152MB), size=16384MB: write-back, count=1

make -j bzImage > /dev/null  361.72s user 546.28s system 2208% cpu 41.109 total
make -j bzImage > /dev/null  364.00s user 575.73s system 2005% cpu 46.858 total
make -j bzImage > /dev/null  366.77s user 568.44s system 2239% cpu 41.765 total

I'll do some bisection search to figure out which patch broke the world.

-- wli

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-18  7:08                                                   ` William Lee Irwin III
@ 2003-01-18  8:12                                                     ` Martin J. Bligh
  2003-01-18  8:16                                                       ` William Lee Irwin III
  2003-01-19  4:22                                                     ` William Lee Irwin III
  1 sibling, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-18  8:12 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

>> agreed, i've attached the -B0 patch that does this. The balancing rates
>> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
>> busy-global).
> 
> I suspect some of these results may be off on NUMA-Q (or any PAE box)
> if CONFIG_MTRR was enabled. Michael, Martin, please doublecheck
> /proc/mtrr and whether CONFIG_MTRR=y. If you didn't enable it, or if
> you compile times aren't on the order of 5-10 minutes, you're unaffected.
> 
> The severity of the MTRR regression in 2.5.59 is apparent from:
> $ cat /proc/mtrr
> reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
> reg01: base=0x00000000 (   0MB), size=4096MB: write-back, count=1

Works for me, I have MTRR on.

larry:~# cat /proc/mtrr
reg00: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
reg01: base=0x00000000 (   0MB), size=16384MB: write-back, count=1



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-18  8:12                                                     ` Martin J. Bligh
@ 2003-01-18  8:16                                                       ` William Lee Irwin III
  0 siblings, 0 replies; 96+ messages in thread
From: William Lee Irwin III @ 2003-01-18  8:16 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

At some point in the past, I wrote:
>> I suspect some of these results may be off on NUMA-Q (or any PAE box)
>> if CONFIG_MTRR was enabled. Michael, Martin, please doublecheck
>> /proc/mtrr and whether CONFIG_MTRR=y. If you didn't enable it, or if
>> you compile times aren't on the order of 5-10 minutes, you're unaffected.
>> The severity of the MTRR regression in 2.5.59 is apparent from:
>> $ cat /proc/mtrr
>> reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
>> reg01: base=0x00000000 (   0MB), size=4096MB: write-back, count=1

On Sat, Jan 18, 2003 at 12:12:31AM -0800, Martin J. Bligh wrote:
> Works for me, I have MTRR on.
> larry:~# cat /proc/mtrr
> reg00: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
> reg01: base=0x00000000 (   0MB), size=16384MB: write-back, count=1

Okay, it sounds like the problem needs some extra RAM to trigger. We
can bounce quads back & forth if need be, but I'll at least take a shot
at finding where it happened before you probably need to look into it.


Thanks,
Bill

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [patch] tunable rebalance rates for sched-2.5.59-B0
  2003-01-18  0:13                                                       ` Michael Hohnbaum
@ 2003-01-18 13:31                                                         ` Erich Focht
  2003-01-18 23:09                                                         ` [patch] sched-2.5.59-A2 Erich Focht
  1 sibling, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-18 13:31 UTC (permalink / raw)
  To: Michael Hohnbaum, Martin J. Bligh, Ingo Molnar
  Cc: Matthew Dobson, Christoph Hellwig, Robert Love, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 775 bytes --]

Hi,

I'm currently scanning the parameter space of IDLE_NODE_REBALANCE_TICK
and BUSY_NODE_REBALANCE_TICK with the help of tunable rebalance
rates. The patch basically does:

-#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
-#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
+int idle_nodebalance_rate = 10;
+int busy_nodebalance_rate = 10;
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * idle_nodebalance_rate)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * busy_nodebalance_rate)

and makes the variables accessible in /proc/sys/kernel

We might want to leave these tunable in case it turns out that
different platforms need significantly different values. Right now
it's just a tool for tuning.

Regards,
Erich


[-- Attachment #2: tunable-balance-rate-2.5.59 --]
[-- Type: text/x-diff, Size: 2456 bytes --]

diff -urNp 2.5.59-B0/include/linux/sysctl.h 2.5.59-B0-tune/include/linux/sysctl.h
--- 2.5.59-B0/include/linux/sysctl.h	2003-01-17 03:22:16.000000000 +0100
+++ 2.5.59-B0-tune/include/linux/sysctl.h	2003-01-18 12:49:17.000000000 +0100
@@ -129,6 +129,8 @@ enum
 	KERN_CADPID=54,		/* int: PID of the process to notify on CAD */
 	KERN_PIDMAX=55,		/* int: PID # limit */
   	KERN_CORE_PATTERN=56,	/* string: pattern for core-file names */
+  	KERN_NODBALI=57,	/* int: idle cross-node balance rate */
+  	KERN_NODBALB=58,	/* int: busy cross-node balance rate */
 };
 
 
diff -urNp 2.5.59-B0/kernel/sched.c 2.5.59-B0-tune/kernel/sched.c
--- 2.5.59-B0/kernel/sched.c	2003-01-18 11:50:23.000000000 +0100
+++ 2.5.59-B0-tune/kernel/sched.c	2003-01-18 12:01:03.000000000 +0100
@@ -984,12 +984,14 @@ out:
  * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  *
- * On NUMA, do a node-rebalance every 400 msecs.
+ * On NUMA, do a node-rebalance every 10ms (idle) or 2 secs (busy).
  */
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 #define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
-#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
-#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
+int idle_nodebalance_rate = 10;
+int busy_nodebalance_rate = 10;
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * idle_nodebalance_rate)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * busy_nodebalance_rate)
 
 #if CONFIG_NUMA
 static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
diff -urNp 2.5.59-B0/kernel/sysctl.c 2.5.59-B0-tune/kernel/sysctl.c
--- 2.5.59-B0/kernel/sysctl.c	2003-01-17 03:21:39.000000000 +0100
+++ 2.5.59-B0-tune/kernel/sysctl.c	2003-01-18 11:59:53.000000000 +0100
@@ -55,6 +55,8 @@ extern char core_pattern[];
 extern int cad_pid;
 extern int pid_max;
 extern int sysctl_lower_zone_protection;
+extern int idle_nodebalance_rate;
+extern int busy_nodebalance_rate;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -261,6 +263,10 @@ static ctl_table kern_table[] = {
 #endif
 	{KERN_PIDMAX, "pid_max", &pid_max, sizeof (int),
 	 0600, NULL, &proc_dointvec},
+	{KERN_NODBALI, "idle_nodebalance_rate", &idle_nodebalance_rate,
+	 sizeof (int), 0600, NULL, &proc_dointvec},
+	{KERN_NODBALB, "busy_nodebalance_rate", &busy_nodebalance_rate,
+	 sizeof (int), 0600, NULL, &proc_dointvec},
 	{0}
 };
 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* NUMA sched -> pooling scheduler (inc HT)
  2003-01-17 16:58                                                   ` Martin J. Bligh
@ 2003-01-18 20:54                                                     ` Martin J. Bligh
  2003-01-18 21:34                                                       ` [Lse-tech] " Martin J. Bligh
  0 siblings, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-18 20:54 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	linux-kernel, lse-tech, Erich Focht, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]

Andrew, hopefully this'll give you a cleaner integration point to do 
the HT scheduler stuff ... I basically did a rename of "node" to "pool" 
on sched.c (OK, it was a little more complex than that), and provided
some hooks for you to attatch to. There's a really hacky version of
the HT stuff in there that I doubt works at all. (sched.h will need
something other than CONFIG_SCHED_NUMA, for starters). 

It's not really finished, but I have to go out ... I thought you or 
someone else might like to have a play with it in the meantime. 
Goes on top of the second half of Ingo's stuff from yesterday 
(also attatched).

I think this should result in a much cleaner integration between the HT
aware stuff and the NUMA stuff. Pools is a concept Erich had in his
scheduler a while back, but it got set aside in the paring down for
integration. We should be able to add multiple levels to this fairly
easily, at some point (eg HT + NUMA), but let's get the basics working
first ;-)

M.

[-- Attachment #2: 01-ingo --]
[-- Type: application/octet-stream, Size: 6829 bytes --]

diff -urpN -X /home/fletch/.diff.exclude 00-virgin/kernel/sched.c 01-ingo/kernel/sched.c
--- 00-virgin/kernel/sched.c	Fri Jan 17 09:18:32 2003
+++ 01-ingo/kernel/sched.c	Sat Jan 18 10:58:57 2003
@@ -153,10 +153,9 @@ struct runqueue {
 			nr_uninterruptible;
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
-	int prev_nr_running[NR_CPUS];
+	int prev_cpu_load[NR_CPUS];
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
-	unsigned int nr_balanced;
 	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
@@ -765,29 +764,6 @@ static int find_busiest_node(int this_no
 	return node;
 }
 
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	int this_node = __cpu_to_node(this_cpu);
-	/*
-	 * Avoid rebalancing between nodes too often.
-	 * We rebalance globally once every NODE_BALANCE_RATE load balances.
-	 */
-	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
-		int node = find_busiest_node(this_node);
-		this_rq->nr_balanced = 0;
-		if (node >= 0)
-			return (__node_to_cpu_mask(node) | (1UL << this_cpu));
-	}
-	return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
-	return cpu_online_map;
-}
-
 #endif /* CONFIG_NUMA */
 
 #if CONFIG_SMP
@@ -807,10 +783,10 @@ static inline unsigned int double_lock_b
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
 			/* Need to recalculate nr_running */
-			if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+			if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 				nr_running = this_rq->nr_running;
 			else
-				nr_running = this_rq->prev_nr_running[this_cpu];
+				nr_running = this_rq->prev_cpu_load[this_cpu];
 		} else
 			spin_lock(&busiest->lock);
 	}
@@ -847,10 +823,10 @@ static inline runqueue_t *find_busiest_q
 	 * that case we are less picky about moving a task across CPUs and
 	 * take what can be taken.
 	 */
-	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+	if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
 		nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		nr_running = this_rq->prev_cpu_load[this_cpu];
 
 	busiest = NULL;
 	max_load = 1;
@@ -859,11 +835,11 @@ static inline runqueue_t *find_busiest_q
 			continue;
 
 		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+		if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
 			load = rq_src->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
+			load = this_rq->prev_cpu_load[i];
+		this_rq->prev_cpu_load[i] = rq_src->nr_running;
 
 		if ((load > max_load) && (rq_src != this_rq)) {
 			busiest = rq_src;
@@ -922,7 +898,7 @@ static inline void pull_task(runqueue_t 
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
 {
 	int imbalance, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
@@ -930,8 +906,7 @@ static void load_balance(runqueue_t *thi
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
-					cpus_to_balance(this_cpu, this_rq));
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
 	if (!busiest)
 		goto out;
 
@@ -1006,21 +981,75 @@ out:
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
 
-static inline void idle_tick(runqueue_t *rq)
+#if CONFIG_NUMA
+static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
 {
-	if (jiffies % IDLE_REBALANCE_TICK)
-		return;
-	spin_lock(&rq->lock);
-	load_balance(rq, 1);
-	spin_unlock(&rq->lock);
+	int node = find_busiest_node(__cpu_to_node(this_cpu));
+	unsigned long cpumask, this_cpumask = 1UL << this_cpu;
+
+	if (node >= 0) {
+		cpumask = __node_to_cpu_mask(node) | this_cpumask;
+		spin_lock(&this_rq->lock);
+		load_balance(this_rq, idle, cpumask);
+		spin_unlock(&this_rq->lock);
+	}
 }
+#endif
 
+static void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+#if CONFIG_NUMA
+	int this_cpu = smp_processor_id();
+#endif
+	unsigned long j = jiffies;
+
+	/*
+	 * First do inter-node rebalancing, then intra-node rebalancing,
+	 * if both events happen in the same tick. The inter-node
+	 * rebalancing does not necessarily have to create a perfect
+	 * balance within the node, since we load-balance the most loaded
+	 * node with the current CPU. (ie. other CPUs in the local node
+	 * are not balanced.)
+	 */
+	if (idle) {
+#if CONFIG_NUMA
+		if (!(j % IDLE_NODE_REBALANCE_TICK))
+			balance_node(this_rq, idle, this_cpu);
+#endif
+		if (!(j % IDLE_REBALANCE_TICK)) {
+			spin_lock(&this_rq->lock);
+			load_balance(this_rq, 0, __cpu_to_node_mask(this_cpu));
+			spin_unlock(&this_rq->lock);
+		}
+		return;
+	}
+#if CONFIG_NUMA
+	if (!(j % BUSY_NODE_REBALANCE_TICK))
+		balance_node(this_rq, idle, this_cpu);
+#endif
+	if (!(j % BUSY_REBALANCE_TICK)) {
+		spin_lock(&this_rq->lock);
+		load_balance(this_rq, idle, __cpu_to_node_mask(this_cpu));
+		spin_unlock(&this_rq->lock);
+	}
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+}
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1092,7 @@ void scheduler_tick(int user_ticks, int 
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
-		idle_tick(rq);
-#endif
+		rebalance_tick(rq, 1);
 		return;
 	}
 	if (TASK_NICE(p) > 0)
@@ -1121,11 +1148,8 @@ void scheduler_tick(int user_ticks, int 
 			enqueue_task(p, rq->active);
 	}
 out:
-#if CONFIG_SMP
-	if (!(jiffies % BUSY_REBALANCE_TICK))
-		load_balance(rq, 0);
-#endif
 	spin_unlock(&rq->lock);
+	rebalance_tick(rq, 0);
 }
 
 void scheduling_functions_start_here(void) { }
@@ -1184,7 +1208,7 @@ need_resched:
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
 #if CONFIG_SMP
-		load_balance(rq, 1);
+		load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
 		if (rq->nr_running)
 			goto pick_next_task;
 #endif

[-- Attachment #3: 02-pools --]
[-- Type: application/octet-stream, Size: 12912 bytes --]

diff -urpN -X /home/fletch/.diff.exclude 01-ingo/arch/i386/Kconfig 02-pools/arch/i386/Kconfig
--- 01-ingo/arch/i386/Kconfig	Fri Jan 17 09:18:19 2003
+++ 02-pools/arch/i386/Kconfig	Sat Jan 18 11:59:54 2003
@@ -476,6 +476,11 @@ config NUMA
 	bool "Numa Memory Allocation Support"
 	depends on X86_NUMAQ
 
+config SCHED_NUMA
+	bool "NUMA aware scheduler"
+	depends on NUMA
+	default y
+
 config DISCONTIGMEM
 	bool
 	depends on NUMA
diff -urpN -X /home/fletch/.diff.exclude 01-ingo/arch/ia64/Kconfig 02-pools/arch/ia64/Kconfig
--- 01-ingo/arch/ia64/Kconfig	Thu Jan  9 19:15:56 2003
+++ 02-pools/arch/ia64/Kconfig	Sat Jan 18 12:00:08 2003
@@ -246,6 +246,11 @@ config DISCONTIGMEM
 	  or have huge holes in the physical address space for other reasons.
 	  See <file:Documentation/vm/numa> for more.
 
+config SCHED_NUMA
+	bool "NUMA aware scheduler"
+	depends on NUMA
+	default y
+
 config VIRTUAL_MEM_MAP
 	bool "Enable Virtual Mem Map"
 	depends on !NUMA
diff -urpN -X /home/fletch/.diff.exclude 01-ingo/include/linux/sched.h 02-pools/include/linux/sched.h
--- 01-ingo/include/linux/sched.h	Fri Jan 17 09:18:32 2003
+++ 02-pools/include/linux/sched.h	Sat Jan 18 12:21:09 2003
@@ -447,12 +447,12 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
-#ifdef CONFIG_NUMA
+#ifdef CONFIG_SCHED_NUMA
 extern void sched_balance_exec(void);
-extern void node_nr_running_init(void);
+extern void pool_nr_running_init(void);
 #else
 #define sched_balance_exec()   {}
-#define node_nr_running_init() {}
+#define pool_nr_running_init() {}
 #endif
 
 extern void set_user_nice(task_t *p, long nice);
diff -urpN -X /home/fletch/.diff.exclude 01-ingo/include/linux/sched_topo_ht.h 02-pools/include/linux/sched_topo_ht.h
--- 01-ingo/include/linux/sched_topo_ht.h	Wed Dec 31 16:00:00 1969
+++ 02-pools/include/linux/sched_topo_ht.h	Sat Jan 18 12:20:00 2003
@@ -0,0 +1,17 @@
+#define CONFIG_SCHED_POOLS 1               /* should be a real config option */
+
+/* 
+ * The following is a temporary hack, for which I make no apologies - mbligh
+ * Assumes CPUs are paired together siblings (0,1) (2,3) (4,5) .... etc.
+ * We should probably do this in an arch topo file and use apicids.
+ */
+
+#define MAX_NUMPOOLS NR_CPUS
+#define numpools (num_online_cpus / 2)
+
+#define pool_to_cpu_mask(pool)	( (1UL << (pool*2)) || (1UL << (pool*2+1)) )
+#define cpu_to_pool(cpu)	(cpu / 2)
+#define cpu_to_pool_mask(cpu)	(pool_to_cpu_mask(cpu_to_pool(cpu)))
+
+#define IDLE_REBALANCE_RATIO 2
+#define BUSY_REBALANCE_RATIO 2
diff -urpN -X /home/fletch/.diff.exclude 01-ingo/include/linux/sched_topo_numa.h 02-pools/include/linux/sched_topo_numa.h
--- 01-ingo/include/linux/sched_topo_numa.h	Wed Dec 31 16:00:00 1969
+++ 02-pools/include/linux/sched_topo_numa.h	Sat Jan 18 12:20:05 2003
@@ -0,0 +1,11 @@
+#define CONFIG_SCHED_POOLS 1               /* should be a real config option */
+
+#define MAX_NUMPOOLS MAX_NUMNODES
+#define numpools numnodes
+
+#define pool_to_cpu_mask	__node_to_cpu_mask
+#define cpu_to_pool		__cpu_to_node
+#define cpu_to_pool_mask(cpu)	(__node_to_cpu_mask(__cpu_to_node(cpu)))
+
+#define IDLE_REBALANCE_RATIO 10
+#define BUSY_REBALANCE_RATIO 5
diff -urpN -X /home/fletch/.diff.exclude 01-ingo/include/linux/sched_topology.h 02-pools/include/linux/sched_topology.h
--- 01-ingo/include/linux/sched_topology.h	Wed Dec 31 16:00:00 1969
+++ 02-pools/include/linux/sched_topology.h	Sat Jan 18 11:59:36 2003
@@ -0,0 +1,14 @@
+#ifndef _LINUX_SCHED_TOPOLOGY_H
+#define _LINUX_SCHED_TOPOLOGY_H
+
+#ifdef CONFIG_SCHED_TOPO_ARCH
+#include <asm/sched_topo.h>
+#elif CONFIG_SCHED_NUMA
+#include <linux/sched_topo_numa.h>
+#elif CONFIG_SCHED_TOPO_HT
+#include <linux/sched_topo_ht.h>
+#else
+#include <linux/sched_topo_flat.h>
+#endif
+
+#endif /* _LINUX_SCHED_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude 01-ingo/init/main.c 02-pools/init/main.c
--- 01-ingo/init/main.c	Fri Jan 17 09:18:32 2003
+++ 02-pools/init/main.c	Sat Jan 18 11:48:10 2003
@@ -495,7 +495,7 @@ static void do_pre_smp_initcalls(void)
 
 	migration_init();
 #endif
-	node_nr_running_init();
+	pool_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urpN -X /home/fletch/.diff.exclude 01-ingo/kernel/sched.c 02-pools/kernel/sched.c
--- 01-ingo/kernel/sched.c	Sat Jan 18 10:58:57 2003
+++ 02-pools/kernel/sched.c	Sat Jan 18 11:49:00 2003
@@ -32,6 +32,7 @@
 #include <linux/delay.h>
 #include <linux/timer.h>
 #include <linux/rcupdate.h>
+#include <linux/sched_topology.h>
 
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -67,7 +68,7 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(2*HZ)
 #define STARVATION_LIMIT	(2*HZ)
-#define NODE_THRESHOLD          125
+#define POOL_THRESHOLD          125
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -154,9 +155,9 @@ struct runqueue {
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_cpu_load[NR_CPUS];
-#ifdef CONFIG_NUMA
-	atomic_t *node_nr_running;
-	int prev_node_load[MAX_NUMNODES];
+#ifdef CONFIG_SCHED_POOLS
+	atomic_t *pool_nr_running;
+	int prev_pool_load[MAX_NUMPOOLS];
 #endif
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -181,47 +182,47 @@ static struct runqueue runqueues[NR_CPUS
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
-#ifdef CONFIG_NUMA
+#ifdef CONFIG_SCHED_POOLS
 
 /*
  * Keep track of running tasks.
  */
 
-static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
-	{[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+static atomic_t pool_nr_running[MAX_NUMPOOLS] ____cacheline_maxaligned_in_smp =
+	{[0 ...MAX_NUMPOOLS-1] = ATOMIC_INIT(0)};
 
 static inline void nr_running_init(struct runqueue *rq)
 {
-	rq->node_nr_running = &node_nr_running[0];
+	rq->pool_nr_running = &pool_nr_running[0];
 }
 
 static inline void nr_running_inc(runqueue_t *rq)
 {
-	atomic_inc(rq->node_nr_running);
+	atomic_inc(rq->pool_nr_running);
 	rq->nr_running++;
 }
 
 static inline void nr_running_dec(runqueue_t *rq)
 {
-	atomic_dec(rq->node_nr_running);
+	atomic_dec(rq->pool_nr_running);
 	rq->nr_running--;
 }
 
-__init void node_nr_running_init(void)
+__init void pool_nr_running_init(void)
 {
 	int i;
 
 	for (i = 0; i < NR_CPUS; i++)
-		cpu_rq(i)->node_nr_running = &node_nr_running[__cpu_to_node(i)];
+		cpu_rq(i)->pool_nr_running = &pool_nr_running[cpu_to_pool(i)];
 }
 
-#else /* !CONFIG_NUMA */
+#else /* !CONFIG_SCHED_POOLS */
 
 # define nr_running_init(rq)   do { } while (0)
 # define nr_running_inc(rq)    do { (rq)->nr_running++; } while (0)
 # define nr_running_dec(rq)    do { (rq)->nr_running--; } while (0)
 
-#endif /* CONFIG_NUMA */
+#endif /* CONFIG_SCHED_POOLS */
 
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
@@ -670,7 +671,7 @@ static inline void double_rq_unlock(runq
 		spin_unlock(&rq2->lock);
 }
 
-#if CONFIG_NUMA
+#if CONFIG_SCHED_POOLS
 /*
  * If dest_cpu is allowed for this process, migrate the task to it.
  * This is accomplished by forcing the cpu_allowed mask to only
@@ -697,7 +698,7 @@ static void sched_migrate_task(task_t *p
  */
 static int sched_best_cpu(struct task_struct *p)
 {
-	int i, minload, load, best_cpu, node = 0;
+	int i, minload, load, best_cpu, pool = 0;
 	unsigned long cpumask;
 
 	best_cpu = task_cpu(p);
@@ -705,16 +706,16 @@ static int sched_best_cpu(struct task_st
 		return best_cpu;
 
 	minload = 10000000;
-	for (i = 0; i < numnodes; i++) {
-		load = atomic_read(&node_nr_running[i]);
+	for (i = 0; i < numpools; i++) {
+		load = atomic_read(&pool_nr_running[i]);
 		if (load < minload) {
 			minload = load;
-			node = i;
+			pool = i;
 		}
 	}
 
 	minload = 10000000;
-	cpumask = __node_to_cpu_mask(node);
+	cpumask = pool_to_cpu_mask(pool);
 	for (i = 0; i < NR_CPUS; ++i) {
 		if (!(cpumask & (1UL << i)))
 			continue;
@@ -730,7 +731,7 @@ void sched_balance_exec(void)
 {
 	int new_cpu;
 
-	if (numnodes > 1) {
+	if (numpools > 1) {
 		new_cpu = sched_best_cpu(current);
 		if (new_cpu != smp_processor_id())
 			sched_migrate_task(current, new_cpu);
@@ -738,33 +739,33 @@ void sched_balance_exec(void)
 }
 
 /*
- * Find the busiest node. All previous node loads contribute with a 
+ * Find the busiest pool. All previous pool loads contribute with a 
  * geometrically deccaying weight to the load measure:
- *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
+ *      load_{t} = load_{t-1}/2 + nr_pool_running_{t}
  * This way sudden load peaks are flattened out a bit.
  */
-static int find_busiest_node(int this_node)
+static int find_busiest_pool(int this_pool)
 {
-	int i, node = -1, load, this_load, maxload;
+	int i, pool = -1, load, this_load, maxload;
 	
-	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
-		+ atomic_read(&node_nr_running[this_node]);
-	this_rq()->prev_node_load[this_node] = this_load;
-	for (i = 0; i < numnodes; i++) {
-		if (i == this_node)
+	this_load = maxload = (this_rq()->prev_pool_load[this_pool] >> 1)
+		+ atomic_read(&pool_nr_running[this_pool]);
+	this_rq()->prev_pool_load[this_pool] = this_load;
+	for (i = 0; i < numpools; i++) {
+		if (i == this_pool)
 			continue;
-		load = (this_rq()->prev_node_load[i] >> 1)
-			+ atomic_read(&node_nr_running[i]);
-		this_rq()->prev_node_load[i] = load;
-		if (load > maxload && (100*load > NODE_THRESHOLD*this_load)) {
+		load = (this_rq()->prev_pool_load[i] >> 1)
+			+ atomic_read(&pool_nr_running[i]);
+		this_rq()->prev_pool_load[i] = load;
+		if (load > maxload && (100*load > POOL_THRESHOLD*this_load)) {
 			maxload = load;
-			node = i;
+			pool = i;
 		}
 	}
-	return node;
+	return pool;
 }
 
-#endif /* CONFIG_NUMA */
+#endif /* CONFIG_SCHED_POOLS */
 
 #if CONFIG_SMP
 
@@ -983,22 +984,20 @@ out:
  *
  * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
- *
- * On NUMA, do a node-rebalance every 400 msecs.
  */
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 #define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
-#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
-#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
+#define IDLE_POOL_REBALANCE_TICK (IDLE_REBALANCE_TICK * IDLE_REBALANCE_RATIO)
+#define BUSY_POOL_REBALANCE_TICK (BUSY_REBALANCE_TICK * BUSY_REBALANCE_RATIO)
 
-#if CONFIG_NUMA
-static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
+#if CONFIG_SCHED_POOLS
+static void balance_pool(runqueue_t *this_rq, int idle, int this_cpu)
 {
-	int node = find_busiest_node(__cpu_to_node(this_cpu));
+	int pool = find_busiest_pool(cpu_to_pool(this_cpu));
 	unsigned long cpumask, this_cpumask = 1UL << this_cpu;
 
-	if (node >= 0) {
-		cpumask = __node_to_cpu_mask(node) | this_cpumask;
+	if (pool >= 0) {
+		cpumask = pool_to_cpu_mask(pool) | this_cpumask;
 		spin_lock(&this_rq->lock);
 		load_balance(this_rq, idle, cpumask);
 		spin_unlock(&this_rq->lock);
@@ -1008,38 +1007,38 @@ static void balance_node(runqueue_t *thi
 
 static void rebalance_tick(runqueue_t *this_rq, int idle)
 {
-#if CONFIG_NUMA
+#if CONFIG_SCHED_POOLS
 	int this_cpu = smp_processor_id();
 #endif
 	unsigned long j = jiffies;
 
 	/*
-	 * First do inter-node rebalancing, then intra-node rebalancing,
-	 * if both events happen in the same tick. The inter-node
+	 * First do inter-pool rebalancing, then intra-pool rebalancing,
+	 * if both events happen in the same tick. The inter-pool
 	 * rebalancing does not necessarily have to create a perfect
-	 * balance within the node, since we load-balance the most loaded
-	 * node with the current CPU. (ie. other CPUs in the local node
+	 * balance within the pool, since we load-balance the most loaded
+	 * pool with the current CPU. (ie. other CPUs in the local pool
 	 * are not balanced.)
 	 */
 	if (idle) {
-#if CONFIG_NUMA
-		if (!(j % IDLE_NODE_REBALANCE_TICK))
-			balance_node(this_rq, idle, this_cpu);
+#if CONFIG_SCHED_POOLS
+		if (!(j % IDLE_POOL_REBALANCE_TICK))
+			balance_pool(this_rq, idle, this_cpu);
 #endif
 		if (!(j % IDLE_REBALANCE_TICK)) {
 			spin_lock(&this_rq->lock);
-			load_balance(this_rq, 0, __cpu_to_node_mask(this_cpu));
+			load_balance(this_rq, 0, cpu_to_pool_mask(this_cpu));
 			spin_unlock(&this_rq->lock);
 		}
 		return;
 	}
-#if CONFIG_NUMA
-	if (!(j % BUSY_NODE_REBALANCE_TICK))
-		balance_node(this_rq, idle, this_cpu);
+#if CONFIG_SCHED_POOLS
+	if (!(j % BUSY_POOL_REBALANCE_TICK))
+		balance_pool(this_rq, idle, this_cpu);
 #endif
 	if (!(j % BUSY_REBALANCE_TICK)) {
 		spin_lock(&this_rq->lock);
-		load_balance(this_rq, idle, __cpu_to_node_mask(this_cpu));
+		load_balance(this_rq, idle, cpu_to_pool_mask(this_cpu));
 		spin_unlock(&this_rq->lock);
 	}
 }
@@ -1208,7 +1207,7 @@ need_resched:
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
 #if CONFIG_SMP
-		load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
+		load_balance(rq, 1, cpu_to_pool_mask(smp_processor_id()));
 		if (rq->nr_running)
 			goto pick_next_task;
 #endif

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] NUMA sched -> pooling scheduler (inc HT)
  2003-01-18 20:54                                                     ` NUMA sched -> pooling scheduler (inc HT) Martin J. Bligh
@ 2003-01-18 21:34                                                       ` Martin J. Bligh
  2003-01-19  0:13                                                         ` Andrew Theurer
  0 siblings, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-18 21:34 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, linux-kernel,
	lse-tech, Erich Focht, Ingo Molnar

Mmm... seems I may have got the ordering of the cpus wrong.
Something like this might work better in sched_topo_ht.h
(yeah, it's ugly. I don't care).

static inline int pool_to_cpu_mask (int pool)
{
	return ( (1UL << pool) || (1UL << cpu_sibling_map[pool]));
}

static inline cpu_to_pool (int cpu)
{
	return min(cpu, cpu_sibling_map[cpu]);
}

Thanks to Andi, Zwane, and Bill for the corrective baseball bat strike.
I changed the macros to inlines to avoid the risk of double eval.

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-18  0:13                                                       ` Michael Hohnbaum
  2003-01-18 13:31                                                         ` [patch] tunable rebalance rates for sched-2.5.59-B0 Erich Focht
@ 2003-01-18 23:09                                                         ` Erich Focht
  2003-01-20  9:28                                                           ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: Erich Focht @ 2003-01-18 23:09 UTC (permalink / raw)
  To: Michael Hohnbaum, Martin J. Bligh, Ingo Molnar
  Cc: Matthew Dobson, Christoph Hellwig, Robert Love, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

The scan through a piece of the parameter space delivered quite
unconclusive results. I used the IDLE_NODE_REBALANCE_TICK multipliers
2, 5, 10, 20, 50 and the BUSY_NODE_REBALANCE_TICK multipliers 2, 5,
10, 20.

The benchmarks I tried were kernbench (average and error of 5 kernel
compiles) and hackbench (5 runs for each number of chatter groups
(10,25,50,100). The 2.5.59 scheduler result is printed first, then a
matrix with all combinations of idle and busy rebalance
multipliers. Each value is followed by its standard error (coming from
the 5 measurements). I didn't measure numa_bench, those values depend
mostly on the initial load balancing and showed no clear
tendency/difference.

The machine is an NEC TX7 (small version, 8 Itanium2 CPUs in 4 nodes).

The results:
- kernbench UserTime is best for the 2.5.59 scheduler (623s). IngoB0
best value 627.33s for idle=20ms, busy=2000ms.
- hackbench: 2.5.59 scheduler is significantly better for all
measurements.

I suppose this comes from the fact that the 2.5.59 version has the
chance to load_balance across nodes when a cpu goes idle. No idea what
other reason it could be... Maybe anybody else?

Kernbench:
==========
2.5.59  : Elapsed = 86.29(1.24)
ingo B0 : Elapsed
      idle      2            5            10           20           50
busy		     
2       86.25(0.45)  86.62(1.56)  86.29(0.99)  85.99(0.60)  86.91(1.09)
5       86.87(1.12)  86.38(0.82)  86.00(0.69)  86.14(0.39)  86.47(0.68)
10      86.06(0.18)  86.23(0.38)  86.63(0.57)  86.82(0.95)  86.06(0.15)
20      86.64(1.24)  86.43(0.74)  86.15(0.99)  86.76(1.34)  86.70(0.68)

2.5.59  : UserTime = 623.24(0.46)
ingo B0 : UserTime
      idle      2             5             10            20            50
busy
2       629.05(0.32)  628.54(0.53)  628.51(0.32)  628.66(0.23)  628.72(0.20)
5       628.14(0.88)  628.10(0.76)  628.33(0.41)  628.45(0.48)  628.11(0.37)
10      627.97(0.30)  627.77(0.23)  627.75(0.21)  627.33(0.45)  627.63(0.52)
20      627.55(0.36)  627.67(0.58)  627.36(0.67)  627.84(0.28)  627.69(0.59)

2.5.59  : SysTime = 21.83(0.16)
ingo B0 : SysTime
      idle      2            5            10           20           50
busy
2       21.99(0.26)  21.89(0.12)  22.12(0.16)  22.06(0.21)  22.44(0.51)
5       22.07(0.21)  22.29(0.54)  22.15(0.08)  22.09(0.26)  21.90(0.18)
10      22.01(0.20)  22.42(0.42)  22.28(0.23)  22.04(0.37)  22.41(0.26)
20      22.03(0.20)  22.08(0.30)  22.31(0.27)  22.03(0.19)  22.35(0.33)


Hackbench  10
=============
2.5.59 : 0.77(0.03)
ingo B0:
      idle      2           5           10          20          50
busy
2       0.90(0.07)  0.88(0.05)  0.84(0.05)  0.82(0.04)  0.85(0.06)
5       0.87(0.05)  0.90(0.07)  0.88(0.08)  0.89(0.09)  0.86(0.07)
10      0.85(0.06)  0.83(0.05)  0.86(0.08)  0.84(0.06)  0.87(0.06)
20      0.85(0.05)  0.87(0.07)  0.83(0.05)  0.86(0.07)  0.87(0.05)

Hackbench  25
=============
2.5.59 : 1.96(0.05)
ingo B0:
      idle      2           5           10          20          50
busy
2       2.20(0.13)  2.21(0.12)  2.23(0.10)  2.20(0.10)  2.16(0.07)
5       2.13(0.12)  2.17(0.13)  2.18(0.08)  2.10(0.11)  2.16(0.10)
10      2.19(0.08)  2.21(0.12)  2.22(0.09)  2.11(0.10)  2.15(0.10)
20      2.11(0.17)  2.13(0.08)  2.18(0.06)  2.13(0.11)  2.13(0.14)

Hackbench  50
=============
2.5.59 : 3.78(0.10)
ingo B0:
      idle      2           5           10          20          50
busy
2       4.31(0.13)  4.30(0.29)  4.29(0.15)  4.23(0.20)  4.14(0.10)
5       4.35(0.16)  4.34(0.24)  4.24(0.24)  4.09(0.18)  4.12(0.14)
10      4.35(0.23)  4.21(0.14)  4.36(0.24)  4.18(0.12)  4.36(0.21)
20      4.34(0.14)  4.27(0.17)  4.18(0.18)  4.29(0.24)  4.08(0.09)

Hackbench  100
==============
2.5.59 : 7.85(0.37)
ingo B0:
      idle      2           5           10          20          50
busy
2       8.21(0.42)  8.07(0.25)  8.32(0.30)  8.06(0.26)  8.10(0.13)
5       8.13(0.25)  8.06(0.33)  8.14(0.49)  8.24(0.24)  8.04(0.20)
10      8.05(0.17)  8.16(0.13)  8.13(0.16)  8.05(0.24)  8.01(0.30)
20      8.21(0.25)  8.23(0.24)  8.36(0.41)  8.30(0.37)  8.10(0.30)


Regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Lse-tech] NUMA sched -> pooling scheduler (inc HT)
  2003-01-18 21:34                                                       ` [Lse-tech] " Martin J. Bligh
@ 2003-01-19  0:13                                                         ` Andrew Theurer
  0 siblings, 0 replies; 96+ messages in thread
From: Andrew Theurer @ 2003-01-19  0:13 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Christoph Hellwig, Robert Love, Michael Hohnbaum, linux-kernel,
	lse-tech, Erich Focht, Ingo Molnar


Thanks Martin, I'll take a look.  One thing I have noticed on B0 numa_sched:
I added node load to /proc/stat and see a problem.  look at procs_running
compared to the total of node1..4.  They do not agree.  It looks as if the
node load and runqueue lengths may have gotten out of sync.  I could not
find any obvious place this happened.

cpu  65922 0 4321 650621 560
cpu0 5066 0 665 84545 4
cpu1 21965 0 1402 66796 0
cpu2 26187 0 1553 62290 132
cpu3 12535 0 550 77043 34
cpu4 121 0 116 89581 344
cpu5 38 0 16 90071 36
cpu6 6 0 12 90136 7
cpu7 2 0 4 90156 0
intr 909780 902661 16 0 1 1 0 5 1 0 1 1 1 60 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 4274 2748 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 32127
btime 1042925625
processes 982
procs_running 1
procs_blocked 0
node0 1
node1 1
node2 0
node3 0




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-18  7:08                                                   ` William Lee Irwin III
  2003-01-18  8:12                                                     ` Martin J. Bligh
@ 2003-01-19  4:22                                                     ` William Lee Irwin III
  1 sibling, 0 replies; 96+ messages in thread
From: William Lee Irwin III @ 2003-01-19  4:22 UTC (permalink / raw)
  To: Ingo Molnar, akpm, davej, ak, Erich Focht, Martin J. Bligh,
	Christoph Hellwig, Robert Love, Michael Hohnbaum, Andrew Theurer,
	Linus Torvalds, linux-kernel, lse-tech

On Fri, Jan 17, 2003 at 11:08:08PM -0800, William Lee Irwin III wrote:
> The severity of the MTRR regression in 2.5.59 is apparent from:

This is not a result of userland initscripts botching the MTRR; not
only are printk's in MTRR-setting routines not visible but it's also
apparent from the fact that highmem mem_map initialization suffers a
similar degradation adding almost a full 20 minutes to boot times.


-- wli

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-18 23:09                                                         ` [patch] sched-2.5.59-A2 Erich Focht
@ 2003-01-20  9:28                                                           ` Ingo Molnar
  2003-01-20 12:07                                                             ` Erich Focht
  2003-01-20 16:23                                                             ` Martin J. Bligh
  0 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-20  9:28 UTC (permalink / raw)
  To: Erich Focht
  Cc: Michael Hohnbaum, Martin J. Bligh, Matthew Dobson,
	Christoph Hellwig, Robert Love, Andrew Theurer, Linus Torvalds,
	linux-kernel, lse-tech


On Sun, 19 Jan 2003, Erich Focht wrote:

> The results:
> - kernbench UserTime is best for the 2.5.59 scheduler (623s). IngoB0
>   best value 627.33s for idle=20ms, busy=2000ms.
> - hackbench: 2.5.59 scheduler is significantly better for all
>   measurements.
> 
> I suppose this comes from the fact that the 2.5.59 version has the
> chance to load_balance across nodes when a cpu goes idle. No idea what
> other reason it could be... Maybe anybody else?

this shows that agressive idle-rebalancing is the most important factor. I
think this means that the unification of the balancing code should go into
the other direction: ie. applying the ->nr_balanced logic to the SMP
balancer as well.

kernelbench is the kind of benchmark that is most sensitive to over-eager
global balancing, and since the 2.5.59 ->nr_balanced logic produced the
best results, it clearly shows it's not over-eager. hackbench is one that
is quite sensitive to under-balancing. Ie. trying to maximize both will
lead us to a good balance.

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20  9:28                                                           ` Ingo Molnar
@ 2003-01-20 12:07                                                             ` Erich Focht
  2003-01-20 16:56                                                               ` Ingo Molnar
  2003-01-20 16:23                                                             ` Martin J. Bligh
  1 sibling, 1 reply; 96+ messages in thread
From: Erich Focht @ 2003-01-20 12:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Michael Hohnbaum, Martin J. Bligh, Matthew Dobson,
	Christoph Hellwig, Robert Love, Andrew Theurer, Linus Torvalds,
	linux-kernel, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1892 bytes --]

On Monday 20 January 2003 10:28, Ingo Molnar wrote:
> On Sun, 19 Jan 2003, Erich Focht wrote:
> > The results:
> > - kernbench UserTime is best for the 2.5.59 scheduler (623s). IngoB0
> >   best value 627.33s for idle=20ms, busy=2000ms.
> > - hackbench: 2.5.59 scheduler is significantly better for all
> >   measurements.
> >
> > I suppose this comes from the fact that the 2.5.59 version has the
> > chance to load_balance across nodes when a cpu goes idle. No idea what
> > other reason it could be... Maybe anybody else?
>
> this shows that agressive idle-rebalancing is the most important factor. I
> think this means that the unification of the balancing code should go into
> the other direction: ie. applying the ->nr_balanced logic to the SMP
> balancer as well.

Could you please explain your idea? As far as I understand, the SMP
balancer (pre-NUMA) tries a global rebalance at each call. Maybe you
mean something different...

> kernelbench is the kind of benchmark that is most sensitive to over-eager
> global balancing, and since the 2.5.59 ->nr_balanced logic produced the
> best results, it clearly shows it's not over-eager. hackbench is one that
> is quite sensitive to under-balancing. Ie. trying to maximize both will
> lead us to a good balance.

Yes! Actually the currently implemented nr_balanced logic is pretty
dumb: the counter reaches the cross-node balance threshold after a
certain number of calls to intra-node lb, no matter whether these were
successfull or not. I'd like to try incrementing the counter only on
unsuccessfull load balances, this would give a clear priority to
intra-node balancing and a clear and controllable delay for cross-node
balancing. A tiny patch for this (for 2.5.59) is attached. As the name
nr_balanced would be misleading for this kind of usage, I renamed it
to nr_lb_failed.

Regards,
Erich

[-- Attachment #2: failed-lb-ctr-2.5.59 --]
[-- Type: text/x-diff, Size: 1243 bytes --]

diff -urNp 2.5.59-topo/kernel/sched.c 2.5.59-topo-succ/kernel/sched.c
--- 2.5.59-topo/kernel/sched.c	2003-01-17 03:22:29.000000000 +0100
+++ 2.5.59-topo-succ/kernel/sched.c	2003-01-20 13:06:04.000000000 +0100
@@ -156,7 +156,7 @@ struct runqueue {
 	int prev_nr_running[NR_CPUS];
 #ifdef CONFIG_NUMA
 	atomic_t *node_nr_running;
-	unsigned int nr_balanced;
+	unsigned int nr_lb_failed;
 	int prev_node_load[MAX_NUMNODES];
 #endif
 	task_t *migration_thread;
@@ -770,11 +770,11 @@ static inline unsigned long cpus_to_bala
 	int this_node = __cpu_to_node(this_cpu);
 	/*
 	 * Avoid rebalancing between nodes too often.
-	 * We rebalance globally once every NODE_BALANCE_RATE load balances.
+	 * We rebalance globally after NODE_BALANCE_RATE failed load balances.
 	 */
-	if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
+	if (this_rq->nr_lb_failed == NODE_BALANCE_RATE) {
 		int node = find_busiest_node(this_node);
-		this_rq->nr_balanced = 0;
+		this_rq->nr_lb_failed = 0;
 		if (node >= 0)
 			return (__node_to_cpu_mask(node) | (1UL << this_cpu));
 	}
@@ -892,6 +892,10 @@ static inline runqueue_t *find_busiest_q
 		busiest = NULL;
 	}
 out:
+#ifdef CONFIG_NUMA
+	if (!busiest)
+		this_rq->nr_lb_failed++;
+#endif
 	return busiest;
 }
 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20  9:28                                                           ` Ingo Molnar
  2003-01-20 12:07                                                             ` Erich Focht
@ 2003-01-20 16:23                                                             ` Martin J. Bligh
  2003-01-20 16:59                                                               ` Ingo Molnar
  1 sibling, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-20 16:23 UTC (permalink / raw)
  To: Ingo Molnar, Erich Focht
  Cc: Michael Hohnbaum, Matthew Dobson, Christoph Hellwig, Robert Love,
	Andrew Theurer, Linus Torvalds, linux-kernel, lse-tech

> kernelbench is the kind of benchmark that is most sensitive to over-eager
> global balancing, and since the 2.5.59 ->nr_balanced logic produced the
> best results, it clearly shows it's not over-eager. 

Careful ... what shows well on one machine may not on another - this depends
heavily on the NUMA ratio - for our machine the nr_balanced logic in 59 is
still over-agressive (20:1 NUMA ratio). For low-ratio machines it may work
fine. It actually works best for us when it's switched off altogether I
think (which is obviously not a good solution).

But there's more than one dimension to tune here - we can tune both the 
frequency and the level of imbalance required. I had good results specifying
a minimum imbalance of > 4 between the current and busiest nodes before 
balancing. Reason (2 nodes, 4 CPUs each): If I have 4 tasks on one node, 
and 8 on another, that's still one or two per cpu in all cases whatever 
I do (well, provided I'm not stupid enough to make anything idle). So
at that point, I just want lowest task thrash. 

Moving tasks between nodes is really expensive, and shouldn't be done 
lightly - the only thing the background busy rebalancer should be fixing
is significant long-term imbalances. It would be nice if we also chose
the task with the smallest RSS to migrate, I think that's a worthwhile
optimisation (we'll need to make sure we use realistic benchmarks with
a mix of different task sizes). Working out which ones have the smallest
"on-node RSS - off-node RSS" is another step after that ...

> hackbench is one that is quite sensitive to under-balancing. 
> Ie. trying to maximize both will lead us to a good balance.

I'll try to do some hackbench runs on NUMA-Q as well.

Just to add something else to the mix, there's another important factor
as well as the NUMA ratio - the size of the interconnect cache vs the
size of the task migrated. The interconnect cache on the NUMA-Q is 32Mb,
our newer machine has a much lower NUMA ratio, but effectively a much
smaller cache as well. NUMA ratios are often expresssed in terms of
latency, but there's a bandwidth consideration too. Hyperthreading will
want something different again.

I think we definitely need to tune this on a per-arch basis. There's no
way that one-size-fits-all is going to fit a situation as complex as this
(though we can definitely learn from each other's analysis).

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 12:07                                                             ` Erich Focht
@ 2003-01-20 16:56                                                               ` Ingo Molnar
  2003-01-20 17:04                                                                 ` Ingo Molnar
                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-20 16:56 UTC (permalink / raw)
  To: Erich Focht
  Cc: Michael Hohnbaum, Martin J. Bligh, Matthew Dobson,
	Christoph Hellwig, Robert Love, Andrew Theurer, Linus Torvalds,
	linux-kernel, lse-tech


On Mon, 20 Jan 2003, Erich Focht wrote:

> Could you please explain your idea? As far as I understand, the SMP
> balancer (pre-NUMA) tries a global rebalance at each call. Maybe you
> mean something different...

yes, but eg. in the idle-rebalance case we are more agressive at moving
tasks across SMP CPUs. We could perhaps do a similar ->nr_balanced logic
to do this 'agressive' balancing even if not triggered from the
CPU-will-be-idle path. Ie. _perhaps_ the SMP balancer could become a bit
more agressive.

ie. SMP is just the first level in the cache-hierarchy, NUMA is the second
level. (lets hope we dont have to deal with a third caching level anytime
soon - although that could as well happen once SMT CPUs start doing NUMA.)  
There's no real reason to do balancing in a different way on each level -
the weight might be different, but the core logic should be synced up.
(one thing that is indeed different for the NUMA step is locality of
uncached memory.)

> > kernelbench is the kind of benchmark that is most sensitive to over-eager
> > global balancing, and since the 2.5.59 ->nr_balanced logic produced the
> > best results, it clearly shows it's not over-eager. hackbench is one that
> > is quite sensitive to under-balancing. Ie. trying to maximize both will
> > lead us to a good balance.
> 
> Yes! Actually the currently implemented nr_balanced logic is pretty
> dumb: the counter reaches the cross-node balance threshold after a
> certain number of calls to intra-node lb, no matter whether these were
> successfull or not. I'd like to try incrementing the counter only on
> unsuccessfull load balances, this would give a clear priority to
> intra-node balancing and a clear and controllable delay for cross-node
> balancing. A tiny patch for this (for 2.5.59) is attached. As the name
> nr_balanced would be misleading for this kind of usage, I renamed it to
> nr_lb_failed.

indeed this approach makes much more sense than the simple ->nr_balanced
counter. A similar approach makes sense on the SMP level as well: if the
current 'busy' rebalancer fails to get a new task, we can try the current
'idle' rebalancer. Ie. a CPU going idle would do the less intrusive
rebalancing first.

have you experimented with making the counter limit == 1 actually? Ie.  
immediately trying to do a global balancing once the less intrusive
balancing fails?

	Ingo




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 16:23                                                             ` Martin J. Bligh
@ 2003-01-20 16:59                                                               ` Ingo Molnar
  0 siblings, 0 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-20 16:59 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Erich Focht, Michael Hohnbaum, Matthew Dobson, Christoph Hellwig,
	Robert Love, Andrew Theurer, Linus Torvalds, linux-kernel,
	lse-tech


On Mon, 20 Jan 2003, Martin J. Bligh wrote:

> I think we definitely need to tune this on a per-arch basis. There's no
> way that one-size-fits-all is going to fit a situation as complex as
> this (though we can definitely learn from each other's analysis).

agreed - although the tunable should be constant (if possible, or
boot-established but not /proc exported), and there should be as few
tunables as possible. We already tune some of our scheduler behavior to
the SMP cachesize. (the cache_decay_ticks SMP logic.)

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 16:56                                                               ` Ingo Molnar
@ 2003-01-20 17:04                                                                 ` Ingo Molnar
  2003-01-20 17:10                                                                   ` Martin J. Bligh
  2003-01-20 17:04                                                                 ` [patch] sched-2.5.59-A2 Martin J. Bligh
  2003-01-21 17:44                                                                 ` Erich Focht
  2 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2003-01-20 17:04 UTC (permalink / raw)
  To: Erich Focht
  Cc: Michael Hohnbaum, Martin J. Bligh, Matthew Dobson,
	Christoph Hellwig, Robert Love, Andrew Theurer, Linus Torvalds,
	linux-kernel, lse-tech


On Mon, 20 Jan 2003, Ingo Molnar wrote:

> ie. SMP is just the first level in the cache-hierarchy, NUMA is the
> second level. (lets hope we dont have to deal with a third caching level
> anytime soon - although that could as well happen once SMT CPUs start
> doing NUMA.)

or as Arjan points out, like the IBM x440 boxes ...

i think we want to handle SMT on a different level, ie. via the
shared-runqueue approach, so it's not a genuine new level of caching, it's
rather a new concept of multiple logical cores per physical core. (which
needs is own code in the scheduler.)

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 16:56                                                               ` Ingo Molnar
  2003-01-20 17:04                                                                 ` Ingo Molnar
@ 2003-01-20 17:04                                                                 ` Martin J. Bligh
  2003-01-21 17:44                                                                 ` Erich Focht
  2 siblings, 0 replies; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-20 17:04 UTC (permalink / raw)
  To: Ingo Molnar, Erich Focht
  Cc: Michael Hohnbaum, Matthew Dobson, Christoph Hellwig, Robert Love,
	Andrew Theurer, Linus Torvalds, linux-kernel, lse-tech

> yes, but eg. in the idle-rebalance case we are more agressive at moving
> tasks across SMP CPUs. We could perhaps do a similar ->nr_balanced logic
> to do this 'agressive' balancing even if not triggered from the
> CPU-will-be-idle path. Ie. _perhaps_ the SMP balancer could become a bit
> more agressive.

Do you think it's worth looking at the initial load-balance code for 
standard SMP?

> ie. SMP is just the first level in the cache-hierarchy, NUMA is the second
> level. (lets hope we dont have to deal with a third caching level anytime
> soon - although that could as well happen once SMT CPUs start doing NUMA.)  

We have those already (IBM x440) ;-) That's one of the reasons why I prefer
the pools concept I posted at the weekend over just "nodes". Also, there
are NUMA machines where nodes are not all equidistant ... that can be
thought of as multi-level too.

> There's no real reason to do balancing in a different way on each level -
> the weight might be different, but the core logic should be synced up.
> (one thing that is indeed different for the NUMA step is locality of
> uncached memory.)

Right, the current model should work fine, it just needs generalising out
a bit.
 
M


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 17:04                                                                 ` Ingo Molnar
@ 2003-01-20 17:10                                                                   ` Martin J. Bligh
  2003-01-20 17:24                                                                     ` Ingo Molnar
  0 siblings, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-20 17:10 UTC (permalink / raw)
  To: Ingo Molnar, Erich Focht
  Cc: Michael Hohnbaum, Matthew Dobson, Christoph Hellwig, Robert Love,
	Andrew Theurer, Linus Torvalds, linux-kernel, lse-tech, anton

> or as Arjan points out, like the IBM x440 boxes ...

;-)
 
> i think we want to handle SMT on a different level, ie. via the
> shared-runqueue approach, so it's not a genuine new level of caching, it's
> rather a new concept of multiple logical cores per physical core. (which
> needs is own code in the scheduler.)

Do you have that code working already (presumably needs locking changes)? 
I seem to recall something like that existing already, but I don't recall 
if it was ever fully working or not ...

I think the large PPC64 boxes have multilevel NUMA as well - two real
phys cores on one die, sharing some cache (L2 but not L1? Anton?).
And SGI have multilevel nodes too I think ... so we'll still need 
multilevel NUMA at some point ... but maybe not right now.

M.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 17:10                                                                   ` Martin J. Bligh
@ 2003-01-20 17:24                                                                     ` Ingo Molnar
  2003-01-20 19:13                                                                       ` Andrew Theurer
  0 siblings, 1 reply; 96+ messages in thread
From: Ingo Molnar @ 2003-01-20 17:24 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Erich Focht, Michael Hohnbaum, Matthew Dobson, Christoph Hellwig,
	Robert Love, Andrew Theurer, Linus Torvalds, linux-kernel,
	lse-tech, Anton Blanchard


On Mon, 20 Jan 2003, Martin J. Bligh wrote:

> Do you have that code working already (presumably needs locking
> changes)?  I seem to recall something like that existing already, but I
> don't recall if it was ever fully working or not ...

yes, i have a HT testbox and working code:

   http://lwn.net/Articles/8553/

the patch is rather old, i'll update it to 2.5.59.

> I think the large PPC64 boxes have multilevel NUMA as well - two real
> phys cores on one die, sharing some cache (L2 but not L1? Anton?). And
> SGI have multilevel nodes too I think ... so we'll still need multilevel
> NUMA at some point ... but maybe not right now.

Intel's HT is the cleanest case: pure logical cores, which clearly need
special handling. Whether the other SMT solutions want to be handled via
the logical-cores code or via another level of NUMA-balancing code,
depends on benchmarking results i suspect. It will be one more flexibility
that system maintainers will have, it's all set up via the
sched_map_runqueue(cpu1, cpu2) boot-time call that 'merges' a CPU's
runqueue into another CPU's runqueue. It's basically the 0th level of
balancing, which will be fundamentally different. The other levels of
balancing are (or should be) similar to each other - only differing in
weight of balancing, not differing in the actual algorithm.

	Ingo


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 17:24                                                                     ` Ingo Molnar
@ 2003-01-20 19:13                                                                       ` Andrew Theurer
  2003-01-20 19:33                                                                         ` Martin J. Bligh
  0 siblings, 1 reply; 96+ messages in thread
From: Andrew Theurer @ 2003-01-20 19:13 UTC (permalink / raw)
  To: Ingo Molnar, Martin J. Bligh
  Cc: Erich Focht, Michael Hohnbaum, Matthew Dobson, Christoph Hellwig,
	Robert Love, Linus Torvalds, linux-kernel, lse-tech,
	Anton Blanchard

[-- Attachment #1: Type: text/plain, Size: 1471 bytes --]

> > I think the large PPC64 boxes have multilevel NUMA as well - two real
> > phys cores on one die, sharing some cache (L2 but not L1? Anton?). And
> > SGI have multilevel nodes too I think ... so we'll still need multilevel
> > NUMA at some point ... but maybe not right now.
>
> Intel's HT is the cleanest case: pure logical cores, which clearly need
> special handling. Whether the other SMT solutions want to be handled via
> the logical-cores code or via another level of NUMA-balancing code,
> depends on benchmarking results i suspect. It will be one more flexibility
> that system maintainers will have, it's all set up via the
> sched_map_runqueue(cpu1, cpu2) boot-time call that 'merges' a CPU's
> runqueue into another CPU's runqueue. It's basically the 0th level of
> balancing, which will be fundamentally different. The other levels of
> balancing are (or should be) similar to each other - only differing in
> weight of balancing, not differing in the actual algorithm.

I have included a very rough patch to do ht-numa topology.  I requires to 
manually define CONFIG_NUMA and CONFIG_NUMA_SCHED.  It also uses num_cpunodes 
instead of numnodes and defines MAX_NUM_NODES to 8 if CONFIG_NUMA is defined.  

I had to remove the first check in sched_best_cpu() to get decent low load 
performance out of this.  I am still sorting through some things, but I 
though it would be best if I just post what I have now.

-Andrew Theurer


[-- Attachment #2: patch-htnuma-topology --]
[-- Type: text/x-diff, Size: 5429 bytes --]

diff -Naur linux-2.5.59-clean/arch/i386/kernel/smpboot.c linux-2.5.59-A2-HT-clean/arch/i386/kernel/smpboot.c
--- linux-2.5.59-clean/arch/i386/kernel/smpboot.c	2003-01-16 20:22:09.000000000 -0600
+++ linux-2.5.59-A2-HT-clean/arch/i386/kernel/smpboot.c	2003-01-17 15:36:38.000000000 -0600
@@ -61,6 +61,7 @@
 int smp_num_siblings = 1;
 int phys_proc_id[NR_CPUS]; /* Package ID of each logical CPU */
 
+int num_cpunodes = 0;
 /* Bitmask of currently online CPUs */
 unsigned long cpu_online_map;
 
@@ -510,6 +511,10 @@
 static inline void map_cpu_to_node(int cpu, int node)
 {
 	printk("Mapping cpu %d to node %d\n", cpu, node);
+	if (!node_2_cpu_mask[node]) {
+		num_cpunodes++;
+		printk("nodecount is now %i\n", num_cpunodes);
+	}
 	node_2_cpu_mask[node] |= (1 << cpu);
 	cpu_2_node[cpu] = node;
 }
@@ -522,7 +527,11 @@
 	printk("Unmapping cpu %d from all nodes\n", cpu);
 	for (node = 0; node < MAX_NR_NODES; node ++)
 		node_2_cpu_mask[node] &= ~(1 << cpu);
-	cpu_2_node[cpu] = -1;
+	node = cpu_2_node[cpu];
+	if (node_2_cpu_mask[node])
+		num_cpunodes--;
+	cpu_2_node[node] = -1;
+	
 }
 #else /* !CONFIG_NUMA */
 
@@ -540,6 +549,9 @@
 
 	cpu_2_logical_apicid[cpu] = apicid;
 	map_cpu_to_node(cpu, apicid_to_node(apicid));
+	printk("cpu[%i]\tapicid[%i]\tnode[%i]\n", cpu, apicid, 
+			apicid_to_node(apicid));
+	printk("MAXNUMNODES[%i]\n", MAX_NUMNODES);
 }
 
 void unmap_cpu_to_logical_apicid(int cpu)
diff -Naur linux-2.5.59-clean/arch/i386/mach-default/topology.c linux-2.5.59-A2-HT-clean/arch/i386/mach-default/topology.c
--- linux-2.5.59-clean/arch/i386/mach-default/topology.c	2003-01-16 20:22:40.000000000 -0600
+++ linux-2.5.59-A2-HT-clean/arch/i386/mach-default/topology.c	2003-01-17 15:10:41.000000000 -0600
@@ -44,11 +44,11 @@
 	int i;
 
 	for (i = 0; i < num_online_nodes(); i++)
-		arch_register_node(i);
+//		arch_register_node(i);
 	for (i = 0; i < NR_CPUS; i++)
 		if (cpu_possible(i)) arch_register_cpu(i);
 	for (i = 0; i < num_online_memblks(); i++)
-		arch_register_memblk(i);
+//		arch_register_memblk(i);
 	return 0;
 }
 
diff -Naur linux-2.5.59-clean/drivers/base/Makefile linux-2.5.59-A2-HT-clean/drivers/base/Makefile
--- linux-2.5.59-clean/drivers/base/Makefile	2003-01-16 20:22:30.000000000 -0600
+++ linux-2.5.59-A2-HT-clean/drivers/base/Makefile	2003-01-17 15:10:14.000000000 -0600
@@ -2,7 +2,7 @@
 
 obj-y		:= core.o sys.o interface.o power.o bus.o \
 			driver.o class.o intf.o platform.o \
-			cpu.o firmware.o
+			cpu.o firmware.o 
 
 obj-$(CONFIG_NUMA)	+= node.o  memblk.o
 
diff -Naur linux-2.5.59-clean/include/asm-i386/mach-default/mach_apic.h linux-2.5.59-A2-HT-clean/include/asm-i386/mach-default/mach_apic.h
--- linux-2.5.59-clean/include/asm-i386/mach-default/mach_apic.h	2003-01-16 20:22:59.000000000 -0600
+++ linux-2.5.59-A2-HT-clean/include/asm-i386/mach-default/mach_apic.h	2003-01-17 10:43:22.000000000 -0600
@@ -60,7 +60,15 @@
 
 static inline int apicid_to_node(int logical_apicid)
 {
-	return 0;
+	int node = 0;
+
+	logical_apicid >>= 2;
+
+	while(logical_apicid) {
+		logical_apicid >>= 2;
+		node++;
+	}
+	return node;
 }
 
 /* Mapping from cpu number to logical apicid */
diff -Naur linux-2.5.59-clean/include/asm-i386/numnodes.h linux-2.5.59-A2-HT-clean/include/asm-i386/numnodes.h
--- linux-2.5.59-clean/include/asm-i386/numnodes.h	2003-01-16 20:21:41.000000000 -0600
+++ linux-2.5.59-A2-HT-clean/include/asm-i386/numnodes.h	2003-01-17 10:07:38.000000000 -0600
@@ -6,7 +6,7 @@
 #ifdef CONFIG_X86_NUMAQ
 #include <asm/numaq.h>
 #else
-#define MAX_NUMNODES	1
+#define MAX_NUMNODES	8
 #endif /* CONFIG_X86_NUMAQ */
 
 #endif /* _ASM_MAX_NUMNODES_H */
diff -Naur linux-2.5.59-clean/include/linux/mmzone.h linux-2.5.59-A2-HT-clean/include/linux/mmzone.h
--- linux-2.5.59-clean/include/linux/mmzone.h	2003-01-17 09:22:22.000000000 -0600
+++ linux-2.5.59-A2-HT-clean/include/linux/mmzone.h	2003-01-17 15:31:12.000000000 -0600
@@ -11,12 +11,13 @@
 #include <linux/cache.h>
 #include <linux/threads.h>
 #include <asm/atomic.h>
-#ifdef CONFIG_DISCONTIGMEM
+#ifdef CONFIG_NUMA
 #include <asm/numnodes.h>
 #endif
 #ifndef MAX_NUMNODES
 #define MAX_NUMNODES 1
-#endif
+#endif 
+
 
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_FORCE_MAX_ZONEORDER
@@ -191,6 +192,7 @@
 } pg_data_t;
 
 extern int numnodes;
+extern int num_cpunodes;
 extern struct pglist_data *pgdat_list;
 
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
diff -Naur linux-2.5.59-clean/kernel/sched.c linux-2.5.59-A2-HT-clean/kernel/sched.c
--- linux-2.5.59-clean/kernel/sched.c	2003-01-17 09:22:22.000000000 -0600
+++ linux-2.5.59-A2-HT-clean/kernel/sched.c	2003-01-17 17:36:18.000000000 -0600
@@ -705,7 +705,7 @@
 		return best_cpu;
 
 	minload = 10000000;
-	for (i = 0; i < numnodes; i++) {
+	for (i = 0; i < num_cpunodes; i++) {
 		load = atomic_read(&node_nr_running[i]);
 		if (load < minload) {
 			minload = load;
@@ -730,7 +730,7 @@
 {
 	int new_cpu;
 
-	if (numnodes > 1) {
+	if (num_cpunodes > 1) {
 		new_cpu = sched_best_cpu(current);
 		if (new_cpu != smp_processor_id())
 			sched_migrate_task(current, new_cpu);
@@ -750,7 +750,7 @@
 	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
 		+ atomic_read(&node_nr_running[this_node]);
 	this_rq()->prev_node_load[this_node] = this_load;
-	for (i = 0; i < numnodes; i++) {
+	for (i = 0; i < num_cpunodes; i++) {
 		if (i == this_node)
 			continue;
 		load = (this_rq()->prev_node_load[i] >> 1)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 19:13                                                                       ` Andrew Theurer
@ 2003-01-20 19:33                                                                         ` Martin J. Bligh
  2003-01-20 19:52                                                                           ` Andrew Theurer
  0 siblings, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-20 19:33 UTC (permalink / raw)
  To: Andrew Theurer, Ingo Molnar
  Cc: Erich Focht, Michael Hohnbaum, Matthew Dobson, Christoph Hellwig,
	Robert Love, Linus Torvalds, linux-kernel, lse-tech,
	Anton Blanchard

>> > I think the large PPC64 boxes have multilevel NUMA as well - two real
>> > phys cores on one die, sharing some cache (L2 but not L1? Anton?). And
>> > SGI have multilevel nodes too I think ... so we'll still need multilevel
>> > NUMA at some point ... but maybe not right now.
>> 
>> Intel's HT is the cleanest case: pure logical cores, which clearly need
>> special handling. Whether the other SMT solutions want to be handled via
>> the logical-cores code or via another level of NUMA-balancing code,
>> depends on benchmarking results i suspect. It will be one more flexibility
>> that system maintainers will have, it's all set up via the
>> sched_map_runqueue(cpu1, cpu2) boot-time call that 'merges' a CPU's
>> runqueue into another CPU's runqueue. It's basically the 0th level of
>> balancing, which will be fundamentally different. The other levels of
>> balancing are (or should be) similar to each other - only differing in
>> weight of balancing, not differing in the actual algorithm.
> 
> I have included a very rough patch to do ht-numa topology.  I requires to 
> manually define CONFIG_NUMA and CONFIG_NUMA_SCHED.  It also uses num_cpunodes 
> instead of numnodes and defines MAX_NUM_NODES to 8 if CONFIG_NUMA is defined.  

Whilst it's fine for benchmarking, I think this kind of overlap is a 
very bad idea long-term - the confusion introduced is just asking for
trouble. And think what's going to happen when you mix HT and NUMA.
If we're going to use this for HT, it needs abstracting out.

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 19:52                                                                           ` Andrew Theurer
@ 2003-01-20 19:52                                                                             ` Martin J. Bligh
  2003-01-20 21:18                                                                               ` [patch] HT scheduler, sched-2.5.59-D7 Ingo Molnar
  0 siblings, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-20 19:52 UTC (permalink / raw)
  To: Andrew Theurer, Ingo Molnar
  Cc: Erich Focht, Michael Hohnbaum, Matthew Dobson, Christoph Hellwig,
	Robert Love, Linus Torvalds, linux-kernel, lse-tech,
	Anton Blanchard

>> > I have included a very rough patch to do ht-numa topology.  I requires to
>> > manually define CONFIG_NUMA and CONFIG_NUMA_SCHED.  It also uses
>> > num_cpunodes instead of numnodes and defines MAX_NUM_NODES to 8 if
>> > CONFIG_NUMA is defined.
>> 
>> Whilst it's fine for benchmarking, I think this kind of overlap is a
>> very bad idea long-term - the confusion introduced is just asking for
>> trouble. And think what's going to happen when you mix HT and NUMA.
>> If we're going to use this for HT, it needs abstracting out.
> 
> I have no issues with using HT specific bits instead of NUMA.  Design wise it 
> would be nice if it could all be happy together, but if not, then so be it.  

That's not what I meant - we can share the code, we just need to abstract
it out so you don't have to turn on CONFIG_NUMA. That was the point of
the pooling patch I posted at the weekend. Anyway, let's decide on the
best approach first, we can clean up the code for merging later.

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 19:33                                                                         ` Martin J. Bligh
@ 2003-01-20 19:52                                                                           ` Andrew Theurer
  2003-01-20 19:52                                                                             ` Martin J. Bligh
  0 siblings, 1 reply; 96+ messages in thread
From: Andrew Theurer @ 2003-01-20 19:52 UTC (permalink / raw)
  To: Martin J. Bligh, Ingo Molnar
  Cc: Erich Focht, Michael Hohnbaum, Matthew Dobson, Christoph Hellwig,
	Robert Love, Linus Torvalds, linux-kernel, lse-tech,
	Anton Blanchard


> > I have included a very rough patch to do ht-numa topology.  I requires to
> > manually define CONFIG_NUMA and CONFIG_NUMA_SCHED.  It also uses
> > num_cpunodes instead of numnodes and defines MAX_NUM_NODES to 8 if
> > CONFIG_NUMA is defined.
>
> Whilst it's fine for benchmarking, I think this kind of overlap is a
> very bad idea long-term - the confusion introduced is just asking for
> trouble. And think what's going to happen when you mix HT and NUMA.
> If we're going to use this for HT, it needs abstracting out.

I have no issues with using HT specific bits instead of NUMA.  Design wise it 
would be nice if it could all be happy together, but if not, then so be it.  

-Andrew Theurer

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [patch] HT scheduler, sched-2.5.59-D7
  2003-01-20 19:52                                                                             ` Martin J. Bligh
@ 2003-01-20 21:18                                                                               ` Ingo Molnar
  2003-01-20 22:28                                                                                 ` Andrew Morton
                                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-01-20 21:18 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrew Theurer, Erich Focht, Michael Hohnbaum, Matthew Dobson,
	Christoph Hellwig, Robert Love, Linus Torvalds, linux-kernel,
	lse-tech, Anton Blanchard, Andrea Arcangeli


the attached patch (against 2.5.59) is my current scheduler tree, it
includes two main areas of changes:

 - interactivity improvements, mostly reworked bits from Andrea's tree and 
   various tunings.

 - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
   passive load balancing, active-balancing, HT-aware task pickup,
   HT-aware affinity and HT-aware wakeup.

the sched-2.5.59-D7 patch can also be downloaded from:

	http://redhat.com/~mingo/O(1)-scheduler/

the NUMA code compiles, but the boot code needs to be updated to make use
of sched_map_runqueue(). Usage is very simple, the following call:

	sched_map_runqueue(0, 1);

merges CPU#1's runqueue into CPU#0's runqueue, so both CPU#0 and CPU#1
maps to runqueue#0 - runqueue#1 is inactive from that point on. The NUMA
boot code needs to interpret the topology and merge CPUs that are on the
same physical core, into one runqueue. The scheduler code will take care
of all the rest.

there's a 'test_ht' boot option available on x86, that can be used to
trigger a shared runqueue for the first two CPUs, for those who have no
genuine HT boxes but want to give it a go.

(it complicates things that the interactivity changes to the scheduler are
included here as well, but this was my base tree and i didnt want to go
back.)

	Ingo

--- linux/arch/i386/kernel/cpu/proc.c.orig	2003-01-20 22:25:23.000000000 +0100
+++ linux/arch/i386/kernel/cpu/proc.c	2003-01-20 22:58:35.000000000 +0100
@@ -1,4 +1,5 @@
 #include <linux/smp.h>
+#include <linux/sched.h>
 #include <linux/timex.h>
 #include <linux/string.h>
 #include <asm/semaphore.h>
@@ -101,6 +102,13 @@
 		     fpu_exception ? "yes" : "no",
 		     c->cpuid_level,
 		     c->wp_works_ok ? "yes" : "no");
+#if CONFIG_SHARE_RUNQUEUE
+{
+	extern long __rq_idx[NR_CPUS];
+
+	seq_printf(m, "\nrunqueue\t: %d\n", __rq_idx[n]);
+}
+#endif
 
 	for ( i = 0 ; i < 32*NCAPINTS ; i++ )
 		if ( test_bit(i, c->x86_capability) &&
--- linux/arch/i386/kernel/smpboot.c.orig	2003-01-20 19:29:09.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c	2003-01-20 22:58:35.000000000 +0100
@@ -38,6 +38,7 @@
 #include <linux/kernel.h>
 
 #include <linux/mm.h>
+#include <linux/sched.h>
 #include <linux/kernel_stat.h>
 #include <linux/smp_lock.h>
 #include <linux/irq.h>
@@ -945,6 +946,16 @@
 
 int cpu_sibling_map[NR_CPUS] __cacheline_aligned;
 
+static int test_ht;
+
+static int __init ht_setup(char *str)
+{
+	test_ht = 1;
+	return 1;
+}
+
+__setup("test_ht", ht_setup);
+
 static void __init smp_boot_cpus(unsigned int max_cpus)
 {
 	int apicid, cpu, bit;
@@ -1087,16 +1098,31 @@
 	Dprintk("Boot done.\n");
 
 	/*
-	 * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so
+	 * Here we can be sure that there is an IO-APIC in the system. Let's
+	 * go and set it up:
+	 */
+	smpboot_setup_io_apic();
+
+	setup_boot_APIC_clock();
+
+	/*
+	 * Synchronize the TSC with the AP
+	 */
+	if (cpu_has_tsc && cpucount)
+		synchronize_tsc_bp();
+	/*
+	 * If Hyper-Threading is available, construct cpu_sibling_map[], so
 	 * that we can tell the sibling CPU efficiently.
 	 */
+printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus());
 	if (cpu_has_ht && smp_num_siblings > 1) {
 		for (cpu = 0; cpu < NR_CPUS; cpu++)
 			cpu_sibling_map[cpu] = NO_PROC_ID;
 		
 		for (cpu = 0; cpu < NR_CPUS; cpu++) {
 			int 	i;
-			if (!test_bit(cpu, &cpu_callout_map)) continue;
+			if (!test_bit(cpu, &cpu_callout_map))
+				continue;
 
 			for (i = 0; i < NR_CPUS; i++) {
 				if (i == cpu || !test_bit(i, &cpu_callout_map))
@@ -1112,17 +1138,41 @@
 				printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu);
 			}
 		}
-	}
-
-	smpboot_setup_io_apic();
-
-	setup_boot_APIC_clock();
+#if CONFIG_SHARE_RUNQUEUE
+		/*
+		 * At this point APs would have synchronised TSC and
+		 * waiting for smp_commenced, with their APIC timer
+		 * disabled. So BP can go ahead do some initialization
+		 * for Hyper-Threading (if needed).
+		 */
+		for (cpu = 0; cpu < NR_CPUS; cpu++) {
+			int i;
+			if (!test_bit(cpu, &cpu_callout_map))
+				continue;
+			for (i = 0; i < NR_CPUS; i++) {
+				if (i <= cpu)
+					continue;
+				if (!test_bit(i, &cpu_callout_map))
+					continue;
 
-	/*
-	 * Synchronize the TSC with the AP
-	 */
-	if (cpu_has_tsc && cpucount)
-		synchronize_tsc_bp();
+				if (phys_proc_id[cpu] != phys_proc_id[i])
+					continue;
+				/*
+				 * merge runqueues, resulting in one
+				 * runqueue per package:
+				 */
+				sched_map_runqueue(cpu, i);
+				break;
+			}
+		}
+#endif
+	}
+#if CONFIG_SHARE_RUNQUEUE
+	if (smp_num_siblings == 1 && test_ht) {
+		printk("Simulating a 2-sibling 1-phys-CPU HT setup!\n");
+		sched_map_runqueue(0, 1);
+	}
+#endif
 }
 
 /* These are wrappers to interface to the new boot process.  Someone
--- linux/arch/i386/Kconfig.orig	2003-01-20 20:19:23.000000000 +0100
+++ linux/arch/i386/Kconfig	2003-01-20 22:58:35.000000000 +0100
@@ -408,6 +408,24 @@
 
 	  If you don't know what to do here, say N.
 
+choice
+
+	prompt "Hyperthreading Support"
+	depends on SMP
+	default NR_SIBLINGS_0
+
+config NR_SIBLINGS_0
+	bool "off"
+
+config NR_SIBLINGS_2
+	bool "2 siblings"
+
+config NR_SIBLINGS_4
+	bool "4 siblings"
+
+endchoice
+
+
 config PREEMPT
 	bool "Preemptible Kernel"
 	help
--- linux/fs/pipe.c.orig	2003-01-20 19:28:43.000000000 +0100
+++ linux/fs/pipe.c	2003-01-20 22:58:35.000000000 +0100
@@ -117,7 +117,7 @@
 	up(PIPE_SEM(*inode));
 	/* Signal writers asynchronously that there is more room.  */
 	if (do_wakeup) {
-		wake_up_interruptible(PIPE_WAIT(*inode));
+		wake_up_interruptible_sync(PIPE_WAIT(*inode));
 		kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
 	}
 	if (ret > 0)
@@ -205,7 +205,7 @@
 	}
 	up(PIPE_SEM(*inode));
 	if (do_wakeup) {
-		wake_up_interruptible(PIPE_WAIT(*inode));
+		wake_up_interruptible_sync(PIPE_WAIT(*inode));
 		kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
 	}
 	if (ret > 0) {
@@ -275,7 +275,7 @@
 		free_page((unsigned long) info->base);
 		kfree(info);
 	} else {
-		wake_up_interruptible(PIPE_WAIT(*inode));
+		wake_up_interruptible_sync(PIPE_WAIT(*inode));
 		kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
 		kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
 	}
--- linux/include/linux/sched.h.orig	2003-01-20 19:29:09.000000000 +0100
+++ linux/include/linux/sched.h	2003-01-20 22:58:35.000000000 +0100
@@ -147,6 +147,24 @@
 extern void sched_init(void);
 extern void init_idle(task_t *idle, int cpu);
 
+/*
+ * Is there a way to do this via Kconfig?
+ */
+#define CONFIG_NR_SIBLINGS 0
+
+#if CONFIG_NR_SIBLINGS_2
+# define CONFIG_NR_SIBLINGS 2
+#elif CONFIG_NR_SIBLINGS_4
+# define CONFIG_NR_SIBLINGS 4
+#endif
+
+#if CONFIG_NR_SIBLINGS
+# define CONFIG_SHARE_RUNQUEUE 1
+#else
+# define CONFIG_SHARE_RUNQUEUE 0
+#endif
+extern void sched_map_runqueue(int cpu1, int cpu2);
+
 extern void show_state(void);
 extern void show_trace(unsigned long *stack);
 extern void show_stack(unsigned long *stack);
@@ -293,7 +311,7 @@
 	prio_array_t *array;
 
 	unsigned long sleep_avg;
-	unsigned long sleep_timestamp;
+	unsigned long last_run;
 
 	unsigned long policy;
 	unsigned long cpus_allowed;
@@ -605,6 +623,8 @@
 #define remove_parent(p)	list_del_init(&(p)->sibling)
 #define add_parent(p, parent)	list_add_tail(&(p)->sibling,&(parent)->children)
 
+#if 1
+
 #define REMOVE_LINKS(p) do {					\
 	if (thread_group_leader(p))				\
 		list_del_init(&(p)->tasks);			\
@@ -633,6 +653,31 @@
 #define while_each_thread(g, t) \
 	while ((t = next_thread(t)) != g)
 
+#else
+
+#define REMOVE_LINKS(p) do {					\
+	list_del_init(&(p)->tasks);				\
+	remove_parent(p);					\
+	} while (0)
+
+#define SET_LINKS(p) do {					\
+	list_add_tail(&(p)->tasks,&init_task.tasks);		\
+	add_parent(p, (p)->parent);				\
+	} while (0)
+
+#define next_task(p)	list_entry((p)->tasks.next, struct task_struct, tasks)
+#define prev_task(p)	list_entry((p)->tasks.prev, struct task_struct, tasks)
+
+#define for_each_process(p) \
+	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
+
+#define do_each_thread(g, t) \
+	for (t = &init_task ; (t = next_task(t)) != &init_task ; )
+
+#define while_each_thread(g, t)
+
+#endif
+
 extern task_t * FASTCALL(next_thread(task_t *p));
 
 #define thread_group_leader(p)	(p->pid == p->tgid)
--- linux/include/asm-i386/apic.h.orig	2003-01-20 19:28:31.000000000 +0100
+++ linux/include/asm-i386/apic.h	2003-01-20 22:58:35.000000000 +0100
@@ -98,4 +98,6 @@
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
+extern int phys_proc_id[NR_CPUS];
+
 #endif /* __ASM_APIC_H */
--- linux/kernel/fork.c.orig	2003-01-20 19:29:09.000000000 +0100
+++ linux/kernel/fork.c	2003-01-20 22:58:35.000000000 +0100
@@ -876,7 +876,7 @@
 	 */
 	p->first_time_slice = 1;
 	current->time_slice >>= 1;
-	p->sleep_timestamp = jiffies;
+	p->last_run = jiffies;
 	if (!current->time_slice) {
 		/*
 	 	 * This case is rare, it happens when the parent has only
--- linux/kernel/sys.c.orig	2003-01-20 19:28:52.000000000 +0100
+++ linux/kernel/sys.c	2003-01-20 22:58:36.000000000 +0100
@@ -220,7 +220,7 @@
 
 	if (error == -ESRCH)
 		error = 0;
-	if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+	if (0 && niceval < task_nice(p) && !capable(CAP_SYS_NICE))
 		error = -EACCES;
 	else
 		set_user_nice(p, niceval);
--- linux/kernel/sched.c.orig	2003-01-20 19:29:09.000000000 +0100
+++ linux/kernel/sched.c	2003-01-20 22:58:36.000000000 +0100
@@ -54,20 +54,21 @@
 /*
  * These are the 'tuning knobs' of the scheduler:
  *
- * Minimum timeslice is 10 msecs, default timeslice is 150 msecs,
- * maximum timeslice is 300 msecs. Timeslices get refilled after
+ * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
+ * maximum timeslice is 200 msecs. Timeslices get refilled after
  * they expire.
  */
 #define MIN_TIMESLICE		( 10 * HZ / 1000)
-#define MAX_TIMESLICE		(300 * HZ / 1000)
-#define CHILD_PENALTY		95
+#define MAX_TIMESLICE		(200 * HZ / 1000)
+#define CHILD_PENALTY		50
 #define PARENT_PENALTY		100
 #define EXIT_WEIGHT		3
 #define PRIO_BONUS_RATIO	25
 #define INTERACTIVE_DELTA	2
-#define MAX_SLEEP_AVG		(2*HZ)
-#define STARVATION_LIMIT	(2*HZ)
-#define NODE_THRESHOLD          125
+#define MAX_SLEEP_AVG		(10*HZ)
+#define STARVATION_LIMIT	(30*HZ)
+#define SYNC_WAKEUPS		1
+#define SMART_WAKE_CHILD	1
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -141,6 +142,48 @@
 };
 
 /*
+ * It's possible for two CPUs to share the same runqueue.
+ * This makes sense if they eg. share caches.
+ *
+ * We take the common 1:1 (SMP, UP) case and optimize it,
+ * the rest goes via remapping: rq_idx(cpu) gives the
+ * runqueue on which a particular cpu is on, cpu_idx(cpu)
+ * gives the rq-specific index of the cpu.
+ *
+ * (Note that the generic scheduler code does not impose any
+ *  restrictions on the mappings - there can be 4 CPUs per
+ *  runqueue or even assymetric mappings.)
+ */
+#if CONFIG_SHARE_RUNQUEUE
+# define MAX_NR_SIBLINGS CONFIG_NR_SIBLINGS
+  long __rq_idx[NR_CPUS] __cacheline_aligned;
+  static long __cpu_idx[NR_CPUS] __cacheline_aligned;
+# define rq_idx(cpu) (__rq_idx[(cpu)])
+# define cpu_idx(cpu) (__cpu_idx[(cpu)])
+# define for_each_sibling(idx, rq) \
+		for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++)
+# define rq_nr_cpus(rq) ((rq)->nr_cpus)
+# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance)
+#else
+# define MAX_NR_SIBLINGS 1
+# define rq_idx(cpu) (cpu)
+# define cpu_idx(cpu) 0
+# define for_each_sibling(idx, rq) while (0)
+# define cpu_active_balance(c) 0
+# define do_active_balance(rq, cpu) do { } while (0)
+# define rq_nr_cpus(rq) 1
+  static inline void active_load_balance(runqueue_t *rq, int this_cpu) { }
+#endif
+
+typedef struct cpu_s {
+	task_t *curr, *idle;
+	task_t *migration_thread;
+	struct list_head migration_queue;
+	int active_balance;
+	int cpu;
+} cpu_t;
+
+/*
  * This is the main, per-CPU runqueue data structure.
  *
  * Locking rule: those places that want to lock multiple runqueues
@@ -151,7 +194,7 @@
 	spinlock_t lock;
 	unsigned long nr_running, nr_switches, expired_timestamp,
 			nr_uninterruptible;
-	task_t *curr, *idle;
+	struct mm_struct *prev_mm;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 #ifdef CONFIG_NUMA
@@ -159,27 +202,39 @@
 	unsigned int nr_balanced;
 	int prev_node_load[MAX_NUMNODES];
 #endif
-	task_t *migration_thread;
-	struct list_head migration_queue;
+	int nr_cpus;
+	cpu_t cpu[MAX_NR_SIBLINGS];
 
 	atomic_t nr_iowait;
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
 
-#define cpu_rq(cpu)		(runqueues + (cpu))
+#define cpu_rq(cpu)		(runqueues + (rq_idx(cpu)))
+#define cpu_int(c)		((cpu_rq(c))->cpu + cpu_idx(c))
+#define cpu_curr_ptr(cpu)	(cpu_int(cpu)->curr)
+#define cpu_idle_ptr(cpu)	(cpu_int(cpu)->idle)
+
 #define this_rq()		cpu_rq(smp_processor_id())
 #define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define rt_task(p)		((p)->prio < MAX_RT_PRIO)
 
+#define migration_thread(cpu)	(cpu_int(cpu)->migration_thread)
+#define migration_queue(cpu)	(&cpu_int(cpu)->migration_queue)
+
+#if NR_CPUS > 1
+# define task_allowed(p, cpu)	((p)->cpus_allowed & (1UL << (cpu)))
+#else
+# define task_allowed(p, cpu)	1
+#endif
+
 /*
  * Default context-switch locking:
  */
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(rq, next)	do { } while(0)
 # define finish_arch_switch(rq, next)	spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p)		((rq)->curr == (p))
+# define task_running(p)		(cpu_curr_ptr(task_cpu(p)) == (p))
 #endif
 
 #ifdef CONFIG_NUMA
@@ -322,16 +377,21 @@
  * Also update all the scheduling statistics stuff. (sleep average
  * calculation, priority modifiers, etc.)
  */
+static inline void __activate_task(task_t *p, runqueue_t *rq)
+{
+	enqueue_task(p, rq->active);
+	nr_running_inc(rq);
+}
+
 static inline void activate_task(task_t *p, runqueue_t *rq)
 {
-	unsigned long sleep_time = jiffies - p->sleep_timestamp;
-	prio_array_t *array = rq->active;
+	unsigned long sleep_time = jiffies - p->last_run;
 
 	if (!rt_task(p) && sleep_time) {
 		/*
 		 * This code gives a bonus to interactive tasks. We update
 		 * an 'average sleep time' value here, based on
-		 * sleep_timestamp. The more time a task spends sleeping,
+		 * ->last_run. The more time a task spends sleeping,
 		 * the higher the average gets - and the higher the priority
 		 * boost gets as well.
 		 */
@@ -340,8 +400,7 @@
 			p->sleep_avg = MAX_SLEEP_AVG;
 		p->prio = effective_prio(p);
 	}
-	enqueue_task(p, array);
-	nr_running_inc(rq);
+	__activate_task(p, rq);
 }
 
 /*
@@ -382,6 +441,11 @@
 #endif
 }
 
+static inline void resched_cpu(int cpu)
+{
+	resched_task(cpu_curr_ptr(cpu));
+}
+
 #ifdef CONFIG_SMP
 
 /*
@@ -398,7 +462,7 @@
 repeat:
 	preempt_disable();
 	rq = task_rq(p);
-	if (unlikely(task_running(rq, p))) {
+	if (unlikely(task_running(p))) {
 		cpu_relax();
 		/*
 		 * enable/disable preemption just to make this
@@ -409,7 +473,7 @@
 		goto repeat;
 	}
 	rq = task_rq_lock(p, &flags);
-	if (unlikely(task_running(rq, p))) {
+	if (unlikely(task_running(p))) {
 		task_rq_unlock(rq, &flags);
 		preempt_enable();
 		goto repeat;
@@ -431,10 +495,39 @@
  */
 void kick_if_running(task_t * p)
 {
-	if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id()))
+	if ((task_running(p)) && (task_cpu(p) != smp_processor_id()))
 		resched_task(p);
 }
 
+static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p)
+{
+	cpu_t *curr_cpu;
+	task_t *curr;
+	int idx;
+
+	if (idle_cpu(cpu))
+		return resched_cpu(cpu);
+
+	for_each_sibling(idx, rq) {
+		curr_cpu = rq->cpu + idx;
+		if (!task_allowed(p, curr_cpu->cpu))
+			continue;
+		if (curr_cpu->idle == curr_cpu->curr)
+			return resched_cpu(curr_cpu->cpu);
+	}
+
+	if (p->prio < cpu_curr_ptr(cpu)->prio)
+		return resched_task(cpu_curr_ptr(cpu));
+
+	for_each_sibling(idx, rq) {
+		curr_cpu = rq->cpu + idx;
+		if (!task_allowed(p, curr_cpu->cpu))
+			continue;
+		curr = curr_cpu->curr;
+		if (p->prio < curr->prio)
+			return resched_task(curr);
+	}
+}
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -455,6 +548,7 @@
 	long old_state;
 	runqueue_t *rq;
 
+	sync &= SYNC_WAKEUPS;
 repeat_lock_task:
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
@@ -463,7 +557,7 @@
 		 * Fast-migrate the task if it's not running or runnable
 		 * currently. Do not violate hard affinity.
 		 */
-		if (unlikely(sync && !task_running(rq, p) &&
+		if (unlikely(sync && !task_running(p) &&
 			(task_cpu(p) != smp_processor_id()) &&
 			(p->cpus_allowed & (1UL << smp_processor_id())))) {
 
@@ -473,10 +567,12 @@
 		}
 		if (old_state == TASK_UNINTERRUPTIBLE)
 			rq->nr_uninterruptible--;
-		activate_task(p, rq);
-
-		if (p->prio < rq->curr->prio)
-			resched_task(rq->curr);
+		if (sync)
+			__activate_task(p, rq);
+		else {
+			activate_task(p, rq);
+			wake_up_cpu(rq, task_cpu(p), p);
+		}
 		success = 1;
 	}
 	p->state = TASK_RUNNING;
@@ -512,8 +608,19 @@
 		p->prio = effective_prio(p);
 	}
 	set_task_cpu(p, smp_processor_id());
-	activate_task(p, rq);
 
+	if (SMART_WAKE_CHILD) {
+		if (unlikely(!current->array))
+			__activate_task(p, rq);
+		else {
+			p->prio = current->prio;
+			list_add_tail(&p->run_list, &current->run_list);
+			p->array = current->array;
+			p->array->nr_active++;
+			nr_running_inc(rq);
+		}
+	} else
+		activate_task(p, rq);
 	rq_unlock(rq);
 }
 
@@ -561,7 +668,7 @@
  * context_switch - switch to the new MM and the new
  * thread's register state.
  */
-static inline task_t * context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
 {
 	struct mm_struct *mm = next->mm;
 	struct mm_struct *oldmm = prev->active_mm;
@@ -575,7 +682,7 @@
 
 	if (unlikely(!prev->mm)) {
 		prev->active_mm = NULL;
-		mmdrop(oldmm);
+		rq->prev_mm = oldmm;
 	}
 
 	/* Here we just switch the register state and the stack. */
@@ -596,8 +703,9 @@
 	unsigned long i, sum = 0;
 
 	for (i = 0; i < NR_CPUS; i++)
-		sum += cpu_rq(i)->nr_running;
-
+		/* Shared runqueues are counted only once. */
+		if (!cpu_idx(i))
+			sum += cpu_rq(i)->nr_running;
 	return sum;
 }
 
@@ -608,7 +716,9 @@
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
-		sum += cpu_rq(i)->nr_uninterruptible;
+		/* Shared runqueues are counted only once. */
+		if (!cpu_idx(i))
+			sum += cpu_rq(i)->nr_uninterruptible;
 	}
 	return sum;
 }
@@ -790,7 +900,23 @@
 
 #endif /* CONFIG_NUMA */
 
-#if CONFIG_SMP
+/*
+ * One of the idle_cpu_tick() and busy_cpu_tick() functions will
+ * get called every timer tick, on every CPU. Our balancing action
+ * frequency and balancing agressivity depends on whether the CPU is
+ * idle or not.
+ *
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * systems with HZ=100, every 10 msecs.)
+ */
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+
+#if !CONFIG_SMP
+
+static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { }
+
+#else
 
 /*
  * double_lock_balance - lock the busiest runqueue
@@ -906,12 +1032,7 @@
 	set_task_cpu(p, this_cpu);
 	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
-	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
-	 * to be always true for them.
-	 */
-	if (p->prio < this_rq->curr->prio)
-		set_need_resched();
+	wake_up_cpu(this_rq, this_cpu, p);
 }
 
 /*
@@ -922,9 +1043,9 @@
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int this_cpu, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, idx;
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
@@ -972,12 +1093,14 @@
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
+	 *
+	 * (except if we are in idle mode which is a more agressive
+	 *  form of rebalancing.)
 	 */
 
-#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
-		!task_running(rq, p) &&					\
-			((p)->cpus_allowed & (1UL << (this_cpu))))
+#define CAN_MIGRATE_TASK(p,rq,cpu)					\
+	((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
+		!task_running(p) && task_allowed(p, cpu))
 
 	curr = curr->prev;
 
@@ -1000,31 +1123,136 @@
 	;
 }
 
+#if CONFIG_SHARE_RUNQUEUE
+static void active_load_balance(runqueue_t *this_rq, int this_cpu)
+{
+	runqueue_t *rq;
+	int i, idx;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		rq = cpu_rq(i);
+		if (rq == this_rq)
+			continue;
+		/*
+		 * Any SMT-specific imbalance?
+		 */
+		for_each_sibling(idx, rq)
+			if (rq->cpu[idx].idle == rq->cpu[idx].curr)
+				goto next_cpu;
+
+		/*
+		 * At this point it's sure that we have a SMT
+		 * imbalance: this (physical) CPU is idle but
+		 * another CPU has two tasks running.
+		 *
+		 * We wake up one of the migration threads (it
+		 * doesnt matter which one) and let it fix things up:
+		 */
+		if (!cpu_active_balance(this_cpu)) {
+			cpu_active_balance(this_cpu) = 1;
+			spin_unlock(&this_rq->lock);
+			wake_up_process(rq->cpu[0].migration_thread);
+			spin_lock(&this_rq->lock);
+		}
+next_cpu:
+	}
+}
+
+static void do_active_balance(runqueue_t *this_rq, int this_cpu)
+{
+	runqueue_t *rq;
+	int i, idx;
+
+	spin_unlock(&this_rq->lock);
+
+	cpu_active_balance(this_cpu) = 0;
+
+	/*
+	 * Is the imbalance still present?
+	 */
+	for_each_sibling(idx, this_rq)
+		if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr)
+			goto out;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		rq = cpu_rq(i);
+		if (rq == this_rq)
+			continue;
+
+		/* completely idle CPU? */
+		if (rq->nr_running)
+			continue;
+
+		/*
+		 * At this point it's reasonably sure that we have an
+		 * imbalance. Since we are the migration thread, try to
+	 	 * balance a thread over to the target queue.
+		 */
+		spin_lock(&rq->lock);
+		load_balance(rq, i, 1);
+		spin_unlock(&rq->lock);
+		goto out;
+	}
+out:
+	spin_lock(&this_rq->lock);
+}
+
 /*
- * One of the idle_cpu_tick() and busy_cpu_tick() functions will
- * get called every timer tick, on every CPU. Our balancing action
- * frequency and balancing agressivity depends on whether the CPU is
- * idle or not.
+ * This routine is called to map a CPU into another CPU's runqueue.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
- * systems with HZ=100, every 10 msecs.)
+ * This must be called during bootup with the merged runqueue having
+ * no tasks.
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
-#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+void sched_map_runqueue(int cpu1, int cpu2)
+{
+	runqueue_t *rq1 = cpu_rq(cpu1);
+	runqueue_t *rq2 = cpu_rq(cpu2);
+	int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx;
+
+	printk("sched_merge_runqueues: CPU#%d <=> CPU#%d, on CPU#%d.\n", cpu1, cpu2, smp_processor_id());
+	if (rq1 == rq2)
+		BUG();
+	if (rq2->nr_running)
+		BUG();
+	/*
+	 * At this point, we dont have anything in the runqueue yet. So,
+	 * there is no need to move processes between the runqueues.
+	 * Only, the idle processes should be combined and accessed
+	 * properly.
+	 */
+	cpu2_idx = rq1->nr_cpus++;
 
-static inline void idle_tick(runqueue_t *rq)
+	if (rq_idx(cpu1) != cpu1)
+		BUG();
+	rq_idx(cpu2) = cpu1;
+	cpu_idx(cpu2) = cpu2_idx;
+	rq1->cpu[cpu2_idx].cpu = cpu2;
+	rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle;
+	rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr;
+	INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue);
+
+	/* just to be safe: */
+	rq2->cpu[cpu2_idx_orig].idle = NULL;
+	rq2->cpu[cpu2_idx_orig].curr = NULL;
+}
+#endif
+#endif
+
+DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
+
+static inline void idle_tick(runqueue_t *rq, unsigned long j)
 {
-	if (jiffies % IDLE_REBALANCE_TICK)
+	if (j % IDLE_REBALANCE_TICK)
 		return;
 	spin_lock(&rq->lock);
-	load_balance(rq, 1);
+	load_balance(rq, smp_processor_id(), 1);
 	spin_unlock(&rq->lock);
 }
 
-#endif
-
-DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
-
 /*
  * We place interactive tasks back into the active array, if possible.
  *
@@ -1035,9 +1263,9 @@
  * increasing number of running tasks:
  */
 #define EXPIRED_STARVING(rq) \
-		((rq)->expired_timestamp && \
+		(STARVATION_LIMIT && ((rq)->expired_timestamp && \
 		(jiffies - (rq)->expired_timestamp >= \
-			STARVATION_LIMIT * ((rq)->nr_running) + 1))
+			STARVATION_LIMIT * ((rq)->nr_running) + 1)))
 
 /*
  * This function gets called by the timer code, with HZ frequency.
@@ -1050,12 +1278,13 @@
 {
 	int cpu = smp_processor_id();
 	runqueue_t *rq = this_rq();
+	unsigned long j = jiffies;
 	task_t *p = current;
 
  	if (rcu_pending(cpu))
  		rcu_check_callbacks(cpu, user_ticks);
 
-	if (p == rq->idle) {
+	if (p == cpu_idle_ptr(cpu)) {
 		/* note: this timer irq context must be accounted for as well */
 		if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
 			kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1063,9 +1292,7 @@
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
-		idle_tick(rq);
-#endif
+		idle_tick(rq, j);
 		return;
 	}
 	if (TASK_NICE(p) > 0)
@@ -1074,12 +1301,13 @@
 		kstat_cpu(cpu).cpustat.user += user_ticks;
 	kstat_cpu(cpu).cpustat.system += sys_ticks;
 
+	spin_lock(&rq->lock);
 	/* Task might have expired already, but not scheduled off yet */
 	if (p->array != rq->active) {
 		set_tsk_need_resched(p);
+		spin_unlock(&rq->lock);
 		return;
 	}
-	spin_lock(&rq->lock);
 	if (unlikely(rt_task(p))) {
 		/*
 		 * RR tasks need a special form of timeslice management.
@@ -1115,16 +1343,14 @@
 
 		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
 			if (!rq->expired_timestamp)
-				rq->expired_timestamp = jiffies;
+				rq->expired_timestamp = j;
 			enqueue_task(p, rq->expired);
 		} else
 			enqueue_task(p, rq->active);
 	}
 out:
-#if CONFIG_SMP
-	if (!(jiffies % BUSY_REBALANCE_TICK))
-		load_balance(rq, 0);
-#endif
+	if (!(j % BUSY_REBALANCE_TICK))
+		load_balance(rq, smp_processor_id(), 0);
 	spin_unlock(&rq->lock);
 }
 
@@ -1135,11 +1361,11 @@
  */
 asmlinkage void schedule(void)
 {
+	int idx, this_cpu, retry = 0;
+	struct list_head *queue;
 	task_t *prev, *next;
-	runqueue_t *rq;
 	prio_array_t *array;
-	struct list_head *queue;
-	int idx;
+	runqueue_t *rq;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -1152,15 +1378,15 @@
 			dump_stack();
 		}
 	}
-
-	check_highmem_ptes();
 need_resched:
+	check_highmem_ptes();
+	this_cpu = smp_processor_id();
 	preempt_disable();
 	prev = current;
 	rq = this_rq();
 
 	release_kernel_lock(prev);
-	prev->sleep_timestamp = jiffies;
+	prev->last_run = jiffies;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1183,12 +1409,14 @@
 	}
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
-#if CONFIG_SMP
-		load_balance(rq, 1);
+		load_balance(rq, this_cpu, 1);
 		if (rq->nr_running)
 			goto pick_next_task;
-#endif
-		next = rq->idle;
+		active_load_balance(rq, this_cpu);
+		if (rq->nr_running)
+			goto pick_next_task;
+pick_idle:
+		next = cpu_idle_ptr(this_cpu);
 		rq->expired_timestamp = 0;
 		goto switch_tasks;
 	}
@@ -1204,24 +1432,59 @@
 		rq->expired_timestamp = 0;
 	}
 
+new_array:
 	idx = sched_find_first_bit(array->bitmap);
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
+	if ((next != prev) && (rq_nr_cpus(rq) > 1)) {
+		struct list_head *tmp = queue->next;
+
+		while (task_running(next) || !task_allowed(next, this_cpu)) {
+			tmp = tmp->next;
+			if (tmp != queue) {
+				next = list_entry(tmp, task_t, run_list);
+				continue;
+			}
+			idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+			if (idx == MAX_PRIO) {
+				if (retry || !rq->expired->nr_active)
+					goto pick_idle;
+				/*
+				 * To avoid infinite changing of arrays,
+				 * when we have only tasks runnable by
+				 * sibling.
+				 */
+				retry = 1;
+
+				array = rq->expired;
+				goto new_array;
+			}
+			queue = array->queue + idx;
+			tmp = queue->next;
+			next = list_entry(tmp, task_t, run_list);
+		}
+	}
 
 switch_tasks:
 	prefetch(next);
 	clear_tsk_need_resched(prev);
-	RCU_qsctr(prev->thread_info->cpu)++;
+	RCU_qsctr(task_cpu(prev))++;
 
 	if (likely(prev != next)) {
+		struct mm_struct *prev_mm;
 		rq->nr_switches++;
-		rq->curr = next;
+		cpu_curr_ptr(this_cpu) = next;
+		set_task_cpu(next, this_cpu);
 	
 		prepare_arch_switch(rq, next);
-		prev = context_switch(prev, next);
+		prev = context_switch(rq, prev, next);
 		barrier();
 		rq = this_rq();
+		prev_mm = rq->prev_mm;
+		rq->prev_mm = NULL;
 		finish_arch_switch(rq, prev);
+		if (prev_mm)
+			mmdrop(prev_mm);
 	} else
 		spin_unlock_irq(&rq->lock);
 
@@ -1481,9 +1744,8 @@
 		 * If the task is running and lowered its priority,
 		 * or increased its priority then reschedule its CPU:
 		 */
-		if ((NICE_TO_PRIO(nice) < p->static_prio) ||
-							task_running(rq, p))
-			resched_task(rq->curr);
+		if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p))
+			resched_task(cpu_curr_ptr(task_cpu(p)));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -1561,7 +1823,7 @@
  */
 int task_curr(task_t *p)
 {
-	return cpu_curr(task_cpu(p)) == p;
+	return cpu_curr_ptr(task_cpu(p)) == p;
 }
 
 /**
@@ -1570,7 +1832,7 @@
  */
 int idle_cpu(int cpu)
 {
-	return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+	return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu);
 }
 
 /**
@@ -1660,7 +1922,7 @@
 	else
 		p->prio = p->static_prio;
 	if (array)
-		activate_task(p, task_rq(p));
+		__activate_task(p, task_rq(p));
 
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -2135,7 +2397,7 @@
 	local_irq_save(flags);
 	double_rq_lock(idle_rq, rq);
 
-	idle_rq->curr = idle_rq->idle = idle;
+	cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle;
 	deactivate_task(idle, rq);
 	idle->array = NULL;
 	idle->prio = MAX_PRIO;
@@ -2190,6 +2452,7 @@
 	unsigned long flags;
 	migration_req_t req;
 	runqueue_t *rq;
+	int cpu;
 
 #if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */
 	new_mask &= cpu_online_map;
@@ -2211,31 +2474,31 @@
 	 * If the task is not on a runqueue (and not running), then
 	 * it is sufficient to simply update the task's cpu field.
 	 */
-	if (!p->array && !task_running(rq, p)) {
+	if (!p->array && !task_running(p)) {
 		set_task_cpu(p, __ffs(p->cpus_allowed));
 		task_rq_unlock(rq, &flags);
 		return;
 	}
 	init_completion(&req.done);
 	req.task = p;
-	list_add(&req.list, &rq->migration_queue);
+	cpu = task_cpu(p);
+	list_add(&req.list, migration_queue(cpu));
 	task_rq_unlock(rq, &flags);
-
-	wake_up_process(rq->migration_thread);
+	wake_up_process(migration_thread(cpu));
 
 	wait_for_completion(&req.done);
 }
 
 /*
- * migration_thread - this is a highprio system thread that performs
+ * migration_task - this is a highprio system thread that performs
  * thread migration by 'pulling' threads into the target runqueue.
  */
-static int migration_thread(void * data)
+static int migration_task(void * data)
 {
 	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	int cpu = (long) data;
 	runqueue_t *rq;
-	int ret;
+	int ret, idx;
 
 	daemonize();
 	sigfillset(&current->blocked);
@@ -2250,7 +2513,8 @@
 	ret = setscheduler(0, SCHED_FIFO, &param);
 
 	rq = this_rq();
-	rq->migration_thread = current;
+	migration_thread(cpu) = current;
+	idx = cpu_idx(cpu);
 
 	sprintf(current->comm, "migration/%d", smp_processor_id());
 
@@ -2263,7 +2527,9 @@
 		task_t *p;
 
 		spin_lock_irqsave(&rq->lock, flags);
-		head = &rq->migration_queue;
+		if (cpu_active_balance(cpu))
+			do_active_balance(rq, cpu);
+		head = migration_queue(cpu);
 		current->state = TASK_INTERRUPTIBLE;
 		if (list_empty(head)) {
 			spin_unlock_irqrestore(&rq->lock, flags);
@@ -2292,9 +2558,8 @@
 			set_task_cpu(p, cpu_dest);
 			if (p->array) {
 				deactivate_task(p, rq_src);
-				activate_task(p, rq_dest);
-				if (p->prio < rq_dest->curr->prio)
-					resched_task(rq_dest->curr);
+				__activate_task(p, rq_dest);
+				wake_up_cpu(rq_dest, cpu_dest, p);
 			}
 		}
 		double_rq_unlock(rq_src, rq_dest);
@@ -2312,12 +2577,13 @@
 			  unsigned long action,
 			  void *hcpu)
 {
+	long cpu = (long) hcpu;
+
 	switch (action) {
 	case CPU_ONLINE:
-		printk("Starting migration thread for cpu %li\n",
-		       (long)hcpu);
-		kernel_thread(migration_thread, hcpu, CLONE_KERNEL);
-		while (!cpu_rq((long)hcpu)->migration_thread)
+		printk("Starting migration thread for cpu %li\n", cpu);
+		kernel_thread(migration_task, hcpu, CLONE_KERNEL);
+		while (!migration_thread(cpu))
 			yield();
 		break;
 	}
@@ -2392,11 +2658,20 @@
 	for (i = 0; i < NR_CPUS; i++) {
 		prio_array_t *array;
 
+		/*
+		 * Start with a 1:1 mapping between CPUs and runqueues:
+		 */
+#if CONFIG_SHARE_RUNQUEUE
+		rq_idx(i) = i;
+		cpu_idx(i) = 0;
+#endif
 		rq = cpu_rq(i);
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
-		INIT_LIST_HEAD(&rq->migration_queue);
+		INIT_LIST_HEAD(migration_queue(i));
+		rq->nr_cpus = 1;
+		rq->cpu[cpu_idx(i)].cpu = i;
 		atomic_set(&rq->nr_iowait, 0);
 		nr_running_init(rq);
 
@@ -2414,9 +2689,13 @@
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.
 	 */
-	rq = this_rq();
-	rq->curr = current;
-	rq->idle = current;
+	cpu_curr_ptr(smp_processor_id()) = current;
+	cpu_idle_ptr(smp_processor_id()) = current;
+	printk("sched_init().\n");
+	printk("smp_processor_id(): %d.\n", smp_processor_id());
+	printk("rq_idx(smp_processor_id()): %ld.\n", rq_idx(smp_processor_id()));
+	printk("this_rq(): %p.\n", this_rq());
+
 	set_task_cpu(current, smp_processor_id());
 	wake_up_process(current);
 
--- linux/init/main.c.orig	2003-01-20 19:29:09.000000000 +0100
+++ linux/init/main.c	2003-01-20 22:58:36.000000000 +0100
@@ -354,7 +354,14 @@
 
 static void rest_init(void)
 {
+	/* 
+	 *	We count on the initial thread going ok 
+	 *	Like idlers init is an unlocked kernel thread, which will
+	 *	make syscalls (and thus be locked).
+	 */
+	init_idle(current, smp_processor_id());
 	kernel_thread(init, NULL, CLONE_KERNEL);
+
 	unlock_kernel();
  	cpu_idle();
 } 
@@ -438,13 +445,6 @@
 	check_bugs();
 	printk("POSIX conformance testing by UNIFIX\n");
 
-	/* 
-	 *	We count on the initial thread going ok 
-	 *	Like idlers init is an unlocked kernel thread, which will
-	 *	make syscalls (and thus be locked).
-	 */
-	init_idle(current, smp_processor_id());
-
 	/* Do the rest non-__init'ed, we're now alive */
 	rest_init();
 }




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-D7
  2003-01-20 21:18                                                                               ` [patch] HT scheduler, sched-2.5.59-D7 Ingo Molnar
@ 2003-01-20 22:28                                                                                 ` Andrew Morton
  2003-01-21  1:11                                                                                   ` Michael Hohnbaum
  2003-01-22  3:15                                                                                 ` Michael Hohnbaum
  2003-02-03 18:23                                                                                 ` [patch] HT scheduler, sched-2.5.59-E2 Ingo Molnar
  2 siblings, 1 reply; 96+ messages in thread
From: Andrew Morton @ 2003-01-20 22:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mbligh, habanero, efocht, hohnbaum, colpatch, hch, rml, torvalds,
	linux-kernel, lse-tech, anton, andrea

Ingo Molnar <mingo@elte.hu> wrote:
>
> 
> the attached patch (against 2.5.59) is my current scheduler tree, it
> includes two main areas of changes:
> 
>  - interactivity improvements, mostly reworked bits from Andrea's tree and 
>    various tunings.
> 

Thanks for doing this.  Initial testing with one workload which is extremely
bad with 2.5.59: huge improvement.

The workload is:

workstation> ssh laptop
laptop> setenv DISPLAY workstation:0
laptop> make -j0 bzImage&
laptop> some-x-application &

For some reason, X-across-ethernet performs terribly when there's a kernel
compile on the client machine - lots of half-second lags.

All gone now.


wrt this:

	if (SMART_WAKE_CHILD) {
		if (unlikely(!current->array))
			__activate_task(p, rq);
		else {
			p->prio = current->prio;
			list_add_tail(&p->run_list, &current->run_list);
			p->array = current->array;
			p->array->nr_active++;
			nr_running_inc(rq);
		}

for some reason I decided that RT tasks need special handling here.  I
forget why though ;)  Please double-check that.


--- 25/kernel/sched.c~rcf-simple	Thu Dec 26 02:34:11 2002
+++ 25-akpm/kernel/sched.c	Thu Dec 26 02:34:40 2002
@@ -452,6 +452,8 @@ int wake_up_process(task_t * p)
 void wake_up_forked_process(task_t * p)
 {
 	runqueue_t *rq = this_rq_lock();
+	struct task_struct *this_task = current;
+	prio_array_t *array = this_task->array;
 
 	p->state = TASK_RUNNING;
 	if (!rt_task(p)) {
@@ -467,6 +469,19 @@ void wake_up_forked_process(task_t * p)
 	set_task_cpu(p, smp_processor_id());
 	activate_task(p, rq);
 
+	/*
+	 * Take caller off the head of the runqueue, so child will run first.
+	 */
+	if (array) {
+		if (!rt_task(current)) {
+			dequeue_task(this_task, array);
+			enqueue_task(this_task, array);
+		} else {
+			list_del(&this_task->run_list);
+			list_add_tail(&this_task->run_list,
+					array->queue + this_task->prio);
+		}
+	}
 	rq_unlock(rq);
 }
 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-D7
  2003-01-20 22:28                                                                                 ` Andrew Morton
@ 2003-01-21  1:11                                                                                   ` Michael Hohnbaum
  0 siblings, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-21  1:11 UTC (permalink / raw)
  To: Andrew Morton, mingo
  Cc: Martin J. Bligh, Andrew Theurer, Erich Focht, Matthew Dobson,
	hch, rml, torvalds, linux-kernel, lse-tech, anton, andrea

On Mon, 2003-01-20 at 14:28, Andrew Morton wrote:
> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > 
> > the attached patch (against 2.5.59) is my current scheduler tree, it
> > includes two main areas of changes:
> > 
> >  - interactivity improvements, mostly reworked bits from Andrea's tree and 
> >    various tunings.
> > 
> 
> Thanks for doing this.  Initial testing with one workload which is extremely
> bad with 2.5.59: huge improvement.
> 
Initial testing on my NUMA box looks good.  So far, I only have
kernbench numbers, but the system time shows a nice decrease,
and the CPU % busy time has gone up.  

Kernbench:
                        Elapsed       User     System        CPU
           ingoD7-59    28.944s   285.008s    79.998s    1260.8%
             stock59    29.668s   283.762s    82.286s      1233%

I'll try to get more testing in over the next few days.

There was one minor build error introduced by the patch:

#define NODE_THRESHOLD          125

was removed, but the NUMA code uses this.

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] sched-2.5.59-A2
  2003-01-20 16:56                                                               ` Ingo Molnar
  2003-01-20 17:04                                                                 ` Ingo Molnar
  2003-01-20 17:04                                                                 ` [patch] sched-2.5.59-A2 Martin J. Bligh
@ 2003-01-21 17:44                                                                 ` Erich Focht
  2 siblings, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-01-21 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Michael Hohnbaum, Martin J. Bligh, Matthew Dobson,
	Christoph Hellwig, Robert Love, Andrew Theurer, Linus Torvalds,
	linux-kernel, lse-tech

On Monday 20 January 2003 17:56, Ingo Molnar wrote:
> On Mon, 20 Jan 2003, Erich Focht wrote:
> > Could you please explain your idea? As far as I understand, the SMP
> > balancer (pre-NUMA) tries a global rebalance at each call. Maybe you
> > mean something different...
>
> yes, but eg. in the idle-rebalance case we are more agressive at moving
> tasks across SMP CPUs. We could perhaps do a similar ->nr_balanced logic
> to do this 'agressive' balancing even if not triggered from the
> CPU-will-be-idle path. Ie. _perhaps_ the SMP balancer could become a bit
> more agressive.

Do you mean: make the SMP balancer more aggressive by lowering the
125% threshold?

> ie. SMP is just the first level in the cache-hierarchy, NUMA is the second
> level. (lets hope we dont have to deal with a third caching level anytime
> soon - although that could as well happen once SMT CPUs start doing NUMA.)
> There's no real reason to do balancing in a different way on each level -
> the weight might be different, but the core logic should be synced up.
> (one thing that is indeed different for the NUMA step is locality of
> uncached memory.)

We have an IA64 2-level node hierarchy machine with 32 CPUs (NEC
TX7). In the "old" node affine scheduler patch the multilevel feature
was in by different cross-node steal delays (longer if node is further
away). In the current approach we could just add another counter, such
that we call the cross-supernode balancer only if the intra-supernode
balancer failed a few times. No idea whether this helps...

> > Yes! Actually the currently implemented nr_balanced logic is pretty
> > dumb: the counter reaches the cross-node balance threshold after a
> > certain number of calls to intra-node lb, no matter whether these were
> > successfull or not. I'd like to try incrementing the counter only on
> > unsuccessfull load balances, this would give a clear priority to
> > intra-node balancing and a clear and controllable delay for cross-node
> > balancing. A tiny patch for this (for 2.5.59) is attached. As the name
> > nr_balanced would be misleading for this kind of usage, I renamed it to
> > nr_lb_failed.
>
> indeed this approach makes much more sense than the simple ->nr_balanced
> counter. A similar approach makes sense on the SMP level as well: if the
> current 'busy' rebalancer fails to get a new task, we can try the current
> 'idle' rebalancer. Ie. a CPU going idle would do the less intrusive
> rebalancing first.
>
> have you experimented with making the counter limit == 1 actually? Ie.
> immediately trying to do a global balancing once the less intrusive
> balancing fails?

Didn't have time to try and probably won't be able to check this
before the beginning of next week :-( .

Regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-D7
  2003-01-20 21:18                                                                               ` [patch] HT scheduler, sched-2.5.59-D7 Ingo Molnar
  2003-01-20 22:28                                                                                 ` Andrew Morton
@ 2003-01-22  3:15                                                                                 ` Michael Hohnbaum
  2003-01-22 16:41                                                                                   ` Andrew Theurer
  2003-02-03 18:23                                                                                 ` [patch] HT scheduler, sched-2.5.59-E2 Ingo Molnar
  2 siblings, 1 reply; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-22  3:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Andrew Theurer, Erich Focht, Matthew Dobson,
	Christoph Hellwig, Robert Love, Linus Torvalds, linux-kernel,
	lse-tech, Anton Blanchard, Andrea Arcangeli

On Mon, 2003-01-20 at 13:18, Ingo Molnar wrote:
> 
> the attached patch (against 2.5.59) is my current scheduler tree, it
> includes two main areas of changes:
> 
>  - interactivity improvements, mostly reworked bits from Andrea's tree and 
>    various tunings.
> 
>  - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
>    passive load balancing, active-balancing, HT-aware task pickup,
>    HT-aware affinity and HT-aware wakeup.

I ran Erich's numatest on a system with this patch, plus the 
cputime_stats patch (so that we would get meaningful numbers),
and found a problem.  It appears that on the lightly loaded system
sched_best_cpu is now loading up one node before moving on to the
next.  Once the system is loaded (i.e., a process per cpu) things
even out.  Before applying the D7 patch, processes were distributed
evenly across nodes, even in low load situations.  

The numbers below on schedbench show that on 4 and 8 process runs, 
the D7 patch gets the same average user time as on the 16 and greater
runs.  However, without the D7 patch, the 4 and 8 process runs tend
to have significantly decreased average user time.

Below I'm including the summarized output, and then the detailed
output for the relevant runs on both D7 patched systems and stock.

Overall performance is improved with the D7 patch, so I would like
to find out and fix what went wrong on the light load cases and
encourage the adoption of the the D7 patch (or at least the parts 
that make the NUMA scheduler even faster).  I'm not likely to have 
time to chase this down for the next few days, so am posting
results to see if anyone else can find the cause.

kernels: 
 * stock59-stats = stock 2.5.59 with the cputime_stats patch
 * ingoD7-59.stats = testD7-59 = stock59-stats + Ingo's D7 patch

Kernbench:
                        Elapsed       User     System        CPU
           testD7-59     28.96s   285.314s    79.616s    1260.6%
     ingoD7-59.stats    28.824s   284.834s    79.164s    1263.6%
       stock59-stats    29.498s   283.976s     83.05s    1243.8%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
           testD7-59      53.19      53.43     212.81       0.59
     ingoD7-59.stats      44.77      46.52     179.10       0.78
       stock59-stats      22.25      35.94      89.06       0.81

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
           testD7-59      53.22      53.66     425.81       1.40
     ingoD7-59.stats      39.44      47.15     315.62       1.62
       stock59-stats      28.40      42.25     227.26       1.67

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
           testD7-59      52.84      58.26     845.49       2.78
     ingoD7-59.stats      52.85      57.31     845.68       3.29
       stock59-stats      52.97      57.19     847.70       3.29

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
           testD7-59      56.77     122.51    1816.80       7.58
     ingoD7-59.stats      56.54     125.79    1809.67       6.97
       stock59-stats      56.57     118.05    1810.53       5.97

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
           testD7-59      57.52     234.27    3681.86      18.18
     ingoD7-59.stats      58.25     242.61    3728.46      17.40
       stock59-stats      56.75     234.12    3632.72      15.70

Detailed stats from running numatest with 4 processes on the D7 patch. 
Note how most of the load is put on node 0.

Executing 4 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 5.039
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  46.27
  2    58.4    0.0    0.0   41.6 |    0     0    |  41.18
  3   100.0    0.0    0.0    0.0 |    0     0    |  45.72
  4   100.0    0.0    0.0    0.0 |    0     0    |  45.89
AverageUserTime 44.77 seconds
ElapsedTime     46.52
TotalUserTime   179.10
TotalSysTime    0.78

Detailed stats from running numatest with 8 processes on the D7 patch.
In this one it appears that node 0 was loaded, then node 1.

Executing 8 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 11.185
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  46.89
  2   100.0    0.0    0.0    0.0 |    0     0    |  46.20
  3   100.0    0.0    0.0    0.0 |    0     0    |  46.31
  4     0.0  100.0    0.0    0.0 |    1     1    |  39.44
  5     0.0    0.0   99.9    0.0 |    2     2    |  16.00
  6    62.6    0.0    0.0   37.4 |    0     0    |  42.23
  7     0.0  100.0    0.0    0.0 |    1     1    |  39.12
  8     0.0  100.0    0.0    0.0 |    1     1    |  39.35
AverageUserTime 39.44 seconds
ElapsedTime     47.15
TotalUserTime   315.62
TotalSysTime    1.62

Control run - detailed stats running numatest with 4 processes
on a stock59 kernel.

Executing 4 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 8.297
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0   99.8    0.0    0.0 |    1     1    |  16.63
  2   100.0    0.0    0.0    0.0 |    0     0    |  27.83
  3     0.0    0.0   99.9    0.0 |    2     2    |  16.27
  4   100.0    0.0    0.0    0.0 |    0     0    |  28.29
AverageUserTime 22.25 seconds
ElapsedTime     35.94
TotalUserTime   89.06
TotalSysTime    0.81

Control run - detailed stats running numatest with 8 processes
on a stock59 kernel.

Executing 8 times: ./randupdt 1000000 
Running 'hackbench 20' in the background: Time: 9.458
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0   99.9    0.0    0.0 |    1     1    |  27.77
  2   100.0    0.0    0.0    0.0 |    0     0    |  29.34
  3     0.0    0.0  100.0    0.0 |    2     2    |  28.03
  4     0.0    0.0    0.0  100.0 |    3     3    |  24.15
  5    13.1    0.0    0.0   86.9 |    0     3   *|  33.36
  6     0.0  100.0    0.0    0.0 |    1     1    |  27.94
  7     0.0    0.0  100.0    0.0 |    2     2    |  28.02
  8   100.0    0.0    0.0    0.0 |    0     0    |  28.58

-- 
Michael Hohnbaum            503-578-5486
hohnbaum@us.ibm.com         T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-D7
  2003-01-22 16:41                                                                                   ` Andrew Theurer
@ 2003-01-22 16:17                                                                                     ` Martin J. Bligh
  2003-01-22 16:20                                                                                       ` Andrew Theurer
  2003-01-22 16:35                                                                                     ` Michael Hohnbaum
  1 sibling, 1 reply; 96+ messages in thread
From: Martin J. Bligh @ 2003-01-22 16:17 UTC (permalink / raw)
  To: Andrew Theurer, Michael Hohnbaum, Ingo Molnar
  Cc: Erich Focht, Matthew Dobson, Christoph Hellwig, Robert Love,
	Linus Torvalds, linux-kernel, lse-tech, Anton Blanchard,
	Andrea Arcangeli

> Michael,  my experience has been that 2.5.59 loaded up the first node before
> distributing out tasks (at least on kernbench). 

Could you throw a printk in map_cpu_to_node and check which cpus come out 
on which nodes?

> I'll try to get some of these tests on x440 asap to compare.

So what are you running on now? the HT stuff?

M.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-D7
  2003-01-22 16:17                                                                                     ` Martin J. Bligh
@ 2003-01-22 16:20                                                                                       ` Andrew Theurer
  0 siblings, 0 replies; 96+ messages in thread
From: Andrew Theurer @ 2003-01-22 16:20 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum, Ingo Molnar
  Cc: Erich Focht, Matthew Dobson, Christoph Hellwig, Robert Love,
	Linus Torvalds, linux-kernel, lse-tech, Anton Blanchard,
	Andrea Arcangeli

On Wednesday 22 January 2003 10:17, Martin J. Bligh wrote:
> > Michael,  my experience has been that 2.5.59 loaded up the first node
> > before distributing out tasks (at least on kernbench).
>
> Could you throw a printk in map_cpu_to_node and check which cpus come out
> on which nodes?
>
> > I'll try to get some of these tests on x440 asap to compare.
>
> So what are you running on now? the HT stuff?

I am running nothing right now, mainly because I don't have access to that HT 
system anymore. HT-numa may work better than it did with the new load_balance 
idle policy, but I'm not sure it's worth persuing with Ingo's HT patch.  Let 
me know if you think it's worth investigating.

-Andrew Theurer

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-D7
  2003-01-22 16:41                                                                                   ` Andrew Theurer
  2003-01-22 16:17                                                                                     ` Martin J. Bligh
@ 2003-01-22 16:35                                                                                     ` Michael Hohnbaum
  1 sibling, 0 replies; 96+ messages in thread
From: Michael Hohnbaum @ 2003-01-22 16:35 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Ingo Molnar, Martin J. Bligh, Erich Focht, Matthew Dobson,
	Christoph Hellwig, Robert Love, Linus Torvalds, linux-kernel,
	lse-tech, Anton Blanchard, Andrea Arcangeli

On Wed, 2003-01-22 at 08:41, Andrew Theurer wrote:
> > On Mon, 2003-01-20 at 13:18, Ingo Molnar wrote:
> > >
> > > the attached patch (against 2.5.59) is my current scheduler tree, it
> > > includes two main areas of changes:
> > >
> > >  - interactivity improvements, mostly reworked bits from Andrea's tree
> and
> > >    various tunings.
> > >
> > >  - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
> > >    passive load balancing, active-balancing, HT-aware task pickup,
> > >    HT-aware affinity and HT-aware wakeup.
> >
> > I ran Erich's numatest on a system with this patch, plus the
> > cputime_stats patch (so that we would get meaningful numbers),
> > and found a problem.  It appears that on the lightly loaded system
> > sched_best_cpu is now loading up one node before moving on to the
> > next.  Once the system is loaded (i.e., a process per cpu) things
> > even out.  Before applying the D7 patch, processes were distributed
> > evenly across nodes, even in low load situations.
> 
> Michael,  my experience has been that 2.5.59 loaded up the first node before
> distributing out tasks (at least on kernbench).  

Well the data I posted doesn't support that conclusion - it showed at
most two processes on the first node before moving to the next node
for 2.5.59, but for the D7 patched system, the current node was fully
loaded before putting processes on other nodes.  I've repeated this on
multiple runs and obtained similar results.

The first check in
> sched_best_cpu would almost always place the new task on the same cpu, and
> intra node balance on an idle cpu in the same node would almost always steal
> it before a inter node balance could steal it.  Also, sched_best_cpu does
> not appear to be changed in D7. 

That is true, and is the only thing I've had a chance to look at.  
sched_best_cpu depends on data collected elsewhere, so my suspicion
is that it is working with bad data.  I'll try to find time this week
to look further at it.

 Actually, I expected D7 to have the
> opposite effect you describe (although I have not tried it yet), since
> load_balance will now steal a running task if called by an idle cpu.
> 
> I'll try to get some of these tests on x440 asap to compare.

I'm interested in seeing these results.  Any chance of getting time on
a 4-node x440?
> 
> -Andrew Theurer
> 
> 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-D7
  2003-01-22  3:15                                                                                 ` Michael Hohnbaum
@ 2003-01-22 16:41                                                                                   ` Andrew Theurer
  2003-01-22 16:17                                                                                     ` Martin J. Bligh
  2003-01-22 16:35                                                                                     ` Michael Hohnbaum
  0 siblings, 2 replies; 96+ messages in thread
From: Andrew Theurer @ 2003-01-22 16:41 UTC (permalink / raw)
  To: Michael Hohnbaum, Ingo Molnar
  Cc: Martin J. Bligh, Erich Focht, Matthew Dobson, Christoph Hellwig,
	Robert Love, Linus Torvalds, linux-kernel, lse-tech,
	Anton Blanchard, Andrea Arcangeli

> On Mon, 2003-01-20 at 13:18, Ingo Molnar wrote:
> >
> > the attached patch (against 2.5.59) is my current scheduler tree, it
> > includes two main areas of changes:
> >
> >  - interactivity improvements, mostly reworked bits from Andrea's tree
and
> >    various tunings.
> >
> >  - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
> >    passive load balancing, active-balancing, HT-aware task pickup,
> >    HT-aware affinity and HT-aware wakeup.
>
> I ran Erich's numatest on a system with this patch, plus the
> cputime_stats patch (so that we would get meaningful numbers),
> and found a problem.  It appears that on the lightly loaded system
> sched_best_cpu is now loading up one node before moving on to the
> next.  Once the system is loaded (i.e., a process per cpu) things
> even out.  Before applying the D7 patch, processes were distributed
> evenly across nodes, even in low load situations.

Michael,  my experience has been that 2.5.59 loaded up the first node before
distributing out tasks (at least on kernbench).  The first check in
sched_best_cpu would almost always place the new task on the same cpu, and
intra node balance on an idle cpu in the same node would almost always steal
it before a inter node balance could steal it.  Also, sched_best_cpu does
not appear to be changed in D7.  Actually, I expected D7 to have the
opposite effect you describe (although I have not tried it yet), since
load_balance will now steal a running task if called by an idle cpu.

I'll try to get some of these tests on x440 asap to compare.

-Andrew Theurer


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [patch] HT scheduler, sched-2.5.59-E2
  2003-01-20 21:18                                                                               ` [patch] HT scheduler, sched-2.5.59-D7 Ingo Molnar
  2003-01-20 22:28                                                                                 ` Andrew Morton
  2003-01-22  3:15                                                                                 ` Michael Hohnbaum
@ 2003-02-03 18:23                                                                                 ` Ingo Molnar
  2003-02-03 20:47                                                                                   ` Robert Love
  2003-02-04  9:31                                                                                   ` Erich Focht
  2 siblings, 2 replies; 96+ messages in thread
From: Ingo Molnar @ 2003-02-03 18:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Martin J. Bligh, Andrew Theurer, Erich Focht, Michael Hohnbaum,
	Matthew Dobson, Christoph Hellwig, Robert Love, Linus Torvalds,
	lse-tech, Anton Blanchard, Andrea Arcangeli


the attached patch (against 2.5.59 or BK-curr) has two HT-scheduler fixes
over the -D7 patch:

 - HT-aware task pickup did not do the right thing in an important
   special-case: when there are 2 tasks running in a 2-CPU runqueue and
   one of the tasks reschedules but still stays runnable. In this case,
   instead of running the previous task again, the scheduler incorrectly
   picked the idle task, which in turn reduced performance.

 - active load-balancing only worked in a limited way, due to an indexing
   bug, now it properly active-balances on all CPUs.

changes in -D7 over 2.5.59:

  - interactivity improvements, mostly reworked bits from Andrea's tree
    and various tunings.
 
  - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
    passive load balancing, active-balancing, HT-aware task pickup,
    HT-aware affinity and HT-aware wakeup.

the sched-2.5.59-E2 patch can also be downloaded from:
 
    http://redhat.com/~mingo/O(1)-scheduler/

	Ingo

--- linux/arch/i386/kernel/cpu/proc.c.orig	
+++ linux/arch/i386/kernel/cpu/proc.c	
@@ -1,4 +1,5 @@
 #include <linux/smp.h>
+#include <linux/sched.h>
 #include <linux/timex.h>
 #include <linux/string.h>
 #include <asm/semaphore.h>
@@ -101,6 +102,13 @@
 		     fpu_exception ? "yes" : "no",
 		     c->cpuid_level,
 		     c->wp_works_ok ? "yes" : "no");
+#if CONFIG_SHARE_RUNQUEUE
+{
+	extern long __rq_idx[NR_CPUS];
+
+	seq_printf(m, "\nrunqueue\t: %d\n", __rq_idx[n]);
+}
+#endif
 
 	for ( i = 0 ; i < 32*NCAPINTS ; i++ )
 		if ( test_bit(i, c->x86_capability) &&
--- linux/arch/i386/kernel/smpboot.c.orig	
+++ linux/arch/i386/kernel/smpboot.c	
@@ -38,6 +38,7 @@
 #include <linux/kernel.h>
 
 #include <linux/mm.h>
+#include <linux/sched.h>
 #include <linux/kernel_stat.h>
 #include <linux/smp_lock.h>
 #include <linux/irq.h>
@@ -945,6 +946,16 @@
 
 int cpu_sibling_map[NR_CPUS] __cacheline_aligned;
 
+static int test_ht;
+
+static int __init ht_setup(char *str)
+{
+	test_ht = 1;
+	return 1;
+}
+
+__setup("test_ht", ht_setup);
+
 static void __init smp_boot_cpus(unsigned int max_cpus)
 {
 	int apicid, cpu, bit;
@@ -1087,16 +1098,31 @@
 	Dprintk("Boot done.\n");
 
 	/*
-	 * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so
+	 * Here we can be sure that there is an IO-APIC in the system. Let's
+	 * go and set it up:
+	 */
+	smpboot_setup_io_apic();
+
+	setup_boot_APIC_clock();
+
+	/*
+	 * Synchronize the TSC with the AP
+	 */
+	if (cpu_has_tsc && cpucount)
+		synchronize_tsc_bp();
+	/*
+	 * If Hyper-Threading is available, construct cpu_sibling_map[], so
 	 * that we can tell the sibling CPU efficiently.
 	 */
+printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus());
 	if (cpu_has_ht && smp_num_siblings > 1) {
 		for (cpu = 0; cpu < NR_CPUS; cpu++)
 			cpu_sibling_map[cpu] = NO_PROC_ID;
 		
 		for (cpu = 0; cpu < NR_CPUS; cpu++) {
 			int 	i;
-			if (!test_bit(cpu, &cpu_callout_map)) continue;
+			if (!test_bit(cpu, &cpu_callout_map))
+				continue;
 
 			for (i = 0; i < NR_CPUS; i++) {
 				if (i == cpu || !test_bit(i, &cpu_callout_map))
@@ -1112,17 +1138,41 @@
 				printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu);
 			}
 		}
-	}
-
-	smpboot_setup_io_apic();
-
-	setup_boot_APIC_clock();
+#if CONFIG_SHARE_RUNQUEUE
+		/*
+		 * At this point APs would have synchronised TSC and
+		 * waiting for smp_commenced, with their APIC timer
+		 * disabled. So BP can go ahead do some initialization
+		 * for Hyper-Threading (if needed).
+		 */
+		for (cpu = 0; cpu < NR_CPUS; cpu++) {
+			int i;
+			if (!test_bit(cpu, &cpu_callout_map))
+				continue;
+			for (i = 0; i < NR_CPUS; i++) {
+				if (i <= cpu)
+					continue;
+				if (!test_bit(i, &cpu_callout_map))
+					continue;
 
-	/*
-	 * Synchronize the TSC with the AP
-	 */
-	if (cpu_has_tsc && cpucount)
-		synchronize_tsc_bp();
+				if (phys_proc_id[cpu] != phys_proc_id[i])
+					continue;
+				/*
+				 * merge runqueues, resulting in one
+				 * runqueue per package:
+				 */
+				sched_map_runqueue(cpu, i);
+				break;
+			}
+		}
+#endif
+	}
+#if CONFIG_SHARE_RUNQUEUE
+	if (smp_num_siblings == 1 && test_ht) {
+		printk("Simulating a 2-sibling 1-phys-CPU HT setup!\n");
+		sched_map_runqueue(0, 1);
+	}
+#endif
 }
 
 /* These are wrappers to interface to the new boot process.  Someone
--- linux/arch/i386/Kconfig.orig	
+++ linux/arch/i386/Kconfig	
@@ -408,6 +408,24 @@
 
 	  If you don't know what to do here, say N.
 
+choice
+
+	prompt "Hyperthreading Support"
+	depends on SMP
+	default NR_SIBLINGS_0
+
+config NR_SIBLINGS_0
+	bool "off"
+
+config NR_SIBLINGS_2
+	bool "2 siblings"
+
+config NR_SIBLINGS_4
+	bool "4 siblings"
+
+endchoice
+
+
 config PREEMPT
 	bool "Preemptible Kernel"
 	help
--- linux/fs/pipe.c.orig	
+++ linux/fs/pipe.c	
@@ -117,7 +117,7 @@
 	up(PIPE_SEM(*inode));
 	/* Signal writers asynchronously that there is more room.  */
 	if (do_wakeup) {
-		wake_up_interruptible(PIPE_WAIT(*inode));
+		wake_up_interruptible_sync(PIPE_WAIT(*inode));
 		kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
 	}
 	if (ret > 0)
@@ -205,7 +205,7 @@
 	}
 	up(PIPE_SEM(*inode));
 	if (do_wakeup) {
-		wake_up_interruptible(PIPE_WAIT(*inode));
+		wake_up_interruptible_sync(PIPE_WAIT(*inode));
 		kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
 	}
 	if (ret > 0) {
@@ -275,7 +275,7 @@
 		free_page((unsigned long) info->base);
 		kfree(info);
 	} else {
-		wake_up_interruptible(PIPE_WAIT(*inode));
+		wake_up_interruptible_sync(PIPE_WAIT(*inode));
 		kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
 		kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
 	}
--- linux/include/linux/sched.h.orig	
+++ linux/include/linux/sched.h	
@@ -147,6 +147,24 @@
 extern void sched_init(void);
 extern void init_idle(task_t *idle, int cpu);
 
+/*
+ * Is there a way to do this via Kconfig?
+ */
+#define CONFIG_NR_SIBLINGS 0
+
+#if CONFIG_NR_SIBLINGS_2
+# define CONFIG_NR_SIBLINGS 2
+#elif CONFIG_NR_SIBLINGS_4
+# define CONFIG_NR_SIBLINGS 4
+#endif
+
+#if CONFIG_NR_SIBLINGS
+# define CONFIG_SHARE_RUNQUEUE 1
+#else
+# define CONFIG_SHARE_RUNQUEUE 0
+#endif
+extern void sched_map_runqueue(int cpu1, int cpu2);
+
 extern void show_state(void);
 extern void show_trace(unsigned long *stack);
 extern void show_stack(unsigned long *stack);
@@ -293,7 +311,7 @@
 	prio_array_t *array;
 
 	unsigned long sleep_avg;
-	unsigned long sleep_timestamp;
+	unsigned long last_run;
 
 	unsigned long policy;
 	unsigned long cpus_allowed;
@@ -605,6 +623,8 @@
 #define remove_parent(p)	list_del_init(&(p)->sibling)
 #define add_parent(p, parent)	list_add_tail(&(p)->sibling,&(parent)->children)
 
+#if 1
+
 #define REMOVE_LINKS(p) do {					\
 	if (thread_group_leader(p))				\
 		list_del_init(&(p)->tasks);			\
@@ -633,6 +653,31 @@
 #define while_each_thread(g, t) \
 	while ((t = next_thread(t)) != g)
 
+#else
+
+#define REMOVE_LINKS(p) do {					\
+	list_del_init(&(p)->tasks);				\
+	remove_parent(p);					\
+	} while (0)
+
+#define SET_LINKS(p) do {					\
+	list_add_tail(&(p)->tasks,&init_task.tasks);		\
+	add_parent(p, (p)->parent);				\
+	} while (0)
+
+#define next_task(p)	list_entry((p)->tasks.next, struct task_struct, tasks)
+#define prev_task(p)	list_entry((p)->tasks.prev, struct task_struct, tasks)
+
+#define for_each_process(p) \
+	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
+
+#define do_each_thread(g, t) \
+	for (t = &init_task ; (t = next_task(t)) != &init_task ; )
+
+#define while_each_thread(g, t)
+
+#endif
+
 extern task_t * FASTCALL(next_thread(task_t *p));
 
 #define thread_group_leader(p)	(p->pid == p->tgid)
--- linux/include/asm-i386/apic.h.orig	
+++ linux/include/asm-i386/apic.h	
@@ -98,4 +98,6 @@
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
+extern int phys_proc_id[NR_CPUS];
+
 #endif /* __ASM_APIC_H */
--- linux/kernel/fork.c.orig	
+++ linux/kernel/fork.c	
@@ -876,7 +876,7 @@
 	 */
 	p->first_time_slice = 1;
 	current->time_slice >>= 1;
-	p->sleep_timestamp = jiffies;
+	p->last_run = jiffies;
 	if (!current->time_slice) {
 		/*
 	 	 * This case is rare, it happens when the parent has only
--- linux/kernel/sys.c.orig	
+++ linux/kernel/sys.c	
@@ -220,7 +220,7 @@
 
 	if (error == -ESRCH)
 		error = 0;
-	if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+	if (0 && niceval < task_nice(p) && !capable(CAP_SYS_NICE))
 		error = -EACCES;
 	else
 		set_user_nice(p, niceval);
--- linux/kernel/sched.c.orig	
+++ linux/kernel/sched.c	
@@ -60,13 +60,15 @@
  */
 #define MIN_TIMESLICE		( 10 * HZ / 1000)
 #define MAX_TIMESLICE		(300 * HZ / 1000)
-#define CHILD_PENALTY		95
+#define CHILD_PENALTY		50
 #define PARENT_PENALTY		100
 #define EXIT_WEIGHT		3
 #define PRIO_BONUS_RATIO	25
 #define INTERACTIVE_DELTA	2
-#define MAX_SLEEP_AVG		(2*HZ)
-#define STARVATION_LIMIT	(2*HZ)
+#define MAX_SLEEP_AVG		(10*HZ)
+#define STARVATION_LIMIT	(30*HZ)
+#define SYNC_WAKEUPS		1
+#define SMART_WAKE_CHILD	1
 #define NODE_THRESHOLD          125
 
 /*
@@ -141,6 +143,48 @@
 };
 
 /*
+ * It's possible for two CPUs to share the same runqueue.
+ * This makes sense if they eg. share caches.
+ *
+ * We take the common 1:1 (SMP, UP) case and optimize it,
+ * the rest goes via remapping: rq_idx(cpu) gives the
+ * runqueue on which a particular cpu is on, cpu_idx(cpu)
+ * gives the rq-specific index of the cpu.
+ *
+ * (Note that the generic scheduler code does not impose any
+ *  restrictions on the mappings - there can be 4 CPUs per
+ *  runqueue or even assymetric mappings.)
+ */
+#if CONFIG_SHARE_RUNQUEUE
+# define MAX_NR_SIBLINGS CONFIG_NR_SIBLINGS
+  long __rq_idx[NR_CPUS] __cacheline_aligned;
+  static long __cpu_idx[NR_CPUS] __cacheline_aligned;
+# define rq_idx(cpu) (__rq_idx[(cpu)])
+# define cpu_idx(cpu) (__cpu_idx[(cpu)])
+# define for_each_sibling(idx, rq) \
+		for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++)
+# define rq_nr_cpus(rq) ((rq)->nr_cpus)
+# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance)
+#else
+# define MAX_NR_SIBLINGS 1
+# define rq_idx(cpu) (cpu)
+# define cpu_idx(cpu) 0
+# define for_each_sibling(idx, rq) while (0)
+# define cpu_active_balance(c) 0
+# define do_active_balance(rq, cpu) do { } while (0)
+# define rq_nr_cpus(rq) 1
+  static inline void active_load_balance(runqueue_t *rq, int this_cpu) { }
+#endif
+
+typedef struct cpu_s {
+	task_t *curr, *idle;
+	task_t *migration_thread;
+	struct list_head migration_queue;
+	int active_balance;
+	int cpu;
+} cpu_t;
+
+/*
  * This is the main, per-CPU runqueue data structure.
  *
  * Locking rule: those places that want to lock multiple runqueues
@@ -151,7 +195,7 @@
 	spinlock_t lock;
 	unsigned long nr_running, nr_switches, expired_timestamp,
 			nr_uninterruptible;
-	task_t *curr, *idle;
+	struct mm_struct *prev_mm;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 #ifdef CONFIG_NUMA
@@ -159,27 +203,39 @@
 	unsigned int nr_balanced;
 	int prev_node_load[MAX_NUMNODES];
 #endif
-	task_t *migration_thread;
-	struct list_head migration_queue;
+	int nr_cpus;
+	cpu_t cpu[MAX_NR_SIBLINGS];
 
 	atomic_t nr_iowait;
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
 
-#define cpu_rq(cpu)		(runqueues + (cpu))
+#define cpu_rq(cpu)		(runqueues + (rq_idx(cpu)))
+#define cpu_int(c)		((cpu_rq(c))->cpu + cpu_idx(c))
+#define cpu_curr_ptr(cpu)	(cpu_int(cpu)->curr)
+#define cpu_idle_ptr(cpu)	(cpu_int(cpu)->idle)
+
 #define this_rq()		cpu_rq(smp_processor_id())
 #define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define rt_task(p)		((p)->prio < MAX_RT_PRIO)
 
+#define migration_thread(cpu)	(cpu_int(cpu)->migration_thread)
+#define migration_queue(cpu)	(&cpu_int(cpu)->migration_queue)
+
+#if NR_CPUS > 1
+# define task_allowed(p, cpu)	((p)->cpus_allowed & (1UL << (cpu)))
+#else
+# define task_allowed(p, cpu)	1
+#endif
+
 /*
  * Default context-switch locking:
  */
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(rq, next)	do { } while(0)
 # define finish_arch_switch(rq, next)	spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p)		((rq)->curr == (p))
+# define task_running(p)		(cpu_curr_ptr(task_cpu(p)) == (p))
 #endif
 
 #ifdef CONFIG_NUMA
@@ -322,16 +378,21 @@
  * Also update all the scheduling statistics stuff. (sleep average
  * calculation, priority modifiers, etc.)
  */
+static inline void __activate_task(task_t *p, runqueue_t *rq)
+{
+	enqueue_task(p, rq->active);
+	nr_running_inc(rq);
+}
+
 static inline void activate_task(task_t *p, runqueue_t *rq)
 {
-	unsigned long sleep_time = jiffies - p->sleep_timestamp;
-	prio_array_t *array = rq->active;
+	unsigned long sleep_time = jiffies - p->last_run;
 
 	if (!rt_task(p) && sleep_time) {
 		/*
 		 * This code gives a bonus to interactive tasks. We update
 		 * an 'average sleep time' value here, based on
-		 * sleep_timestamp. The more time a task spends sleeping,
+		 * ->last_run. The more time a task spends sleeping,
 		 * the higher the average gets - and the higher the priority
 		 * boost gets as well.
 		 */
@@ -340,8 +401,7 @@
 			p->sleep_avg = MAX_SLEEP_AVG;
 		p->prio = effective_prio(p);
 	}
-	enqueue_task(p, array);
-	nr_running_inc(rq);
+	__activate_task(p, rq);
 }
 
 /*
@@ -382,6 +442,11 @@
 #endif
 }
 
+static inline void resched_cpu(int cpu)
+{
+	resched_task(cpu_curr_ptr(cpu));
+}
+
 #ifdef CONFIG_SMP
 
 /*
@@ -398,7 +463,7 @@
 repeat:
 	preempt_disable();
 	rq = task_rq(p);
-	if (unlikely(task_running(rq, p))) {
+	if (unlikely(task_running(p))) {
 		cpu_relax();
 		/*
 		 * enable/disable preemption just to make this
@@ -409,7 +474,7 @@
 		goto repeat;
 	}
 	rq = task_rq_lock(p, &flags);
-	if (unlikely(task_running(rq, p))) {
+	if (unlikely(task_running(p))) {
 		task_rq_unlock(rq, &flags);
 		preempt_enable();
 		goto repeat;
@@ -431,10 +496,39 @@
  */
 void kick_if_running(task_t * p)
 {
-	if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id()))
+	if ((task_running(p)) && (task_cpu(p) != smp_processor_id()))
 		resched_task(p);
 }
 
+static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p)
+{
+	cpu_t *curr_cpu;
+	task_t *curr;
+	int idx;
+
+	if (idle_cpu(cpu))
+		return resched_cpu(cpu);
+
+	for_each_sibling(idx, rq) {
+		curr_cpu = rq->cpu + idx;
+		if (!task_allowed(p, curr_cpu->cpu))
+			continue;
+		if (curr_cpu->idle == curr_cpu->curr)
+			return resched_cpu(curr_cpu->cpu);
+	}
+
+	if (p->prio < cpu_curr_ptr(cpu)->prio)
+		return resched_task(cpu_curr_ptr(cpu));
+
+	for_each_sibling(idx, rq) {
+		curr_cpu = rq->cpu + idx;
+		if (!task_allowed(p, curr_cpu->cpu))
+			continue;
+		curr = curr_cpu->curr;
+		if (p->prio < curr->prio)
+			return resched_task(curr);
+	}
+}
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -455,6 +549,7 @@
 	long old_state;
 	runqueue_t *rq;
 
+	sync &= SYNC_WAKEUPS;
 repeat_lock_task:
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
@@ -463,7 +558,7 @@
 		 * Fast-migrate the task if it's not running or runnable
 		 * currently. Do not violate hard affinity.
 		 */
-		if (unlikely(sync && !task_running(rq, p) &&
+		if (unlikely(sync && !task_running(p) &&
 			(task_cpu(p) != smp_processor_id()) &&
 			(p->cpus_allowed & (1UL << smp_processor_id())))) {
 
@@ -473,10 +568,12 @@
 		}
 		if (old_state == TASK_UNINTERRUPTIBLE)
 			rq->nr_uninterruptible--;
-		activate_task(p, rq);
-
-		if (p->prio < rq->curr->prio)
-			resched_task(rq->curr);
+		if (sync)
+			__activate_task(p, rq);
+		else {
+			activate_task(p, rq);
+			wake_up_cpu(rq, task_cpu(p), p);
+		}
 		success = 1;
 	}
 	p->state = TASK_RUNNING;
@@ -512,8 +609,19 @@
 		p->prio = effective_prio(p);
 	}
 	set_task_cpu(p, smp_processor_id());
-	activate_task(p, rq);
 
+	if (SMART_WAKE_CHILD) {
+		if (unlikely(!current->array))
+			__activate_task(p, rq);
+		else {
+			p->prio = current->prio;
+			list_add_tail(&p->run_list, &current->run_list);
+			p->array = current->array;
+			p->array->nr_active++;
+			nr_running_inc(rq);
+		}
+	} else
+		activate_task(p, rq);
 	rq_unlock(rq);
 }
 
@@ -561,7 +669,7 @@
  * context_switch - switch to the new MM and the new
  * thread's register state.
  */
-static inline task_t * context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
 {
 	struct mm_struct *mm = next->mm;
 	struct mm_struct *oldmm = prev->active_mm;
@@ -575,7 +683,7 @@
 
 	if (unlikely(!prev->mm)) {
 		prev->active_mm = NULL;
-		mmdrop(oldmm);
+		rq->prev_mm = oldmm;
 	}
 
 	/* Here we just switch the register state and the stack. */
@@ -596,8 +704,9 @@
 	unsigned long i, sum = 0;
 
 	for (i = 0; i < NR_CPUS; i++)
-		sum += cpu_rq(i)->nr_running;
-
+		/* Shared runqueues are counted only once. */
+		if (!cpu_idx(i))
+			sum += cpu_rq(i)->nr_running;
 	return sum;
 }
 
@@ -608,7 +717,9 @@
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
-		sum += cpu_rq(i)->nr_uninterruptible;
+		/* Shared runqueues are counted only once. */
+		if (!cpu_idx(i))
+			sum += cpu_rq(i)->nr_uninterruptible;
 	}
 	return sum;
 }
@@ -790,7 +901,23 @@
 
 #endif /* CONFIG_NUMA */
 
-#if CONFIG_SMP
+/*
+ * One of the idle_cpu_tick() and busy_cpu_tick() functions will
+ * get called every timer tick, on every CPU. Our balancing action
+ * frequency and balancing agressivity depends on whether the CPU is
+ * idle or not.
+ *
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * systems with HZ=100, every 10 msecs.)
+ */
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+
+#if !CONFIG_SMP
+
+static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { }
+
+#else
 
 /*
  * double_lock_balance - lock the busiest runqueue
@@ -906,12 +1033,7 @@
 	set_task_cpu(p, this_cpu);
 	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
-	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
-	 * to be always true for them.
-	 */
-	if (p->prio < this_rq->curr->prio)
-		set_need_resched();
+	wake_up_cpu(this_rq, this_cpu, p);
 }
 
 /*
@@ -922,9 +1044,9 @@
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int this_cpu, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, idx;
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
@@ -972,12 +1094,14 @@
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
+	 *
+	 * (except if we are in idle mode which is a more agressive
+	 *  form of rebalancing.)
 	 */
 
-#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
-		!task_running(rq, p) &&					\
-			((p)->cpus_allowed & (1UL << (this_cpu))))
+#define CAN_MIGRATE_TASK(p,rq,cpu)					\
+	((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
+		!task_running(p) && task_allowed(p, cpu))
 
 	curr = curr->prev;
 
@@ -1000,31 +1124,136 @@
 	;
 }
 
+#if CONFIG_SHARE_RUNQUEUE
+static void active_load_balance(runqueue_t *this_rq, int this_cpu)
+{
+	runqueue_t *rq;
+	int i, idx;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		rq = cpu_rq(i);
+		if (rq == this_rq)
+			continue;
+		/*
+		 * Any SMT-specific imbalance?
+		 */
+		for_each_sibling(idx, rq)
+			if (rq->cpu[idx].idle == rq->cpu[idx].curr)
+				goto next_cpu;
+
+		/*
+		 * At this point it's sure that we have a SMT
+		 * imbalance: this (physical) CPU is idle but
+		 * another CPU has two (or more) tasks running.
+		 *
+		 * We wake up one of the migration threads (it
+		 * doesnt matter which one) and let it fix things up:
+		 */
+		if (!cpu_active_balance(i)) {
+			cpu_active_balance(i) = 1;
+			spin_unlock(&this_rq->lock);
+			wake_up_process(rq->cpu[0].migration_thread);
+			spin_lock(&this_rq->lock);
+		}
+next_cpu:
+	}
+}
+
+static void do_active_balance(runqueue_t *this_rq, int this_cpu)
+{
+	runqueue_t *rq;
+	int i, idx;
+
+	spin_unlock(&this_rq->lock);
+
+	cpu_active_balance(this_cpu) = 0;
+
+	/*
+	 * Is the imbalance still present?
+	 */
+	for_each_sibling(idx, this_rq)
+		if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr)
+			goto out;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		rq = cpu_rq(i);
+		if (rq == this_rq)
+			continue;
+
+		/* completely idle CPU? */
+		if (rq->nr_running)
+			continue;
+
+		/*
+		 * At this point it's reasonably sure that we have an
+		 * imbalance. Since we are the migration thread, try to
+	 	 * balance a thread over to the target queue.
+		 */
+		spin_lock(&rq->lock);
+		load_balance(rq, i, 1);
+		spin_unlock(&rq->lock);
+		goto out;
+	}
+out:
+	spin_lock(&this_rq->lock);
+}
+
 /*
- * One of the idle_cpu_tick() and busy_cpu_tick() functions will
- * get called every timer tick, on every CPU. Our balancing action
- * frequency and balancing agressivity depends on whether the CPU is
- * idle or not.
+ * This routine is called to map a CPU into another CPU's runqueue.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
- * systems with HZ=100, every 10 msecs.)
+ * This must be called during bootup with the merged runqueue having
+ * no tasks.
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
-#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+void sched_map_runqueue(int cpu1, int cpu2)
+{
+	runqueue_t *rq1 = cpu_rq(cpu1);
+	runqueue_t *rq2 = cpu_rq(cpu2);
+	int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx;
+
+	printk("sched_merge_runqueues: CPU#%d <=> CPU#%d, on CPU#%d.\n", cpu1, cpu2, smp_processor_id());
+	if (rq1 == rq2)
+		BUG();
+	if (rq2->nr_running)
+		BUG();
+	/*
+	 * At this point, we dont have anything in the runqueue yet. So,
+	 * there is no need to move processes between the runqueues.
+	 * Only, the idle processes should be combined and accessed
+	 * properly.
+	 */
+	cpu2_idx = rq1->nr_cpus++;
 
-static inline void idle_tick(runqueue_t *rq)
+	if (rq_idx(cpu1) != cpu1)
+		BUG();
+	rq_idx(cpu2) = cpu1;
+	cpu_idx(cpu2) = cpu2_idx;
+	rq1->cpu[cpu2_idx].cpu = cpu2;
+	rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle;
+	rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr;
+	INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue);
+
+	/* just to be safe: */
+	rq2->cpu[cpu2_idx_orig].idle = NULL;
+	rq2->cpu[cpu2_idx_orig].curr = NULL;
+}
+#endif
+#endif
+
+DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
+
+static inline void idle_tick(runqueue_t *rq, unsigned long j)
 {
-	if (jiffies % IDLE_REBALANCE_TICK)
+	if (j % IDLE_REBALANCE_TICK)
 		return;
 	spin_lock(&rq->lock);
-	load_balance(rq, 1);
+	load_balance(rq, smp_processor_id(), 1);
 	spin_unlock(&rq->lock);
 }
 
-#endif
-
-DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
-
 /*
  * We place interactive tasks back into the active array, if possible.
  *
@@ -1035,9 +1264,9 @@
  * increasing number of running tasks:
  */
 #define EXPIRED_STARVING(rq) \
-		((rq)->expired_timestamp && \
+		(STARVATION_LIMIT && ((rq)->expired_timestamp && \
 		(jiffies - (rq)->expired_timestamp >= \
-			STARVATION_LIMIT * ((rq)->nr_running) + 1))
+			STARVATION_LIMIT * ((rq)->nr_running) + 1)))
 
 /*
  * This function gets called by the timer code, with HZ frequency.
@@ -1050,12 +1279,13 @@
 {
 	int cpu = smp_processor_id();
 	runqueue_t *rq = this_rq();
+	unsigned long j = jiffies;
 	task_t *p = current;
 
  	if (rcu_pending(cpu))
  		rcu_check_callbacks(cpu, user_ticks);
 
-	if (p == rq->idle) {
+	if (p == cpu_idle_ptr(cpu)) {
 		/* note: this timer irq context must be accounted for as well */
 		if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
 			kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1063,9 +1293,7 @@
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
-		idle_tick(rq);
-#endif
+		idle_tick(rq, j);
 		return;
 	}
 	if (TASK_NICE(p) > 0)
@@ -1074,12 +1302,13 @@
 		kstat_cpu(cpu).cpustat.user += user_ticks;
 	kstat_cpu(cpu).cpustat.system += sys_ticks;
 
+	spin_lock(&rq->lock);
 	/* Task might have expired already, but not scheduled off yet */
 	if (p->array != rq->active) {
 		set_tsk_need_resched(p);
+		spin_unlock(&rq->lock);
 		return;
 	}
-	spin_lock(&rq->lock);
 	if (unlikely(rt_task(p))) {
 		/*
 		 * RR tasks need a special form of timeslice management.
@@ -1115,16 +1344,14 @@
 
 		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
 			if (!rq->expired_timestamp)
-				rq->expired_timestamp = jiffies;
+				rq->expired_timestamp = j;
 			enqueue_task(p, rq->expired);
 		} else
 			enqueue_task(p, rq->active);
 	}
 out:
-#if CONFIG_SMP
-	if (!(jiffies % BUSY_REBALANCE_TICK))
-		load_balance(rq, 0);
-#endif
+	if (!(j % BUSY_REBALANCE_TICK))
+		load_balance(rq, smp_processor_id(), 0);
 	spin_unlock(&rq->lock);
 }
 
@@ -1135,11 +1362,11 @@
  */
 asmlinkage void schedule(void)
 {
+	int idx, this_cpu, retry = 0;
+	struct list_head *queue;
 	task_t *prev, *next;
-	runqueue_t *rq;
 	prio_array_t *array;
-	struct list_head *queue;
-	int idx;
+	runqueue_t *rq;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -1152,15 +1379,15 @@
 			dump_stack();
 		}
 	}
-
-	check_highmem_ptes();
 need_resched:
+	check_highmem_ptes();
+	this_cpu = smp_processor_id();
 	preempt_disable();
 	prev = current;
 	rq = this_rq();
 
 	release_kernel_lock(prev);
-	prev->sleep_timestamp = jiffies;
+	prev->last_run = jiffies;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1183,12 +1410,14 @@
 	}
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
-#if CONFIG_SMP
-		load_balance(rq, 1);
+		load_balance(rq, this_cpu, 1);
 		if (rq->nr_running)
 			goto pick_next_task;
-#endif
-		next = rq->idle;
+		active_load_balance(rq, this_cpu);
+		if (rq->nr_running)
+			goto pick_next_task;
+pick_idle:
+		next = cpu_idle_ptr(this_cpu);
 		rq->expired_timestamp = 0;
 		goto switch_tasks;
 	}
@@ -1204,24 +1433,60 @@
 		rq->expired_timestamp = 0;
 	}
 
+new_array:
 	idx = sched_find_first_bit(array->bitmap);
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
+	if ((next != prev) && (rq_nr_cpus(rq) > 1)) {
+		struct list_head *tmp = queue->next;
+
+		while ((task_running(next) && (next != prev)) || !task_allowed(next, this_cpu)) {
+			tmp = tmp->next;
+			if (tmp != queue) {
+				next = list_entry(tmp, task_t, run_list);
+				continue;
+			}
+			idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+			if (idx == MAX_PRIO) {
+				if (retry || !rq->expired->nr_active) {
+					goto pick_idle;
+				}
+				/*
+				 * To avoid infinite changing of arrays,
+				 * when we have only tasks runnable by
+				 * sibling.
+				 */
+				retry = 1;
+
+				array = rq->expired;
+				goto new_array;
+			}
+			queue = array->queue + idx;
+			tmp = queue->next;
+			next = list_entry(tmp, task_t, run_list);
+		}
+	}
 
 switch_tasks:
 	prefetch(next);
 	clear_tsk_need_resched(prev);
-	RCU_qsctr(prev->thread_info->cpu)++;
+	RCU_qsctr(task_cpu(prev))++;
 
 	if (likely(prev != next)) {
+		struct mm_struct *prev_mm;
 		rq->nr_switches++;
-		rq->curr = next;
+		cpu_curr_ptr(this_cpu) = next;
+		set_task_cpu(next, this_cpu);
 	
 		prepare_arch_switch(rq, next);
-		prev = context_switch(prev, next);
+		prev = context_switch(rq, prev, next);
 		barrier();
 		rq = this_rq();
+		prev_mm = rq->prev_mm;
+		rq->prev_mm = NULL;
 		finish_arch_switch(rq, prev);
+		if (prev_mm)
+			mmdrop(prev_mm);
 	} else
 		spin_unlock_irq(&rq->lock);
 
@@ -1481,9 +1746,8 @@
 		 * If the task is running and lowered its priority,
 		 * or increased its priority then reschedule its CPU:
 		 */
-		if ((NICE_TO_PRIO(nice) < p->static_prio) ||
-							task_running(rq, p))
-			resched_task(rq->curr);
+		if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p))
+			resched_task(cpu_curr_ptr(task_cpu(p)));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -1561,7 +1825,7 @@
  */
 int task_curr(task_t *p)
 {
-	return cpu_curr(task_cpu(p)) == p;
+	return cpu_curr_ptr(task_cpu(p)) == p;
 }
 
 /**
@@ -1570,7 +1834,7 @@
  */
 int idle_cpu(int cpu)
 {
-	return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+	return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu);
 }
 
 /**
@@ -1660,7 +1924,7 @@
 	else
 		p->prio = p->static_prio;
 	if (array)
-		activate_task(p, task_rq(p));
+		__activate_task(p, task_rq(p));
 
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -2135,7 +2399,7 @@
 	local_irq_save(flags);
 	double_rq_lock(idle_rq, rq);
 
-	idle_rq->curr = idle_rq->idle = idle;
+	cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle;
 	deactivate_task(idle, rq);
 	idle->array = NULL;
 	idle->prio = MAX_PRIO;
@@ -2190,6 +2454,7 @@
 	unsigned long flags;
 	migration_req_t req;
 	runqueue_t *rq;
+	int cpu;
 
 #if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */
 	new_mask &= cpu_online_map;
@@ -2211,31 +2476,31 @@
 	 * If the task is not on a runqueue (and not running), then
 	 * it is sufficient to simply update the task's cpu field.
 	 */
-	if (!p->array && !task_running(rq, p)) {
+	if (!p->array && !task_running(p)) {
 		set_task_cpu(p, __ffs(p->cpus_allowed));
 		task_rq_unlock(rq, &flags);
 		return;
 	}
 	init_completion(&req.done);
 	req.task = p;
-	list_add(&req.list, &rq->migration_queue);
+	cpu = task_cpu(p);
+	list_add(&req.list, migration_queue(cpu));
 	task_rq_unlock(rq, &flags);
-
-	wake_up_process(rq->migration_thread);
+	wake_up_process(migration_thread(cpu));
 
 	wait_for_completion(&req.done);
 }
 
 /*
- * migration_thread - this is a highprio system thread that performs
+ * migration_task - this is a highprio system thread that performs
  * thread migration by 'pulling' threads into the target runqueue.
  */
-static int migration_thread(void * data)
+static int migration_task(void * data)
 {
 	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	int cpu = (long) data;
 	runqueue_t *rq;
-	int ret;
+	int ret, idx;
 
 	daemonize();
 	sigfillset(&current->blocked);
@@ -2250,7 +2515,8 @@
 	ret = setscheduler(0, SCHED_FIFO, &param);
 
 	rq = this_rq();
-	rq->migration_thread = current;
+	migration_thread(cpu) = current;
+	idx = cpu_idx(cpu);
 
 	sprintf(current->comm, "migration/%d", smp_processor_id());
 
@@ -2263,7 +2529,9 @@
 		task_t *p;
 
 		spin_lock_irqsave(&rq->lock, flags);
-		head = &rq->migration_queue;
+		if (cpu_active_balance(cpu))
+			do_active_balance(rq, cpu);
+		head = migration_queue(cpu);
 		current->state = TASK_INTERRUPTIBLE;
 		if (list_empty(head)) {
 			spin_unlock_irqrestore(&rq->lock, flags);
@@ -2292,9 +2560,8 @@
 			set_task_cpu(p, cpu_dest);
 			if (p->array) {
 				deactivate_task(p, rq_src);
-				activate_task(p, rq_dest);
-				if (p->prio < rq_dest->curr->prio)
-					resched_task(rq_dest->curr);
+				__activate_task(p, rq_dest);
+				wake_up_cpu(rq_dest, cpu_dest, p);
 			}
 		}
 		double_rq_unlock(rq_src, rq_dest);
@@ -2312,12 +2579,13 @@
 			  unsigned long action,
 			  void *hcpu)
 {
+	long cpu = (long) hcpu;
+
 	switch (action) {
 	case CPU_ONLINE:
-		printk("Starting migration thread for cpu %li\n",
-		       (long)hcpu);
-		kernel_thread(migration_thread, hcpu, CLONE_KERNEL);
-		while (!cpu_rq((long)hcpu)->migration_thread)
+		printk("Starting migration thread for cpu %li\n", cpu);
+		kernel_thread(migration_task, hcpu, CLONE_KERNEL);
+		while (!migration_thread(cpu))
 			yield();
 		break;
 	}
@@ -2392,11 +2660,20 @@
 	for (i = 0; i < NR_CPUS; i++) {
 		prio_array_t *array;
 
+		/*
+		 * Start with a 1:1 mapping between CPUs and runqueues:
+		 */
+#if CONFIG_SHARE_RUNQUEUE
+		rq_idx(i) = i;
+		cpu_idx(i) = 0;
+#endif
 		rq = cpu_rq(i);
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
-		INIT_LIST_HEAD(&rq->migration_queue);
+		INIT_LIST_HEAD(migration_queue(i));
+		rq->nr_cpus = 1;
+		rq->cpu[cpu_idx(i)].cpu = i;
 		atomic_set(&rq->nr_iowait, 0);
 		nr_running_init(rq);
 
@@ -2414,9 +2691,13 @@
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.
 	 */
-	rq = this_rq();
-	rq->curr = current;
-	rq->idle = current;
+	cpu_curr_ptr(smp_processor_id()) = current;
+	cpu_idle_ptr(smp_processor_id()) = current;
+	printk("sched_init().\n");
+	printk("smp_processor_id(): %d.\n", smp_processor_id());
+	printk("rq_idx(smp_processor_id()): %ld.\n", rq_idx(smp_processor_id()));
+	printk("this_rq(): %p.\n", this_rq());
+
 	set_task_cpu(current, smp_processor_id());
 	wake_up_process(current);
 
--- linux/init/main.c.orig	
+++ linux/init/main.c	
@@ -354,7 +354,14 @@
 
 static void rest_init(void)
 {
+	/* 
+	 *	We count on the initial thread going ok 
+	 *	Like idlers init is an unlocked kernel thread, which will
+	 *	make syscalls (and thus be locked).
+	 */
+	init_idle(current, smp_processor_id());
 	kernel_thread(init, NULL, CLONE_KERNEL);
+
 	unlock_kernel();
  	cpu_idle();
 } 
@@ -438,13 +445,6 @@
 	check_bugs();
 	printk("POSIX conformance testing by UNIFIX\n");
 
-	/* 
-	 *	We count on the initial thread going ok 
-	 *	Like idlers init is an unlocked kernel thread, which will
-	 *	make syscalls (and thus be locked).
-	 */
-	init_idle(current, smp_processor_id());
-
 	/* Do the rest non-__init'ed, we're now alive */
 	rest_init();
 }


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-E2
  2003-02-03 18:23                                                                                 ` [patch] HT scheduler, sched-2.5.59-E2 Ingo Molnar
@ 2003-02-03 20:47                                                                                   ` Robert Love
  2003-02-04  9:31                                                                                   ` Erich Focht
  1 sibling, 0 replies; 96+ messages in thread
From: Robert Love @ 2003-02-03 20:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Martin J. Bligh, Andrew Theurer, Erich Focht,
	Michael Hohnbaum, Matthew Dobson, Christoph Hellwig,
	Linus Torvalds, lse-tech, Anton Blanchard, Andrea Arcangeli

On Mon, 2003-02-03 at 13:23, Ingo Molnar wrote:

> the attached patch (against 2.5.59 or BK-curr) has two HT-scheduler fixes
> over the -D7 patch:

Hi, Ingo.  I am running this now, with good results. Unfortunately I do
not have an HT-capable processor, so I am only testing the interactivity
improvements.  They are looking very good - a step in the right
direction.  Very nice.

A couple bits:

> -		wake_up_interruptible(PIPE_WAIT(*inode));
> +		wake_up_interruptible_sync(PIPE_WAIT(*inode));
> ...
> -		wake_up_interruptible(PIPE_WAIT(*inode));
> +		wake_up_interruptible_sync(PIPE_WAIT(*inode));
>  ...
> -		wake_up_interruptible(PIPE_WAIT(*inode));
> +		wake_up_interruptible_sync(PIPE_WAIT(*inode));

These are not correct, right?  I believe we want normal behavior here,
no?

> --- linux/kernel/sys.c.orig	
> +++ linux/kernel/sys.c	
> @@ -220,7 +220,7 @@
>  
>  	if (error == -ESRCH)
>  		error = 0;
> -	if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
> +	if (0 && niceval < task_nice(p) && !capable(CAP_SYS_NICE))

What is the point of this? Left in for debugging?

> -#define MAX_SLEEP_AVG		(2*HZ)
> -#define STARVATION_LIMIT	(2*HZ)
> +#define MAX_SLEEP_AVG		(10*HZ)
> +#define STARVATION_LIMIT	(30*HZ)

Do you really want a 30 second starvation limit?  Ouch.

> +	printk("rq_idx(smp_processor_id()): %ld.\n", rq_idx(smp_processor_id()));

This gives a compiler warning on UP:

        kernel/sched.c: In function `sched_init':
        kernel/sched.c:2722: warning: long int format, int arg (arg 2)

Since rq_idx(), on UP, merely returns its parameter which is an int.

Otherwise, looking nice :)

	Robert Love


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [patch] HT scheduler, sched-2.5.59-E2
  2003-02-03 18:23                                                                                 ` [patch] HT scheduler, sched-2.5.59-E2 Ingo Molnar
  2003-02-03 20:47                                                                                   ` Robert Love
@ 2003-02-04  9:31                                                                                   ` Erich Focht
  1 sibling, 0 replies; 96+ messages in thread
From: Erich Focht @ 2003-02-04  9:31 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum,
	Matthew Dobson, Christoph Hellwig, Robert Love, Linus Torvalds,
	lse-tech, Anton Blanchard, Andrea Arcangeli

Hi Ingo,

On Monday 03 February 2003 19:23, Ingo Molnar wrote:
> -#define
> CAN_MIGRATE_TASK(p,rq,this_cpu)                                        \
> -       ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&        \
> -               !task_running(rq, p) &&                                 \
> -                       ((p)->cpus_allowed & (1UL << (this_cpu)))) +#define
> CAN_MIGRATE_TASK(p,rq,cpu)                                     \
> +       ((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
> +               !task_running(p) && task_allowed(p, cpu))

at least for NUMA systems this is too aggressive (though I believe
normal SMP systems could be hurt, too).

The problem: freshly forked tasks are stolen by idle cpus on the same
node before they exec. This actually disables the sched_balance_exec()
mechanism as the tasks to be balanced already run alone on other
CPUs. Which means: the whole benefit of having balanced nodes
(maximize the memory bandwidth) is gone.

The change below is less aggressive but in the same philosophy. Could
you please take it instead?

> CAN_MIGRATE_TASK(p,rq,cpu)                                     \
> +       ((jiffies - (p)->last_run > (cache_decay_ticks >> idle)) && \
> +               !task_running(p) && task_allowed(p, cpu))

Regards,
Erich


^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2003-02-04  9:21 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-01-09 23:54 Minature NUMA scheduler Martin J. Bligh
2003-01-10  5:36 ` [Lse-tech] " Michael Hohnbaum
2003-01-10 16:34   ` Erich Focht
2003-01-10 16:57     ` Martin J. Bligh
2003-01-12 23:35       ` Erich Focht
2003-01-12 23:55       ` NUMA scheduler 2nd approach Erich Focht
2003-01-13  8:02         ` Christoph Hellwig
2003-01-13 11:32           ` Erich Focht
2003-01-13 15:26             ` [Lse-tech] " Christoph Hellwig
2003-01-13 15:46               ` Erich Focht
2003-01-13 19:03             ` Michael Hohnbaum
2003-01-14  1:23         ` Michael Hohnbaum
2003-01-14  4:45           ` [Lse-tech] " Andrew Theurer
2003-01-14  4:56             ` Martin J. Bligh
2003-01-14 11:14               ` Erich Focht
2003-01-14 15:55                 ` [PATCH 2.5.58] new NUMA scheduler Erich Focht
2003-01-14 16:07                   ` [Lse-tech] " Christoph Hellwig
2003-01-14 16:23                   ` [PATCH 2.5.58] new NUMA scheduler: fix Erich Focht
2003-01-14 16:43                     ` Erich Focht
2003-01-14 19:02                       ` Michael Hohnbaum
2003-01-14 21:56                         ` [Lse-tech] " Michael Hohnbaum
2003-01-15 15:10                         ` Erich Focht
2003-01-16  0:14                           ` Michael Hohnbaum
2003-01-16  6:05                           ` Martin J. Bligh
2003-01-16 16:47                             ` Erich Focht
2003-01-16 18:07                               ` Robert Love
2003-01-16 18:48                                 ` Martin J. Bligh
2003-01-16 19:07                                 ` Ingo Molnar
2003-01-16 18:59                                   ` Martin J. Bligh
2003-01-16 19:10                                   ` Christoph Hellwig
2003-01-16 19:44                                     ` Ingo Molnar
2003-01-16 19:43                                       ` Martin J. Bligh
2003-01-16 20:19                                         ` Ingo Molnar
2003-01-16 20:29                                           ` [Lse-tech] " Rick Lindsley
2003-01-16 23:31                                           ` Martin J. Bligh
2003-01-17  7:23                                             ` Ingo Molnar
2003-01-17  8:47                                             ` [patch] sched-2.5.59-A2 Ingo Molnar
2003-01-17 14:35                                               ` Erich Focht
2003-01-17 15:11                                                 ` Ingo Molnar
2003-01-17 15:30                                                   ` Erich Focht
2003-01-17 16:58                                                   ` Martin J. Bligh
2003-01-18 20:54                                                     ` NUMA sched -> pooling scheduler (inc HT) Martin J. Bligh
2003-01-18 21:34                                                       ` [Lse-tech] " Martin J. Bligh
2003-01-19  0:13                                                         ` Andrew Theurer
2003-01-17 18:19                                                   ` [patch] sched-2.5.59-A2 Michael Hohnbaum
2003-01-18  7:08                                                   ` William Lee Irwin III
2003-01-18  8:12                                                     ` Martin J. Bligh
2003-01-18  8:16                                                       ` William Lee Irwin III
2003-01-19  4:22                                                     ` William Lee Irwin III
2003-01-17 17:21                                                 ` Martin J. Bligh
2003-01-17 17:23                                                 ` Martin J. Bligh
2003-01-17 18:11                                                 ` Erich Focht
2003-01-17 19:04                                                   ` Martin J. Bligh
2003-01-17 19:26                                                     ` [Lse-tech] " Martin J. Bligh
2003-01-18  0:13                                                       ` Michael Hohnbaum
2003-01-18 13:31                                                         ` [patch] tunable rebalance rates for sched-2.5.59-B0 Erich Focht
2003-01-18 23:09                                                         ` [patch] sched-2.5.59-A2 Erich Focht
2003-01-20  9:28                                                           ` Ingo Molnar
2003-01-20 12:07                                                             ` Erich Focht
2003-01-20 16:56                                                               ` Ingo Molnar
2003-01-20 17:04                                                                 ` Ingo Molnar
2003-01-20 17:10                                                                   ` Martin J. Bligh
2003-01-20 17:24                                                                     ` Ingo Molnar
2003-01-20 19:13                                                                       ` Andrew Theurer
2003-01-20 19:33                                                                         ` Martin J. Bligh
2003-01-20 19:52                                                                           ` Andrew Theurer
2003-01-20 19:52                                                                             ` Martin J. Bligh
2003-01-20 21:18                                                                               ` [patch] HT scheduler, sched-2.5.59-D7 Ingo Molnar
2003-01-20 22:28                                                                                 ` Andrew Morton
2003-01-21  1:11                                                                                   ` Michael Hohnbaum
2003-01-22  3:15                                                                                 ` Michael Hohnbaum
2003-01-22 16:41                                                                                   ` Andrew Theurer
2003-01-22 16:17                                                                                     ` Martin J. Bligh
2003-01-22 16:20                                                                                       ` Andrew Theurer
2003-01-22 16:35                                                                                     ` Michael Hohnbaum
2003-02-03 18:23                                                                                 ` [patch] HT scheduler, sched-2.5.59-E2 Ingo Molnar
2003-02-03 20:47                                                                                   ` Robert Love
2003-02-04  9:31                                                                                   ` Erich Focht
2003-01-20 17:04                                                                 ` [patch] sched-2.5.59-A2 Martin J. Bligh
2003-01-21 17:44                                                                 ` Erich Focht
2003-01-20 16:23                                                             ` Martin J. Bligh
2003-01-20 16:59                                                               ` Ingo Molnar
2003-01-17 23:09                                                     ` Matthew Dobson
2003-01-16 23:45                                           ` [PATCH 2.5.58] new NUMA scheduler: fix Michael Hohnbaum
2003-01-17 11:10                                           ` Erich Focht
2003-01-17 14:07                                             ` Ingo Molnar
2003-01-16 19:44                                       ` John Bradford
2003-01-14 16:51                     ` Christoph Hellwig
2003-01-15  0:05                     ` Michael Hohnbaum
2003-01-15  7:47                     ` Martin J. Bligh
2003-01-14  5:50             ` [Lse-tech] Re: NUMA scheduler 2nd approach Michael Hohnbaum
2003-01-14 16:52               ` Andrew Theurer
2003-01-14 15:13                 ` Erich Focht
2003-01-14 10:56           ` Erich Focht
2003-01-11 14:43     ` [Lse-tech] Minature NUMA scheduler Bill Davidsen
2003-01-12 23:24       ` Erich Focht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).