linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] NUMA schedulers benchmark results
@ 2002-10-06 16:51 Erich Focht
  2002-10-06 20:24 ` Erich Focht
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Erich Focht @ 2002-10-06 16:51 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3752 bytes --]

Hi,

here are some results from testing various versions and approaches
to a NUMA scheduler. I used the numa_test benchmark which I'll post
in a separate email. It runs in parallel N tasks doing the same job:
access randomly a large array. As the array is large enough not to
fit into cache, this is very memory latency sensitive. Also it is
memory bandwidth sensitive. To emulate a real multi-user environment, the
jobs are disturbed by a short load peak. This is simulated by a call
to "hackbench" 3 seconds after the tasks were started. The performance
of the user tasks is depending very much on where they are scheduled
and are CPU hoggers such that the user times are quite independent of
the non-scheduler part of the underlying kernel. The elapsed times
are depending on "hackbench" which actually blocks the machine for the
time it is running. Hackbench is depending on the underlying kernel
and one should compare "elapsed_time - hackbench_time".

The test machine is a 16 CPU NEC Azusa with Itanium processors and
four nodes. The tested schedulers are:

A: O(1) scheduler in 2.5.39
B: O(1) scheduler with task steal limited to only one task (node
   affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
   set to 0, under 2.4.18
E: node affine scheduler with initial load balancing and static homenode
F: node affine scheduler without initial load balancing and dynamic
   homenode selection (homenode selected where most of the memory is
   allocated).

As I'm rewriting the node affine scheduler to be more modular, I'll
redo the tests for cases D, E, F on top of 2.5.X kernels soon.

The results are summarized in the tables below. A set of outputs (for
N=8, 16, 32) is attached. They show clearly why the node affine
scheduler beats them all: The initial load balancing is node-aware,
the task-stealing too. Sometimes the node affine force is not large
enough to bring the task back to the homenode, but it is almost always
good enough.

The (F) solution with dynamically determined homenode show that the
initial load balancing is crucial, as the equal node balance is not
strongly enforced dynamically. So the optimal solution is probably
(F) with initial load balancing.


Average user time (U) and total user time (TU). Minimum per row should
be considered.
----------------------------------------------------------------------
Scheduler:	A	B	C	D	E	F
N=4	U	28.12	30.77	33.00	-	27.20	30.29
	TU	112.55	123.13	132.08	-	108.88	121.25
N=8	U	30.47	31.39	31.65	30.76	28.67	30.08
	TU	243.86	251.27	253.30	246.23	229.51	240.75
N=16	U	36.42	33.64	32.18	32.27	31.50	32.83
	TU	582.91	538.49	515.11	516.53	504.17	525.59
N=32	U	38.69	34.83	34.05	33.76	33.89	34.11
	TU	1238.4	1114.9	1090.1	1080.8	1084.9	1091.9
N=64	U	39.73	34.73	34.23	-	(33.32)	34.98
	TU	2543.4	2223.4	2191.7	-	(2133)	2239.5


Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
2.5.39 based kernels due to "hackbench", which performs differently.
Compare E-H within a row, but don't take it too seriously.
--------------------------------------------------------------------
Scheduler:	A	B	C	D	E	F
N=4	E	37.33	37.96	48.31	-	28.14	35.91
	H	9.98	1.49	10.65	-	1.99	1.43
N=8	E	46.17	39.50	42.53	39.72	30.28	38.28
	H	9.64	1.86	7.27	2.07	2.33	1.86
N=16	E	47.21	44.67	49.66	42.97	36.98	42.51
	H	5.90	4.69	2.93	5.178	5.56	5.94
N=32	E	88.60	79.92	80.34	78.35	76.84	77.38
	H	6.29	5.23	2.85	4.51	5.29	4.28
N=64	E	167.10	147.16	150.59	-	(133.9)	148.94
	H	5.96	4.67	3.10	-	(-)	6.86

(The E:N=64 results are without hackbench disturbance.)

Best regards,
Erich

[-- Attachment #2: NUMA_SCHED_8_16_32.results --]
[-- Type: text/plain, Size: 24048 bytes --]

A: 2.5.39 O(1) scheduler

Running 'hackbench 20' in the background: Time: 9.639
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     7.5   92.5    0.0    0.0 |    0     1   *|  38.29
  2     0.0    0.0    0.0  100.0 |    0     3   *|  30.19
  3     0.0  100.0    0.0    0.0 |    0     1   *|  27.84
  4     0.0    0.0    0.0  100.0 |    0     3   *|  30.38
  5   100.0    0.0    0.0    0.0 |    0     0    |  29.96
  6   100.0    0.0    0.0    0.0 |    0     0    |  29.63
  7     0.0    0.0  100.0    0.0 |    0     2   *|  27.33
  8     0.0    0.0    0.0  100.0 |    0     3   *|  30.14
AverageUserTime 30.47 seconds
ElapsedTime     46.17
TotalUserTime   243.86
TotalSysTime    0.41

Running 'hackbench 20' in the background: Time: 5.896
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0   98.7    1.3    0.0 |    0     1   *|  41.07
  2    89.2   10.8    0.0    0.0 |    0     0    |  39.16
  3     9.4    0.0   90.6    0.0 |    0     2   *|  40.72
  4     0.0    0.0    0.0  100.0 |    0     3   *|  31.96
  5     0.0    0.1   99.0    0.9 |    0     2   *|  41.36
  6     0.0    0.0  100.0    0.0 |    0     2   *|  30.62
  7     0.6    0.0   86.2   13.2 |    0     2   *|  40.11
  8   100.0    0.0    0.0    0.0 |    0     0    |  31.25
  9     8.0    0.0    0.0   92.0 |    0     3   *|  40.63
 10     0.0  100.0    0.0    0.0 |    0     1   *|  30.43
 11     0.0    0.0    0.0  100.0 |    0     3   *|  31.64
 12    91.6    0.0    8.4    0.0 |    0     0    |  39.92
 13     0.0  100.0    0.0    0.0 |    0     1   *|  30.46
 14    92.1    0.0    7.9    0.0 |    0     0    |  40.36
 15     0.0    0.0    0.0  100.0 |    0     3   *|  32.29
 16     4.8   95.2    0.0    0.0 |    0     1   *|  40.70
AverageUserTime 36.42 seconds
ElapsedTime     47.21
TotalUserTime   582.91
TotalSysTime    0.85

Running 'hackbench 20' in the background: Time: 6.293
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0  100.0    0.0 |    0     2   *|  31.53
  2   100.0    0.0    0.0    0.0 |    0     0    |  34.84
  3   100.0    0.0    0.0    0.0 |    0     0    |  32.70
  4     0.0  100.0    0.0    0.0 |    0     1   *|  32.25
  5     0.0    0.0    0.0  100.0 |    0     3   *|  33.41
  6    96.5    2.7    0.8    0.0 |    0     0    |  43.39
  7     5.6    0.0   94.4    0.0 |    0     2   *|  44.42
  8    73.4   22.9    3.7    0.0 |    0     0    |  42.14
  9     0.9    0.0   99.1    0.0 |    0     2   *|  45.10
 10     0.0    8.4   91.6    0.0 |    0     2   *|  42.85
 11     0.0    0.0    0.0  100.0 |    0     3   *|  32.11
 12     0.0  100.0    0.0    0.0 |    0     1   *|  32.16
 13     0.0    0.0    0.0  100.0 |    0     3   *|  33.48
 14     0.0    0.0  100.0    0.0 |    0     2   *|  31.35
 15     0.0    0.3    3.2   96.4 |    0     3   *|  42.27
 16    91.0    0.0    0.0    9.0 |    0     0    |  35.50
 17     0.0  100.0    0.0    0.0 |    0     1   *|  32.45
 18     0.0   97.6    2.4    0.0 |    0     1   *|  41.97
 19     0.0  100.0    0.0    0.0 |    0     1   *|  32.46
 20     1.3    0.0   74.2   24.5 |    0     2   *|  43.78
 21     0.0    0.0    0.9   99.1 |    0     3   *|  32.74
 22     0.8    0.0   99.2    0.0 |    0     2   *|  45.61
 23   100.0    0.0    0.0    0.0 |    0     0    |  33.78
 24    97.0    1.0    0.0    1.9 |    0     0    |  43.74
 25     0.0    1.8    0.0   98.2 |    0     3   *|  43.31
 26     0.0   96.1    0.0    3.9 |    0     1   *|  43.61
 27     2.6    0.0    0.3   97.0 |    0     3   *|  46.54
 28     0.0    0.0    0.0  100.0 |    0     3   *|  33.59
 29     0.0   92.1    0.0    7.9 |    0     1   *|  43.43
 30     2.2    0.0   97.8    0.0 |    0     2   *|  45.19
 31    26.0   72.3    0.0    1.8 |    0     1   *|  43.29
 32    98.5    1.3    0.2    0.0 |    0     0    |  42.97
AverageUserTime 38.69 seconds
ElapsedTime     88.60
TotalUserTime   1238.43
TotalSysTime    1.85

-------------------------------------------------------------
B: O(1) scheduler equivalent, 1 task steal per load balance step
(2.4.18 node affine scheduler with CONFIG_NUMA_SCHED=n)

Running 'hackbench 20' in the background: Time: 1.861
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0  100.0    0.0 |    2     2    |  31.62
  2     0.0    0.0  100.0    0.0 |    2     2    |  31.68
  3   100.0    0.0    0.0    0.0 |    0     0    |  29.40
  4     0.0    0.0  100.0    0.0 |    2     2    |  31.56
  5     0.0    0.0  100.0    0.0 |    2     2    |  31.60
  6    95.0    5.0    0.0    0.0 |    1     0   *|  38.14
  7   100.0    0.0    0.0    0.0 |    0     0    |  29.27
  8     0.0  100.0    0.0    0.0 |    1     1    |  27.87
AverageUserTime 31.39 seconds
ElapsedTime     39.50
TotalUserTime   251.27
TotalSysTime    0.48

Running 'hackbench 20' in the background: Time: 4.688
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0  100.0    0.0 |    2     2    |  31.49
  2     0.0    0.0  100.0    0.0 |    2     2    |  31.33
  3     0.0    0.0  100.0    0.0 |    2     2    |  31.30
  4   100.0    0.0    0.0    0.0 |    0     0    |  31.98
  5   100.0    0.0    0.0    0.0 |    0     0    |  32.21
  6     0.0   91.3    8.7    0.0 |    2     1   *|  39.88
  7     0.0  100.0    0.0    0.0 |    1     1    |  30.72
  8     0.0    0.0    0.0  100.0 |    3     3    |  31.56
  9     1.2   98.8    0.0    0.0 |    0     1   *|  41.31
 10     0.0    0.0    0.0  100.0 |    3     3    |  31.56
 11     0.0  100.0    0.0    0.0 |    1     1    |  30.67
 12     0.0    0.0    0.0  100.0 |    3     3    |  31.45
 13   100.0    0.0    0.0    0.0 |    0     0    |  31.94
 14     5.5    0.0   94.5    0.0 |    0     2   *|  40.83
 15     0.0    0.0    0.0  100.0 |    3     3    |  31.53
 16    83.0   17.0    0.0    0.0 |    1     0   *|  38.50
AverageUserTime 33.64 seconds
ElapsedTime     44.67
TotalUserTime   538.49
TotalSysTime    1.02

Running 'hackbench 20' in the background: Time: 5.229
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1    67.6    0.0   17.4   15.0 |    2     0   *|  41.36
  2     0.0    0.0  100.0    0.0 |    2     2    |  32.78
  3   100.0    0.0    0.0    0.0 |    0     0    |  32.92
  4     0.0    0.0  100.0    0.0 |    2     2    |  33.56
  5    22.4   75.3    2.3    0.0 |    2     1   *|  43.26
  6     0.0    0.0  100.0    0.0 |    2     2    |  33.38
  7   100.0    0.0    0.0    0.0 |    0     0    |  33.16
  8     0.0    0.0    0.0  100.0 |    3     3    |  32.03
  9     0.0  100.0    0.0    0.0 |    1     1    |  32.64
 10     0.0    0.0    0.0  100.0 |    3     3    |  32.00
 11     4.5    0.0    0.0   95.5 |    0     3   *|  43.82
 12     0.0    0.0    0.0  100.0 |    3     3    |  31.91
 13   100.0    0.0    0.0    0.0 |    0     0    |  33.46
 14     0.0  100.0    0.0    0.0 |    1     1    |  32.39
 15     0.0    0.0    0.0  100.0 |    3     3    |  31.86
 16     0.0    0.0  100.0    0.0 |    2     2    |  33.35
 17   100.0    0.0    0.0    0.0 |    0     0    |  33.95
 18     0.0    0.0    0.0  100.0 |    3     3    |  31.68
 19     0.0    5.3    0.0   94.7 |    1     3   *|  42.97
 20     2.0   98.0    0.0    0.0 |    0     1   *|  43.56
 21     0.0    0.0    0.0  100.0 |    3     3    |  31.65
 22   100.0    0.0    0.0    0.0 |    0     0    |  34.13
 23     0.0  100.0    0.0    0.0 |    1     1    |  32.52
 24     9.9   90.1    0.0    0.0 |    1     1    |  33.58
 25     0.0    3.2   96.8    0.0 |    1     2   *|  42.51
 26    89.8    0.0    0.0   10.2 |    0     0    |  34.18
 27     0.0  100.0    0.0    0.0 |    1     1    |  32.44
 28     0.0    2.4   97.6    0.0 |    2     2    |  33.59
 29     8.4    0.0   91.6    0.0 |    2     2    |  33.78
 30   100.0    0.0    0.0    0.0 |    0     0    |  33.52
 31     0.0    6.3   93.7    0.0 |    2     2    |  33.37
 32     0.0  100.0    0.0    0.0 |    1     1    |  33.20
AverageUserTime 34.83 seconds
ElapsedTime     79.92
TotalUserTime   1114.91
TotalSysTime    2.30

-----------------------------------------------------------
C: simple scheduler MH

Running 'hackbench 20' in the background: Time: 7.271
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  32.66
  2   100.0    0.0    0.0    0.0 |    0     0    |  32.53
  3   100.0    0.0    0.0    0.0 |    0     0    |  32.47
  4     0.0  100.0    0.0    0.0 |    1     1    |  30.33
  5     0.0  100.0    0.0    0.0 |    1     1    |  30.20
  6     0.0  100.0    0.0    0.0 |    1     1    |  30.06
  7   100.0    0.0    0.0    0.0 |    0     0    |  32.35
  8   100.0    0.0    0.0    0.0 |    0     0    |  32.58
AverageUserTime 31.65 seconds
ElapsedTime     42.53
TotalUserTime   253.30
TotalSysTime    0.48

Running 'hackbench 20' in the background: Time: 2.929
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  31.07
  2   100.0    0.0    0.0    0.0 |    0     0    |  30.97
  3   100.0    0.0    0.0    0.0 |    0     0    |  30.86
  4     0.0  100.0    0.0    0.0 |    1     1    |  31.91
  5     0.0  100.0    0.0    0.0 |    1     1    |  32.04
  6     0.0  100.0    0.0    0.0 |    1     1    |  31.98
  7     0.0  100.0    0.0    0.0 |    1     1    |  32.01
  8     0.0    0.0  100.0    0.0 |    2     2    |  31.83
  9     0.0    0.0  100.0    0.0 |    2     2    |  31.66
 10     0.0    0.0  100.0    0.0 |    2     2    |  31.47
 11     0.0    0.0  100.0    0.0 |    2     2    |  31.74
 12     0.0    0.0    0.0  100.0 |    3     3    |  31.54
 13     0.0    0.0    0.0  100.0 |    3     3    |  31.48
 14     0.0    0.0    0.0  100.0 |    3     3    |  31.49
 15     0.0    0.0    0.0  100.0 |    3     3    |  31.97
 16     5.4    0.0    0.0   94.6 |    0     3   *|  40.86
AverageUserTime 32.18 seconds
ElapsedTime     49.66
TotalUserTime   515.11
TotalSysTime    0.95

Running 'hackbench 20' in the background: Time: 2.849
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  33.75
  2   100.0    0.0    0.0    0.0 |    0     0    |  33.89
  3   100.0    0.0    0.0    0.0 |    0     0    |  33.87
  4     0.0  100.0    0.0    0.0 |    1     1    |  33.34
  5     0.0  100.0    0.0    0.0 |    1     1    |  33.41
  6     0.0  100.0    0.0    0.0 |    1     1    |  33.16
  7     0.0  100.0    0.0    0.0 |    1     1    |  33.25
  8     0.0    0.0  100.0    0.0 |    2     2    |  32.90
  9     0.0    0.0  100.0    0.0 |    2     2    |  33.08
 10     0.0    0.0  100.0    0.0 |    2     2    |  33.07
 11     0.0    0.0  100.0    0.0 |    2     2    |  33.11
 12     0.0    0.0    0.0  100.0 |    3     3    |  31.46
 13     0.0    0.0    0.0  100.0 |    3     3    |  32.04
 14     0.0    0.0    0.0  100.0 |    3     3    |  31.59
 15     0.0    0.0    0.0  100.0 |    3     3    |  31.83
 16   100.0    0.0    0.0    0.0 |    0     0    |  33.95
 17     1.2    0.0    0.0   98.8 |    0     3   *|  44.66
 18   100.0    0.0    0.0    0.0 |    0     0    |  34.02
 19    55.4    0.0    0.0   44.6 |    3     0    |  37.11
 20   100.0    0.0    0.0    0.0 |    0     0    |  33.62
 21     0.0  100.0    0.0    0.0 |    1     1    |  33.33
 22     0.0  100.0    0.0    0.0 |    1     1    |  33.26
 23     0.0  100.0    0.0    0.0 |    1     1    |  33.41
 24     0.0  100.0    0.0    0.0 |    1     1    |  33.14
 25     0.0    0.0  100.0    0.0 |    2     2    |  33.26
 26     0.0    0.0  100.0    0.0 |    2     2    |  33.05
 27     0.0    0.0    3.1   96.9 |    2     3   *|  43.50
 28   100.0    0.0    0.0    0.0 |    0     0    |  34.02
 29    60.4    0.0   39.6    0.0 |    2     0   *|  37.05
 30     0.0    0.0    0.0  100.0 |    3     3    |  31.53
 31   100.0    0.0    0.0    0.0 |    0     0    |  33.94
 32     0.0    0.0  100.0    0.0 |    2     2    |  33.03
AverageUserTime 34.05 seconds
ElapsedTime     80.34
TotalUserTime   1090.10
TotalSysTime    2.15

---------------------------------------------------------------
D: numasched : pooling only (idle pool delay set to 0)

Running 'hackbench 20' in the background: Time: 2.066
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0    0.0  100.0 |    3     3    |  28.50
  2   100.0    0.0    0.0    0.0 |    0     0    |  30.97
  3     0.0    0.0    0.0  100.0 |    3     3    |  28.49
  4     0.0    0.0  100.0    0.0 |    2     2    |  28.65
  5   100.0    0.0    0.0    0.0 |    0     0    |  31.18
  6   100.0    0.0    0.0    0.0 |    0     0    |  30.96
  7     5.7   94.3    0.0    0.0 |    0     1   *|  38.69
  8     0.0    0.0  100.0    0.0 |    2     2    |  28.65
AverageUserTime 30.76 seconds
ElapsedTime     39.72
TotalUserTime   246.23
TotalSysTime    0.46

Running 'hackbench 20' in the background: Time: 5.178
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1   100.0    0.0    0.0    0.0 |    0     0    |  32.01
  2   100.0    0.0    0.0    0.0 |    0     0    |  31.82
  3     0.0    3.4    0.0   96.6 |    1     3   *|  41.01
  4   100.0    0.0    0.0    0.0 |    0     0    |  32.24
  5   100.0    0.0    0.0    0.0 |    0     0    |  31.91
  6     0.0    0.0  100.0    0.0 |    2     2    |  31.75
  7     0.0    0.0    0.0  100.0 |    3     3    |  30.55
  8     0.0    0.0  100.0    0.0 |    2     2    |  31.74
  9     0.0    0.0  100.0    0.0 |    2     2    |  31.74
 10     0.0    0.0    0.0  100.0 |    3     3    |  30.63
 11     0.0    0.0    0.0  100.0 |    3     3    |  30.67
 12     0.0    0.0  100.0    0.0 |    2     2    |  31.75
 13     0.0  100.0    0.0    0.0 |    1     1    |  31.94
 14     0.0  100.0    0.0    0.0 |    1     1    |  32.20
 15     0.0  100.0    0.0    0.0 |    1     1    |  32.24
 16     0.0  100.0    0.0    0.0 |    1     1    |  32.07
AverageUserTime 32.27 seconds
ElapsedTime     42.97
TotalUserTime   516.53
TotalSysTime    1.16

Running 'hackbench 20' in the background: Time: 4.505
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0   49.4    0.0   50.6 |    3     3    |  36.48
  2     0.3   99.7    0.0    0.0 |    0     1   *|  39.21
  3     0.0  100.0    0.0    0.0 |    1     1    |  32.23
  4   100.0    0.0    0.0    0.0 |    0     0    |  32.88
  5   100.0    0.0    0.0    0.0 |    0     0    |  32.98
  6   100.0    0.0    0.0    0.0 |    0     0    |  32.97
  7     0.0   18.3   81.7    0.0 |    2     2    |  35.13
  8     0.0    5.7   94.3    0.0 |    2     2    |  34.00
  9     0.0    0.0  100.0    0.0 |    2     2    |  33.14
 10     0.0    3.3   96.7    0.0 |    2     2    |  34.22
 11     0.0  100.0    0.0    0.0 |    1     1    |  32.34
 12     0.0  100.0    0.0    0.0 |    1     1    |  32.37
 13     0.0   13.4    0.0   86.6 |    3     3    |  34.55
 14     0.0    0.0    0.0  100.0 |    3     3    |  33.93
 15     0.0    0.0    0.0  100.0 |    3     3    |  33.08
 16   100.0    0.0    0.0    0.0 |    0     0    |  33.31
 17   100.0    0.0    0.0    0.0 |    0     0    |  33.08
 18     0.0    8.2    0.0   91.8 |    3     3    |  33.76
 19     0.0    0.0    0.0  100.0 |    3     3    |  33.23
 20     0.0   32.7    0.0   67.3 |    3     3    |  34.79
 21     0.0    0.0    0.0  100.0 |    3     3    |  32.94
 22   100.0    0.0    0.0    0.0 |    0     0    |  33.23
 23     0.0  100.0    0.0    0.0 |    1     1    |  32.35
 24     0.0    0.0  100.0    0.0 |    2     2    |  33.44
 25     0.0  100.0    0.0    0.0 |    1     1    |  32.42
 26     0.0    0.0  100.0    0.0 |    2     2    |  33.25
 27     0.0    0.0  100.0    0.0 |    2     2    |  33.82
 28     0.0    0.0  100.0    0.0 |    2     2    |  32.88
 29     0.0    0.0    0.0  100.0 |    3     3    |  31.99
 30     0.0  100.0    0.0    0.0 |    1     1    |  31.90
 31   100.0    0.0    0.0    0.0 |    0     0    |  32.93
 32    99.6    0.0    0.4    0.0 |    2     0   *|  41.48
AverageUserTime 33.76 seconds
ElapsedTime     78.35
TotalUserTime   1080.75
TotalSysTime    2.03

-----------------------------------------------------------

E: node affine scheduler with initial load balancing, statically
assigned homenode

Running 'hackbench 20' in the background: Time: 2.330
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0  100.0    0.0 |    2     2    |  28.86
  2     0.0  100.0    0.0    0.0 |    1     1    |  28.82
  3     0.0  100.0    0.0    0.0 |    1     1    |  28.85
  4     0.0    0.0    0.0  100.0 |    3     3    |  28.47
  5   100.0    0.0    0.0    0.0 |    0     0    |  28.69
  6     0.0    0.0  100.0    0.0 |    2     2    |  28.65
  7     0.0    0.0    0.0  100.0 |    3     3    |  28.49
  8   100.0    0.0    0.0    0.0 |    0     0    |  28.55
AverageUserTime 28.67 seconds
ElapsedTime     30.28
TotalUserTime   229.51
TotalSysTime    0.43


Running 'hackbench 20' in the background: Time: 5.562
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0  100.0    0.0    0.0 |    1     1    |  31.47
  2     0.0  100.0    0.0    0.0 |    1     1    |  31.38
  3     0.0    0.0    0.0  100.0 |    3     3    |  31.53
  4     0.0    0.0    0.0  100.0 |    3     3    |  31.46
  5   100.0    0.0    0.0    0.0 |    0     0    |  31.64
  6   100.0    0.0    0.0    0.0 |    0     0    |  31.57
  7   100.0    0.0    0.0    0.0 |    0     0    |  31.57
  8     0.0    0.0    0.0  100.0 |    3     3    |  31.51
  9     0.0    0.0  100.0    0.0 |    2     2    |  31.63
 10     0.0    0.0  100.0    0.0 |    2     2    |  31.56
 11     0.0    0.0  100.0    0.0 |    2     2    |  31.42
 12     0.0  100.0    0.0    0.0 |    1     1    |  31.44
 13     0.0    0.0    0.0  100.0 |    3     3    |  31.43
 14     0.0    0.0  100.0    0.0 |    2     2    |  31.61
 15   100.0    0.0    0.0    0.0 |    0     0    |  31.34
 16     0.0  100.0    0.0    0.0 |    1     1    |  31.36
AverageUserTime 31.50 seconds
ElapsedTime     36.98
TotalUserTime   504.17
TotalSysTime    1.03

Running 'hackbench 20' in the background: Time: 5.288
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0   46.1   53.9 |    2     3   *|  37.11
  2     0.0    0.0    0.0  100.0 |    3     3    |  31.32
  3     0.0  100.0    0.0    0.0 |    1     1    |  33.15
  4     0.0    4.9   95.1    0.0 |    2     2    |  33.55
  5     0.0    0.0  100.0    0.0 |    2     2    |  32.81
  6     0.0  100.0    0.0    0.0 |    1     1    |  32.88
  7     0.0  100.0    0.0    0.0 |    1     1    |  33.31
  8    82.9   17.1    0.0    0.0 |    1     0   *|  40.70
  9   100.0    0.0    0.0    0.0 |    0     0    |  33.36
 10     9.3    0.0    0.0   90.7 |    0     3   *|  42.12
 11   100.0    0.0    0.0    0.0 |    0     0    |  33.38
 12     0.0    0.0    0.0  100.0 |    3     3    |  31.28
 13     0.0    0.0    0.0  100.0 |    3     3    |  31.43
 14     0.0  100.0    0.0    0.0 |    1     1    |  33.33
 15     0.0  100.0    0.0    0.0 |    1     1    |  33.04
 16   100.0    0.0    0.0    0.0 |    0     0    |  33.02
 17    95.2    4.8    0.0    0.0 |    0     0    |  33.82
 18     3.8    0.0    0.0   96.2 |    0     3   *|  43.21
 19     0.0    0.0    0.0  100.0 |    3     3    |  31.11
 20     0.0    0.0  100.0    0.0 |    2     2    |  32.19
 21     0.0  100.0    0.0    0.0 |    1     1    |  33.43
 22     0.0    0.0  100.0    0.0 |    2     2    |  33.03
 23     0.0    0.0  100.0    0.0 |    2     2    |  33.06
 24   100.0    0.0    0.0    0.0 |    0     0    |  33.04
 25   100.0    0.0    0.0    0.0 |    0     0    |  33.07
 26     0.0    0.0  100.0    0.0 |    2     2    |  33.07
 27     0.0    0.0  100.0    0.0 |    2     2    |  33.22
 28     0.0    0.0    0.0  100.0 |    3     3    |  31.06
 29     0.0   15.3   84.7    0.0 |    2     2    |  33.25
 30     0.0  100.0    0.0    0.0 |    1     1    |  33.36
 31    85.0    0.0    0.0   15.0 |    0     0    |  34.50
 32     0.0   86.4    0.0   13.6 |    1     1    |  34.27
AverageUserTime 33.89 seconds
ElapsedTime     76.84
TotalUserTime   1084.86
TotalSysTime    2.12

--------------------------------------------------------

F: node affine scheduler, dynamically assigned homenode, no
initial load balancing

Running 'hackbench 20' in the background: Time: 1.861
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0    0.0  100.0 |    3     3    |  30.75
  2   100.0    0.0    0.0    0.0 |    0     0    |  27.12
  3     0.0    0.0    0.0  100.0 |    3     3    |  30.60
  4     0.0    0.0    0.0  100.0 |    3     3    |  30.65
  5     0.0  100.0    0.0    0.0 |    1     1    |  28.65
  6     0.0    0.0   78.4   21.6 |    3     2   *|  36.63
  7     0.0  100.0    0.0    0.0 |    1     1    |  28.57
  8     0.0    0.0  100.0    0.0 |    2     2    |  27.65
AverageUserTime 30.08 seconds
ElapsedTime     38.28
TotalUserTime   240.75
TotalSysTime    0.49

Running 'hackbench 20' in the background: Time: 5.936
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0    0.0  100.0 |    3     3    |  31.47
  2   100.0    0.0    0.0    0.0 |    0     0    |  29.90
  3    88.2    0.0    4.6    7.2 |    2     0   *|  40.96
  4     0.0    0.0    0.0  100.0 |    3     3    |  31.53
  5     0.0    0.0    0.0  100.0 |    3     3    |  31.50
  6     0.0  100.0    0.0    0.0 |    1     1    |  32.25
  7     0.0   84.5    0.0   15.5 |    1     1    |  33.41
  8     0.0    0.0  100.0    0.0 |    2     2    |  32.38
  9    83.0   17.0    0.0    0.0 |    1     0   *|  39.29
 10     0.0    0.0    0.0  100.0 |    3     3    |  31.51
 11     0.0  100.0    0.0    0.0 |    1     1    |  32.38
 12     0.0    0.0  100.0    0.0 |    2     2    |  32.30
 13     0.0  100.0    0.0    0.0 |    1     1    |  32.06
 14   100.0    0.0    0.0    0.0 |    0     0    |  29.85
 15     0.0    0.0  100.0    0.0 |    2     2    |  32.18
 16     0.0    0.0  100.0    0.0 |    2     2    |  32.39
AverageUserTime 32.83 seconds
ElapsedTime     42.51
TotalUserTime   525.59
TotalSysTime    1.14

Running 'hackbench 20' in the background: Time: 4.277
Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
  1     0.0    0.0    0.0  100.0 |    3     3    |  31.69
  2     0.0    0.0  100.0    0.0 |    2     2    |  33.81
  3   100.0    0.0    0.0    0.0 |    0     0    |  33.10
  4     1.4    4.9    2.8   90.9 |    2     3   *|  44.26
  5     0.0    0.0    0.0  100.0 |    3     3    |  31.76
  6     0.0    0.0    0.0  100.0 |    3     3    |  31.69
  7     0.0    0.0    3.6   96.4 |    2     3   *|  43.73
  8     0.0  100.0    0.0    0.0 |    1     1    |  33.07
  9     0.0    0.0  100.0    0.0 |    2     2    |  34.04
 10     0.0  100.0    0.0    0.0 |    1     1    |  33.24
 11     0.0  100.0    0.0    0.0 |    1     1    |  33.23
 12   100.0    0.0    0.0    0.0 |    0     0    |  33.77
 13     0.0  100.0    0.0    0.0 |    1     1    |  33.10
 14   100.0    0.0    0.0    0.0 |    0     0    |  33.26
 15     0.0  100.0    0.0    0.0 |    1     1    |  33.16
 16   100.0    0.0    0.0    0.0 |    0     0    |  33.23
 17     3.9    0.0    0.0   96.1 |    0     3   *|  42.74
 18     0.0  100.0    0.0    0.0 |    1     1    |  32.97
 19   100.0    0.0    0.0    0.0 |    0     0    |  33.21
 20     0.0    0.0   92.1    7.9 |    2     2    |  34.30
 21     0.0    0.0   95.9    4.1 |    2     2    |  34.31
 22     0.0    0.0  100.0    0.0 |    2     2    |  33.88
 23     0.0    0.0    0.0  100.0 |    3     3    |  30.85
 24     0.0  100.0    0.0    0.0 |    1     1    |  32.99
 25     0.0    0.0  100.0    0.0 |    2     2    |  33.84
 26     0.0  100.0    0.0    0.0 |    1     1    |  33.35
 27   100.0    0.0    0.0    0.0 |    0     0    |  33.29
 28     0.0    0.0  100.0    0.0 |    2     2    |  33.79
 29   100.0    0.0    0.0    0.0 |    0     0    |  33.25
 30   100.0    0.0    0.0    0.0 |    0     0    |  33.41
 31     0.0    0.0  100.0    0.0 |    2     2    |  33.92
 32     0.0    0.0    0.0  100.0 |    3     3    |  31.16
AverageUserTime 34.11 seconds
ElapsedTime     77.38
TotalUserTime   1091.86
TotalSysTime    2.14


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht
@ 2002-10-06 20:24 ` Erich Focht
  2002-10-07  0:00   ` Martin J. Bligh
  2002-10-07  0:58 ` Martin J. Bligh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Erich Focht @ 2002-10-06 20:24 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Anton Blanchard, Ingo Molnar, linux-kernel, LSE

[-- Attachment #1: Type: text/plain, Size: 4283 bytes --]

Hi,

here comes the benchmark I used for the NUMA scheduler test. Would
be interesting to know whether it is useful to any other NUMA
developer...

Regards,
Erich

PS: it uses a 'time' command that understands the --format option,
e.g. GNU time 1.7. Change it in the main script, if it doesn't
work for you.


On Sunday 06 October 2002 18:51, Erich Focht wrote:
> Hi,
>
> here are some results from testing various versions and approaches
> to a NUMA scheduler. I used the numa_test benchmark which I'll post
> in a separate email. It runs in parallel N tasks doing the same job:
> access randomly a large array. As the array is large enough not to
> fit into cache, this is very memory latency sensitive. Also it is
> memory bandwidth sensitive. To emulate a real multi-user environment, the
> jobs are disturbed by a short load peak. This is simulated by a call
> to "hackbench" 3 seconds after the tasks were started. The performance
> of the user tasks is depending very much on where they are scheduled
> and are CPU hoggers such that the user times are quite independent of
> the non-scheduler part of the underlying kernel. The elapsed times
> are depending on "hackbench" which actually blocks the machine for the
> time it is running. Hackbench is depending on the underlying kernel
> and one should compare "elapsed_time - hackbench_time".
>
> The test machine is a 16 CPU NEC Azusa with Itanium processors and
> four nodes. The tested schedulers are:
>
> A: O(1) scheduler in 2.5.39
> B: O(1) scheduler with task steal limited to only one task (node
>    affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
> C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
> D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
>    set to 0, under 2.4.18
> E: node affine scheduler with initial load balancing and static homenode
> F: node affine scheduler without initial load balancing and dynamic
>    homenode selection (homenode selected where most of the memory is
>    allocated).
>
> As I'm rewriting the node affine scheduler to be more modular, I'll
> redo the tests for cases D, E, F on top of 2.5.X kernels soon.
>
> The results are summarized in the tables below. A set of outputs (for
> N=8, 16, 32) is attached. They show clearly why the node affine
> scheduler beats them all: The initial load balancing is node-aware,
> the task-stealing too. Sometimes the node affine force is not large
> enough to bring the task back to the homenode, but it is almost always
> good enough.
>
> The (F) solution with dynamically determined homenode show that the
> initial load balancing is crucial, as the equal node balance is not
> strongly enforced dynamically. So the optimal solution is probably
> (F) with initial load balancing.
>
>
> Average user time (U) and total user time (TU). Minimum per row should
> be considered.
> ----------------------------------------------------------------------
> Scheduler:	A	B	C	D	E	F
> N=4	U	28.12	30.77	33.00	-	27.20	30.29
> 	TU	112.55	123.13	132.08	-	108.88	121.25
> N=8	U	30.47	31.39	31.65	30.76	28.67	30.08
> 	TU	243.86	251.27	253.30	246.23	229.51	240.75
> N=16	U	36.42	33.64	32.18	32.27	31.50	32.83
> 	TU	582.91	538.49	515.11	516.53	504.17	525.59
> N=32	U	38.69	34.83	34.05	33.76	33.89	34.11
> 	TU	1238.4	1114.9	1090.1	1080.8	1084.9	1091.9
> N=64	U	39.73	34.73	34.23	-	(33.32)	34.98
> 	TU	2543.4	2223.4	2191.7	-	(2133)	2239.5
>
>
> Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
> 2.5.39 based kernels due to "hackbench", which performs differently.
> Compare E-H within a row, but don't take it too seriously.
> --------------------------------------------------------------------
> Scheduler:	A	B	C	D	E	F
> N=4	E	37.33	37.96	48.31	-	28.14	35.91
> 	H	9.98	1.49	10.65	-	1.99	1.43
> N=8	E	46.17	39.50	42.53	39.72	30.28	38.28
> 	H	9.64	1.86	7.27	2.07	2.33	1.86
> N=16	E	47.21	44.67	49.66	42.97	36.98	42.51
> 	H	5.90	4.69	2.93	5.178	5.56	5.94
> N=32	E	88.60	79.92	80.34	78.35	76.84	77.38
> 	H	6.29	5.23	2.85	4.51	5.29	4.28
> N=64	E	167.10	147.16	150.59	-	(133.9)	148.94
> 	H	5.96	4.67	3.10	-	(-)	6.86
>
> (The E:N=64 results are without hackbench disturbance.)
>
> Best regards,
> Erich

[-- Attachment #2: numa_test --]
[-- Type: application/x-shellscript, Size: 8322 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-06 20:24 ` Erich Focht
@ 2002-10-07  0:00   ` Martin J. Bligh
  0 siblings, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2002-10-07  0:00 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum
  Cc: Anton Blanchard, Ingo Molnar, linux-kernel, LSE

Errm, your magic shellscript is too wierd (aka doesn't work).
Can you just send the files out as attatchments?

M.

--On Sunday, October 06, 2002 10:24 PM +0200 Erich Focht <efocht@ess.nec.de> wrote:

> Hi,
> 
> here comes the benchmark I used for the NUMA scheduler test. Would
> be interesting to know whether it is useful to any other NUMA
> developer...
> 
> Regards,
> Erich
> 
> PS: it uses a 'time' command that understands the --format option,
> e.g. GNU time 1.7. Change it in the main script, if it doesn't
> work for you.
> 
> 
> On Sunday 06 October 2002 18:51, Erich Focht wrote:
>> Hi,
>> 
>> here are some results from testing various versions and approaches
>> to a NUMA scheduler. I used the numa_test benchmark which I'll post
>> in a separate email. It runs in parallel N tasks doing the same job:
>> access randomly a large array. As the array is large enough not to
>> fit into cache, this is very memory latency sensitive. Also it is
>> memory bandwidth sensitive. To emulate a real multi-user environment, the
>> jobs are disturbed by a short load peak. This is simulated by a call
>> to "hackbench" 3 seconds after the tasks were started. The performance
>> of the user tasks is depending very much on where they are scheduled
>> and are CPU hoggers such that the user times are quite independent of
>> the non-scheduler part of the underlying kernel. The elapsed times
>> are depending on "hackbench" which actually blocks the machine for the
>> time it is running. Hackbench is depending on the underlying kernel
>> and one should compare "elapsed_time - hackbench_time".
>> 
>> The test machine is a 16 CPU NEC Azusa with Itanium processors and
>> four nodes. The tested schedulers are:
>> 
>> A: O(1) scheduler in 2.5.39
>> B: O(1) scheduler with task steal limited to only one task (node
>>    affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
>> C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
>> D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
>>    set to 0, under 2.4.18
>> E: node affine scheduler with initial load balancing and static homenode
>> F: node affine scheduler without initial load balancing and dynamic
>>    homenode selection (homenode selected where most of the memory is
>>    allocated).
>> 
>> As I'm rewriting the node affine scheduler to be more modular, I'll
>> redo the tests for cases D, E, F on top of 2.5.X kernels soon.
>> 
>> The results are summarized in the tables below. A set of outputs (for
>> N=8, 16, 32) is attached. They show clearly why the node affine
>> scheduler beats them all: The initial load balancing is node-aware,
>> the task-stealing too. Sometimes the node affine force is not large
>> enough to bring the task back to the homenode, but it is almost always
>> good enough.
>> 
>> The (F) solution with dynamically determined homenode show that the
>> initial load balancing is crucial, as the equal node balance is not
>> strongly enforced dynamically. So the optimal solution is probably
>> (F) with initial load balancing.
>> 
>> 
>> Average user time (U) and total user time (TU). Minimum per row should
>> be considered.
>> ----------------------------------------------------------------------
>> Scheduler:	A	B	C	D	E	F
>> N=4	U	28.12	30.77	33.00	-	27.20	30.29
>> 	TU	112.55	123.13	132.08	-	108.88	121.25
>> N=8	U	30.47	31.39	31.65	30.76	28.67	30.08
>> 	TU	243.86	251.27	253.30	246.23	229.51	240.75
>> N=16	U	36.42	33.64	32.18	32.27	31.50	32.83
>> 	TU	582.91	538.49	515.11	516.53	504.17	525.59
>> N=32	U	38.69	34.83	34.05	33.76	33.89	34.11
>> 	TU	1238.4	1114.9	1090.1	1080.8	1084.9	1091.9
>> N=64	U	39.73	34.73	34.23	-	(33.32)	34.98
>> 	TU	2543.4	2223.4	2191.7	-	(2133)	2239.5
>> 
>> 
>> Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
>> 2.5.39 based kernels due to "hackbench", which performs differently.
>> Compare E-H within a row, but don't take it too seriously.
>> --------------------------------------------------------------------
>> Scheduler:	A	B	C	D	E	F
>> N=4	E	37.33	37.96	48.31	-	28.14	35.91
>> 	H	9.98	1.49	10.65	-	1.99	1.43
>> N=8	E	46.17	39.50	42.53	39.72	30.28	38.28
>> 	H	9.64	1.86	7.27	2.07	2.33	1.86
>> N=16	E	47.21	44.67	49.66	42.97	36.98	42.51
>> 	H	5.90	4.69	2.93	5.178	5.56	5.94
>> N=32	E	88.60	79.92	80.34	78.35	76.84	77.38
>> 	H	6.29	5.23	2.85	4.51	5.29	4.28
>> N=64	E	167.10	147.16	150.59	-	(133.9)	148.94
>> 	H	5.96	4.67	3.10	-	(-)	6.86
>> 
>> (The E:N=64 results are without hackbench disturbance.)
>> 
>> Best regards,
>> Erich



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht
  2002-10-06 20:24 ` Erich Focht
@ 2002-10-07  0:58 ` Martin J. Bligh
  2002-10-07 16:52   ` Erich Focht
  2002-10-07  7:25 ` Martin J. Bligh
  2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum
  3 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2002-10-07  0:58 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

> As I'm rewriting the node affine scheduler to be more modular, I'll
> redo the tests for cases D, E, F on top of 2.5.X kernels soon.

It's really hard to see anything comparing 2.4 vs 2.5 results,
so when you have results for everything across the same kernel,
it'll be a lot easier to decipher ...

For the record, I really don't care whose code gets accepted, I'm 
just interested in having a scheduler that works well for me ;-) 

However, I think it would be a good idea to run some other benchmarks 
on these, that are a little more general ... dbench, kernel compile, 
sdet, whatever, as *well* as your own benchmark ... I think we're 
working on that.

> Average user time (U) and total user time (TU). Minimum per row should
> be considered.
> ----------------------------------------------------------------------
> Scheduler:      A       B       C       D       E       F
> N=4     U       28.12   30.77   33.00   -       27.20   30.29
> N=8     U       30.47   31.39   31.65   30.76   28.67   30.08
> N=16    U       36.42   33.64   32.18   32.27   31.50   32.83
> N=32    U       38.69   34.83   34.05   33.76   33.89   34.11
> N=64    U       39.73   34.73   34.23   -       (33.32) 34.98

So lower numbers are better, right? So Michael's stuff seems to
work fine at the higher end, but poorly at the lower end - I think
this may be due to a bug he found on Friday, if that gets fixed, it
might make a difference.

> The results are summarized in the tables below. A set of outputs 
> (for N=8, 16, 32) is attached. They show clearly why the node affine
> scheduler beats them all

It would be a lot clearer if you baselines weren't different.
Now maybe I'm just totally misreading your results (tell us if so)
but I presume it's more valid to compare:

1. difference between A and C (2.5 virgin vs 2.5 + Michael's stuff)
vs
2. difference between E and B (2.4 virgin vs 2.4 + your stuff)

rather than across 2.4 vs 2.5, which seems meaningless.

N=32 (E) .... C vs A 80.34 vs 88.60 (about 10% improvement)
              E vs B 76.84 vs 79.92 (about 5% improvement)
N=32 (H) .... C vs A 2.85 vs 6.29 (over 50% improvement)
              E vs B 5.29 vs 5.23 (about 1% impromement)

Which actually makes C better than E on both measures, surely? 
Obviously Michael's stuff is worse at the low end, but to say 
"They show clearly why the node affine scheduler beats them all"
seems a little optimistic ;-) 

I'm not trying to say you're wrong, or one scheduler is better than
the other ... I'm just saying I can't really draw any clear-cut 
conclusion from your results other than that Michael's stuff looks 
borked for low numbers of tasks ...

I'll run your tests on the NUMA-Q comparing 2.5.40 vs Michael's stuff. 
If you send me a cleaned up version (ie without the ia64 stuff in
generic code) of your stuff against 2.5.40 (or ideally 2.5.40-mm2), 
I'll run that too ... (if you're putting the latencies into arch code,
you can set i386 (well NUMA-Q) for 20:1 if you think that'll help).

M.

PS. the wierd "this one was run without hackbench" thing makes the
results even harder to read ...

PPS. Does Kimio's discontigmem patch work for you on 2.5?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht
  2002-10-06 20:24 ` Erich Focht
  2002-10-07  0:58 ` Martin J. Bligh
@ 2002-10-07  7:25 ` Martin J. Bligh
  2002-10-07  7:40   ` Ingo Molnar
  2002-10-07 20:09   ` [PATCH] pooling NUMA scheduler with initial load balancing Erich Focht
  2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum
  3 siblings, 2 replies; 24+ messages in thread
From: Martin J. Bligh @ 2002-10-07  7:25 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

Just a data dump for you, I'm sleepy ... will look at it in the
morning. 16-way NUMA-Q running your benchmark:

2.5.40-mm2

time.4:AverageUserTime 40.69 seconds
time.4:ElapsedTime     52.14
time.4:TotalUserTime   162.85
time.4:TotalSysTime    0.71
time.8:AverageUserTime 36.90 seconds
time.8:ElapsedTime     62.12
time.8:TotalUserTime   295.29
time.8:TotalSysTime    1.79
time.16:AverageUserTime 63.91 seconds
time.16:ElapsedTime     88.82
time.16:TotalUserTime   1022.71
time.16:TotalSysTime    4.85
time.32:AverageUserTime 102.11 seconds
time.32:ElapsedTime     224.24
time.32:TotalUserTime   3267.84
time.32:TotalSysTime    11.73
time.64:AverageUserTime 148.33 seconds
time.64:ElapsedTime     611.71
time.64:TotalUserTime   9494.10
time.64:TotalSysTime    25.81

2.5.40-mm2 + michael's stuff (latest on tux1)

time.4:AverageUserTime 31.55 seconds
time.4:ElapsedTime     41.58
time.4:TotalUserTime   126.23
time.4:TotalSysTime    1.11
time.8:AverageUserTime 51.80 seconds
time.8:ElapsedTime     80.38
time.8:TotalUserTime   414.50
time.8:TotalSysTime    2.67
time.16:AverageUserTime 50.75 seconds
time.16:ElapsedTime     72.28
time.16:TotalUserTime   812.08
time.16:TotalSysTime    4.60
time.32:AverageUserTime 55.78 seconds
time.32:ElapsedTime     126.45
time.32:TotalUserTime   1785.21
time.32:TotalSysTime    8.79
time.64:AverageUserTime 57.49 seconds
time.64:ElapsedTime     243.27
time.64:TotalUserTime   3679.94
time.64:TotalSysTime    15.09

Looks pretty sweet to me, apart from 8, which is just wierd.

Profile looks horrible (what the hell is fib_node_get_info?
net/ipv4/fib_semantics.c):

without:

241652 total                                      0.2455
120117 default_idle                            
 19728 fib_node_get_info                       
 17400 swapin_readahead                        
 12805 kmalloc                                 
 11047 kfree                                   
  7918 sock_alloc_send_pskb                    
  5956 alloc_skb                               
  4690 __wake_up                               
  4194 kmem_cache_free                         

with:

213131 total                                      0.2165
138082 default_idle                            
 14620 fib_node_get_info                       
  9631 swapin_readahead                        
  5316 kmalloc                                 
  5279 kfree                                   
  5251 sock_alloc_send_pskb                    
  4154 __wake_up                               
  4116 schedule                                
  3002 alloc_skb                 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-07  7:25 ` Martin J. Bligh
@ 2002-10-07  7:40   ` Ingo Molnar
  2002-10-07 20:09   ` [PATCH] pooling NUMA scheduler with initial load balancing Erich Focht
  1 sibling, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2002-10-07  7:40 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Erich Focht, Michael Hohnbaum, linux-kernel


On Mon, 7 Oct 2002, Martin J. Bligh wrote:

> Profile looks horrible (what the hell is fib_node_get_info?
> net/ipv4/fib_semantics.c):

most likely the wrong System.map?

	Ingo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht
                   ` (2 preceding siblings ...)
  2002-10-07  7:25 ` Martin J. Bligh
@ 2002-10-07 16:37 ` Michael Hohnbaum
  2002-10-07 20:35   ` Erich Focht
  3 siblings, 1 reply; 24+ messages in thread
From: Michael Hohnbaum @ 2002-10-07 16:37 UTC (permalink / raw)
  To: Erich Focht; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel

Erich,

These numbers look pretty good for your scheduler.  It would be good
if you could post numbers against 2.5.x where x is the same for all
tests.  I thought you had your scheduler working on 2.5.x recently.

Running numa_test on my systems I find a lot of variation (as much
as 10%) between identical runs.  Are all of these results averaged
over multiple runs?  Any idea why the hackbench times vary so widely?

Looking at the results for my scheduler, it is real obvious what the
algorithm is that I use for selecting a cpu for a process.  This has
not been quite as apparent on my systems, probably due to other tasks
running in the background.  The code in your scheduler that causes a
different node to start the search for a cpu really pays off here.  

I know that there are some balance issues, especially on light loads
for my scheduler.  I've got two tweaks that help that, but I think that
adding something similar to what you have done to keep track of the last
node picked and starting at the next node could be beneficial.

                  Michael

On Sun, 2002-10-06 at 09:51, Erich Focht wrote:
> Hi,
> 
> here are some results from testing various versions and approaches
> to a NUMA scheduler. I used the numa_test benchmark which I'll post
> in a separate email. It runs in parallel N tasks doing the same job:
> access randomly a large array. As the array is large enough not to
> fit into cache, this is very memory latency sensitive. Also it is
> memory bandwidth sensitive. To emulate a real multi-user environment, the
> jobs are disturbed by a short load peak. This is simulated by a call
> to "hackbench" 3 seconds after the tasks were started. The performance
> of the user tasks is depending very much on where they are scheduled
> and are CPU hoggers such that the user times are quite independent of
> the non-scheduler part of the underlying kernel. The elapsed times
> are depending on "hackbench" which actually blocks the machine for the
> time it is running. Hackbench is depending on the underlying kernel
> and one should compare "elapsed_time - hackbench_time".
> 
> The test machine is a 16 CPU NEC Azusa with Itanium processors and
> four nodes. The tested schedulers are:
> 
> A: O(1) scheduler in 2.5.39
> B: O(1) scheduler with task steal limited to only one task (node
>    affine scheduler with CONFIG_NUMA_SCHED=n) under 2.4.18
> C: Michael Hohnbaum's "simple NUMA scheduler" under 2.5.39
> D: pooling NUMA scheduler, no initial load balancing, idle pool_delay
>    set to 0, under 2.4.18
> E: node affine scheduler with initial load balancing and static homenode
> F: node affine scheduler without initial load balancing and dynamic
>    homenode selection (homenode selected where most of the memory is
>    allocated).
> 
> As I'm rewriting the node affine scheduler to be more modular, I'll
> redo the tests for cases D, E, F on top of 2.5.X kernels soon.
> 
> The results are summarized in the tables below. A set of outputs (for
> N=8, 16, 32) is attached. They show clearly why the node affine
> scheduler beats them all: The initial load balancing is node-aware,
> the task-stealing too. Sometimes the node affine force is not large
> enough to bring the task back to the homenode, but it is almost always
> good enough.
> 
> The (F) solution with dynamically determined homenode show that the
> initial load balancing is crucial, as the equal node balance is not
> strongly enforced dynamically. So the optimal solution is probably
> (F) with initial load balancing.
> 
> 
> Average user time (U) and total user time (TU). Minimum per row should
> be considered.
> ----------------------------------------------------------------------
> Scheduler:	A	B	C	D	E	F
> N=4	U	28.12	30.77	33.00	-	27.20	30.29
> 	TU	112.55	123.13	132.08	-	108.88	121.25
> N=8	U	30.47	31.39	31.65	30.76	28.67	30.08
> 	TU	243.86	251.27	253.30	246.23	229.51	240.75
> N=16	U	36.42	33.64	32.18	32.27	31.50	32.83
> 	TU	582.91	538.49	515.11	516.53	504.17	525.59
> N=32	U	38.69	34.83	34.05	33.76	33.89	34.11
> 	TU	1238.4	1114.9	1090.1	1080.8	1084.9	1091.9
> N=64	U	39.73	34.73	34.23	-	(33.32)	34.98
> 	TU	2543.4	2223.4	2191.7	-	(2133)	2239.5
> 
> 
> Elapsed time (E), hackbench time (H). Diferences between 2.4.18 and
> 2.5.39 based kernels due to "hackbench", which performs differently.
> Compare E-H within a row, but don't take it too seriously.
> --------------------------------------------------------------------
> Scheduler:	A	B	C	D	E	F
> N=4	E	37.33	37.96	48.31	-	28.14	35.91
> 	H	9.98	1.49	10.65	-	1.99	1.43
> N=8	E	46.17	39.50	42.53	39.72	30.28	38.28
> 	H	9.64	1.86	7.27	2.07	2.33	1.86
> N=16	E	47.21	44.67	49.66	42.97	36.98	42.51
> 	H	5.90	4.69	2.93	5.178	5.56	5.94
> N=32	E	88.60	79.92	80.34	78.35	76.84	77.38
> 	H	6.29	5.23	2.85	4.51	5.29	4.28
> N=64	E	167.10	147.16	150.59	-	(133.9)	148.94
> 	H	5.96	4.67	3.10	-	(-)	6.86
> 
> (The E:N=64 results are without hackbench disturbance.)
> 
> Best regards,
> Erich
> ----
> 

> A: 2.5.39 O(1) scheduler
> 
> Running 'hackbench 20' in the background: Time: 9.639
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     7.5   92.5    0.0    0.0 |    0     1   *|  38.29
>   2     0.0    0.0    0.0  100.0 |    0     3   *|  30.19
>   3     0.0  100.0    0.0    0.0 |    0     1   *|  27.84
>   4     0.0    0.0    0.0  100.0 |    0     3   *|  30.38
>   5   100.0    0.0    0.0    0.0 |    0     0    |  29.96
>   6   100.0    0.0    0.0    0.0 |    0     0    |  29.63
>   7     0.0    0.0  100.0    0.0 |    0     2   *|  27.33
>   8     0.0    0.0    0.0  100.0 |    0     3   *|  30.14
> AverageUserTime 30.47 seconds
> ElapsedTime     46.17
> TotalUserTime   243.86
> TotalSysTime    0.41
> 
> Running 'hackbench 20' in the background: Time: 5.896
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0   98.7    1.3    0.0 |    0     1   *|  41.07
>   2    89.2   10.8    0.0    0.0 |    0     0    |  39.16
>   3     9.4    0.0   90.6    0.0 |    0     2   *|  40.72
>   4     0.0    0.0    0.0  100.0 |    0     3   *|  31.96
>   5     0.0    0.1   99.0    0.9 |    0     2   *|  41.36
>   6     0.0    0.0  100.0    0.0 |    0     2   *|  30.62
>   7     0.6    0.0   86.2   13.2 |    0     2   *|  40.11
>   8   100.0    0.0    0.0    0.0 |    0     0    |  31.25
>   9     8.0    0.0    0.0   92.0 |    0     3   *|  40.63
>  10     0.0  100.0    0.0    0.0 |    0     1   *|  30.43
>  11     0.0    0.0    0.0  100.0 |    0     3   *|  31.64
>  12    91.6    0.0    8.4    0.0 |    0     0    |  39.92
>  13     0.0  100.0    0.0    0.0 |    0     1   *|  30.46
>  14    92.1    0.0    7.9    0.0 |    0     0    |  40.36
>  15     0.0    0.0    0.0  100.0 |    0     3   *|  32.29
>  16     4.8   95.2    0.0    0.0 |    0     1   *|  40.70
> AverageUserTime 36.42 seconds
> ElapsedTime     47.21
> TotalUserTime   582.91
> TotalSysTime    0.85
> 
> Running 'hackbench 20' in the background: Time: 6.293
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0  100.0    0.0 |    0     2   *|  31.53
>   2   100.0    0.0    0.0    0.0 |    0     0    |  34.84
>   3   100.0    0.0    0.0    0.0 |    0     0    |  32.70
>   4     0.0  100.0    0.0    0.0 |    0     1   *|  32.25
>   5     0.0    0.0    0.0  100.0 |    0     3   *|  33.41
>   6    96.5    2.7    0.8    0.0 |    0     0    |  43.39
>   7     5.6    0.0   94.4    0.0 |    0     2   *|  44.42
>   8    73.4   22.9    3.7    0.0 |    0     0    |  42.14
>   9     0.9    0.0   99.1    0.0 |    0     2   *|  45.10
>  10     0.0    8.4   91.6    0.0 |    0     2   *|  42.85
>  11     0.0    0.0    0.0  100.0 |    0     3   *|  32.11
>  12     0.0  100.0    0.0    0.0 |    0     1   *|  32.16
>  13     0.0    0.0    0.0  100.0 |    0     3   *|  33.48
>  14     0.0    0.0  100.0    0.0 |    0     2   *|  31.35
>  15     0.0    0.3    3.2   96.4 |    0     3   *|  42.27
>  16    91.0    0.0    0.0    9.0 |    0     0    |  35.50
>  17     0.0  100.0    0.0    0.0 |    0     1   *|  32.45
>  18     0.0   97.6    2.4    0.0 |    0     1   *|  41.97
>  19     0.0  100.0    0.0    0.0 |    0     1   *|  32.46
>  20     1.3    0.0   74.2   24.5 |    0     2   *|  43.78
>  21     0.0    0.0    0.9   99.1 |    0     3   *|  32.74
>  22     0.8    0.0   99.2    0.0 |    0     2   *|  45.61
>  23   100.0    0.0    0.0    0.0 |    0     0    |  33.78
>  24    97.0    1.0    0.0    1.9 |    0     0    |  43.74
>  25     0.0    1.8    0.0   98.2 |    0     3   *|  43.31
>  26     0.0   96.1    0.0    3.9 |    0     1   *|  43.61
>  27     2.6    0.0    0.3   97.0 |    0     3   *|  46.54
>  28     0.0    0.0    0.0  100.0 |    0     3   *|  33.59
>  29     0.0   92.1    0.0    7.9 |    0     1   *|  43.43
>  30     2.2    0.0   97.8    0.0 |    0     2   *|  45.19
>  31    26.0   72.3    0.0    1.8 |    0     1   *|  43.29
>  32    98.5    1.3    0.2    0.0 |    0     0    |  42.97
> AverageUserTime 38.69 seconds
> ElapsedTime     88.60
> TotalUserTime   1238.43
> TotalSysTime    1.85
> 
> -------------------------------------------------------------
> B: O(1) scheduler equivalent, 1 task steal per load balance step
> (2.4.18 node affine scheduler with CONFIG_NUMA_SCHED=n)
> 
> Running 'hackbench 20' in the background: Time: 1.861
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0  100.0    0.0 |    2     2    |  31.62
>   2     0.0    0.0  100.0    0.0 |    2     2    |  31.68
>   3   100.0    0.0    0.0    0.0 |    0     0    |  29.40
>   4     0.0    0.0  100.0    0.0 |    2     2    |  31.56
>   5     0.0    0.0  100.0    0.0 |    2     2    |  31.60
>   6    95.0    5.0    0.0    0.0 |    1     0   *|  38.14
>   7   100.0    0.0    0.0    0.0 |    0     0    |  29.27
>   8     0.0  100.0    0.0    0.0 |    1     1    |  27.87
> AverageUserTime 31.39 seconds
> ElapsedTime     39.50
> TotalUserTime   251.27
> TotalSysTime    0.48
> 
> Running 'hackbench 20' in the background: Time: 4.688
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0  100.0    0.0 |    2     2    |  31.49
>   2     0.0    0.0  100.0    0.0 |    2     2    |  31.33
>   3     0.0    0.0  100.0    0.0 |    2     2    |  31.30
>   4   100.0    0.0    0.0    0.0 |    0     0    |  31.98
>   5   100.0    0.0    0.0    0.0 |    0     0    |  32.21
>   6     0.0   91.3    8.7    0.0 |    2     1   *|  39.88
>   7     0.0  100.0    0.0    0.0 |    1     1    |  30.72
>   8     0.0    0.0    0.0  100.0 |    3     3    |  31.56
>   9     1.2   98.8    0.0    0.0 |    0     1   *|  41.31
>  10     0.0    0.0    0.0  100.0 |    3     3    |  31.56
>  11     0.0  100.0    0.0    0.0 |    1     1    |  30.67
>  12     0.0    0.0    0.0  100.0 |    3     3    |  31.45
>  13   100.0    0.0    0.0    0.0 |    0     0    |  31.94
>  14     5.5    0.0   94.5    0.0 |    0     2   *|  40.83
>  15     0.0    0.0    0.0  100.0 |    3     3    |  31.53
>  16    83.0   17.0    0.0    0.0 |    1     0   *|  38.50
> AverageUserTime 33.64 seconds
> ElapsedTime     44.67
> TotalUserTime   538.49
> TotalSysTime    1.02
> 
> Running 'hackbench 20' in the background: Time: 5.229
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1    67.6    0.0   17.4   15.0 |    2     0   *|  41.36
>   2     0.0    0.0  100.0    0.0 |    2     2    |  32.78
>   3   100.0    0.0    0.0    0.0 |    0     0    |  32.92
>   4     0.0    0.0  100.0    0.0 |    2     2    |  33.56
>   5    22.4   75.3    2.3    0.0 |    2     1   *|  43.26
>   6     0.0    0.0  100.0    0.0 |    2     2    |  33.38
>   7   100.0    0.0    0.0    0.0 |    0     0    |  33.16
>   8     0.0    0.0    0.0  100.0 |    3     3    |  32.03
>   9     0.0  100.0    0.0    0.0 |    1     1    |  32.64
>  10     0.0    0.0    0.0  100.0 |    3     3    |  32.00
>  11     4.5    0.0    0.0   95.5 |    0     3   *|  43.82
>  12     0.0    0.0    0.0  100.0 |    3     3    |  31.91
>  13   100.0    0.0    0.0    0.0 |    0     0    |  33.46
>  14     0.0  100.0    0.0    0.0 |    1     1    |  32.39
>  15     0.0    0.0    0.0  100.0 |    3     3    |  31.86
>  16     0.0    0.0  100.0    0.0 |    2     2    |  33.35
>  17   100.0    0.0    0.0    0.0 |    0     0    |  33.95
>  18     0.0    0.0    0.0  100.0 |    3     3    |  31.68
>  19     0.0    5.3    0.0   94.7 |    1     3   *|  42.97
>  20     2.0   98.0    0.0    0.0 |    0     1   *|  43.56
>  21     0.0    0.0    0.0  100.0 |    3     3    |  31.65
>  22   100.0    0.0    0.0    0.0 |    0     0    |  34.13
>  23     0.0  100.0    0.0    0.0 |    1     1    |  32.52
>  24     9.9   90.1    0.0    0.0 |    1     1    |  33.58
>  25     0.0    3.2   96.8    0.0 |    1     2   *|  42.51
>  26    89.8    0.0    0.0   10.2 |    0     0    |  34.18
>  27     0.0  100.0    0.0    0.0 |    1     1    |  32.44
>  28     0.0    2.4   97.6    0.0 |    2     2    |  33.59
>  29     8.4    0.0   91.6    0.0 |    2     2    |  33.78
>  30   100.0    0.0    0.0    0.0 |    0     0    |  33.52
>  31     0.0    6.3   93.7    0.0 |    2     2    |  33.37
>  32     0.0  100.0    0.0    0.0 |    1     1    |  33.20
> AverageUserTime 34.83 seconds
> ElapsedTime     79.92
> TotalUserTime   1114.91
> TotalSysTime    2.30
> 
> -----------------------------------------------------------
> C: simple scheduler MH
> 
> Running 'hackbench 20' in the background: Time: 7.271
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1   100.0    0.0    0.0    0.0 |    0     0    |  32.66
>   2   100.0    0.0    0.0    0.0 |    0     0    |  32.53
>   3   100.0    0.0    0.0    0.0 |    0     0    |  32.47
>   4     0.0  100.0    0.0    0.0 |    1     1    |  30.33
>   5     0.0  100.0    0.0    0.0 |    1     1    |  30.20
>   6     0.0  100.0    0.0    0.0 |    1     1    |  30.06
>   7   100.0    0.0    0.0    0.0 |    0     0    |  32.35
>   8   100.0    0.0    0.0    0.0 |    0     0    |  32.58
> AverageUserTime 31.65 seconds
> ElapsedTime     42.53
> TotalUserTime   253.30
> TotalSysTime    0.48
> 
> Running 'hackbench 20' in the background: Time: 2.929
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1   100.0    0.0    0.0    0.0 |    0     0    |  31.07
>   2   100.0    0.0    0.0    0.0 |    0     0    |  30.97
>   3   100.0    0.0    0.0    0.0 |    0     0    |  30.86
>   4     0.0  100.0    0.0    0.0 |    1     1    |  31.91
>   5     0.0  100.0    0.0    0.0 |    1     1    |  32.04
>   6     0.0  100.0    0.0    0.0 |    1     1    |  31.98
>   7     0.0  100.0    0.0    0.0 |    1     1    |  32.01
>   8     0.0    0.0  100.0    0.0 |    2     2    |  31.83
>   9     0.0    0.0  100.0    0.0 |    2     2    |  31.66
>  10     0.0    0.0  100.0    0.0 |    2     2    |  31.47
>  11     0.0    0.0  100.0    0.0 |    2     2    |  31.74
>  12     0.0    0.0    0.0  100.0 |    3     3    |  31.54
>  13     0.0    0.0    0.0  100.0 |    3     3    |  31.48
>  14     0.0    0.0    0.0  100.0 |    3     3    |  31.49
>  15     0.0    0.0    0.0  100.0 |    3     3    |  31.97
>  16     5.4    0.0    0.0   94.6 |    0     3   *|  40.86
> AverageUserTime 32.18 seconds
> ElapsedTime     49.66
> TotalUserTime   515.11
> TotalSysTime    0.95
> 
> Running 'hackbench 20' in the background: Time: 2.849
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1   100.0    0.0    0.0    0.0 |    0     0    |  33.75
>   2   100.0    0.0    0.0    0.0 |    0     0    |  33.89
>   3   100.0    0.0    0.0    0.0 |    0     0    |  33.87
>   4     0.0  100.0    0.0    0.0 |    1     1    |  33.34
>   5     0.0  100.0    0.0    0.0 |    1     1    |  33.41
>   6     0.0  100.0    0.0    0.0 |    1     1    |  33.16
>   7     0.0  100.0    0.0    0.0 |    1     1    |  33.25
>   8     0.0    0.0  100.0    0.0 |    2     2    |  32.90
>   9     0.0    0.0  100.0    0.0 |    2     2    |  33.08
>  10     0.0    0.0  100.0    0.0 |    2     2    |  33.07
>  11     0.0    0.0  100.0    0.0 |    2     2    |  33.11
>  12     0.0    0.0    0.0  100.0 |    3     3    |  31.46
>  13     0.0    0.0    0.0  100.0 |    3     3    |  32.04
>  14     0.0    0.0    0.0  100.0 |    3     3    |  31.59
>  15     0.0    0.0    0.0  100.0 |    3     3    |  31.83
>  16   100.0    0.0    0.0    0.0 |    0     0    |  33.95
>  17     1.2    0.0    0.0   98.8 |    0     3   *|  44.66
>  18   100.0    0.0    0.0    0.0 |    0     0    |  34.02
>  19    55.4    0.0    0.0   44.6 |    3     0    |  37.11
>  20   100.0    0.0    0.0    0.0 |    0     0    |  33.62
>  21     0.0  100.0    0.0    0.0 |    1     1    |  33.33
>  22     0.0  100.0    0.0    0.0 |    1     1    |  33.26
>  23     0.0  100.0    0.0    0.0 |    1     1    |  33.41
>  24     0.0  100.0    0.0    0.0 |    1     1    |  33.14
>  25     0.0    0.0  100.0    0.0 |    2     2    |  33.26
>  26     0.0    0.0  100.0    0.0 |    2     2    |  33.05
>  27     0.0    0.0    3.1   96.9 |    2     3   *|  43.50
>  28   100.0    0.0    0.0    0.0 |    0     0    |  34.02
>  29    60.4    0.0   39.6    0.0 |    2     0   *|  37.05
>  30     0.0    0.0    0.0  100.0 |    3     3    |  31.53
>  31   100.0    0.0    0.0    0.0 |    0     0    |  33.94
>  32     0.0    0.0  100.0    0.0 |    2     2    |  33.03
> AverageUserTime 34.05 seconds
> ElapsedTime     80.34
> TotalUserTime   1090.10
> TotalSysTime    2.15
> 
> ---------------------------------------------------------------
> D: numasched : pooling only (idle pool delay set to 0)
> 
> Running 'hackbench 20' in the background: Time: 2.066
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0    0.0  100.0 |    3     3    |  28.50
>   2   100.0    0.0    0.0    0.0 |    0     0    |  30.97
>   3     0.0    0.0    0.0  100.0 |    3     3    |  28.49
>   4     0.0    0.0  100.0    0.0 |    2     2    |  28.65
>   5   100.0    0.0    0.0    0.0 |    0     0    |  31.18
>   6   100.0    0.0    0.0    0.0 |    0     0    |  30.96
>   7     5.7   94.3    0.0    0.0 |    0     1   *|  38.69
>   8     0.0    0.0  100.0    0.0 |    2     2    |  28.65
> AverageUserTime 30.76 seconds
> ElapsedTime     39.72
> TotalUserTime   246.23
> TotalSysTime    0.46
> 
> Running 'hackbench 20' in the background: Time: 5.178
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1   100.0    0.0    0.0    0.0 |    0     0    |  32.01
>   2   100.0    0.0    0.0    0.0 |    0     0    |  31.82
>   3     0.0    3.4    0.0   96.6 |    1     3   *|  41.01
>   4   100.0    0.0    0.0    0.0 |    0     0    |  32.24
>   5   100.0    0.0    0.0    0.0 |    0     0    |  31.91
>   6     0.0    0.0  100.0    0.0 |    2     2    |  31.75
>   7     0.0    0.0    0.0  100.0 |    3     3    |  30.55
>   8     0.0    0.0  100.0    0.0 |    2     2    |  31.74
>   9     0.0    0.0  100.0    0.0 |    2     2    |  31.74
>  10     0.0    0.0    0.0  100.0 |    3     3    |  30.63
>  11     0.0    0.0    0.0  100.0 |    3     3    |  30.67
>  12     0.0    0.0  100.0    0.0 |    2     2    |  31.75
>  13     0.0  100.0    0.0    0.0 |    1     1    |  31.94
>  14     0.0  100.0    0.0    0.0 |    1     1    |  32.20
>  15     0.0  100.0    0.0    0.0 |    1     1    |  32.24
>  16     0.0  100.0    0.0    0.0 |    1     1    |  32.07
> AverageUserTime 32.27 seconds
> ElapsedTime     42.97
> TotalUserTime   516.53
> TotalSysTime    1.16
> 
> Running 'hackbench 20' in the background: Time: 4.505
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0   49.4    0.0   50.6 |    3     3    |  36.48
>   2     0.3   99.7    0.0    0.0 |    0     1   *|  39.21
>   3     0.0  100.0    0.0    0.0 |    1     1    |  32.23
>   4   100.0    0.0    0.0    0.0 |    0     0    |  32.88
>   5   100.0    0.0    0.0    0.0 |    0     0    |  32.98
>   6   100.0    0.0    0.0    0.0 |    0     0    |  32.97
>   7     0.0   18.3   81.7    0.0 |    2     2    |  35.13
>   8     0.0    5.7   94.3    0.0 |    2     2    |  34.00
>   9     0.0    0.0  100.0    0.0 |    2     2    |  33.14
>  10     0.0    3.3   96.7    0.0 |    2     2    |  34.22
>  11     0.0  100.0    0.0    0.0 |    1     1    |  32.34
>  12     0.0  100.0    0.0    0.0 |    1     1    |  32.37
>  13     0.0   13.4    0.0   86.6 |    3     3    |  34.55
>  14     0.0    0.0    0.0  100.0 |    3     3    |  33.93
>  15     0.0    0.0    0.0  100.0 |    3     3    |  33.08
>  16   100.0    0.0    0.0    0.0 |    0     0    |  33.31
>  17   100.0    0.0    0.0    0.0 |    0     0    |  33.08
>  18     0.0    8.2    0.0   91.8 |    3     3    |  33.76
>  19     0.0    0.0    0.0  100.0 |    3     3    |  33.23
>  20     0.0   32.7    0.0   67.3 |    3     3    |  34.79
>  21     0.0    0.0    0.0  100.0 |    3     3    |  32.94
>  22   100.0    0.0    0.0    0.0 |    0     0    |  33.23
>  23     0.0  100.0    0.0    0.0 |    1     1    |  32.35
>  24     0.0    0.0  100.0    0.0 |    2     2    |  33.44
>  25     0.0  100.0    0.0    0.0 |    1     1    |  32.42
>  26     0.0    0.0  100.0    0.0 |    2     2    |  33.25
>  27     0.0    0.0  100.0    0.0 |    2     2    |  33.82
>  28     0.0    0.0  100.0    0.0 |    2     2    |  32.88
>  29     0.0    0.0    0.0  100.0 |    3     3    |  31.99
>  30     0.0  100.0    0.0    0.0 |    1     1    |  31.90
>  31   100.0    0.0    0.0    0.0 |    0     0    |  32.93
>  32    99.6    0.0    0.4    0.0 |    2     0   *|  41.48
> AverageUserTime 33.76 seconds
> ElapsedTime     78.35
> TotalUserTime   1080.75
> TotalSysTime    2.03
> 
> -----------------------------------------------------------
> 
> E: node affine scheduler with initial load balancing, statically
> assigned homenode
> 
> Running 'hackbench 20' in the background: Time: 2.330
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0  100.0    0.0 |    2     2    |  28.86
>   2     0.0  100.0    0.0    0.0 |    1     1    |  28.82
>   3     0.0  100.0    0.0    0.0 |    1     1    |  28.85
>   4     0.0    0.0    0.0  100.0 |    3     3    |  28.47
>   5   100.0    0.0    0.0    0.0 |    0     0    |  28.69
>   6     0.0    0.0  100.0    0.0 |    2     2    |  28.65
>   7     0.0    0.0    0.0  100.0 |    3     3    |  28.49
>   8   100.0    0.0    0.0    0.0 |    0     0    |  28.55
> AverageUserTime 28.67 seconds
> ElapsedTime     30.28
> TotalUserTime   229.51
> TotalSysTime    0.43
> 
> 
> Running 'hackbench 20' in the background: Time: 5.562
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0  100.0    0.0    0.0 |    1     1    |  31.47
>   2     0.0  100.0    0.0    0.0 |    1     1    |  31.38
>   3     0.0    0.0    0.0  100.0 |    3     3    |  31.53
>   4     0.0    0.0    0.0  100.0 |    3     3    |  31.46
>   5   100.0    0.0    0.0    0.0 |    0     0    |  31.64
>   6   100.0    0.0    0.0    0.0 |    0     0    |  31.57
>   7   100.0    0.0    0.0    0.0 |    0     0    |  31.57
>   8     0.0    0.0    0.0  100.0 |    3     3    |  31.51
>   9     0.0    0.0  100.0    0.0 |    2     2    |  31.63
>  10     0.0    0.0  100.0    0.0 |    2     2    |  31.56
>  11     0.0    0.0  100.0    0.0 |    2     2    |  31.42
>  12     0.0  100.0    0.0    0.0 |    1     1    |  31.44
>  13     0.0    0.0    0.0  100.0 |    3     3    |  31.43
>  14     0.0    0.0  100.0    0.0 |    2     2    |  31.61
>  15   100.0    0.0    0.0    0.0 |    0     0    |  31.34
>  16     0.0  100.0    0.0    0.0 |    1     1    |  31.36
> AverageUserTime 31.50 seconds
> ElapsedTime     36.98
> TotalUserTime   504.17
> TotalSysTime    1.03
> 
> Running 'hackbench 20' in the background: Time: 5.288
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0   46.1   53.9 |    2     3   *|  37.11
>   2     0.0    0.0    0.0  100.0 |    3     3    |  31.32
>   3     0.0  100.0    0.0    0.0 |    1     1    |  33.15
>   4     0.0    4.9   95.1    0.0 |    2     2    |  33.55
>   5     0.0    0.0  100.0    0.0 |    2     2    |  32.81
>   6     0.0  100.0    0.0    0.0 |    1     1    |  32.88
>   7     0.0  100.0    0.0    0.0 |    1     1    |  33.31
>   8    82.9   17.1    0.0    0.0 |    1     0   *|  40.70
>   9   100.0    0.0    0.0    0.0 |    0     0    |  33.36
>  10     9.3    0.0    0.0   90.7 |    0     3   *|  42.12
>  11   100.0    0.0    0.0    0.0 |    0     0    |  33.38
>  12     0.0    0.0    0.0  100.0 |    3     3    |  31.28
>  13     0.0    0.0    0.0  100.0 |    3     3    |  31.43
>  14     0.0  100.0    0.0    0.0 |    1     1    |  33.33
>  15     0.0  100.0    0.0    0.0 |    1     1    |  33.04
>  16   100.0    0.0    0.0    0.0 |    0     0    |  33.02
>  17    95.2    4.8    0.0    0.0 |    0     0    |  33.82
>  18     3.8    0.0    0.0   96.2 |    0     3   *|  43.21
>  19     0.0    0.0    0.0  100.0 |    3     3    |  31.11
>  20     0.0    0.0  100.0    0.0 |    2     2    |  32.19
>  21     0.0  100.0    0.0    0.0 |    1     1    |  33.43
>  22     0.0    0.0  100.0    0.0 |    2     2    |  33.03
>  23     0.0    0.0  100.0    0.0 |    2     2    |  33.06
>  24   100.0    0.0    0.0    0.0 |    0     0    |  33.04
>  25   100.0    0.0    0.0    0.0 |    0     0    |  33.07
>  26     0.0    0.0  100.0    0.0 |    2     2    |  33.07
>  27     0.0    0.0  100.0    0.0 |    2     2    |  33.22
>  28     0.0    0.0    0.0  100.0 |    3     3    |  31.06
>  29     0.0   15.3   84.7    0.0 |    2     2    |  33.25
>  30     0.0  100.0    0.0    0.0 |    1     1    |  33.36
>  31    85.0    0.0    0.0   15.0 |    0     0    |  34.50
>  32     0.0   86.4    0.0   13.6 |    1     1    |  34.27
> AverageUserTime 33.89 seconds
> ElapsedTime     76.84
> TotalUserTime   1084.86
> TotalSysTime    2.12
> 
> --------------------------------------------------------
> 
> F: node affine scheduler, dynamically assigned homenode, no
> initial load balancing
> 
> Running 'hackbench 20' in the background: Time: 1.861
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0    0.0  100.0 |    3     3    |  30.75
>   2   100.0    0.0    0.0    0.0 |    0     0    |  27.12
>   3     0.0    0.0    0.0  100.0 |    3     3    |  30.60
>   4     0.0    0.0    0.0  100.0 |    3     3    |  30.65
>   5     0.0  100.0    0.0    0.0 |    1     1    |  28.65
>   6     0.0    0.0   78.4   21.6 |    3     2   *|  36.63
>   7     0.0  100.0    0.0    0.0 |    1     1    |  28.57
>   8     0.0    0.0  100.0    0.0 |    2     2    |  27.65
> AverageUserTime 30.08 seconds
> ElapsedTime     38.28
> TotalUserTime   240.75
> TotalSysTime    0.49
> 
> Running 'hackbench 20' in the background: Time: 5.936
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0    0.0  100.0 |    3     3    |  31.47
>   2   100.0    0.0    0.0    0.0 |    0     0    |  29.90
>   3    88.2    0.0    4.6    7.2 |    2     0   *|  40.96
>   4     0.0    0.0    0.0  100.0 |    3     3    |  31.53
>   5     0.0    0.0    0.0  100.0 |    3     3    |  31.50
>   6     0.0  100.0    0.0    0.0 |    1     1    |  32.25
>   7     0.0   84.5    0.0   15.5 |    1     1    |  33.41
>   8     0.0    0.0  100.0    0.0 |    2     2    |  32.38
>   9    83.0   17.0    0.0    0.0 |    1     0   *|  39.29
>  10     0.0    0.0    0.0  100.0 |    3     3    |  31.51
>  11     0.0  100.0    0.0    0.0 |    1     1    |  32.38
>  12     0.0    0.0  100.0    0.0 |    2     2    |  32.30
>  13     0.0  100.0    0.0    0.0 |    1     1    |  32.06
>  14   100.0    0.0    0.0    0.0 |    0     0    |  29.85
>  15     0.0    0.0  100.0    0.0 |    2     2    |  32.18
>  16     0.0    0.0  100.0    0.0 |    2     2    |  32.39
> AverageUserTime 32.83 seconds
> ElapsedTime     42.51
> TotalUserTime   525.59
> TotalSysTime    1.14
> 
> Running 'hackbench 20' in the background: Time: 4.277
> Job  node00 node01 node02 node03 | iSched MSched | UserTime(s)
>   1     0.0    0.0    0.0  100.0 |    3     3    |  31.69
>   2     0.0    0.0  100.0    0.0 |    2     2    |  33.81
>   3   100.0    0.0    0.0    0.0 |    0     0    |  33.10
>   4     1.4    4.9    2.8   90.9 |    2     3   *|  44.26
>   5     0.0    0.0    0.0  100.0 |    3     3    |  31.76
>   6     0.0    0.0    0.0  100.0 |    3     3    |  31.69
>   7     0.0    0.0    3.6   96.4 |    2     3   *|  43.73
>   8     0.0  100.0    0.0    0.0 |    1     1    |  33.07
>   9     0.0    0.0  100.0    0.0 |    2     2    |  34.04
>  10     0.0  100.0    0.0    0.0 |    1     1    |  33.24
>  11     0.0  100.0    0.0    0.0 |    1     1    |  33.23
>  12   100.0    0.0    0.0    0.0 |    0     0    |  33.77
>  13     0.0  100.0    0.0    0.0 |    1     1    |  33.10
>  14   100.0    0.0    0.0    0.0 |    0     0    |  33.26
>  15     0.0  100.0    0.0    0.0 |    1     1    |  33.16
>  16   100.0    0.0    0.0    0.0 |    0     0    |  33.23
>  17     3.9    0.0    0.0   96.1 |    0     3   *|  42.74
>  18     0.0  100.0    0.0    0.0 |    1     1    |  32.97
>  19   100.0    0.0    0.0    0.0 |    0     0    |  33.21
>  20     0.0    0.0   92.1    7.9 |    2     2    |  34.30
>  21     0.0    0.0   95.9    4.1 |    2     2    |  34.31
>  22     0.0    0.0  100.0    0.0 |    2     2    |  33.88
>  23     0.0    0.0    0.0  100.0 |    3     3    |  30.85
>  24     0.0  100.0    0.0    0.0 |    1     1    |  32.99
>  25     0.0    0.0  100.0    0.0 |    2     2    |  33.84
>  26     0.0  100.0    0.0    0.0 |    1     1    |  33.35
>  27   100.0    0.0    0.0    0.0 |    0     0    |  33.29
>  28     0.0    0.0  100.0    0.0 |    2     2    |  33.79
>  29   100.0    0.0    0.0    0.0 |    0     0    |  33.25
>  30   100.0    0.0    0.0    0.0 |    0     0    |  33.41
>  31     0.0    0.0  100.0    0.0 |    2     2    |  33.92
>  32     0.0    0.0    0.0  100.0 |    3     3    |  31.16
> AverageUserTime 34.11 seconds
> ElapsedTime     77.38
> TotalUserTime   1091.86
> TotalSysTime    2.14
> 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-07  0:58 ` Martin J. Bligh
@ 2002-10-07 16:52   ` Erich Focht
  0 siblings, 0 replies; 24+ messages in thread
From: Erich Focht @ 2002-10-07 16:52 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

On Monday 07 October 2002 02:58, Martin J. Bligh wrote:
> > As I'm rewriting the node affine scheduler to be more modular, I'll
> > redo the tests for cases D, E, F on top of 2.5.X kernels soon.
>
> It's really hard to see anything comparing 2.4 vs 2.5 results,
> so when you have results for everything across the same kernel,
> it'll be a lot easier to decipher ...

The user times are really comparable. And total user time, too. I'm
taking the node affine patches to 2.5.39 and will rerun them ASAP.

> For the record, I really don't care whose code gets accepted, I'm
> just interested in having a scheduler that works well for me ;-)

I was running several of these to see which features help and to try
to understand where the advantage comes from. In the output tables
you can see where the job has spentits time and on which node it started.
You can also recognize from the user time how well a job ran. Alone on
it's node: 27s, away from it's memory: 36s (or so), sharing the node
with others: 27-32s, running on wrong node with others: up to 43s. And
seeing these you can understand how well a feature works or how much
you miss another feature. It wasn't my plan to advertise the node
affine scheduler, it just has most of the features.

> However, I think it would be a good idea to run some other benchmarks
> on these, that are a little more general ... dbench, kernel compile,
> sdet, whatever, as *well* as your own benchmark ... I think we're
> working on that.
Yes, great!

> So lower numbers are better, right? So Michael's stuff seems to
> work fine at the higher end, but poorly at the lower end - I think
> this may be due to a bug he found on Friday, if that gets fixed, it
> might make a difference.

OK, but how about the node-wise selection and balancing? Then we get
the approaches closer...

> I'll run your tests on the NUMA-Q comparing 2.5.40 vs Michael's stuff.
> If you send me a cleaned up version (ie without the ia64 stuff in
> generic code) of your stuff against 2.5.40 (or ideally 2.5.40-mm2),
> I'll run that too ... (if you're putting the latencies into arch code,
> you can set i386 (well NUMA-Q) for 20:1 if you think that'll help).

Is in preparation. First step: pooling scheduler without multi-level
support but node-wise initial balancing.

> PS. the wierd "this one was run without hackbench" thing makes the
> results even harder to read ...
Sorry about that. Didn't have a newer measurement.

> PPS. Does Kimio's discontigmem patch work for you on 2.5?
Yes. We're trying to get that stuff in but David's away from his email
for a while as you know.

Regards,
Erich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-07  7:25 ` Martin J. Bligh
  2002-10-07  7:40   ` Ingo Molnar
@ 2002-10-07 20:09   ` Erich Focht
       [not found]     ` <1420721189.1034032091@[10.10.2.3]>
  1 sibling, 1 reply; 24+ messages in thread
From: Erich Focht @ 2002-10-07 20:09 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1604 bytes --]

Hi,

here come two patches which together implement the pooling NUMA scheduler
with initial load balancing stripped of the node affine scheduler.

First patch does:
 - implement CPU pools structure
 - changes find_busiest_queue to first scan the own node, then (if no
   imbalance found) find the most loaded node and steal a task from it
   if the imbalance exceeds 25%.
 - The steal is delayed 100ms if we steal from a remote node and the own
   node is averagely (within 25%) loaded.
 - The steal is delayed 1ms if the own node is unloaded (this gives
   CPUs from the own node an advantage).

I dropped the multilevel structure as it is only complicating things at
this stage and can be introduced later. The patch is prepared for it.

The second patch implements initial balancing by selecting the most
unloaded node and its most unloaded CPU for an execed task. It does
a round-robin selection of the next node if the loads are equal.

The node affine patch will be on top of this.

The pool setup is pretty messy, it needs some sort of RCU to become
nicer. Also undefining CONFIG_NUMA_SCHED could lead to errors, I didn't
clean this up. This is mainly for testing and as a base experimenting
with the node affine extensions.

Michael, you could just take the second patch for initial load balancing
of your approach. Compared to your patch this one tries to achieve
equal node load also in the find_busiest_queue step. With more effort
and code, though... I hope it works for you, I removed the arch
dependencies and just need __cpu_to_node().

Regards,
Erich

[-- Attachment #2: 01-numa_sched_core5-noML-2.5.39.patch --]
[-- Type: text/x-diff, Size: 21171 bytes --]

diff -urNp a/arch/i386/config.in b/arch/i386/config.in
--- a/arch/i386/config.in	Fri Sep 27 23:49:42 2002
+++ b/arch/i386/config.in	Mon Oct  7 19:46:52 2002
@@ -177,6 +177,7 @@ else
      fi
      # Common NUMA Features
      if [ "$CONFIG_X86_NUMAQ" = "y" ]; then
+        bool 'Enable NUMA scheduler' CONFIG_NUMA_SCHED
         bool 'Numa Memory Allocation Support' CONFIG_NUMA
         if [ "$CONFIG_NUMA" = "y" ]; then
            define_bool CONFIG_DISCONTIGMEM y
diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Fri Sep 27 23:49:54 2002
+++ b/arch/i386/kernel/smpboot.c	Mon Oct  7 19:46:52 2002
@@ -765,6 +765,8 @@ static int __init wakeup_secondary_via_I
 
 extern unsigned long cpu_initialized;
 
+static int __initdata nr_lnodes = 0;
+
 static void __init do_boot_cpu (int apicid) 
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
@@ -1194,6 +1196,11 @@ int __devinit __cpu_up(unsigned int cpu)
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA_SCHED
+	pooldata_lock();
+	bld_pools();
+	pooldata_unlock();
+#endif
 }
 
 void __init smp_intr_init()
diff -urNp a/arch/ia64/config.in b/arch/ia64/config.in
--- a/arch/ia64/config.in	Sat Oct  5 17:23:24 2002
+++ b/arch/ia64/config.in	Mon Oct  7 19:47:37 2002
@@ -69,6 +69,7 @@ then
 	bool '  Enable NUMA support' CONFIG_NUMA
 	if [ "$CONFIG_NUMA" = "y" ]; then
 		define_bool CONFIG_DISCONTIGMEM y
+		bool '  Enable NUMA scheduler' CONFIG_NUMA_SCHED
 	fi
 	bool '  Enable IA-64 Machine Check Abort' CONFIG_IA64_MCA
 	define_bool CONFIG_PM y
diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	Mon Oct  7 17:42:02 2002
+++ b/arch/ia64/kernel/smpboot.c	Mon Oct  7 19:46:52 2002
@@ -397,7 +397,7 @@ unsigned long cache_decay_ticks;	/* # of
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -508,6 +508,11 @@ smp_cpus_done (unsigned int dummy)
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA_SCHED
+	pooldata_lock();
+	bld_pools();
+	pooldata_unlock();
+#endif
 }
 
 int __devinit
diff -urNp a/include/asm-i386/atomic.h b/include/asm-i386/atomic.h
--- a/include/asm-i386/atomic.h	Fri Sep 27 23:49:16 2002
+++ b/include/asm-i386/atomic.h	Mon Oct  7 19:46:52 2002
@@ -111,6 +111,18 @@ static __inline__ void atomic_inc(atomic
 }
 
 /**
+ * atomic_inc_return - increment atomic variable and return new value
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1 and return it's new value.  Note that
+ * the guaranteed useful range of an atomic_t is only 24 bits.
+ */
+static inline int atomic_inc_return(atomic_t *v){
+	atomic_inc(v);
+	return v->counter;
+}
+
+/**
  * atomic_dec - decrement atomic variable
  * @v: pointer of type atomic_t
  * 
diff -urNp a/include/asm-ia64/topology.h b/include/asm-ia64/topology.h
--- a/include/asm-ia64/topology.h	Mon Oct  7 17:42:02 2002
+++ b/include/asm-ia64/topology.h	Mon Oct  7 19:46:52 2002
@@ -37,7 +37,7 @@
  * don't use it too early.
  * Who needs this?
  */
-/* #define __node_to_first_cpu(node) pool_cpus[pool_ptr[node]] */
+#define __node_to_first_cpu(node) pool_cpus[pool_ptr[node]]
 
 /* 
  * Returns a bitmask of CPUs on Node 'node'.
diff -urNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Sat Oct  5 17:22:45 2002
+++ b/include/linux/sched.h	Mon Oct  7 19:46:52 2002
@@ -22,6 +22,8 @@ extern unsigned long event;
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
+#include <asm/numa.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -167,7 +169,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -457,6 +458,15 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA_SCHED
+extern int node_levels[NR_NODES];
+extern int nr_node_levels;
+/* extern void find_node_levels(int numpools); */
+extern void pooldata_lock(void);
+extern void pooldata_unlock(void);
+#endif
+extern void sched_migrate_task(task_t *p, int cpu);
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Fri Sep 27 23:50:27 2002
+++ b/kernel/sched.c	Mon Oct  7 19:46:52 2002
@@ -154,6 +154,10 @@ struct runqueue {
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
+	unsigned long wait_time;
+	int wait_node;
+	int load[2][NR_CPUS];
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -174,6 +178,95 @@ static struct runqueue runqueues[NR_CPUS
 #endif
 
 /*
+ * Variables for describing and accessing processor pools. Using a
+ * compressed row format like notation. Processor pools are treated
+ * like logical node numbers.
+ *
+ * numpools: number of CPU pools (nodes),
+ * pool_cpus[]: CPUs in pools sorted by their pool ID,
+ * pool_ptr[pool]: index of first element in pool_cpus[] belonging to pool.
+ * pool_mask[]: cpu mask of a pool.
+ *
+ * Example: loop over all CPUs in a pool p:
+ * loop_over_node(i,node) {
+ *      cpu = pool_cpu(i);
+ *      ...
+ * }
+ *                                                      <efocht@ess.nec.de>
+ */
+int numpools, pool_ptr[NR_NODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[NR_NODES];
+unsigned long pool_mask[NR_NODES];
+static atomic_t pool_lock = ATOMIC_INIT(0);
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+
+/* Avoid zeroes in integer divides for load calculations */
+#define BALANCE_FACTOR 100
+
+#ifdef CONFIG_NUMA_SCHED
+#define POOL_DELAY     (100)
+#define loop_over_node(i,n) for(i=pool_ptr[n]; i<pool_ptr[n+1]; i++)
+#define pool_cpu(i) pool_cpus[i]
+
+void pooldata_lock(void)
+{
+	int i;
+ retry:
+	while (atomic_read(&pool_lock));
+	if (atomic_inc_return(&pool_lock) > 1) {
+		atomic_dec(&pool_lock);
+		goto retry;
+	}
+	/* 
+	 * Wait a while, any loops using pool data should finish
+	 * in between. This is VERY ugly and should be replaced
+	 * by some real RCU stuff. [EF]
+	 */
+	for (i=0; i<100; i++)
+		udelay(1000);
+}
+
+void pooldata_unlock(void)
+{
+	atomic_dec(&pool_lock);
+}
+
+/*
+ * Call pooldata_lock() before calling this function and
+ * pooldata_unlock() after!
+ */
+void bld_pools(void)
+{
+	int n, cpu, ptr, i;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map;
+		pool_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				pool_cpu(ptr++) = cpu;
+		pool_nr_cpus[n] = ptr - pool_ptr[n];
+	}
+	numpools=numnodes;
+	pool_ptr[numpools]=ptr;
+	printk("CPU pools : %d\n",numpools);
+	for (n=0;n<numpools;n++) {
+		printk("pool %d :",n);
+		loop_over_node(i,n)
+			printk("%d ",pool_cpu(i));
+		printk("\n");
+	}
+}
+
+#else
+#define POOL_DELAY  (0)
+#define loop_over_node(i,n) for(i=0; i<NR_CPUS; i++)
+#define pool_cpu(i) (i)
+#endif
+
+
+/*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
  * explicitly disabling preemption.
@@ -632,121 +728,145 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * Calculate load of a CPU pool, store results in data[][NR_CPUS].
+ * Return the index of the most loaded runqueue.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static int calc_pool_load(int data[][NR_CPUS], int this_cpu, int pool, int idle)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
-
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
-	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
-	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+	runqueue_t *rq_src, *this_rq = cpu_rq(this_cpu);
+	int this_pool = cpu_to_node(this_cpu);
+	int i, ii, idx=-1, refload, load;
 
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	data[1][pool] = 0;
+	refload = -1;
 
+	loop_over_node(ii, pool) {
+		i = pool_cpu(ii);
 		rq_src = cpu_rq(i);
 		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
 			load = rq_src->nr_running;
 		else
 			load = this_rq->prev_nr_running[i];
 		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+		data[0][i] = load;
+		data[1][pool] += load;
+		if (load > refload) {
+			idx = i;
+			refload = load;
 		}
 	}
+	data[1][pool] = data[1][pool] * BALANCE_FACTOR / pool_nr_cpus[pool];
+	return idx;
+}
 
-	if (likely(!busiest))
-		goto out;
+/*
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their home node.
+ * This is done in two steps:
+ * 1. First try to find a runqueue within the own CPU pool (AKA node) with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU. Remote runqueues running tasks having their homenode on the
+ * current node are preferred (those tasks count twice in the load calculation).
+ * If the current load is far below the average try to steal a task from the
+ * most loaded node/cpu. Otherwise wait 100ms and give less loaded nodes the
+ * chance to approach the average load.
+ *
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode...
+ *                                                         <efocht@ess.nec.de>
+ */
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+					     int idle, int *nr_running)
+{
+	runqueue_t *busiest = NULL;
+	int imax, best_cpu, pool, max_pool_load, max_pool_idx;
+	int i, del_shift;
+	int avg_load=-1, this_pool = cpu_to_node(this_cpu);
 
-	*imbalance = (max_load - nr_running) / 2;
+	/* Need at least ~25% imbalance to trigger balancing. */
+#define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4))
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+		*nr_running = this_rq->nr_running;
+	else
+		*nr_running = this_rq->prev_nr_running[this_cpu];
+
+	best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle);
+	
+	if (best_cpu != this_cpu)
+		goto check_out;
+
+ scan_all:
+	best_cpu = -1;
+	max_pool_load = this_rq->load[1][this_pool];
+	max_pool_idx = this_pool;
+	avg_load = max_pool_load * pool_nr_cpus[this_pool];
+	for (i = 1; i < numpools; i++) {
+		pool = (i + this_pool) % numpools;
+		imax = calc_pool_load(this_rq->load, this_cpu, pool, idle);
+		avg_load += this_rq->load[1][pool]*pool_nr_cpus[pool];
+		if (this_rq->load[1][pool] > max_pool_load) {
+			max_pool_load = this_rq->load[1][pool];
+			max_pool_idx = pool;
+			best_cpu = imax;
+		}
+	}
+	/* Exit if not enough imbalance on any remote node. */
+	if ((best_cpu < 0) ||
+	    BALANCED(max_pool_load,this_rq->load[1][this_pool])) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
+	avg_load /= num_online_cpus();
+	/* Wait longer before stealing if load is average. */
+	if (BALANCED(avg_load,this_rq->load[1][this_pool]))
+		del_shift = 0;
+	else
+		del_shift = 6;
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
-	}
-out:
-	return busiest;
-}
+	if (this_rq->wait_node != max_pool_idx) {
+		this_rq->wait_node = max_pool_idx;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+		if (jiffies - this_rq->wait_time < (POOL_DELAY >> del_shift))
+			goto out;
 
-/*
- * pull_task - move a task from a remote runqueue to the local runqueue.
- * Both runqueues must be locked.
- */
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
-{
-	dequeue_task(p, src_array);
-	src_rq->nr_running--;
-	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
-	enqueue_task(p, this_rq->active);
-	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
-	 * to be always true for them.
-	 */
-	if (p->prio < this_rq->curr->prio)
-		set_need_resched();
+ check_out:
+	/* Enough imbalance in the remote cpu loads? */
+	if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) {
+		busiest = cpu_rq(best_cpu);
+		this_rq->wait_node = -1;
+	} else if (avg_load == -1)
+		/* only scanned local pool, so let's look at all of them */
+		goto scan_all;
+ out:
+	return busiest;
 }
 
 /*
- * Current runqueue is empty, or rebalance tick: if there is an
- * inbalance (current runqueue is too short) then pull from
- * busiest runqueue(s).
- *
- * We call this with the current runqueue locked,
- * irqs disabled.
+ * Find a task to steal from the busiest RQ. The busiest->lock must be held
+ * while calling this routine. 
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
-	runqueue_t *busiest;
+	int idx;
+	task_t *next = NULL, *tmp;
 	prio_array_t *array;
 	struct list_head *head, *curr;
-	task_t *tmp;
+	int this_pool=cpu_to_node(this_cpu), weight, maxweight=0;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
-	if (!busiest)
-		goto out;
+	/*
+	 * We do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed.
+	 */
+
+#define CAN_MIGRATE_TASK(p,rq,this_cpu)	\
+		((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
+		p != rq->curr && \
+		 ((p)->cpus_allowed & (1UL<<(this_cpu))))
 
 	/*
 	 * We first consider expired tasks. Those will likely not be
@@ -772,7 +892,7 @@ skip_bitmap:
 			array = busiest->active;
 			goto new_array;
 		}
-		goto out_unlock;
+		goto out;
 	}
 
 	head = array->queue + idx;
@@ -780,33 +900,74 @@ skip_bitmap:
 skip_queue:
 	tmp = list_entry(curr, task_t, run_list);
 
+	if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
+		weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks;
+		if (weight > maxweight) {
+			maxweight = weight;
+			next = tmp;
+		}
+	}
+	curr = curr->next;
+	if (curr != head)
+		goto skip_queue;
+	idx++;
+	goto skip_bitmap;
+
+ out:
+	return next;
+}
+
+/*
+ * pull_task - move a task from a remote runqueue to the local runqueue.
+ * Both runqueues must be locked.
+ */
+static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+{
+	dequeue_task(p, src_array);
+	src_rq->nr_running--;
+	set_task_cpu(p, this_cpu);
+	this_rq->nr_running++;
+	enqueue_task(p, this_rq->active);
 	/*
-	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * Note that idle threads have a prio of MAX_PRIO, for this test
+	 * to be always true for them.
 	 */
+	if (p->prio < this_rq->curr->prio)
+		set_need_resched();
+}
 
-#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
-		!task_running(rq, p) &&					\
-			((p)->cpus_allowed & (1UL << (this_cpu))))
-
-	curr = curr->prev;
-
-	if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
-	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
+/*
+ * Current runqueue is empty, or rebalance tick: if there is an
+ * inbalance (current runqueue is too short) then pull from
+ * busiest runqueue(s).
+ *
+ * We call this with the current runqueue locked,
+ * irqs disabled.
+ */
+static void load_balance(runqueue_t *this_rq, int idle)
+{
+	int nr_running, this_cpu = task_cpu(this_rq->curr);
+	task_t *tmp;
+	runqueue_t *busiest;
+
+	/* avoid deadlock by timer interrupt on own cpu */
+	if (atomic_read(&pool_lock)) return;
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &nr_running);
+	if (!busiest)
+		goto out;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	/*
+	 * Make sure nothing changed since we checked the
+	 * runqueue length.
+	 */
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
+
+	tmp = task_to_steal(busiest, this_cpu);
+	if (!tmp)
+		goto out_unlock;
+	pull_task(busiest, tmp->array, tmp, this_rq, this_cpu);
 out_unlock:
 	spin_unlock(&busiest->lock);
 out:
@@ -819,10 +980,10 @@ out:
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -1920,6 +2081,8 @@ typedef struct {
 	struct list_head list;
 	task_t *task;
 	struct completion done;
+	int cpu_dest;
+	int sync;
 } migration_req_t;
 
 /*
@@ -1965,6 +2128,7 @@ void set_cpus_allowed(task_t *p, unsigne
 	}
 	init_completion(&req.done);
 	req.task = p;
+	req.sync = 1;
 	list_add(&req.list, &rq->migration_queue);
 	task_rq_unlock(rq, &flags);
 	wake_up_process(rq->migration_thread);
@@ -1974,6 +2138,17 @@ out:
 	preempt_enable();
 }
 
+void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	set_cpus_allowed(p, 1UL << dest_cpu);
+	set_cpus_allowed(p, old_mask);
+}
+
 /*
  * migration_thread - this is a highprio system thread that performs
  * thread migration by 'pulling' threads into the target runqueue.
@@ -2009,7 +2184,7 @@ static int migration_thread(void * data)
 	for (;;) {
 		runqueue_t *rq_src, *rq_dest;
 		struct list_head *head;
-		int cpu_src, cpu_dest;
+		int cpu_src, cpu_dest, sync;
 		migration_req_t *req;
 		unsigned long flags;
 		task_t *p;
@@ -2024,10 +2199,17 @@ static int migration_thread(void * data)
 		}
 		req = list_entry(head->next, migration_req_t, list);
 		list_del_init(head->next);
-		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		sync = req->sync;
+		if (sync)
+			cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+		else {
+			cpu_dest = req->cpu_dest;
+			req->task = NULL;
+		}
+		spin_unlock_irqrestore(&rq->lock, flags);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);
@@ -2050,7 +2232,8 @@ repeat:
 		double_rq_unlock(rq_src, rq_dest);
 		local_irq_restore(flags);
 
-		complete(&req->done);
+		if (sync)
+			complete(&req->done);
 	}
 }
 
@@ -2130,6 +2313,15 @@ void __init sched_init(void)
 			__set_bit(MAX_PRIO, array->bitmap);
 		}
 	}
+#ifdef CONFIG_NUMA_SCHED
+	pool_ptr[0] = 0;
+	pool_ptr[1] = NR_CPUS;
+
+	numpools = 1;
+	pool_mask[0] = -1L;
+	pool_nr_cpus[0] = NR_CPUS;
+#endif
+
 	/*
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.

[-- Attachment #3: 02-numa_sched_ilb-2.5.39.patch --]
[-- Type: text/x-diff, Size: 2336 bytes --]

diff -urNp a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	Sat Oct  5 17:22:45 2002
+++ b/fs/exec.c	Mon Oct  7 19:49:08 2002
@@ -993,6 +993,7 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Mon Oct  7 19:46:52 2002
+++ b/kernel/sched.c	Mon Oct  7 20:01:56 2002
@@ -2138,6 +2135,78 @@ out:
 	preempt_enable();
 }
 
+#ifdef CONFIG_NUMA_SCHED
+/* used as counter for round-robin node-scheduling */
+static atomic_t sched_node=ATOMIC_INIT(0);
+
+/*
+ * Find the least loaded CPU on the current node of the task.
+ */
+static int sched_best_cpu(struct task_struct *p, int node)
+{
+	int n, cpu, load, best_cpu = task_cpu(p);
+	
+	load = 1000000;
+	loop_over_node(n,node) {
+		cpu = pool_cpu(n);
+		if (!(p->cpus_allowed & (1UL << cpu) & cpu_online_map))
+			continue;
+		if (cpu_rq(cpu)->nr_running < load) {
+			best_cpu = cpu;
+			load = cpu_rq(cpu)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+/*
+ * Find the node with fewest number of tasks running on it.
+ */
+static int sched_best_node(struct task_struct *p)
+{
+	int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id();
+	int pool, load;
+	unsigned long mask = p->cpus_allowed & cpu_online_map;
+
+	do {
+		best_node = atomic_inc_return(&sched_node) % numpools;
+	} while (!(pool_mask[best_node] & mask));
+
+	min_load = 100000000;
+	for (n = 0; n < numpools; n++) {
+		pool = (best_node + n) % numpools;
+		load = 0;
+		loop_over_node(i, pool) {
+			cpu=pool_cpu(i);
+			if (!cpu_online(cpu)) continue;
+			load += cpu_rq(cpu)->nr_running;
+		}
+		if (pool == numa_node_id()) load--;
+		pool_load = 100*load/pool_nr_cpus[pool];
+		if ((pool_load < min_load) && (pool_mask[pool] & mask)) {
+			min_load = pool_load;
+			min_pool = pool;
+		}
+	}
+	atomic_set(&sched_node, min_pool);
+	return min_pool;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu, new_node=0;
+
+	while (atomic_read(&pool_lock))
+		cpu_relax();
+	if (numpools > 1) {
+		new_node = sched_best_node(current, 0);
+	} 
+	new_cpu = sched_best_cpu(current, new_node);
+	if (new_cpu != smp_processor_id())
+		sched_migrate_task(current, new_cpu);
+}
+#endif
+
 void sched_migrate_task(task_t *p, int dest_cpu)
 {
 	unsigned long old_mask;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] NUMA schedulers benchmark results
  2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum
@ 2002-10-07 20:35   ` Erich Focht
  0 siblings, 0 replies; 24+ messages in thread
From: Erich Focht @ 2002-10-07 20:35 UTC (permalink / raw)
  To: Michael Hohnbaum; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel

Hi Michael,
On Monday 07 October 2002 18:37, Michael Hohnbaum wrote:
> These numbers look pretty good for your scheduler.  It would be good
> if you could post numbers against 2.5.x where x is the same for all
> tests.  I thought you had your scheduler working on 2.5.x recently.
I had it running under 2.5.35, but some of the important peripherals
of the machine didn't work with that and I hate doing experiments
when I only have the serial console. I'll get the node affine stuff
finished tomorrow and will try to find a testing slot (that's another
problem :-(  )

> Running numa_test on my systems I find a lot of variation (as much
> as 10%) between identical runs.  Are all of these results averaged
> over multiple runs?  Any idea why the hackbench times vary so widely?
No, they are not averaged. I often made 2-3 runs and they usually didn't
differ significantly (within 2-3% of average user time), so I just took
one. Would be cool to have a script for averaging several results, I'll
make that if I get some time, but then we can't look at the distribution
over the nodes. And that's what I find most useful for understanding
what's going on.

> Looking at the results for my scheduler, it is real obvious what the
> algorithm is that I use for selecting a cpu for a process.  This has
> not been quite as apparent on my systems, probably due to other tasks
> running in the background.  The code in your scheduler that causes a
> different node to start the search for a cpu really pays off here.
Yes, but it's not perfect, as it cannot predict how long the processes
will live. Also it takes into account every running task, no matter
whether it's just a shortly running migration_thread or something else.
A running average might be better, but I have no experience with that.
Any ideas?

Regarding the hackbench times: it forks 20 tasks which fork 20 tasks
each, these are sending messages to each other. Depending on the load
balancing (which is not really predictible) it could happen that you end
up with bad distributions of senders/receivers sometimes.

Best regards,
Erich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
       [not found]     ` <1420721189.1034032091@[10.10.2.3]>
@ 2002-10-08 17:33       ` Erich Focht
  2002-10-08 19:44         ` Martin J. Bligh
  2002-10-09  1:15         ` Christoph Hellwig
  0 siblings, 2 replies; 24+ messages in thread
From: Erich Focht @ 2002-10-08 17:33 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 673 bytes --]

On Tuesday 08 October 2002 08:08, Martin J. Bligh wrote:
> I did:
>
> 1. remove asm/numa.h declaration (you need to work out something
> generic here ... mmzone.h should include it, I think?)
> 2. changed NR_NODES to MAX_NUMNODES everywhere.
> 3. Put in an extern declartion into fs/exec.c for sched_balance_exec
>
> Then I got bored around the linking problem. Does this compile for
> you somehow?

Aaargh, you got the wrong second patch :-( Sorry for that...

Thanks for the hints, I cleaned up the first patch, too. No
CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
asm/numa.h from asm/topology.h, so no need for you to see it.

Erich

[-- Attachment #2: 01-numa_sched_core6-2.5.39.patch --]
[-- Type: text/x-diff, Size: 19311 bytes --]

diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Fri Sep 27 23:49:54 2002
+++ b/arch/i386/kernel/smpboot.c	Tue Oct  8 11:37:56 2002
@@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu)
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	pooldata_lock();
+	bld_pools();
+	pooldata_unlock();
+#endif
 }
 
 void __init smp_intr_init()
diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	Tue Oct  8 10:57:10 2002
+++ b/arch/ia64/kernel/smpboot.c	Tue Oct  8 11:38:17 2002
@@ -397,7 +397,7 @@ unsigned long cache_decay_ticks;	/* # of
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -508,6 +508,11 @@ smp_cpus_done (unsigned int dummy)
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+	pooldata_lock();
+	bld_pools();
+	pooldata_unlock();
+#endif
 }
 
 int __devinit
diff -urNp a/include/asm-i386/atomic.h b/include/asm-i386/atomic.h
--- a/include/asm-i386/atomic.h	Fri Sep 27 23:49:16 2002
+++ b/include/asm-i386/atomic.h	Tue Oct  8 10:59:13 2002
@@ -111,6 +111,18 @@ static __inline__ void atomic_inc(atomic
 }
 
 /**
+ * atomic_inc_return - increment atomic variable and return new value
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1 and return it's new value.  Note that
+ * the guaranteed useful range of an atomic_t is only 24 bits.
+ */
+static inline int atomic_inc_return(atomic_t *v){
+	atomic_inc(v);
+	return v->counter;
+}
+
+/**
  * atomic_dec - decrement atomic variable
  * @v: pointer of type atomic_t
  * 
diff -urNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Tue Oct  8 10:56:28 2002
+++ b/include/linux/sched.h	Tue Oct  8 11:30:21 2002
@@ -22,6 +22,7 @@ extern unsigned long event;
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -167,7 +168,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -457,6 +457,12 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void pooldata_lock(void);
+extern void pooldata_unlock(void);
+#endif
+extern void sched_migrate_task(task_t *p, int cpu);
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Fri Sep 27 23:50:27 2002
+++ b/kernel/sched.c	Tue Oct  8 11:19:40 2002
@@ -154,6 +154,10 @@ struct runqueue {
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
+	unsigned long wait_time;
+	int wait_node;
+	int load[2][NR_CPUS];
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -174,6 +178,95 @@ static struct runqueue runqueues[NR_CPUS
 #endif
 
 /*
+ * Variables for describing and accessing processor pools. Using a
+ * compressed row format like notation. Processor pools are treated
+ * like logical node numbers.
+ *
+ * numpools: number of CPU pools (nodes),
+ * pool_cpus[]: CPUs in pools sorted by their pool ID,
+ * pool_ptr[pool]: index of first element in pool_cpus[] belonging to pool.
+ * pool_mask[]: cpu mask of a pool.
+ *
+ * Example: loop over all CPUs in a pool p:
+ * loop_over_node(i,node) {
+ *      cpu = pool_cpu(i);
+ *      ...
+ * }
+ *                                                      <efocht@ess.nec.de>
+ */
+int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[MAX_NUMNODES];
+unsigned long pool_mask[MAX_NUMNODES];
+static atomic_t pool_lock = ATOMIC_INIT(0);
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+
+/* Avoid zeroes in integer divides for load calculations */
+#define BALANCE_FACTOR 100
+
+#ifdef CONFIG_NUMA
+#define POOL_DELAY     (100)
+#define loop_over_node(i,n) for(i=pool_ptr[n]; i<pool_ptr[n+1]; i++)
+#define pool_cpu(i) pool_cpus[i]
+
+void pooldata_lock(void)
+{
+	int i;
+ retry:
+	while (atomic_read(&pool_lock));
+	if (atomic_inc_return(&pool_lock) > 1) {
+		atomic_dec(&pool_lock);
+		goto retry;
+	}
+	/* 
+	 * Wait a while, any loops using pool data should finish
+	 * in between. This is VERY ugly and should be replaced
+	 * by some real RCU stuff. [EF]
+	 */
+	for (i=0; i<100; i++)
+		udelay(1000);
+}
+
+void pooldata_unlock(void)
+{
+	atomic_dec(&pool_lock);
+}
+
+/*
+ * Call pooldata_lock() before calling this function and
+ * pooldata_unlock() after!
+ */
+void bld_pools(void)
+{
+	int n, cpu, ptr, i;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map;
+		pool_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				pool_cpu(ptr++) = cpu;
+		pool_nr_cpus[n] = ptr - pool_ptr[n];
+	}
+	numpools=numnodes;
+	pool_ptr[numpools]=ptr;
+	printk("CPU pools : %d\n",numpools);
+	for (n=0;n<numpools;n++) {
+		printk("pool %d :",n);
+		loop_over_node(i,n)
+			printk("%d ",pool_cpu(i));
+		printk("\n");
+	}
+}
+
+#else
+#define POOL_DELAY  (0)
+#define loop_over_node(i,n) for(i=0; i<NR_CPUS; i++)
+#define pool_cpu(i) (i)
+#endif
+
+
+/*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
  * explicitly disabling preemption.
@@ -632,121 +725,144 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * Calculate load of a CPU pool, store results in data[][NR_CPUS].
+ * Return the index of the most loaded runqueue.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static int calc_pool_load(int data[][NR_CPUS], int this_cpu, int pool, int idle)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
-
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
-	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
-	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+	runqueue_t *rq_src, *this_rq = cpu_rq(this_cpu);
+	int i, ii, idx=-1, refload, load;
 
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	data[1][pool] = 0;
+	refload = -1;
 
+	loop_over_node(ii, pool) {
+		i = pool_cpu(ii);
 		rq_src = cpu_rq(i);
 		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
 			load = rq_src->nr_running;
 		else
 			load = this_rq->prev_nr_running[i];
 		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+		data[0][i] = load;
+		data[1][pool] += load;
+		if (load > refload) {
+			idx = i;
+			refload = load;
 		}
 	}
+	data[1][pool] = data[1][pool] * BALANCE_FACTOR / pool_nr_cpus[pool];
+	return idx;
+}
 
-	if (likely(!busiest))
-		goto out;
+/*
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their home node.
+ * This is done in two steps:
+ * 1. First try to find a runqueue within the own CPU pool (AKA node) with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU. Remote runqueues running tasks having their homenode on the
+ * current node are preferred (those tasks count twice in the load calculation).
+ * If the current load is far below the average try to steal a task from the
+ * most loaded node/cpu. Otherwise wait 100ms and give less loaded nodes the
+ * chance to approach the average load.
+ *
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode...
+ *                                                         <efocht@ess.nec.de>
+ */
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+					     int idle, int *nr_running)
+{
+	runqueue_t *busiest = NULL;
+	int imax, best_cpu, pool, max_pool_load, max_pool_idx;
+	int i, del_shift;
+	int avg_load=-1, this_pool = cpu_to_node(this_cpu);
+
+	/* Need at least ~25% imbalance to trigger balancing. */
+#define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4))
+
+	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+		*nr_running = this_rq->nr_running;
+	else
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-	*imbalance = (max_load - nr_running) / 2;
+	best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle);
+	
+	if (best_cpu != this_cpu)
+		goto check_out;
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+ scan_all:
+	best_cpu = -1;
+	max_pool_load = this_rq->load[1][this_pool];
+	max_pool_idx = this_pool;
+	avg_load = max_pool_load * pool_nr_cpus[this_pool];
+	for (i = 1; i < numpools; i++) {
+		pool = (i + this_pool) % numpools;
+		imax = calc_pool_load(this_rq->load, this_cpu, pool, idle);
+		avg_load += this_rq->load[1][pool]*pool_nr_cpus[pool];
+		if (this_rq->load[1][pool] > max_pool_load) {
+			max_pool_load = this_rq->load[1][pool];
+			max_pool_idx = pool;
+			best_cpu = imax;
+		}
+	}
+	/* Exit if not enough imbalance on any remote node. */
+	if ((best_cpu < 0) ||
+	    BALANCED(max_pool_load,this_rq->load[1][this_pool])) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
+	avg_load /= num_online_cpus();
+	/* Wait longer before stealing if load is average. */
+	if (BALANCED(avg_load,this_rq->load[1][this_pool]))
+		del_shift = 0;
+	else
+		del_shift = 6;
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
-	}
-out:
+	if (this_rq->wait_node != max_pool_idx) {
+		this_rq->wait_node = max_pool_idx;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+		if (jiffies - this_rq->wait_time < (POOL_DELAY >> del_shift))
+			goto out;
+
+ check_out:
+	/* Enough imbalance in the remote cpu loads? */
+	if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) {
+		busiest = cpu_rq(best_cpu);
+		this_rq->wait_node = -1;
+	} else if (avg_load == -1)
+		/* only scanned local pool, so let's look at all of them */
+		goto scan_all;
+ out:
 	return busiest;
 }
 
 /*
- * pull_task - move a task from a remote runqueue to the local runqueue.
- * Both runqueues must be locked.
+ * Find a task to steal from the busiest RQ. The busiest->lock must be held
+ * while calling this routine. 
  */
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu)
 {
-	dequeue_task(p, src_array);
-	src_rq->nr_running--;
-	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
-	enqueue_task(p, this_rq->active);
-	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
-	 * to be always true for them.
-	 */
-	if (p->prio < this_rq->curr->prio)
-		set_need_resched();
-}
-
-/*
- * Current runqueue is empty, or rebalance tick: if there is an
- * inbalance (current runqueue is too short) then pull from
- * busiest runqueue(s).
- *
- * We call this with the current runqueue locked,
- * irqs disabled.
- */
-static void load_balance(runqueue_t *this_rq, int idle)
-{
-	int imbalance, idx, this_cpu = smp_processor_id();
-	runqueue_t *busiest;
+	int idx;
+	task_t *next = NULL, *tmp;
 	prio_array_t *array;
 	struct list_head *head, *curr;
-	task_t *tmp;
+	int weight, maxweight=0;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
-	if (!busiest)
-		goto out;
+	/*
+	 * We do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed.
+	 */
+
+#define CAN_MIGRATE_TASK(p,rq,this_cpu)	\
+		((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
+		p != rq->curr && \
+		 ((p)->cpus_allowed & (1UL<<(this_cpu))))
 
 	/*
 	 * We first consider expired tasks. Those will likely not be
@@ -772,7 +888,7 @@ skip_bitmap:
 			array = busiest->active;
 			goto new_array;
 		}
-		goto out_unlock;
+		goto out;
 	}
 
 	head = array->queue + idx;
@@ -780,33 +896,74 @@ skip_bitmap:
 skip_queue:
 	tmp = list_entry(curr, task_t, run_list);
 
+	if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
+		weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks;
+		if (weight > maxweight) {
+			maxweight = weight;
+			next = tmp;
+		}
+	}
+	curr = curr->next;
+	if (curr != head)
+		goto skip_queue;
+	idx++;
+	goto skip_bitmap;
+
+ out:
+	return next;
+}
+
+/*
+ * pull_task - move a task from a remote runqueue to the local runqueue.
+ * Both runqueues must be locked.
+ */
+static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+{
+	dequeue_task(p, src_array);
+	src_rq->nr_running--;
+	set_task_cpu(p, this_cpu);
+	this_rq->nr_running++;
+	enqueue_task(p, this_rq->active);
 	/*
-	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * Note that idle threads have a prio of MAX_PRIO, for this test
+	 * to be always true for them.
 	 */
+	if (p->prio < this_rq->curr->prio)
+		set_need_resched();
+}
 
-#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
-		!task_running(rq, p) &&					\
-			((p)->cpus_allowed & (1UL << (this_cpu))))
-
-	curr = curr->prev;
-
-	if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
-	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
+/*
+ * Current runqueue is empty, or rebalance tick: if there is an
+ * inbalance (current runqueue is too short) then pull from
+ * busiest runqueue(s).
+ *
+ * We call this with the current runqueue locked,
+ * irqs disabled.
+ */
+static void load_balance(runqueue_t *this_rq, int idle)
+{
+	int nr_running, this_cpu = task_cpu(this_rq->curr);
+	task_t *tmp;
+	runqueue_t *busiest;
+
+	/* avoid deadlock by timer interrupt on own cpu */
+	if (atomic_read(&pool_lock)) return;
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &nr_running);
+	if (!busiest)
+		goto out;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	/*
+	 * Make sure nothing changed since we checked the
+	 * runqueue length.
+	 */
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
+
+	tmp = task_to_steal(busiest, this_cpu);
+	if (!tmp)
+		goto out_unlock;
+	pull_task(busiest, tmp->array, tmp, this_rq, this_cpu);
 out_unlock:
 	spin_unlock(&busiest->lock);
 out:
@@ -819,10 +976,10 @@ out:
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -1920,6 +2077,8 @@ typedef struct {
 	struct list_head list;
 	task_t *task;
 	struct completion done;
+	int cpu_dest;
+	int sync;
 } migration_req_t;
 
 /*
@@ -1965,6 +2124,7 @@ void set_cpus_allowed(task_t *p, unsigne
 	}
 	init_completion(&req.done);
 	req.task = p;
+	req.sync = 1;
 	list_add(&req.list, &rq->migration_queue);
 	task_rq_unlock(rq, &flags);
 	wake_up_process(rq->migration_thread);
@@ -1974,6 +2134,17 @@ out:
 	preempt_enable();
 }
 
+void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	set_cpus_allowed(p, 1UL << dest_cpu);
+	set_cpus_allowed(p, old_mask);
+}
+
 /*
  * migration_thread - this is a highprio system thread that performs
  * thread migration by 'pulling' threads into the target runqueue.
@@ -2009,7 +2180,7 @@ static int migration_thread(void * data)
 	for (;;) {
 		runqueue_t *rq_src, *rq_dest;
 		struct list_head *head;
-		int cpu_src, cpu_dest;
+		int cpu_src, cpu_dest, sync;
 		migration_req_t *req;
 		unsigned long flags;
 		task_t *p;
@@ -2024,10 +2195,17 @@ static int migration_thread(void * data)
 		}
 		req = list_entry(head->next, migration_req_t, list);
 		list_del_init(head->next);
-		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		sync = req->sync;
+		if (sync)
+			cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+		else {
+			cpu_dest = req->cpu_dest;
+			req->task = NULL;
+		}
+		spin_unlock_irqrestore(&rq->lock, flags);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);
@@ -2050,7 +2228,8 @@ repeat:
 		double_rq_unlock(rq_src, rq_dest);
 		local_irq_restore(flags);
 
-		complete(&req->done);
+		if (sync)
+			complete(&req->done);
 	}
 }
 
@@ -2130,6 +2309,15 @@ void __init sched_init(void)
 			__set_bit(MAX_PRIO, array->bitmap);
 		}
 	}
+#ifdef CONFIG_NUMA
+	pool_ptr[0] = 0;
+	pool_ptr[1] = NR_CPUS;
+
+	numpools = 1;
+	pool_mask[0] = -1L;
+	pool_nr_cpus[0] = NR_CPUS;
+#endif
+
 	/*
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.

[-- Attachment #3: 02-numa_sched_ilb-2.5.39.patch --]
[-- Type: text/x-diff, Size: 2790 bytes --]

diff -urNp a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	Tue Oct  8 15:03:54 2002
+++ b/fs/exec.c	Tue Oct  8 15:08:22 2002
@@ -993,6 +993,7 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Tue Oct  8 15:07:48 2002
+++ b/include/linux/sched.h	Tue Oct  8 15:09:27 2002
@@ -460,6 +460,9 @@ extern void set_cpus_allowed(task_t *p, 
 #ifdef CONFIG_NUMA
 extern void pooldata_lock(void);
 extern void pooldata_unlock(void);
+extern void sched_balance_exec(void);
+#else
+#define sched_balance_exec() {}
 #endif
 extern void sched_migrate_task(task_t *p, int cpu);
 
diff -urNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Tue Oct  8 15:07:48 2002
+++ b/kernel/sched.c	Tue Oct  8 15:08:22 2002
@@ -2134,6 +2134,78 @@ out:
 	preempt_enable();
 }
 
+#ifdef CONFIG_NUMA
+/* used as counter for round-robin node-scheduling */
+static atomic_t sched_node=ATOMIC_INIT(0);
+
+/*
+ * Find the least loaded CPU on the current node of the task.
+ */
+static int sched_best_cpu(struct task_struct *p, int node)
+{
+	int n, cpu, load, best_cpu = task_cpu(p);
+	
+	load = 1000000;
+	loop_over_node(n,node) {
+		cpu = pool_cpu(n);
+		if (!(p->cpus_allowed & (1UL << cpu) & cpu_online_map))
+			continue;
+		if (cpu_rq(cpu)->nr_running < load) {
+			best_cpu = cpu;
+			load = cpu_rq(cpu)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+/*
+ * Find the node with fewest number of tasks running on it.
+ */
+static int sched_best_node(struct task_struct *p)
+{
+	int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id();
+	int cpu, pool, load;
+	unsigned long mask = p->cpus_allowed & cpu_online_map;
+
+	do {
+		best_node = atomic_inc_return(&sched_node) % numpools;
+	} while (!(pool_mask[best_node] & mask));
+
+	min_load = 100000000;
+	for (n = 0; n < numpools; n++) {
+		pool = (best_node + n) % numpools;
+		load = 0;
+		loop_over_node(i, pool) {
+			cpu=pool_cpu(i);
+			if (!cpu_online(cpu)) continue;
+			load += cpu_rq(cpu)->nr_running;
+		}
+		if (pool == numa_node_id()) load--;
+		pool_load = 100*load/pool_nr_cpus[pool];
+		if ((pool_load < min_load) && (pool_mask[pool] & mask)) {
+			min_load = pool_load;
+			min_pool = pool;
+		}
+	}
+	atomic_set(&sched_node, min_pool);
+	return min_pool;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu, new_node=0;
+
+	while (atomic_read(&pool_lock))
+		cpu_relax();
+	if (numpools > 1) {
+		new_node = sched_best_node(current);
+	} 
+	new_cpu = sched_best_cpu(current, new_node);
+	if (new_cpu != smp_processor_id())
+		sched_migrate_task(current, new_cpu);
+}
+#endif
+
 void sched_migrate_task(task_t *p, int dest_cpu)
 {
 	unsigned long old_mask;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-08 17:33       ` Erich Focht
@ 2002-10-08 19:44         ` Martin J. Bligh
  2002-10-09 16:26           ` Erich Focht
  2002-10-09  1:15         ` Christoph Hellwig
  1 sibling, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2002-10-08 19:44 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

> Aaargh, you got the wrong second patch :-( Sorry for that...

No problem
 
> Thanks for the hints, I cleaned up the first patch, too. No
> CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
> asm/numa.h from asm/topology.h, so no need for you to see it.

Cool. Well it compiles now, but:

CPU 3 IS NOW UP!
Starting migration thread for cpu 3
Bringing up 4
CP4 4ivi e e UPr: 0000
ing
igrU:i   4
eaIP:    p060:[<c0114231>]    Not tainted
EFLAGS: 00010046
EIP is at calc_pool_load+0x109/0x120
eax: 00000000   ebx: 00000001   ecx: c034f380   edx: 00000000
esi: c028efa0   edi: 00000020   ebp: c3a37efc   esp: c3a37ed4
ds: 0068   es: 0068   ss: 0068
Process swapper (pid: 0, threadinfo=c3a36000 task=f01bf060)
Stack: c028f948 c028efa0 f01bf060 00002928 00002944 00002944 ffffffff ffffffff 
       c028efa0 ffffffff c3a37f78 c01142d5 c028f948 00000004 00000001 00000001 
       c3a36000 c028efa0 f01bf060 00000010 00000008 00000000 00000001 00002a13 
Call Trace:
 [<c01142d5>] load_balance+0x8d/0x5c4
 [<c0118a8d>] call_console_drivers+0xdd/0xe4
 [<c0118cc2>] release_console_sem+0x42/0xa4
 [<c0114c21>] schedule+0xd5/0x3ac
 [<c0105300>] default_idle+0x0/0x34
 [<c01053bf>] cpu_idle+0x43/0x48
 [<c0118cc2>] release_console_sem+0x42/0xa4
 [<c0118c1d>] printk+0x125/0x140

Code: f7 3c 99 8b 55 08 89 84 9a 80 00 00 00 8b 45 f4 5b 5e 5f 89 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-08 17:33       ` Erich Focht
  2002-10-08 19:44         ` Martin J. Bligh
@ 2002-10-09  1:15         ` Christoph Hellwig
  2002-10-09 10:29           ` Erich Focht
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2002-10-09  1:15 UTC (permalink / raw)
  To: Erich Focht; +Cc: Martin J. Bligh, Michael Hohnbaum, Ingo Molnar, linux-kernel

On Tue, Oct 08, 2002 at 07:33:06PM +0200, Erich Focht wrote:
> Aaargh, you got the wrong second patch :-( Sorry for that...
> 
> Thanks for the hints, I cleaned up the first patch, too. No
> CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
> asm/numa.h from asm/topology.h, so no need for you to see it.


diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Fri Sep 27 23:49:54 2002
+++ b/arch/i386/kernel/smpboot.c	Tue Oct  8 11:37:56 2002
@@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu)
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	pooldata_lock();
+	bld_pools();
+	pooldata_unlock();
+#endif

All callers of bld_pools() need the pooldata lock - taking
it inside that function makes the code a little more readable..
Also I'd suggest to rename bld_pools() to build_pools() ;)

-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */

Could you explain this change?  And it's affect on non-NUMA IA64
machines?

 /**
+ * atomic_inc_return - increment atomic variable and return new value
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1 and return it's new value.  Note that
+ * the guaranteed useful range of an atomic_t is only 24 bits.
+ */
+static inline int atomic_inc_return(atomic_t *v){
+	atomic_inc(v);
+	return v->counter;
+}

Who do you guarantee this is atomic?  Please make it fit
Documentation/CodyingStyle, btw..

+int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[MAX_NUMNODES];
+unsigned long pool_mask[MAX_NUMNODES];

Hmm, shouldn't those [MAX_NUMNODES] arrays be in some per-node array
to avoid cacheline-bouncing?

+void pooldata_lock(void)
+{
+	int i;
+ retry:
+	while (atomic_read(&pool_lock));
+	if (atomic_inc_return(&pool_lock) > 1) {
+		atomic_dec(&pool_lock);
+		goto retry;
+	}

Why not a simple spin_lock()?

+	/* 
+	 * Wait a while, any loops using pool data should finish
+	 * in between. This is VERY ugly and should be replaced
+	 * by some real RCU stuff. [EF]
+	 */
+	for (i=0; i<100; i++)
+		udelay(1000);

Urgg.  I'd suggest you switch to RCU now and make your patch apply
ontop of it - another reason to apply the RCU core patch..

+void pooldata_unlock(void)
+{
+	atomic_dec(&pool_lock);
+}

Dito for spin_unlock.

+	/* avoid deadlock by timer interrupt on own cpu */
+	if (atomic_read(&pool_lock)) return;

spin_trylock..

All in all your code doesn't seem to be very cachelign-friendly,
lots of global bouncing.  Do you have any numbers on what your
patch changes for normal SMP configurations?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-09  1:15         ` Christoph Hellwig
@ 2002-10-09 10:29           ` Erich Focht
  0 siblings, 0 replies; 24+ messages in thread
From: Erich Focht @ 2002-10-09 10:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin J. Bligh, Michael Hohnbaum, Ingo Molnar, linux-kernel

Hi Christoph,

thanks for looking at this and the comments. Replies are below.

On Wednesday 09 October 2002 03:15, Christoph Hellwig wrote:
> On Tue, Oct 08, 2002 at 07:33:06PM +0200, Erich Focht wrote:
> diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
> --- a/arch/i386/kernel/smpboot.c	Fri Sep 27 23:49:54 2002
> +++ b/arch/i386/kernel/smpboot.c	Tue Oct  8 11:37:56 2002
> @@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu)
>  void __init smp_cpus_done(unsigned int max_cpus)
>  {
>  	zap_low_mappings();
> +#ifdef CONFIG_NUMA
> +	pooldata_lock();
> +	bld_pools();
> +	pooldata_unlock();
> +#endif
>
> All callers of bld_pools() need the pooldata lock - taking
> it inside that function makes the code a little more readable..
> Also I'd suggest to rename bld_pools() to build_pools() ;)
The separation is due to the needs of CPU hot-add. When integrating
with Kimi's CPU hotplug patch within the Atlas project, we found that
there was more to do in the scheduler blocked phase when adding one CPU
to the running system (besides recomputing the pools). Looking too far
into the future here...

> -	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth
> estimate */ +	cache_decay_ticks = 8;	/* XXX base this on PAL info and
> cache-bandwidth estimate */
>
> Could you explain this change?  And it's affect on non-NUMA IA64
> machines?

At some stage I looked up the minimal timeslice and found it to be 10.
This means that when you have two tasks on a runqueue and zero on
another, you might see both tasks as absolutely cache-hot and cannot
steal any of them. The cache_decay_ticks is anyway a very rough
approximation. With the task_to_steal() selection from the patch this
change should have no significant impact because we make a ranking
and select the cache coolest task which slept more than 8ms. The default
O(1) behavior is to select the first task which slept more than 10ms,
without the ranking.

>  /**
> + * atomic_inc_return - increment atomic variable and return new value
> + * @v: pointer of type atomic_t
> + *
> + * Atomically increments @v by 1 and return it's new value.  Note that
> + * the guaranteed useful range of an atomic_t is only 24 bits.
> + */
> +static inline int atomic_inc_return(atomic_t *v){
> +	atomic_inc(v);
> +	return v->counter;
> +}
>
> Who do you guarantee this is atomic?  Please make it fit
> Documentation/CodyingStyle, btw..

This code comes from a port of the node affine scheduler done by
somebody from IBM for the 2.4.18 version. On IA64 we have really atomic
stuff for this operation. It's a quick hack for now. I not good in ia32
assembler and don't know how to replace it.

> +int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS],
> pool_nr_cpus[MAX_NUMNODES]; +unsigned long pool_mask[MAX_NUMNODES];
>
> Hmm, shouldn't those [MAX_NUMNODES] arrays be in some per-node array
> to avoid cacheline-bouncing?

They could be replicated but I don't think this is necessary. They
populate cachelines on each CPU. As these arrays are never changed,
there is no need for bouncing them. The cachelines on every CPU are
only invalidated when some CPU changes these array, which just doesn't
happen after boot (except you add a CPU, which you'll do rarely).

> +void pooldata_lock(void)
> +{
> +	int i;
> + retry:
> +	while (atomic_read(&pool_lock));
> +	if (atomic_inc_return(&pool_lock) > 1) {
> +		atomic_dec(&pool_lock);
> +		goto retry;
> +	}
>
> Why not a simple spin_lock()?

Actually we need something like write_lock and read_lock. But read_lock
counts the readers, therefore changes the lock and we get
cacheline-bouncing.

A spinlock here would be ok. Then the readers would need to do something
like  "while (spin_is_locked(pool_lock));". This only reads the lock
and (if nobody changes it) avoids cacheline bouncing. We still need to
check that all readers went out of the section where the pool data is
accessed (following ugly block)...

> +	/*
> +	 * Wait a while, any loops using pool data should finish
> +	 * in between. This is VERY ugly and should be replaced
> +	 * by some real RCU stuff. [EF]
> +	 */
> +	for (i=0; i<100; i++)
> +		udelay(1000);
>
> Urgg.  I'd suggest you switch to RCU now and make your patch apply
> ontop of it - another reason to apply the RCU core patch..

I agree (see the comment). I didn't replace it by RCU because RCU is
not in the mainline and it is easier to develop with an ugly but portable
replacement. But add this to the list of potential users of RCU.

> +void pooldata_unlock(void)
> +{
> +	atomic_dec(&pool_lock);
> +}
>
> Dito for spin_unlock.
>
> +	/* avoid deadlock by timer interrupt on own cpu */
> +	if (atomic_read(&pool_lock)) return;
>
> spin_trylock..
No! No trylock here, that would change the lock and lead to cacheline
bouncing! We don't need the lock, just make sure that the pooldata is not
beeing changed right now.

> All in all your code doesn't seem to be very cachelign-friendly,
> lots of global bouncing.  Do you have any numbers on what your
> patch changes for normal SMP configurations?

Once again: once the address of pool_lock is in the cache, I don't
expect any bouncing. The value of that lock almost never changes (except
in the boot phase and when you want to hot-add/remove a CPU), therefore
there is no reason for cache-line bouncing. The same applies for the
pool variables.

As you see, I thought a lot about cacheline bouncing and tried to avoid
it. I'll replace the pool_lock with a spin_lock, as the atomic_inc_return
for ia32 isn't really atomic, you're absolutely right here. As the
load balancing code is called pretty rarely (at most every 1ms, normally
every 250ms per CPU), additional cycles invested here are ok if the
benefit for NUMA is reasonable. If CONFIG_NUMA is undefined, the code
should be optimized by the compiler such that it looks like normal SMP
code. Still need some changes to get there, but that's not the big
problem.

Regards,
Erich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-08 19:44         ` Martin J. Bligh
@ 2002-10-09 16:26           ` Erich Focht
  2002-10-09 17:33             ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Erich Focht @ 2002-10-09 16:26 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1903 bytes --]

On Tuesday 08 October 2002 21:44, Martin J. Bligh wrote:
> > Thanks for the hints, I cleaned up the first patch, too. No
> > CONFIG_NUMA_SCHED any more, switched to MAX_NUMNODES, including
> > asm/numa.h from asm/topology.h, so no need for you to see it.
>
> Cool. Well it compiles now, but:
>
> CPU 3 IS NOW UP!
> Starting migration thread for cpu 3
> Bringing up 4
> CP4 4ivi e e UPr: 0000
> ing
> igrU:i   4
> eaIP:    p060:[<c0114231>]    Not tainted
> EFLAGS: 00010046
> EIP is at calc_pool_load+0x109/0x120

This is strange. It works for me really reliably. I added a check
for non-online CPUs in calc_pool_load and changed the pool_lock to
be a spinlock. The patches are attached again.

Results so far (all based on 2.5.39 + ia64 + discontig, measured on
16 CPU Azusa), averaged over 4 measurements. The statistics show that
we need more measurements to get something meaningfull.

                   2.5.39         numa_sched+ilb
numa_test 4 4      
AverageUserTime    31.18+-0.67     31.25+- 2.51 
    ElapsedTime    41.59+-1.95	   38.55+- 4.79
  TotalUserTime   124.78+-2.71	  125.08+-10.03
					  			   
numa_test 8 4				  		   
AverageUserTime    30.84+-0.51	   28.94+- 0.04
    ElapsedTime    45.02+-1.54	   30.00+- 0.24
  TotalUserTime   246.82+-4.02	  231.66+- 0.34
					  			   
numa_test 16 4				  		   
AverageUserTime    33.61+- 0.72	   32.15+- 0.33
    ElapsedTime    47.16+- 0.95	   37.45+- 3.20
  TotalUserTime   538.05+-11.58	  514.72+- 5.25
					  			   
numa_test 32 4				  		   
AverageUserTime    37.94+- 0.74	   33.36+- 0.02
    ElapsedTime    84.83+- 1.28	   68.78+- 0.18
  TotalUserTime  1214.37+-23.64	 1068.06+- 0.53
					  			   
numa_test 64 4				  		   
AverageUserTime    39.58+- 1.34	   33.64+- 0.09
    ElapsedTime   168.04+- 4.77	  142.84+- 5.44
  TotalUserTime  2533.82+-85.96	 2153.69+- 5.58

Regards,
Erich

[-- Attachment #2: 01-numa_sched_core7-2.5.39.patch --]
[-- Type: text/x-diff, Size: 19262 bytes --]

diff -urNp a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Fri Sep 27 23:49:54 2002
+++ b/arch/i386/kernel/smpboot.c	Tue Oct  8 15:07:48 2002
@@ -1194,6 +1194,11 @@ int __devinit __cpu_up(unsigned int cpu)
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	pooldata_lock();
+	bld_pools();
+	pooldata_unlock();
+#endif
 }
 
 void __init smp_intr_init()
diff -urNp a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	Tue Oct  8 15:04:54 2002
+++ b/arch/ia64/kernel/smpboot.c	Tue Oct  8 15:07:48 2002
@@ -397,7 +397,7 @@ unsigned long cache_decay_ticks;	/* # of
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -508,6 +508,11 @@ smp_cpus_done (unsigned int dummy)
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+	pooldata_lock();
+	bld_pools();
+	pooldata_unlock();
+#endif
 }
 
 int __devinit
diff -urNp a/include/asm-i386/atomic.h b/include/asm-i386/atomic.h
--- a/include/asm-i386/atomic.h	Fri Sep 27 23:49:16 2002
+++ b/include/asm-i386/atomic.h	Tue Oct  8 15:07:48 2002
@@ -111,6 +111,18 @@ static __inline__ void atomic_inc(atomic
 }
 
 /**
+ * atomic_inc_return - increment atomic variable and return new value
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1 and return it's new value.  Note that
+ * the guaranteed useful range of an atomic_t is only 24 bits.
+ */
+static inline int atomic_inc_return(atomic_t *v){
+	atomic_inc(v);
+	return v->counter;
+}
+
+/**
  * atomic_dec - decrement atomic variable
  * @v: pointer of type atomic_t
  * 
diff -urNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Tue Oct  8 15:03:54 2002
+++ b/include/linux/sched.h	Tue Oct  8 15:07:48 2002
@@ -22,6 +22,7 @@ extern unsigned long event;
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -167,7 +168,6 @@ extern void update_one_process(struct ta
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -457,6 +457,12 @@ extern void set_cpus_allowed(task_t *p, 
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void pooldata_lock(void);
+extern void pooldata_unlock(void);
+#endif
+extern void sched_migrate_task(task_t *p, int cpu);
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Fri Sep 27 23:50:27 2002
+++ b/kernel/sched.c	Wed Oct  9 17:06:04 2002
@@ -154,6 +154,10 @@ struct runqueue {
 	task_t *migration_thread;
 	struct list_head migration_queue;
 
+	unsigned long wait_time;
+	int wait_node;
+	int load[2][NR_CPUS];
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -174,6 +178,90 @@ static struct runqueue runqueues[NR_CPUS
 #endif
 
 /*
+ * Variables for describing and accessing processor pools. Using a
+ * compressed row format like notation. Processor pools are treated
+ * like logical node numbers.
+ *
+ * numpools: number of CPU pools (nodes),
+ * pool_cpus[]: CPUs in pools sorted by their pool ID,
+ * pool_ptr[pool]: index of first element in pool_cpus[] belonging to pool.
+ * pool_mask[]: cpu mask of a pool.
+ *
+ * Example: loop over all CPUs in a pool p:
+ * loop_over_node(i,node) {
+ *      cpu = pool_cpu(i);
+ *      ...
+ * }
+ *                                                      <efocht@ess.nec.de>
+ */
+int numpools, pool_ptr[MAX_NUMNODES+1], pool_cpus[NR_CPUS], pool_nr_cpus[MAX_NUMNODES];
+unsigned long pool_mask[MAX_NUMNODES];
+static spinlock_t pool_lock = SPIN_LOCK_UNLOCKED;
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+
+/* Avoid zeroes in integer divides for load calculations */
+#define BALANCE_FACTOR 100
+
+#ifdef CONFIG_NUMA
+#define POOL_DELAY     (100)
+#define loop_over_node(i,n) for(i=pool_ptr[n]; i<pool_ptr[n+1]; i++)
+#define pool_cpu(i) pool_cpus[i]
+
+void pooldata_lock(void)
+{
+	int i;
+	spin_lock(&pool_lock);
+	/* 
+	 * Wait a while, any loops using pool data should finish
+	 * in between. This is VERY ugly and should be replaced
+	 * by some real RCU stuff. [EF]
+	 */
+	for (i=0; i<100; i++)
+		udelay(1000);
+}
+
+void pooldata_unlock(void)
+{
+	spin_unlock(&pool_lock);
+}
+
+/*
+ * Call pooldata_lock() before calling this function and
+ * pooldata_unlock() after!
+ */
+void bld_pools(void)
+{
+	int n, cpu, ptr, i;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = pool_mask[n] = __node_to_cpu_mask(n) & cpu_online_map;
+		pool_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				pool_cpu(ptr++) = cpu;
+		pool_nr_cpus[n] = ptr - pool_ptr[n];
+	}
+	numpools=numnodes;
+	pool_ptr[numpools]=ptr;
+	printk("CPU pools : %d\n",numpools);
+	for (n=0;n<numpools;n++) {
+		printk("pool %d :",n);
+		loop_over_node(i,n)
+			printk("%d ",pool_cpu(i));
+		printk("\n");
+	}
+}
+
+#else
+#define POOL_DELAY  (0)
+#define loop_over_node(i,n) for(i=0; i<NR_CPUS; i++)
+#define pool_cpu(i) (i)
+#endif
+
+
+/*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
  * explicitly disabling preemption.
@@ -632,121 +720,145 @@ static inline unsigned int double_lock_b
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * Calculate load of a CPU pool, store results in data[][NR_CPUS].
+ * Return the index of the most loaded runqueue.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static int calc_pool_load(int data[][NR_CPUS], int this_cpu, int pool, int idle)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+	runqueue_t *rq_src, *this_rq = cpu_rq(this_cpu);
+	int i, ii, idx=-1, refload, load;
 
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
-	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
-	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
-
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	data[1][pool] = 0;
+	refload = -1;
 
+	loop_over_node(ii, pool) {
+		i = pool_cpu(ii);
+		if (!cpu_online(i)) continue;
 		rq_src = cpu_rq(i);
 		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
 			load = rq_src->nr_running;
 		else
 			load = this_rq->prev_nr_running[i];
 		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+		data[0][i] = load;
+		data[1][pool] += load;
+		if (load > refload) {
+			idx = i;
+			refload = load;
 		}
 	}
+	data[1][pool] = data[1][pool] * BALANCE_FACTOR / pool_nr_cpus[pool];
+	return idx;
+}
 
-	if (likely(!busiest))
-		goto out;
+/*
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their home node.
+ * This is done in two steps:
+ * 1. First try to find a runqueue within the own CPU pool (AKA node) with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU. Remote runqueues running tasks having their homenode on the
+ * current node are preferred (those tasks count twice in the load calculation).
+ * If the current load is far below the average try to steal a task from the
+ * most loaded node/cpu. Otherwise wait 100ms and give less loaded nodes the
+ * chance to approach the average load.
+ *
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode...
+ *                                                         <efocht@ess.nec.de>
+ */
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+					     int idle, int *nr_running)
+{
+	runqueue_t *busiest = NULL;
+	int imax, best_cpu, pool, max_pool_load, max_pool_idx;
+	int i, del_shift;
+	int avg_load=-1, this_pool = cpu_to_node(this_cpu);
+
+	/* Need at least ~25% imbalance to trigger balancing. */
+#define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4))
+
+	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+		*nr_running = this_rq->nr_running;
+	else
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-	*imbalance = (max_load - nr_running) / 2;
+	best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle);
+	
+	if (best_cpu != this_cpu)
+		goto check_out;
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+ scan_all:
+	best_cpu = -1;
+	max_pool_load = this_rq->load[1][this_pool];
+	max_pool_idx = this_pool;
+	avg_load = max_pool_load * pool_nr_cpus[this_pool];
+	for (i = 1; i < numpools; i++) {
+		pool = (i + this_pool) % numpools;
+		imax = calc_pool_load(this_rq->load, this_cpu, pool, idle);
+		avg_load += this_rq->load[1][pool]*pool_nr_cpus[pool];
+		if (this_rq->load[1][pool] > max_pool_load) {
+			max_pool_load = this_rq->load[1][pool];
+			max_pool_idx = pool;
+			best_cpu = imax;
+		}
+	}
+	/* Exit if not enough imbalance on any remote node. */
+	if ((best_cpu < 0) ||
+	    BALANCED(max_pool_load,this_rq->load[1][this_pool])) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
+	avg_load /= num_online_cpus();
+	/* Wait longer before stealing if load is average. */
+	if (BALANCED(avg_load,this_rq->load[1][this_pool]))
+		del_shift = 0;
+	else
+		del_shift = 6;
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
-	}
-out:
-	return busiest;
-}
+	if (this_rq->wait_node != max_pool_idx) {
+		this_rq->wait_node = max_pool_idx;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+		if (jiffies - this_rq->wait_time < (POOL_DELAY >> del_shift))
+			goto out;
 
-/*
- * pull_task - move a task from a remote runqueue to the local runqueue.
- * Both runqueues must be locked.
- */
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
-{
-	dequeue_task(p, src_array);
-	src_rq->nr_running--;
-	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
-	enqueue_task(p, this_rq->active);
-	/*
-	 * Note that idle threads have a prio of MAX_PRIO, for this test
-	 * to be always true for them.
-	 */
-	if (p->prio < this_rq->curr->prio)
-		set_need_resched();
+ check_out:
+	/* Enough imbalance in the remote cpu loads? */
+	if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) {
+		busiest = cpu_rq(best_cpu);
+		this_rq->wait_node = -1;
+	} else if (avg_load == -1)
+		/* only scanned local pool, so let's look at all of them */
+		goto scan_all;
+ out:
+	return busiest;
 }
 
 /*
- * Current runqueue is empty, or rebalance tick: if there is an
- * inbalance (current runqueue is too short) then pull from
- * busiest runqueue(s).
- *
- * We call this with the current runqueue locked,
- * irqs disabled.
+ * Find a task to steal from the busiest RQ. The busiest->lock must be held
+ * while calling this routine. 
  */
-static void load_balance(runqueue_t *this_rq, int idle)
+static inline task_t *task_to_steal(runqueue_t *busiest, int this_cpu)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
-	runqueue_t *busiest;
+	int idx;
+	task_t *next = NULL, *tmp;
 	prio_array_t *array;
 	struct list_head *head, *curr;
-	task_t *tmp;
+	int weight, maxweight=0;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
-	if (!busiest)
-		goto out;
+	/*
+	 * We do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed.
+	 */
+
+#define CAN_MIGRATE_TASK(p,rq,this_cpu)	\
+		((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
+		p != rq->curr && \
+		 ((p)->cpus_allowed & (1UL<<(this_cpu))))
 
 	/*
 	 * We first consider expired tasks. Those will likely not be
@@ -772,7 +884,7 @@ skip_bitmap:
 			array = busiest->active;
 			goto new_array;
 		}
-		goto out_unlock;
+		goto out;
 	}
 
 	head = array->queue + idx;
@@ -780,33 +892,74 @@ skip_bitmap:
 skip_queue:
 	tmp = list_entry(curr, task_t, run_list);
 
+	if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
+		weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks;
+		if (weight > maxweight) {
+			maxweight = weight;
+			next = tmp;
+		}
+	}
+	curr = curr->next;
+	if (curr != head)
+		goto skip_queue;
+	idx++;
+	goto skip_bitmap;
+
+ out:
+	return next;
+}
+
+/*
+ * pull_task - move a task from a remote runqueue to the local runqueue.
+ * Both runqueues must be locked.
+ */
+static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+{
+	dequeue_task(p, src_array);
+	src_rq->nr_running--;
+	set_task_cpu(p, this_cpu);
+	this_rq->nr_running++;
+	enqueue_task(p, this_rq->active);
 	/*
-	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * Note that idle threads have a prio of MAX_PRIO, for this test
+	 * to be always true for them.
 	 */
+	if (p->prio < this_rq->curr->prio)
+		set_need_resched();
+}
 
-#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
-		!task_running(rq, p) &&					\
-			((p)->cpus_allowed & (1UL << (this_cpu))))
-
-	curr = curr->prev;
-
-	if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
-	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
+/*
+ * Current runqueue is empty, or rebalance tick: if there is an
+ * inbalance (current runqueue is too short) then pull from
+ * busiest runqueue(s).
+ *
+ * We call this with the current runqueue locked,
+ * irqs disabled.
+ */
+static void load_balance(runqueue_t *this_rq, int idle)
+{
+	int nr_running, this_cpu = task_cpu(this_rq->curr);
+	task_t *tmp;
+	runqueue_t *busiest;
+
+	/* avoid deadlock by timer interrupt on own cpu */
+	if (spin_is_locked(&pool_lock)) return;
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &nr_running);
+	if (!busiest)
+		goto out;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	/*
+	 * Make sure nothing changed since we checked the
+	 * runqueue length.
+	 */
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
+
+	tmp = task_to_steal(busiest, this_cpu);
+	if (!tmp)
+		goto out_unlock;
+	pull_task(busiest, tmp->array, tmp, this_rq, this_cpu);
 out_unlock:
 	spin_unlock(&busiest->lock);
 out:
@@ -819,10 +972,10 @@ out:
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -1920,6 +2073,8 @@ typedef struct {
 	struct list_head list;
 	task_t *task;
 	struct completion done;
+	int cpu_dest;
+	int sync;
 } migration_req_t;
 
 /*
@@ -1965,6 +2120,7 @@ void set_cpus_allowed(task_t *p, unsigne
 	}
 	init_completion(&req.done);
 	req.task = p;
+	req.sync = 1;
 	list_add(&req.list, &rq->migration_queue);
 	task_rq_unlock(rq, &flags);
 	wake_up_process(rq->migration_thread);
@@ -1974,6 +2130,17 @@ out:
 	preempt_enable();
 }
 
+void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	set_cpus_allowed(p, 1UL << dest_cpu);
+	set_cpus_allowed(p, old_mask);
+}
+
 /*
  * migration_thread - this is a highprio system thread that performs
  * thread migration by 'pulling' threads into the target runqueue.
@@ -2009,7 +2176,7 @@ static int migration_thread(void * data)
 	for (;;) {
 		runqueue_t *rq_src, *rq_dest;
 		struct list_head *head;
-		int cpu_src, cpu_dest;
+		int cpu_src, cpu_dest, sync;
 		migration_req_t *req;
 		unsigned long flags;
 		task_t *p;
@@ -2024,10 +2191,17 @@ static int migration_thread(void * data)
 		}
 		req = list_entry(head->next, migration_req_t, list);
 		list_del_init(head->next);
-		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		sync = req->sync;
+		if (sync)
+			cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+		else {
+			cpu_dest = req->cpu_dest;
+			req->task = NULL;
+		}
+		spin_unlock_irqrestore(&rq->lock, flags);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);
@@ -2050,7 +2224,8 @@ repeat:
 		double_rq_unlock(rq_src, rq_dest);
 		local_irq_restore(flags);
 
-		complete(&req->done);
+		if (sync)
+			complete(&req->done);
 	}
 }
 
@@ -2130,6 +2305,15 @@ void __init sched_init(void)
 			__set_bit(MAX_PRIO, array->bitmap);
 		}
 	}
+#ifdef CONFIG_NUMA
+	pool_ptr[0] = 0;
+	pool_ptr[1] = NR_CPUS;
+
+	numpools = 1;
+	pool_mask[0] = -1L;
+	pool_nr_cpus[0] = NR_CPUS;
+#endif
+
 	/*
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.

[-- Attachment #3: 02-numa_sched_ilb-2.5.39.patch --]
[-- Type: text/x-diff, Size: 2793 bytes --]

diff -urNp a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	Tue Oct  8 15:03:54 2002
+++ b/fs/exec.c	Tue Oct  8 15:08:22 2002
@@ -993,6 +993,7 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Tue Oct  8 15:07:48 2002
+++ b/include/linux/sched.h	Tue Oct  8 15:09:27 2002
@@ -460,6 +460,9 @@ extern void set_cpus_allowed(task_t *p, 
 #ifdef CONFIG_NUMA
 extern void pooldata_lock(void);
 extern void pooldata_unlock(void);
+extern void sched_balance_exec(void);
+#else
+#define sched_balance_exec() {}
 #endif
 extern void sched_migrate_task(task_t *p, int cpu);
 
diff -urNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Tue Oct  8 15:07:48 2002
+++ b/kernel/sched.c	Tue Oct  8 15:08:22 2002
@@ -2134,6 +2134,78 @@ out:
 	preempt_enable();
 }
 
+#ifdef CONFIG_NUMA
+/* used as counter for round-robin node-scheduling */
+static atomic_t sched_node=ATOMIC_INIT(0);
+
+/*
+ * Find the least loaded CPU on the current node of the task.
+ */
+static int sched_best_cpu(struct task_struct *p, int node)
+{
+	int n, cpu, load, best_cpu = task_cpu(p);
+	
+	load = 1000000;
+	loop_over_node(n,node) {
+		cpu = pool_cpu(n);
+		if (!(p->cpus_allowed & (1UL << cpu) & cpu_online_map))
+			continue;
+		if (cpu_rq(cpu)->nr_running < load) {
+			best_cpu = cpu;
+			load = cpu_rq(cpu)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+/*
+ * Find the node with fewest number of tasks running on it.
+ */
+static int sched_best_node(struct task_struct *p)
+{
+	int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id();
+	int cpu, pool, load;
+	unsigned long mask = p->cpus_allowed & cpu_online_map;
+
+	do {
+		best_node = atomic_inc_return(&sched_node) % numpools;
+	} while (!(pool_mask[best_node] & mask));
+
+	min_load = 100000000;
+	for (n = 0; n < numpools; n++) {
+		pool = (best_node + n) % numpools;
+		load = 0;
+		loop_over_node(i, pool) {
+			cpu=pool_cpu(i);
+			if (!cpu_online(cpu)) continue;
+			load += cpu_rq(cpu)->nr_running;
+		}
+		if (pool == numa_node_id()) load--;
+		pool_load = 100*load/pool_nr_cpus[pool];
+		if ((pool_load < min_load) && (pool_mask[pool] & mask)) {
+			min_load = pool_load;
+			min_pool = pool;
+		}
+	}
+	atomic_set(&sched_node, min_pool);
+	return min_pool;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu, new_node=0;
+
+	while (spin_is_locked(&pool_lock))
+		cpu_relax();
+	if (numpools > 1) {
+		new_node = sched_best_node(current);
+	} 
+	new_cpu = sched_best_cpu(current, new_node);
+	if (new_cpu != smp_processor_id())
+		sched_migrate_task(current, new_cpu);
+}
+#endif
+
 void sched_migrate_task(task_t *p, int dest_cpu)
 {
 	unsigned long old_mask;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-09 16:26           ` Erich Focht
@ 2002-10-09 17:33             ` Martin J. Bligh
  2002-10-09 17:58               ` Andrew Theurer
  0 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2002-10-09 17:33 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

> This is strange. It works for me really reliably. I added a check
> for non-online CPUs in calc_pool_load and changed the pool_lock to
> be a spinlock. The patches are attached again.

I'm testing on 2.5.41-mm1 ... your patches still apply clean. That
has a whole bunch of nice NUMA mods, you might want to try that 
instead? All the changes in there will end up in mainline real soon
anyway ;-)

One minor warning:

arch/i386/kernel/smpboot.c: In function `smp_cpus_done':
arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of function `bld_pools'

And the same panic:

Starting migration thread for cpu 3
Bringing up 4
CPU>dividNOWrro!
                 0St0
igrU:    4
ead :or  0060:[<c011422d>]    Not tainted
EFLAGS: 00010046
EIP is at calc_pool_load+0x115/0x12c
eax: 00000000   ebx: 00000001   ecx: c028f948   edx: 00000000
esi: c034f380   edi: 00000020   ebp: c3a37efc   esp: c3a37ed4
ds: 0068   es: 0068   ss: 0068
Process swapper (pid: 0, threadinfo=c3a36000 task=f01bf060)
Stack: c028f948 c028efa0 f01bf060 00002928 00002944 00002944 ffffffff ffffffff 
       c028efa0 ffffffff c3a37f78 c01142d1 c028f948 00000004 00000001 00000001 
       c3a36000 c028efa0 f01bf060 00000010 00000008 00000000 00000001 00002a13 
Call Trace:
 [<c01142d1>] load_balance+0x8d/0x5c8
 [<c0118a9d>] call_console_drivers+0xdd/0xe4
 [<c0118cd2>] release_console_sem+0x42/0xa4
 [<c0114c21>] schedule+0xd5/0x3ac
 [<c0105300>] default_idle+0x0/0x34
 [<c01053bf>] cpu_idle+0x43/0x48
 [<c0118cd2>] release_console_sem+0x42/0xa4
 [<c0118c2d>] printk+0x125/0x140

Code: f7 3c 9e 89 84 99 80 00 00 00 8b 45 f4 5b 5e 5f 89 ec 5d c3 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-09 17:33             ` Martin J. Bligh
@ 2002-10-09 17:58               ` Andrew Theurer
  2002-10-09 18:13                 ` Andrew Theurer
  2002-10-09 23:02                 ` Erich Focht
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Theurer @ 2002-10-09 17:58 UTC (permalink / raw)
  To: Martin J. Bligh, Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

> I'm testing on 2.5.41-mm1 ... your patches still apply clean. That
> has a whole bunch of nice NUMA mods, you might want to try that
> instead? All the changes in there will end up in mainline real soon
> anyway ;-)
>
> One minor warning:
>
> arch/i386/kernel/smpboot.c: In function `smp_cpus_done':
> arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of function
> `bld_pools'
>
> And the same panic:
>
> Starting migration thread for cpu 3
> Bringing up 4
> CPU>dividNOWrro!

I got the same thing on 2.5.40-mm1.  It looks like it may be a a divide by 
zero in calc_pool_load.  I am attempting to boot a band-aid version right 
now.  OK, got a little further:

PCI: Cannot allocate resource region 1 of device 03:0a.0
PCI: Cannot allocate resource region 4 of device 03:0e.1
PCI: Cannot allocate resource region 2 of device 05:0a.0
PCI: Cannot allocate resource region 0 of device 05:0b.0
PCI: Cannot allocate resource region 4 of device 05:0e.1
PCI: Cannot allocate resource region 1 of device 07:0a.0
PCI: Cannot allocate resource region 0 of device 07:0b.0
PCI: Cannot allocate resource region 4 of device 07:0e.1
Starting kswapd
d<4>highmem bounce poo
sCPU:    3
eEIP:    0060:[<c011a3d3>]    Not tainted
EFLAGS: 00010046
eax: 00000001   ebx: 00000000   ecx: 00000003   edx: 00000000
esi: c02c98c4   edi: c39c5880   ebp: f0199ee8   esp: f0199ec0
ds: 0068   es: 0068   ss: 0068
Process swapper (pid: 0, threadinfo=f0198000 task=f01c1740)
Stack: c39c58a0 c02c94c8 c02c993c 00000000 c02c94c4 00000000 0000007d c02c94a0 
       00000001 00000003 f0199f1c c0117bec c02c94a0 00000003 00000003 00000001 
       00000000 c01135d1 f01a3f10 00000000 00000003 f01c1740 c02cb4e0 f0199f44 
Call Trace: [<c0117bec>] [<c01135d1>] [<c0117eed>] [<c0122587>] [<c0105420>] 
[<c0125e60>] [<c0113ca7>] [<c0105420>] [<c010818e>] [<c0105420>] [<c0105420>] 
[<c010544a>] [<c01054ea>] [<c011cf50>]
Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0 
 
-Andrew Theurer



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-09 17:58               ` Andrew Theurer
@ 2002-10-09 18:13                 ` Andrew Theurer
  2002-10-09 23:02                 ` Erich Focht
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Theurer @ 2002-10-09 18:13 UTC (permalink / raw)
  To: Martin J. Bligh, Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

On Wednesday 09 October 2002 12:58 pm, Andrew Theurer wrote:
> > I'm testing on 2.5.41-mm1 ... your patches still apply clean. That
> > has a whole bunch of nice NUMA mods, you might want to try that
> > instead? All the changes in there will end up in mainline real soon
> > anyway ;-)
> >
> > One minor warning:
> >
> > arch/i386/kernel/smpboot.c: In function `smp_cpus_done':
> > arch/i386/kernel/smpboot.c:1199: warning: implicit declaration of
> > function `bld_pools'
> >
> > And the same panic:
> >
> > Starting migration thread for cpu 3
> > Bringing up 4
> > CPU>dividNOWrro!
>
> I got the same thing on 2.5.40-mm1.  It looks like it may be a a divide by
> zero in calc_pool_load.  I am attempting to boot a band-aid version right
> now.  OK, got a little further:
>
A little more detail:
Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0 
EFLAGS: 00010046
eax: 00000001   ebx: 00000000   ecx: 00000003   edx: 00000000
esi: c02c98c4   edi: c39c5880   ebp: f0199ee8   esp: f0199ec0
ds: 0068   es: 0068   ss: 0068
Stack: c39c58a0 c02c94c8 c02c993c 00000000 c02c94c4 00000000 0000007d c02c94a0 
       00000001 00000003 f0199f1c c0117bec c02c94a0 00000003 00000003 00000001 
       00000000 c01135d1 f01a3f10 00000000 00000003 f01c1740 c02cb4e0 f0199f44 
Call Trace: [<c0117bec>] [<c01135d1>] [<c0117eed>] [<c0122587>] [<c0105420>] 
[<c0125e60>] [<c0113ca7>] [<c0105420>] [<c010818e>] [<c
0105420>] [<c0105420>] [<c010544a>] [<c01054ea>] [<c011cf50>]
Code: f7 f3 3b 45 e4 7e 06 89 45 e4 89 7d ec 8b 45 d8 8b 00 39 f0 
Using defaults from ksymoops -t elf32-i386 -a i386


>>esi; c02c98c4 <runqueues+424/15800>
>>edi; c39c5880 <END_OF_CODE+36419c4/????>
>>ebp; f0199ee8 <END_OF_CODE+2fe1602c/????>
>>esp; f0199ec0 <END_OF_CODE+2fe16004/????>

Trace; c0117bec <load_balance+8c/140>
Trace; c01135d1 <smp_call_function_interrupt+41/90>
Trace; c0117eed <scheduler_tick+24d/390>
Trace; c0122587 <tasklet_hi_action+77/c0>
Trace; c0105420 <default_idle+0/50>
Trace; c0125e60 <update_process_times+40/50>
Trace; c0113ca7 <smp_apic_timer_interrupt+117/120>
Trace; c0105420 <default_idle+0/50>
Trace; c010818e <apic_timer_interrupt+1a/20>
Trace; c0105420 <default_idle+0/50>
Trace; c0105420 <default_idle+0/50>
Trace; c010544a <default_idle+2a/50>
Trace; c01054ea <cpu_idle+3a/50>
Trace; c011cf50 <printk+140/180>

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   f7 f3                     div    %ebx
Code;  00000002 Before first symbol
   2:   3b 45 e4                  cmp    0xffffffe4(%ebp),%eax
Code;  00000005 Before first symbol
   5:   7e 06                     jle    d <_EIP+0xd> 0000000d Before first 
symbol
Code;  00000007 Before first symbol
   7:   89 45 e4                  mov    %eax,0xffffffe4(%ebp)
Code;  0000000a Before first symbol
   a:   89 7d ec                  mov    %edi,0xffffffec(%ebp)
Code;  0000000d Before first symbol
   d:   8b 45 d8                  mov    0xffffffd8(%ebp),%eax
Code;  00000010 Before first symbol
  10:   8b 00                     mov    (%eax),%eax
Code;  00000012 Before first symbol
  12:   39 f0                     cmp    %esi,%eax





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-09 17:58               ` Andrew Theurer
  2002-10-09 18:13                 ` Andrew Theurer
@ 2002-10-09 23:02                 ` Erich Focht
  2002-10-10 17:34                   ` Andrew Theurer
  1 sibling, 1 reply; 24+ messages in thread
From: Erich Focht @ 2002-10-09 23:02 UTC (permalink / raw)
  To: Andrew Theurer, Martin J. Bligh, Michael Hohnbaum
  Cc: Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

> > Starting migration thread for cpu 3
> > Bringing up 4
> > CPU>dividNOWrro!
>
> I got the same thing on 2.5.40-mm1.  It looks like it may be a a divide by
> zero in calc_pool_load.  I am attempting to boot a band-aid version right
> now.  OK, got a little further:
This opened my eyes, thanks for all your help and patience!!!

The problem is that the load balancer is called before the CPU pools
were set up. That's fine, I thought, because I define in sched_init
the default pool 0 to include all CPUs. But: in find_busiest_queue()
the cpu_to_node(this_cpu) delivers a non-zero pool which is not set up
yet, therefore pool_nr_cpus[pool]=0 and we get a zero divide.

I'm still wondering why this doesn't happen on our architecture. Maybe
the interrupts are disabled longer, I'll check. Anyway, a fix is to
force this_pool to be 0 as long as numpools=1. The attached patch is a
quick untested hack, maybe one can do it better. Has to be applied on top
of the other 2.

Going to sleep now...

Bye,
Erich

[-- Attachment #2: 01a-numa_sched_fix-2.5.39.patch --]
[-- Type: text/x-diff, Size: 1299 bytes --]

diff -urNp 2.5.39-disc-ns/kernel/sched.c 2.5.39-disc-ns8/kernel/sched.c
--- 2.5.39-disc-ns/kernel/sched.c	Wed Oct  9 17:06:04 2002
+++ 2.5.39-disc-ns8/kernel/sched.c	Thu Oct 10 00:51:20 2002
@@ -774,7 +774,7 @@ static inline runqueue_t *find_busiest_q
 	runqueue_t *busiest = NULL;
 	int imax, best_cpu, pool, max_pool_load, max_pool_idx;
 	int i, del_shift;
-	int avg_load=-1, this_pool = cpu_to_node(this_cpu);
+	int avg_load=-1, this_pool;
 
 	/* Need at least ~25% imbalance to trigger balancing. */
 #define BALANCED(m,t) (((m) <= 1) || (((m) - (t))/2 < (((m) + (t))/2 + 3)/4))
@@ -784,10 +784,13 @@ static inline runqueue_t *find_busiest_q
 	else
 		*nr_running = this_rq->prev_nr_running[this_cpu];
 
+	this_pool = (numpools == 1 ? 0 : cpu_to_node(this_cpu));
 	best_cpu = calc_pool_load(this_rq->load, this_cpu, this_pool, idle);
 	
 	if (best_cpu != this_cpu)
 		goto check_out;
+	else if (numpools == 1)
+		goto out;
 
  scan_all:
 	best_cpu = -1;
@@ -830,7 +833,7 @@ static inline runqueue_t *find_busiest_q
 	if (!BALANCED(this_rq->load[0][best_cpu],*nr_running)) {
 		busiest = cpu_rq(best_cpu);
 		this_rq->wait_node = -1;
-	} else if (avg_load == -1)
+	} else if (avg_load == -1 && numpools > 1)
 		/* only scanned local pool, so let's look at all of them */
 		goto scan_all;
  out:

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-09 23:02                 ` Erich Focht
@ 2002-10-10 17:34                   ` Andrew Theurer
       [not found]                     ` <200210110947.11714.efocht@ess.nec.de>
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Theurer @ 2002-10-10 17:34 UTC (permalink / raw)
  To: Erich Focht, Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

On Wednesday 09 October 2002 6:02 pm, Erich Focht wrote:
> > > Starting migration thread for cpu 3
> > > Bringing up 4
> > > CPU>dividNOWrro!
> >
> > I got the same thing on 2.5.40-mm1.  It looks like it may be a a divide
> > by zero in calc_pool_load.  I am attempting to boot a band-aid version
> > right now.  OK, got a little further:
>
> This opened my eyes, thanks for all your help and patience!!!
>
> The problem is that the load balancer is called before the CPU pools
> were set up. That's fine, I thought, because I define in sched_init
> the default pool 0 to include all CPUs. But: in find_busiest_queue()
> the cpu_to_node(this_cpu) delivers a non-zero pool which is not set up
> yet, therefore pool_nr_cpus[pool]=0 and we get a zero divide.
>
> I'm still wondering why this doesn't happen on our architecture. Maybe
> the interrupts are disabled longer, I'll check. Anyway, a fix is to
> force this_pool to be 0 as long as numpools=1. The attached patch is a
> quick untested hack, maybe one can do it better. Has to be applied on top
> of the other 2.

Thanks very much Erich.  I did come across another problem here on numa-q.  In 
task_to_steal() there is a divide by cache_decay_ticks, which apparantly is 0 
on my system.  This may have to do with notsc, but I am not sure.  I set 
cache_decay_ticks to 8, (I cannot boot without using notsc) which is probably 
not correct, but I can now boot 16 processor numa-q on 2.5.40-mm1 with your 
patches!  I'll get some benchmark results soon.  

Andrew Theurer

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
       [not found]                     ` <200210110947.11714.efocht@ess.nec.de>
@ 2002-10-11  8:27                       ` Erich Focht
  2002-10-11 14:47                         ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Erich Focht @ 2002-10-11  8:27 UTC (permalink / raw)
  To: Andrew Theurer, Martin J. Bligh, Michael Hohnbaum
  Cc: Ingo Molnar, linux-kernel

On Friday 11 October 2002 09:47, Erich Focht wrote:
> Hi Andrew,
>
> On Thursday 10 October 2002 19:34, Andrew Theurer wrote:
> > Thanks very much Erich.  I did come across another problem here on
> > numa-q. In task_to_steal() there is a divide by cache_decay_ticks, which
> > apparantly is 0 on my system.  This may have to do with notsc, but I am
> > not sure.  I set cache_decay_ticks to 8, (I cannot boot without using
> > notsc) which is probably not correct, but I can now boot 16 processor
> > numa-q on 2.5.40-mm1 with your patches!  I'll get some benchmark results
> > soon.
>
> oops... This is a bug in 2.5-i386. It means that the O(1) scheduler in
> 2.5 doesn't work well either because it doesn't take into account cache
> coolness. I'll post a fix to LKML in a few minutes.

Sorry, I thought the smp_tune_scheduling() call went lost during the
transition to the new cpu boot scheme. But it's there. And the problem
is indeed "notsc". So you'll have to fix it, I can't.

If you set the cache_decay_ticks to something non-zero, you should
_really_ do this for all the scheduler tests, otherwise your measurements
will not be comparable.

Regards,
Erich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-11  8:27                       ` Erich Focht
@ 2002-10-11 14:47                         ` Martin J. Bligh
  2002-10-11 15:29                           ` Erich Focht
  0 siblings, 1 reply; 24+ messages in thread
From: Martin J. Bligh @ 2002-10-11 14:47 UTC (permalink / raw)
  To: Erich Focht, Andrew Theurer, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

> Sorry, I thought the smp_tune_scheduling() call went lost during the
> transition to the new cpu boot scheme. But it's there. And the problem
> is indeed "notsc". So you'll have to fix it, I can't.

Errrm ... not sure what you mean by this. notsc is a perfectly
valid thing to do, so if your patch panics with that option, I 
suggest you make something conditional on notsc inside your patch?
Works just fine without your patch, or with Michael's patch ....

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-11 14:47                         ` Martin J. Bligh
@ 2002-10-11 15:29                           ` Erich Focht
  2002-10-11 15:34                             ` Martin J. Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Erich Focht @ 2002-10-11 15:29 UTC (permalink / raw)
  To: Martin J. Bligh, Andrew Theurer, Michael Hohnbaum
  Cc: Ingo Molnar, linux-kernel

On Friday 11 October 2002 16:47, Martin J. Bligh wrote:
> > Sorry, I thought the smp_tune_scheduling() call went lost during the
> > transition to the new cpu boot scheme. But it's there. And the problem
> > is indeed "notsc". So you'll have to fix it, I can't.
>
> Errrm ... not sure what you mean by this. notsc is a perfectly
> valid thing to do, so if your patch panics with that option, I
> suggest you make something conditional on notsc inside your patch?
> Works just fine without your patch, or with Michael's patch ....

Martin,

arch/i386/kernel/smpboot.c:smp_tune_scheduling() says:

       if (!cpu_khz) {
                /*
                 * this basically disables processor-affinity
                 * scheduling on SMP without a TSC.
                 */
                cacheflush_time = 0;
                return;

If you boot with notsc, you won't have cache affinity on your machine.
Which means that the load_balancer eventually selects cache hot tasks
for stealing. The O(1) scheduler doesn't do that under normal conditions!

Of course I'll add something to my patch such that it doesn't crash
if cache_decay_ticks is unset. But you might be measuring wrong things
right now if you leave cache_decay_ticks=0 as then the cache-affinity
on NUMAQ is switched off with the vanilla O(1) and with Michael's patch.
I want to say: you cannot evaluate the impact of Michael's patches if
you don't fix that. This issue is independent of my patches.

Regards,
Erich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] pooling NUMA scheduler with initial load balancing
  2002-10-11 15:29                           ` Erich Focht
@ 2002-10-11 15:34                             ` Martin J. Bligh
  0 siblings, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2002-10-11 15:34 UTC (permalink / raw)
  To: Erich Focht, Andrew Theurer, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

> arch/i386/kernel/smpboot.c:smp_tune_scheduling() says:
> 
>        if (!cpu_khz) {
>                 /*
>                  * this basically disables processor-affinity
>                  * scheduling on SMP without a TSC.
>                  */
>                 cacheflush_time = 0;
>                 return;
> 
> If you boot with notsc, you won't have cache affinity on your machine.
> Which means that the load_balancer eventually selects cache hot tasks
> for stealing. The O(1) scheduler doesn't do that under normal conditions!

OK, that makes more sense ... I'll go stare at the code some more and
see what can be done.
 
> Of course I'll add something to my patch such that it doesn't crash
> if cache_decay_ticks is unset. But you might be measuring wrong things
> right now if you leave cache_decay_ticks=0 as then the cache-affinity
> on NUMAQ is switched off with the vanilla O(1) and with Michael's patch.
> I want to say: you cannot evaluate the impact of Michael's patches if
> you don't fix that. This issue is independent of my patches.

OK, I'll make sure to make the tests uniform somehow to get a fair 
comparison.

Thanks!

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2002-10-11 15:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-06 16:51 [RFC] NUMA schedulers benchmark results Erich Focht
2002-10-06 20:24 ` Erich Focht
2002-10-07  0:00   ` Martin J. Bligh
2002-10-07  0:58 ` Martin J. Bligh
2002-10-07 16:52   ` Erich Focht
2002-10-07  7:25 ` Martin J. Bligh
2002-10-07  7:40   ` Ingo Molnar
2002-10-07 20:09   ` [PATCH] pooling NUMA scheduler with initial load balancing Erich Focht
     [not found]     ` <1420721189.1034032091@[10.10.2.3]>
2002-10-08 17:33       ` Erich Focht
2002-10-08 19:44         ` Martin J. Bligh
2002-10-09 16:26           ` Erich Focht
2002-10-09 17:33             ` Martin J. Bligh
2002-10-09 17:58               ` Andrew Theurer
2002-10-09 18:13                 ` Andrew Theurer
2002-10-09 23:02                 ` Erich Focht
2002-10-10 17:34                   ` Andrew Theurer
     [not found]                     ` <200210110947.11714.efocht@ess.nec.de>
2002-10-11  8:27                       ` Erich Focht
2002-10-11 14:47                         ` Martin J. Bligh
2002-10-11 15:29                           ` Erich Focht
2002-10-11 15:34                             ` Martin J. Bligh
2002-10-09  1:15         ` Christoph Hellwig
2002-10-09 10:29           ` Erich Focht
2002-10-07 16:37 ` [RFC] NUMA schedulers benchmark results Michael Hohnbaum
2002-10-07 20:35   ` Erich Focht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).