[PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2

* [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2
@ 2013-07-03 14:21 Mel Gorman
  2013-07-03 14:21 ` [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
                   ` (14 more replies)
  0 siblings, 15 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This builds on the V1 series a bit. The performance still needs to be tied
down but it brings in a few more essential basics. Note that Peter has
posted another patch related to avoiding overloading compute nodes but I
have not had the chance to examine it yet. I'll be doing that after this
is posted as I decided not to postpone releasing this series as I'm two
days overdue to release an update.

Changelog since V2
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 7 splits the accounting of faults between those that passed the
	two-stage filter and those that did not. Task placement favours
	the filtered faults initially although ultimately this will need
	more smarts when node-local faults do not dominate.

Patch 8 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

Patch 9 favours moving tasks towards nodes where more faults were incurred
	even if it is not the preferred node

Patch 10 sets the scan rate proportional to the size of the task being scanned.

Patch 11 avoids some unnecessary allocation

Patch 12 kicks away some training wheels and scans shared pages

Patch 13 accounts for how many "preferred placed" tasks are running on an node
	 and attempts to avoid overloading them

Testing on this is only partial as full tests take a long time to run.

I tested 5 kernels using 3.9.0 as a basline

o 3.9.0-vanilla		vanilla kernel with automatic numa balancing enabled
o 3.9.0-morefaults	Patches 1-9
o 3.9.0-scalescan	Patches 1-10
o 3.9.0-scanshared	Patches 1-12
o 3.9.0-accountpreferred Patches 1-13

autonumabench
                                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                                        vanilla       morefaults            scalescan            scanshared      accountpreferred      
User    NUMA01               52623.86 (  0.00%)    53408.85 ( -1.49%)    52042.73 (  1.10%)    60404.32 (-14.79%)    57403.32 ( -9.08%)
User    NUMA01_THEADLOCAL    17595.48 (  0.00%)    17902.64 ( -1.75%)    18070.07 ( -2.70%)    18937.22 ( -7.63%)    17675.13 ( -0.45%)
User    NUMA02                2043.84 (  0.00%)     2029.40 (  0.71%)     2183.84 ( -6.85%)     2173.80 ( -6.36%)     2259.45 (-10.55%)
User    NUMA02_SMT            1057.11 (  0.00%)      999.71 (  5.43%)     1045.10 (  1.14%)     1046.01 (  1.05%)     1048.58 (  0.81%)
System  NUMA01                 414.17 (  0.00%)      328.68 ( 20.64%)      326.08 ( 21.27%)      155.69 ( 62.41%)      144.53 ( 65.10%)
System  NUMA01_THEADLOCAL      105.17 (  0.00%)       93.22 ( 11.36%)       97.63 (  7.17%)       95.46 (  9.23%)      102.47 (  2.57%)
System  NUMA02                   9.36 (  0.00%)        9.39 ( -0.32%)        9.25 (  1.18%)        8.42 ( 10.04%)       10.46 (-11.75%)
System  NUMA02_SMT               3.54 (  0.00%)        3.32 (  6.21%)        4.27 (-20.62%)        3.41 (  3.67%)        3.72 ( -5.08%)
Elapsed NUMA01                1201.52 (  0.00%)     1238.04 ( -3.04%)     1220.85 ( -1.61%)     1385.58 (-15.32%)     1335.06 (-11.11%)
Elapsed NUMA01_THEADLOCAL      393.91 (  0.00%)      410.64 ( -4.25%)      414.33 ( -5.18%)      434.54 (-10.31%)      406.84 ( -3.28%)
Elapsed NUMA02                  50.30 (  0.00%)       50.30 (  0.00%)       54.49 ( -8.33%)       52.14 ( -3.66%)       56.81 (-12.94%)
Elapsed NUMA02_SMT              58.48 (  0.00%)       52.91 (  9.52%)       58.71 ( -0.39%)       53.12 (  9.17%)       60.82 ( -4.00%)
CPU     NUMA01                4414.00 (  0.00%)     4340.00 (  1.68%)     4289.00 (  2.83%)     4370.00 (  1.00%)     4310.00 (  2.36%)
CPU     NUMA01_THEADLOCAL     4493.00 (  0.00%)     4382.00 (  2.47%)     4384.00 (  2.43%)     4379.00 (  2.54%)     4369.00 (  2.76%)
CPU     NUMA02                4081.00 (  0.00%)     4052.00 (  0.71%)     4024.00 (  1.40%)     4184.00 ( -2.52%)     3995.00 (  2.11%)
CPU     NUMA02_SMT            1813.00 (  0.00%)     1895.00 ( -4.52%)     1787.00 (  1.43%)     1975.00 ( -8.94%)     1730.00 (  4.58%)

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillamorefaults     scalescan      scanshared      accountpreferred      
User        73328.02    74347.84    73349.21    82568.87    78393.27
System        532.89      435.24      437.90      263.61      261.79
Elapsed      1714.18     1763.03     1759.02     1936.17     1869.51

numa01 suffers a bit here but numa01 is also an adverse workload on this
machine. The result is poor but I'm not concentrating on it right now.

Just patches 1-9 (morefaults) performs ok. numa02 is flat and numa02_smt
sees a small performance gain. I do not have variance data to establish
if this is significant or not. After that, altering the scanning had a
large impact and I'll re-examine if the default scan rate is just too slow.

It's worth noting the impact on system CPU time. Overall it is much reduced
and I think we need to keep pushing for keeping the overhead as low as possible
particularly in the future as memory sizes grow.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillamorefaults     scalescan      scanshared      accountpreferred      
THP fault alloc                  14325       14293       14103       14259       14081
THP collapse alloc                   6           3           1          10           5
THP splits                           4           5           5           3           2
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           9020528     5227450     5355703     5597558     5637844
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   9363        5426        5559        5810        5852
NUMA PTE updates             119292401    79765854    73441393    76125744    75857594
NUMA hint faults                755901      384660      206195      214063      193969
NUMA hint local faults          595478      292221      120436      113812      109472
NUMA pages migrated            9020528     5227450     5355703     5597558     5637844
AutoNUMA cost                     4785        2580        1646        1709        1607

Primary take-away point is to note the reduction in NUMA hinting faults
and PTE updates.  Patches 1-9 incur about half the number of faults for
comparable overall performance.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time. The full set is queued.

specjbb
                        3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                      vanilla       morefaults            scalescan            scanshared      accountpreferred      
TPut 1      26099.00 (  0.00%)     24848.00 ( -4.79%)     23990.00 ( -8.08%)     24350.00 ( -6.70%)     24248.00 ( -7.09%)
TPut 7     187276.00 (  0.00%)    189731.00 (  1.31%)    189065.00 (  0.96%)    188680.00 (  0.75%)    189774.00 (  1.33%)
TPut 13    318028.00 (  0.00%)    337374.00 (  6.08%)    339016.00 (  6.60%)    329143.00 (  3.49%)    338743.00 (  6.51%)
TPut 19    368547.00 (  0.00%)    429440.00 ( 16.52%)    423973.00 ( 15.04%)    403563.00 (  9.50%)    430941.00 ( 16.93%)
TPut 25    377522.00 (  0.00%)    497621.00 ( 31.81%)    488108.00 ( 29.29%)    437813.00 ( 15.97%)    485013.00 ( 28.47%)
TPut 31    347642.00 (  0.00%)    487253.00 ( 40.16%)    466366.00 ( 34.15%)    386972.00 ( 11.31%)    437104.00 ( 25.73%)
TPut 37    313439.00 (  0.00%)    478601.00 ( 52.69%)    443415.00 ( 41.47%)    379081.00 ( 20.94%)    425452.00 ( 35.74%)
TPut 43    291958.00 (  0.00%)    458614.00 ( 57.08%)    398195.00 ( 36.39%)    349661.00 ( 19.76%)    393102.00 ( 34.64%)

Pathces 1-9 again perform extremely well here. Reducing the scan rate
had an impact as did scanning shared pages which may indicate that the
shared/private identification is insufficient. Reducing the scan rate might
be the dominant factor as the tests are very short lived -- 30 seconds
each which is just 10 PTE scan windows. Basic accounting of compute load
helped again and overall the series was competetive.

specjbb Peaks
                         3.9.0                      3.9.0        3.9.0                3.9.0                      3.9.0
                       vanilla            morefaults          scalescan          scanshared           accountpreferred      
 Expctd Warehouse     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)
 Actual Warehouse     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)
 Actual Peak Bops 377522.00 (  0.00%) 497621.00 ( 31.81%) 488108.00 ( 29.29%)  37813.00 ( 15.97%) 485013.00 ( 28.47%)

At least peak bops always improved.

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillamorefaults     scalescan      scanshared      accountpreferred      
User         5184.53     5190.44     5195.01     5173.33     5185.88
System         59.61       58.91       61.47       73.84       64.81
Elapsed       254.52      254.17      254.12      254.80      254.55

Interestingly system CPU times were mixed. Scan shared incurred fewer
faults, migrated fewer pages and updated fewer PTEs so the time is being
lost elsewhere.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanilla morefaults    scalescan  scanshared accountpreferred      
THP fault alloc                  33297       33251       34306       35144       33898
THP collapse alloc                   9           8          14           8          16
THP splits                           3           4           5           4           4
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           1773768     1716596     2075251     1815999     1858598
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   1841        1781        2154        1885        1929
NUMA PTE updates              17461135    17268534    18766637    17602518    17406295
NUMA hint faults                 85873      170092       86195       80027       84052
NUMA hint local faults           27145      116070       30651       28293       29919
NUMA pages migrated            1773768     1716596     2075251     1815999     1858598
AutoNUMA cost                      585        1003         601         557         577

Not much of note there other than Patches 1-9 had a very high number of hinting faults
and it's not immediately obvious why.

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system. Similar to
the other test, it's only for a limited number of clients to save time.

specjbb
                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                        vanilla       morefaults-v2r2       scalescan-v2r12      scanshared-v2r12accountpreferred-v2r12
Mean   1      30331.00 (  0.00%)     31076.75 (  2.46%)     30813.50 (  1.59%)     30612.50 (  0.93%)     30411.00 (  0.26%)
Mean   7     150487.75 (  0.00%)    153060.00 (  1.71%)    155117.50 (  3.08%)    152602.25 (  1.41%)    151431.25 (  0.63%)
Mean   13    130513.00 (  0.00%)    135521.25 (  3.84%)    136205.50 (  4.36%)    135635.25 (  3.92%)    130575.50 (  0.05%)
Mean   19    123404.75 (  0.00%)    131505.75 (  6.56%)    126020.75 (  2.12%)    127171.25 (  3.05%)    119632.75 ( -3.06%)
Mean   25    116276.00 (  0.00%)    120041.75 (  3.24%)    117053.25 (  0.67%)    121249.75 (  4.28%)    112591.75 ( -3.17%)
Mean   31    108080.00 (  0.00%)    113237.00 (  4.77%)    113738.00 (  5.24%)    114078.00 (  5.55%)    106955.00 ( -1.04%)
Mean   37    102704.00 (  0.00%)    107246.75 (  4.42%)    113435.75 ( 10.45%)    111945.50 (  9.00%)    106184.75 (  3.39%)
Mean   43     98132.00 (  0.00%)    105014.75 (  7.01%)    109398.75 ( 11.48%)    106662.75 (  8.69%)    103322.75 (  5.29%)
Stddev 1        792.83 (  0.00%)      1127.16 (-42.17%)      1321.59 (-66.69%)      1356.36 (-71.08%)       715.51 (  9.75%)
Stddev 7       4080.34 (  0.00%)       526.84 ( 87.09%)      3153.16 ( 22.72%)      3781.85 (  7.32%)      2863.35 ( 29.83%)
Stddev 13      6614.16 (  0.00%)      2086.04 ( 68.46%)      4139.26 ( 37.42%)      2486.95 ( 62.40%)      4066.48 ( 38.52%)
Stddev 19      2835.73 (  0.00%)      1928.86 ( 31.98%)      4097.14 (-44.48%)       591.59 ( 79.14%)      3182.51 (-12.23%)
Stddev 25      3608.71 (  0.00%)      3198.96 ( 11.35%)      5391.60 (-49.41%)      1606.37 ( 55.49%)      3326.21 (  7.83%)
Stddev 31      2778.25 (  0.00%)       784.02 ( 71.78%)      6802.53 (-144.85%)      1738.20 ( 37.44%)      1126.27 ( 59.46%)
Stddev 37      4069.13 (  0.00%)      5009.93 (-23.12%)      5022.13 (-23.42%)      4191.94 ( -3.02%)      1031.05 ( 74.66%)
Stddev 43      9215.73 (  0.00%)      5589.12 ( 39.35%)      8915.80 (  3.25%)      8042.72 ( 12.73%)      3113.04 ( 66.22%)
TPut   1     121324.00 (  0.00%)    124307.00 (  2.46%)    123254.00 (  1.59%)    122450.00 (  0.93%)    121644.00 (  0.26%)
TPut   7     601951.00 (  0.00%)    612240.00 (  1.71%)    620470.00 (  3.08%)    610409.00 (  1.41%)    605725.00 (  0.63%)
TPut   13    522052.00 (  0.00%)    542085.00 (  3.84%)    544822.00 (  4.36%)    542541.00 (  3.92%)    522302.00 (  0.05%)
TPut   19    493619.00 (  0.00%)    526023.00 (  6.56%)    504083.00 (  2.12%)    508685.00 (  3.05%)    478531.00 ( -3.06%)
TPut   25    465104.00 (  0.00%)    480167.00 (  3.24%)    468213.00 (  0.67%)    484999.00 (  4.28%)    450367.00 ( -3.17%)
TPut   31    432320.00 (  0.00%)    452948.00 (  4.77%)    454952.00 (  5.24%)    456312.00 (  5.55%)    427820.00 ( -1.04%)
TPut   37    410816.00 (  0.00%)    428987.00 (  4.42%)    453743.00 ( 10.45%)    447782.00 (  9.00%)    424739.00 (  3.39%)
TPut   43    392528.00 (  0.00%)    420059.00 (  7.01%)    437595.00 ( 11.48%)    426651.00 (  8.69%)    413291.00 (  5.29%)

These are the mean throughput figures between JVMs and the standard
deviation. Note that with the patches applied that there is a lot less
deviation between JVMs in many cases. As the number of clients increases
the performance improves. This is still far short of the theoritical best
performance but it's a step in the right direction.

specjbb Peaks
                         3.9.0                      3.9.0               3.9.0              3.9.0                3.9.0
                       vanilla            morefaults-v2r2     scalescan-v2r12    scanshared-v2r12 accountpreferred-v2r12
 Expctd Warehouse     12.00 (  0.00%)     12.00 (  0.00%)     12.00 (  0.00%)     12.00 (  0.00%)     12.00 (  0.00%)
 Actual Warehouse      8.00 (  0.00%)      8.00 (  0.00%)      8.00 (  0.00%)      8.00 (  0.00%)      8.00 (  0.00%)
 Actual Peak Bops 601951.00 (  0.00%) 612240.00 (  1.71%) 620470.00 (  3.08%) 610409.00 (  1.41%) 605725.00 (  0.63%)

Peaks are only marginally improved even though many of the individual
throughput figures look ok.

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillamorefaults-v2r2scalescan-v2r12scanshared-v2r12accountpreferred-v2r12
User        78020.94    77250.13    78334.69    78027.78    77752.27
System        305.94      261.17      228.74      234.28      240.04
Elapsed      1744.52     1717.79     1744.31     1742.62     1730.79

And the performance is improved with a healthy reduction in system CPU time

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillamorefaults-v2r2scalescan-v2r12scanshared-v2r12accountpreferred-v2r12
THP fault alloc                  65433       64779       64234       65547       63519
THP collapse alloc                  51          54          58          55          55
THP splits                          55          49          46          51          56
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success          20348847    15323475    11375529    11777597    12110444
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                  21122       15905       11807       12225       12570
NUMA PTE updates             180124094   145320534   109608785   108346894   107390100
NUMA hint faults               2358728     1623277     1489903     1472556     1378097
NUMA hint local faults          835051      603375      585183      516949      425342
NUMA pages migrated           20348847    15323475    11375529    11777597    12110444
AutoNUMA cost                    13441        9424        8432        8344        7872

Much fewer PTE updates and faults.

The performance is still a mixed bag. Patches 1-9 are generally
good. Conceptually I think the other patches make sense but need a bit
more love. The last patch in particularly will be replaced with more of
Peter's work.

 Documentation/sysctl/kernel.txt |  68 +++++++++
 include/linux/migrate.h         |   7 +-
 include/linux/mm_types.h        |   3 -
 include/linux/sched.h           |  21 ++-
 include/linux/sched/sysctl.h    |   1 -
 kernel/sched/core.c             |  33 +++-
 kernel/sched/fair.c             | 330 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h            |  16 ++
 kernel/sysctl.c                 |  14 +-
 mm/huge_memory.c                |   7 +-
 mm/memory.c                     |  13 +-
 mm/migrate.c                    |  17 +--
 mm/mprotect.c                   |   4 +-
 13 files changed, 462 insertions(+), 72 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 43+ messages in thread