[PATCH 0/6] Basic scheduler support for automatic NUMA balancing

* [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-06-26 14:37 ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

It's several months overdue and everything was quiet after 3.8 came out
but I recently had a chance to revisit automatic NUMA balancing for a few
days. I looked at basic scheduler integration resulting in the following
small series. Much of the following is heavily based on the numacore series
which in itself takes part of the autonuma series from back in November. In
particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
mm: Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps between
this and manual binding where possible and depending on the workload between
it and interleaving when hard bindings are not an option.  As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. This
will allow us to validate each step and keep reviewer stress to a minimum.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 7 splits the accounting of faults between those that passed the
	two-stage filter and those that did not. Task placement favours
	the filtered faults initially although ultimately this will need
	more smarts when node-local faults do not dominate.

Patch 8 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.
                        3.9.0                 3.9.0
                      vanilla       resetscan-v1r29
TPut 1      24770.00 (  0.00%)     24735.00 ( -0.14%)
TPut 2      54639.00 (  0.00%)     55727.00 (  1.99%)
TPut 3      88338.00 (  0.00%)     87322.00 ( -1.15%)
TPut 4     115379.00 (  0.00%)    115912.00 (  0.46%)
TPut 5     143165.00 (  0.00%)    142017.00 ( -0.80%)
TPut 6     170256.00 (  0.00%)    171133.00 (  0.52%)
TPut 7     194410.00 (  0.00%)    200601.00 (  3.18%)
TPut 8     225864.00 (  0.00%)    225518.00 ( -0.15%)
TPut 9     248977.00 (  0.00%)    251078.00 (  0.84%)
TPut 10    274911.00 (  0.00%)    275088.00 (  0.06%)
TPut 11    299963.00 (  0.00%)    305233.00 (  1.76%)
TPut 12    329709.00 (  0.00%)    326502.00 ( -0.97%)
TPut 13    347794.00 (  0.00%)    352284.00 (  1.29%)
TPut 14    372475.00 (  0.00%)    375917.00 (  0.92%)
TPut 15    392596.00 (  0.00%)    391675.00 ( -0.23%)
TPut 16    405273.00 (  0.00%)    418292.00 (  3.21%)
TPut 17    429656.00 (  0.00%)    438006.00 (  1.94%)
TPut 18    447152.00 (  0.00%)    458248.00 (  2.48%)
TPut 19    453475.00 (  0.00%)    482686.00 (  6.44%)
TPut 20    473828.00 (  0.00%)    494508.00 (  4.36%)
TPut 21    477896.00 (  0.00%)    516264.00 (  8.03%)
TPut 22    502557.00 (  0.00%)    521956.00 (  3.86%)
TPut 23    503415.00 (  0.00%)    545774.00 (  8.41%)
TPut 24    516095.00 (  0.00%)    555747.00 (  7.68%)
TPut 25    515441.00 (  0.00%)    562987.00 (  9.22%)
TPut 26    517906.00 (  0.00%)    562589.00 (  8.63%)
TPut 27    517312.00 (  0.00%)    551823.00 (  6.67%)
TPut 28    511740.00 (  0.00%)    548546.00 (  7.19%)
TPut 29    515789.00 (  0.00%)    552132.00 (  7.05%)
TPut 30    501366.00 (  0.00%)    556688.00 ( 11.03%)
TPut 31    509797.00 (  0.00%)    558124.00 (  9.48%)
TPut 32    514932.00 (  0.00%)    553529.00 (  7.50%)
TPut 33    502227.00 (  0.00%)    550933.00 (  9.70%)
TPut 34    509668.00 (  0.00%)    530995.00 (  4.18%)
TPut 35    500032.00 (  0.00%)    539452.00 (  7.88%)
TPut 36    483231.00 (  0.00%)    527146.00 (  9.09%)
TPut 37    493236.00 (  0.00%)    524913.00 (  6.42%)
TPut 38    483924.00 (  0.00%)    521526.00 (  7.77%)
TPut 39    467308.00 (  0.00%)    523683.00 ( 12.06%)
TPut 40    461353.00 (  0.00%)    494697.00 (  7.23%)
TPut 41    462128.00 (  0.00%)    513593.00 ( 11.14%)
TPut 42    450428.00 (  0.00%)    505080.00 ( 12.13%)
TPut 43    444065.00 (  0.00%)    491715.00 ( 10.73%)
TPut 44    455875.00 (  0.00%)    473548.00 (  3.88%)
TPut 45    413063.00 (  0.00%)    474189.00 ( 14.80%)
TPut 46    421084.00 (  0.00%)    457423.00 (  8.63%)
TPut 47    399403.00 (  0.00%)    450189.00 ( 12.72%)
TPut 48    411438.00 (  0.00%)    443868.00 (  7.88%)

Somewhat respectable performance improvement for most numbers of clients.

specjbb Peaks
                                       3.9.0                      3.9.0
                                     vanilla            resetscan-v1r29
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               399403.00 (  0.00%)               450189.00 ( 12.72%)
 Actual Warehouse                   27.00 (  0.00%)                   26.00 ( -3.70%)
 Actual Peak Bops               517906.00 (  0.00%)               562987.00 (  8.70%)
 SpecJBB Bops                     8397.00 (  0.00%)                 9059.00 (  7.88%)
 SpecJBB Bops/JVM                 8397.00 (  0.00%)                 9059.00 (  7.88%)

The specjbb score and peak bops are improved. The actual peak warehouse
is lower which is unfortunate.

               3.9.0       3.9.0
             vanillaresetscan-v1r29
User        44532.91    44541.85
System        145.18      133.87
Elapsed      1667.08     1666.65

System CPU usage is slightly lower so we get higher performance for lower overhead.

                                 3.9.0       3.9.0
                               vanillaresetscan-v1r29
Minor Faults                   1951410     1864310
Major Faults                       149         130
Swap Ins                             0           0
Swap Outs                            0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate               0           0
Sector Reads                     61964       37260
Sector Writes                    23408       17708
Page rescued immediate               0           0
Slabs scanned                        0           0
Direct inode steals                  0           0
Kswapd inode steals                  0           0
Kswapd skipped wait                  0           0
THP fault alloc                  42876       40951
THP collapse alloc                  61          66
THP splits                          58          52
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0           0
Compaction success                   0           0
Compaction failures                  0           0
Page migrate success          14446025    13710610
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                  14994       14231
NUMA PTE updates             112474717   106764423
NUMA hint faults                692716      543202
NUMA hint local faults          272512      154250
NUMA pages migrated           14446025    13710610
AutoNUMA cost                     4525        3723

Note that there are marginally fewer PTE updates, NUMA hinting faults and
pages migrated again showing we're getting the higher performance for lower overhea

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system. It's a lot of data unfortunately.

                          3.9.0                 3.9.0
                        vanilla       resetscan-v1r29
Mean   1      30420.25 (  0.00%)     30813.00 (  1.29%)
Mean   2      61628.50 (  0.00%)     62773.00 (  1.86%)
Mean   3      89830.25 (  0.00%)     90780.00 (  1.06%)
Mean   4     115535.00 (  0.00%)    115962.50 (  0.37%)
Mean   5     138453.75 (  0.00%)    137142.00 ( -0.95%)
Mean   6     157207.75 (  0.00%)    154942.50 ( -1.44%)
Mean   7     159087.50 (  0.00%)    158301.75 ( -0.49%)
Mean   8     158453.00 (  0.00%)    157125.00 ( -0.84%)
Mean   9     156613.75 (  0.00%)    151507.50 ( -3.26%)
Mean   10    151129.75 (  0.00%)    146982.25 ( -2.74%)
Mean   11    141945.00 (  0.00%)    136831.50 ( -3.60%)
Mean   12    136653.75 (  0.00%)    132907.50 ( -2.74%)
Mean   13    135432.00 (  0.00%)    130598.50 ( -3.57%)
Mean   14    132629.00 (  0.00%)    130460.50 ( -1.64%)
Mean   15    127698.00 (  0.00%)    132509.25 (  3.77%)
Mean   16    128686.75 (  0.00%)    130936.25 (  1.75%)
Mean   17    123666.50 (  0.00%)    125579.75 (  1.55%)
Mean   18    121543.75 (  0.00%)    122923.50 (  1.14%)
Mean   19    118704.75 (  0.00%)    127232.00 (  7.18%)
Mean   20    117251.50 (  0.00%)    124994.75 (  6.60%)
Mean   21    114060.25 (  0.00%)    123165.50 (  7.98%)
Mean   22    108594.00 (  0.00%)    116716.00 (  7.48%)
Mean   23    108471.25 (  0.00%)    115118.25 (  6.13%)
Mean   24    110019.25 (  0.00%)    114149.75 (  3.75%)
Mean   25    109250.50 (  0.00%)    112506.75 (  2.98%)
Mean   26    107827.75 (  0.00%)    112699.50 (  4.52%)
Mean   27    104496.25 (  0.00%)    114260.00 (  9.34%)
Mean   28    104117.75 (  0.00%)    114140.75 (  9.63%)
Mean   29    103018.75 (  0.00%)    109829.50 (  6.61%)
Mean   30    104718.00 (  0.00%)    108194.25 (  3.32%)
Mean   31    101520.50 (  0.00%)    108311.25 (  6.69%)
Mean   32     97662.75 (  0.00%)    105314.75 (  7.84%)
Mean   33    101508.50 (  0.00%)    106076.25 (  4.50%)
Mean   34     98576.50 (  0.00%)    111020.50 ( 12.62%)
Mean   35    105180.75 (  0.00%)    108971.25 (  3.60%)
Mean   36    101517.00 (  0.00%)    108781.25 (  7.16%)
Mean   37    100664.00 (  0.00%)    109634.50 (  8.91%)
Mean   38    101012.25 (  0.00%)    110988.25 (  9.88%)
Mean   39    101967.00 (  0.00%)    105927.75 (  3.88%)
Mean   40     97732.50 (  0.00%)    110570.00 ( 13.14%)
Mean   41    103773.25 (  0.00%)    111583.00 (  7.53%)
Mean   42    105105.00 (  0.00%)    110321.00 (  4.96%)
Mean   43    102351.50 (  0.00%)    107145.75 (  4.68%)
Mean   44    105980.00 (  0.00%)    107938.50 (  1.85%)
Mean   45    111055.00 (  0.00%)    111159.25 (  0.09%)
Mean   46    112757.25 (  0.00%)    114807.00 (  1.82%)
Mean   47     93706.75 (  0.00%)    113681.25 ( 21.32%)
Mean   48    106624.00 (  0.00%)    117423.75 ( 10.13%)
Stddev 1       1371.00 (  0.00%)       872.33 ( 36.37%)
Stddev 2       1326.07 (  0.00%)       310.98 ( 76.55%)
Stddev 3       1160.36 (  0.00%)      1074.95 (  7.36%)
Stddev 4       1689.80 (  0.00%)      1461.05 ( 13.54%)
Stddev 5       2214.45 (  0.00%)      1089.81 ( 50.79%)
Stddev 6       1756.74 (  0.00%)      2138.00 (-21.70%)
Stddev 7       3419.70 (  0.00%)      3335.13 (  2.47%)
Stddev 8       6511.71 (  0.00%)      4716.75 ( 27.57%)
Stddev 9       5373.19 (  0.00%)      2899.89 ( 46.03%)
Stddev 10      3732.23 (  0.00%)      2558.50 ( 31.45%)
Stddev 11      4616.71 (  0.00%)      5919.34 (-28.22%)
Stddev 12      5503.15 (  0.00%)      5953.85 ( -8.19%)
Stddev 13      5202.46 (  0.00%)      7507.23 (-44.30%)
Stddev 14      3526.10 (  0.00%)      2296.23 ( 34.88%)
Stddev 15      3576.78 (  0.00%)      3450.47 (  3.53%)
Stddev 16      2786.08 (  0.00%)       950.31 ( 65.89%)
Stddev 17      3055.44 (  0.00%)      2881.78 (  5.68%)
Stddev 18      2543.08 (  0.00%)      1332.83 ( 47.59%)
Stddev 19      3936.65 (  0.00%)      1403.64 ( 64.34%)
Stddev 20      3005.94 (  0.00%)      1342.59 ( 55.34%)
Stddev 21      2657.19 (  0.00%)      2498.95 (  5.96%)
Stddev 22      2016.42 (  0.00%)      2078.84 ( -3.10%)
Stddev 23      2209.88 (  0.00%)      2939.24 (-33.00%)
Stddev 24      5325.86 (  0.00%)      2760.85 ( 48.16%)
Stddev 25      4659.26 (  0.00%)      1433.24 ( 69.24%)
Stddev 26      1169.78 (  0.00%)      1977.32 (-69.03%)
Stddev 27      2923.78 (  0.00%)      2675.50 (  8.49%)
Stddev 28      5335.85 (  0.00%)      1874.29 ( 64.87%)
Stddev 29      4381.68 (  0.00%)      3660.16 ( 16.47%)
Stddev 30      3437.44 (  0.00%)      6535.20 (-90.12%)
Stddev 31      3979.56 (  0.00%)      5032.62 (-26.46%)
Stddev 32      2614.04 (  0.00%)      5118.99 (-95.83%)
Stddev 33      5358.35 (  0.00%)      2488.64 ( 53.56%)
Stddev 34      6375.57 (  0.00%)      4105.34 ( 35.61%)
Stddev 35      8079.76 (  0.00%)      3696.10 ( 54.25%)
Stddev 36      8665.59 (  0.00%)      5155.29 ( 40.51%)
Stddev 37      8002.37 (  0.00%)      8660.12 ( -8.22%)
Stddev 38      4955.36 (  0.00%)      8615.78 (-73.87%)
Stddev 39      9940.79 (  0.00%)      9620.33 (  3.22%)
Stddev 40     12344.56 (  0.00%)     11248.42 (  8.88%)
Stddev 41     15834.32 (  0.00%)     13587.05 ( 14.19%)
Stddev 42     12006.48 (  0.00%)     10554.10 ( 12.10%)
Stddev 43      4141.73 (  0.00%)     13565.76 (-227.54%)
Stddev 44      7476.54 (  0.00%)     16442.62 (-119.92%)
Stddev 45     16048.04 (  0.00%)     17095.94 ( -6.53%)
Stddev 46     16198.20 (  0.00%)     17323.97 ( -6.95%)
Stddev 47     15743.04 (  0.00%)     17748.58 (-12.74%)
Stddev 48     12627.98 (  0.00%)     17082.27 (-35.27%)

These are the mean throughput figures between JVMs and the standard
deviation. Note that with the patches applied that there is a lot less
deviation between JVMs in many cases. As the number of clients increases
the performance improves. This is still far short of the theoritical best
performance but it's a step in the right direction.

TPut   1     121681.00 (  0.00%)    123252.00 (  1.29%)
TPut   2     246514.00 (  0.00%)    251092.00 (  1.86%)
TPut   3     359321.00 (  0.00%)    363120.00 (  1.06%)
TPut   4     462140.00 (  0.00%)    463850.00 (  0.37%)
TPut   5     553815.00 (  0.00%)    548568.00 ( -0.95%)
TPut   6     628831.00 (  0.00%)    619770.00 ( -1.44%)
TPut   7     636350.00 (  0.00%)    633207.00 ( -0.49%)
TPut   8     633812.00 (  0.00%)    628500.00 ( -0.84%)
TPut   9     626455.00 (  0.00%)    606030.00 ( -3.26%)
TPut   10    604519.00 (  0.00%)    587929.00 ( -2.74%)
TPut   11    567780.00 (  0.00%)    547326.00 ( -3.60%)
TPut   12    546615.00 (  0.00%)    531630.00 ( -2.74%)
TPut   13    541728.00 (  0.00%)    522394.00 ( -3.57%)
TPut   14    530516.00 (  0.00%)    521842.00 ( -1.64%)
TPut   15    510792.00 (  0.00%)    530037.00 (  3.77%)
TPut   16    514747.00 (  0.00%)    523745.00 (  1.75%)
TPut   17    494666.00 (  0.00%)    502319.00 (  1.55%)
TPut   18    486175.00 (  0.00%)    491694.00 (  1.14%)
TPut   19    474819.00 (  0.00%)    508928.00 (  7.18%)
TPut   20    469006.00 (  0.00%)    499979.00 (  6.60%)
TPut   21    456241.00 (  0.00%)    492662.00 (  7.98%)
TPut   22    434376.00 (  0.00%)    466864.00 (  7.48%)
TPut   23    433885.00 (  0.00%)    460473.00 (  6.13%)
TPut   24    440077.00 (  0.00%)    456599.00 (  3.75%)
TPut   25    437002.00 (  0.00%)    450027.00 (  2.98%)
TPut   26    431311.00 (  0.00%)    450798.00 (  4.52%)
TPut   27    417985.00 (  0.00%)    457040.00 (  9.34%)
TPut   28    416471.00 (  0.00%)    456563.00 (  9.63%)
TPut   29    412075.00 (  0.00%)    439318.00 (  6.61%)
TPut   30    418872.00 (  0.00%)    432777.00 (  3.32%)
TPut   31    406082.00 (  0.00%)    433245.00 (  6.69%)
TPut   32    390651.00 (  0.00%)    421259.00 (  7.84%)
TPut   33    406034.00 (  0.00%)    424305.00 (  4.50%)
TPut   34    394306.00 (  0.00%)    444082.00 ( 12.62%)
TPut   35    420723.00 (  0.00%)    435885.00 (  3.60%)
TPut   36    406068.00 (  0.00%)    435125.00 (  7.16%)
TPut   37    402656.00 (  0.00%)    438538.00 (  8.91%)
TPut   38    404049.00 (  0.00%)    443953.00 (  9.88%)
TPut   39    407868.00 (  0.00%)    423711.00 (  3.88%)
TPut   40    390930.00 (  0.00%)    442280.00 ( 13.14%)
TPut   41    415093.00 (  0.00%)    446332.00 (  7.53%)
TPut   42    420420.00 (  0.00%)    441284.00 (  4.96%)
TPut   43    409406.00 (  0.00%)    428583.00 (  4.68%)
TPut   44    423920.00 (  0.00%)    431754.00 (  1.85%)
TPut   45    444220.00 (  0.00%)    444637.00 (  0.09%)
TPut   46    451029.00 (  0.00%)    459228.00 (  1.82%)
TPut   47    374827.00 (  0.00%)    454725.00 ( 21.32%)
TPut   48    426496.00 (  0.00%)    469695.00 ( 10.13%)

Similarly overall throughput is improved for larger numbers of clients.

specjbb Peaks
                                       3.9.0                      3.9.0
                                     vanilla            resetscan-v1r29
 Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)
 Expctd Peak Bops               567780.00 (  0.00%)               547326.00 ( -3.60%)
 Actual Warehouse                    8.00 (  0.00%)                    8.00 (  0.00%)
 Actual Peak Bops               636350.00 (  0.00%)               633207.00 ( -0.49%)
 SpecJBB Bops                   487204.00 (  0.00%)               500705.00 (  2.77%)
 SpecJBB Bops/JVM               121801.00 (  0.00%)               125176.00 (  2.77%)

Peak performance is not great but the specjbb score is slightly improved.

               3.9.0       3.9.0
             vanillaresetscan-v1r29
User       479120.95   479525.04
System       1395.40     1124.93
Elapsed     10363.40    10376.34

System CPU time is reduced by quite a lot so automatic NUMA balancing now has less overhead.

                                 3.9.0       3.9.0
                               vanillaresetscan-v1r29
Minor Faults                  15711256    14962529
Major Faults                       132         151
Swap Ins                             0           0
Swap Outs                            0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate               0           0
Sector Reads                     32700       67420
Sector Writes                   108660      116092
Page rescued immediate               0           0
Slabs scanned                        0           0
Direct inode steals                  0           0
Kswapd inode steals                  0           0
Kswapd skipped wait                  0           0
THP fault alloc                  77041       76063
THP collapse alloc                 194         208
THP splits                         430         428
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0           0
Compaction success                   0           0
Compaction failures                  0           0
Page migrate success         134743458   102408111
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                 139863      106299
NUMA PTE updates            1167722150   961427213
NUMA hint faults               9915871     8411075
NUMA hint local faults         3660769     3212050
NUMA pages migrated          134743458   102408111
AutoNUMA cost                    60313       50731

Note that there are 20% fewer PTE updates reflecting the changes in the
scan rates. Similarly there are fewer hinting faults incurred and fewer
pages migrated.

Overall the performance has improved slightly but in general there is
less system overhead when delivering that performance so it's at least
a step in the right direction albeit far short of what it needs to be
ultimately.

 Documentation/sysctl/kernel.txt |  67 ++++++++++++++++
 include/linux/mm_types.h        |   3 -
 include/linux/sched.h           |  21 ++++-
 include/linux/sched/sysctl.h    |   1 -
 kernel/sched/core.c             |  33 +++++++-
 kernel/sched/fair.c             | 169 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h            |  12 +++
 kernel/sysctl.c                 |  14 ++--
 mm/huge_memory.c                |   7 +-
 mm/memory.c                     |   9 ++-
 10 files changed, 294 insertions(+), 42 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 124+ messages in thread