All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-06-26 14:37 ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

It's several months overdue and everything was quiet after 3.8 came out
but I recently had a chance to revisit automatic NUMA balancing for a few
days. I looked at basic scheduler integration resulting in the following
small series. Much of the following is heavily based on the numacore series
which in itself takes part of the autonuma series from back in November. In
particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
mm: Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps between
this and manual binding where possible and depending on the workload between
it and interleaving when hard bindings are not an option.  As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. This
will allow us to validate each step and keep reviewer stress to a minimum.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 7 splits the accounting of faults between those that passed the
	two-stage filter and those that did not. Task placement favours
	the filtered faults initially although ultimately this will need
	more smarts when node-local faults do not dominate.

Patch 8 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.
                        3.9.0                 3.9.0
                      vanilla       resetscan-v1r29
TPut 1      24770.00 (  0.00%)     24735.00 ( -0.14%)
TPut 2      54639.00 (  0.00%)     55727.00 (  1.99%)
TPut 3      88338.00 (  0.00%)     87322.00 ( -1.15%)
TPut 4     115379.00 (  0.00%)    115912.00 (  0.46%)
TPut 5     143165.00 (  0.00%)    142017.00 ( -0.80%)
TPut 6     170256.00 (  0.00%)    171133.00 (  0.52%)
TPut 7     194410.00 (  0.00%)    200601.00 (  3.18%)
TPut 8     225864.00 (  0.00%)    225518.00 ( -0.15%)
TPut 9     248977.00 (  0.00%)    251078.00 (  0.84%)
TPut 10    274911.00 (  0.00%)    275088.00 (  0.06%)
TPut 11    299963.00 (  0.00%)    305233.00 (  1.76%)
TPut 12    329709.00 (  0.00%)    326502.00 ( -0.97%)
TPut 13    347794.00 (  0.00%)    352284.00 (  1.29%)
TPut 14    372475.00 (  0.00%)    375917.00 (  0.92%)
TPut 15    392596.00 (  0.00%)    391675.00 ( -0.23%)
TPut 16    405273.00 (  0.00%)    418292.00 (  3.21%)
TPut 17    429656.00 (  0.00%)    438006.00 (  1.94%)
TPut 18    447152.00 (  0.00%)    458248.00 (  2.48%)
TPut 19    453475.00 (  0.00%)    482686.00 (  6.44%)
TPut 20    473828.00 (  0.00%)    494508.00 (  4.36%)
TPut 21    477896.00 (  0.00%)    516264.00 (  8.03%)
TPut 22    502557.00 (  0.00%)    521956.00 (  3.86%)
TPut 23    503415.00 (  0.00%)    545774.00 (  8.41%)
TPut 24    516095.00 (  0.00%)    555747.00 (  7.68%)
TPut 25    515441.00 (  0.00%)    562987.00 (  9.22%)
TPut 26    517906.00 (  0.00%)    562589.00 (  8.63%)
TPut 27    517312.00 (  0.00%)    551823.00 (  6.67%)
TPut 28    511740.00 (  0.00%)    548546.00 (  7.19%)
TPut 29    515789.00 (  0.00%)    552132.00 (  7.05%)
TPut 30    501366.00 (  0.00%)    556688.00 ( 11.03%)
TPut 31    509797.00 (  0.00%)    558124.00 (  9.48%)
TPut 32    514932.00 (  0.00%)    553529.00 (  7.50%)
TPut 33    502227.00 (  0.00%)    550933.00 (  9.70%)
TPut 34    509668.00 (  0.00%)    530995.00 (  4.18%)
TPut 35    500032.00 (  0.00%)    539452.00 (  7.88%)
TPut 36    483231.00 (  0.00%)    527146.00 (  9.09%)
TPut 37    493236.00 (  0.00%)    524913.00 (  6.42%)
TPut 38    483924.00 (  0.00%)    521526.00 (  7.77%)
TPut 39    467308.00 (  0.00%)    523683.00 ( 12.06%)
TPut 40    461353.00 (  0.00%)    494697.00 (  7.23%)
TPut 41    462128.00 (  0.00%)    513593.00 ( 11.14%)
TPut 42    450428.00 (  0.00%)    505080.00 ( 12.13%)
TPut 43    444065.00 (  0.00%)    491715.00 ( 10.73%)
TPut 44    455875.00 (  0.00%)    473548.00 (  3.88%)
TPut 45    413063.00 (  0.00%)    474189.00 ( 14.80%)
TPut 46    421084.00 (  0.00%)    457423.00 (  8.63%)
TPut 47    399403.00 (  0.00%)    450189.00 ( 12.72%)
TPut 48    411438.00 (  0.00%)    443868.00 (  7.88%)

Somewhat respectable performance improvement for most numbers of clients.

specjbb Peaks
                                       3.9.0                      3.9.0
                                     vanilla            resetscan-v1r29
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               399403.00 (  0.00%)               450189.00 ( 12.72%)
 Actual Warehouse                   27.00 (  0.00%)                   26.00 ( -3.70%)
 Actual Peak Bops               517906.00 (  0.00%)               562987.00 (  8.70%)
 SpecJBB Bops                     8397.00 (  0.00%)                 9059.00 (  7.88%)
 SpecJBB Bops/JVM                 8397.00 (  0.00%)                 9059.00 (  7.88%)

The specjbb score and peak bops are improved. The actual peak warehouse
is lower which is unfortunate.

               3.9.0       3.9.0
             vanillaresetscan-v1r29
User        44532.91    44541.85
System        145.18      133.87
Elapsed      1667.08     1666.65

System CPU usage is slightly lower so we get higher performance for lower overhead.

                                 3.9.0       3.9.0
                               vanillaresetscan-v1r29
Minor Faults                   1951410     1864310
Major Faults                       149         130
Swap Ins                             0           0
Swap Outs                            0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate               0           0
Sector Reads                     61964       37260
Sector Writes                    23408       17708
Page rescued immediate               0           0
Slabs scanned                        0           0
Direct inode steals                  0           0
Kswapd inode steals                  0           0
Kswapd skipped wait                  0           0
THP fault alloc                  42876       40951
THP collapse alloc                  61          66
THP splits                          58          52
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0           0
Compaction success                   0           0
Compaction failures                  0           0
Page migrate success          14446025    13710610
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                  14994       14231
NUMA PTE updates             112474717   106764423
NUMA hint faults                692716      543202
NUMA hint local faults          272512      154250
NUMA pages migrated           14446025    13710610
AutoNUMA cost                     4525        3723

Note that there are marginally fewer PTE updates, NUMA hinting faults and
pages migrated again showing we're getting the higher performance for lower overhea

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system. It's a lot of data unfortunately.

                          3.9.0                 3.9.0
                        vanilla       resetscan-v1r29
Mean   1      30420.25 (  0.00%)     30813.00 (  1.29%)
Mean   2      61628.50 (  0.00%)     62773.00 (  1.86%)
Mean   3      89830.25 (  0.00%)     90780.00 (  1.06%)
Mean   4     115535.00 (  0.00%)    115962.50 (  0.37%)
Mean   5     138453.75 (  0.00%)    137142.00 ( -0.95%)
Mean   6     157207.75 (  0.00%)    154942.50 ( -1.44%)
Mean   7     159087.50 (  0.00%)    158301.75 ( -0.49%)
Mean   8     158453.00 (  0.00%)    157125.00 ( -0.84%)
Mean   9     156613.75 (  0.00%)    151507.50 ( -3.26%)
Mean   10    151129.75 (  0.00%)    146982.25 ( -2.74%)
Mean   11    141945.00 (  0.00%)    136831.50 ( -3.60%)
Mean   12    136653.75 (  0.00%)    132907.50 ( -2.74%)
Mean   13    135432.00 (  0.00%)    130598.50 ( -3.57%)
Mean   14    132629.00 (  0.00%)    130460.50 ( -1.64%)
Mean   15    127698.00 (  0.00%)    132509.25 (  3.77%)
Mean   16    128686.75 (  0.00%)    130936.25 (  1.75%)
Mean   17    123666.50 (  0.00%)    125579.75 (  1.55%)
Mean   18    121543.75 (  0.00%)    122923.50 (  1.14%)
Mean   19    118704.75 (  0.00%)    127232.00 (  7.18%)
Mean   20    117251.50 (  0.00%)    124994.75 (  6.60%)
Mean   21    114060.25 (  0.00%)    123165.50 (  7.98%)
Mean   22    108594.00 (  0.00%)    116716.00 (  7.48%)
Mean   23    108471.25 (  0.00%)    115118.25 (  6.13%)
Mean   24    110019.25 (  0.00%)    114149.75 (  3.75%)
Mean   25    109250.50 (  0.00%)    112506.75 (  2.98%)
Mean   26    107827.75 (  0.00%)    112699.50 (  4.52%)
Mean   27    104496.25 (  0.00%)    114260.00 (  9.34%)
Mean   28    104117.75 (  0.00%)    114140.75 (  9.63%)
Mean   29    103018.75 (  0.00%)    109829.50 (  6.61%)
Mean   30    104718.00 (  0.00%)    108194.25 (  3.32%)
Mean   31    101520.50 (  0.00%)    108311.25 (  6.69%)
Mean   32     97662.75 (  0.00%)    105314.75 (  7.84%)
Mean   33    101508.50 (  0.00%)    106076.25 (  4.50%)
Mean   34     98576.50 (  0.00%)    111020.50 ( 12.62%)
Mean   35    105180.75 (  0.00%)    108971.25 (  3.60%)
Mean   36    101517.00 (  0.00%)    108781.25 (  7.16%)
Mean   37    100664.00 (  0.00%)    109634.50 (  8.91%)
Mean   38    101012.25 (  0.00%)    110988.25 (  9.88%)
Mean   39    101967.00 (  0.00%)    105927.75 (  3.88%)
Mean   40     97732.50 (  0.00%)    110570.00 ( 13.14%)
Mean   41    103773.25 (  0.00%)    111583.00 (  7.53%)
Mean   42    105105.00 (  0.00%)    110321.00 (  4.96%)
Mean   43    102351.50 (  0.00%)    107145.75 (  4.68%)
Mean   44    105980.00 (  0.00%)    107938.50 (  1.85%)
Mean   45    111055.00 (  0.00%)    111159.25 (  0.09%)
Mean   46    112757.25 (  0.00%)    114807.00 (  1.82%)
Mean   47     93706.75 (  0.00%)    113681.25 ( 21.32%)
Mean   48    106624.00 (  0.00%)    117423.75 ( 10.13%)
Stddev 1       1371.00 (  0.00%)       872.33 ( 36.37%)
Stddev 2       1326.07 (  0.00%)       310.98 ( 76.55%)
Stddev 3       1160.36 (  0.00%)      1074.95 (  7.36%)
Stddev 4       1689.80 (  0.00%)      1461.05 ( 13.54%)
Stddev 5       2214.45 (  0.00%)      1089.81 ( 50.79%)
Stddev 6       1756.74 (  0.00%)      2138.00 (-21.70%)
Stddev 7       3419.70 (  0.00%)      3335.13 (  2.47%)
Stddev 8       6511.71 (  0.00%)      4716.75 ( 27.57%)
Stddev 9       5373.19 (  0.00%)      2899.89 ( 46.03%)
Stddev 10      3732.23 (  0.00%)      2558.50 ( 31.45%)
Stddev 11      4616.71 (  0.00%)      5919.34 (-28.22%)
Stddev 12      5503.15 (  0.00%)      5953.85 ( -8.19%)
Stddev 13      5202.46 (  0.00%)      7507.23 (-44.30%)
Stddev 14      3526.10 (  0.00%)      2296.23 ( 34.88%)
Stddev 15      3576.78 (  0.00%)      3450.47 (  3.53%)
Stddev 16      2786.08 (  0.00%)       950.31 ( 65.89%)
Stddev 17      3055.44 (  0.00%)      2881.78 (  5.68%)
Stddev 18      2543.08 (  0.00%)      1332.83 ( 47.59%)
Stddev 19      3936.65 (  0.00%)      1403.64 ( 64.34%)
Stddev 20      3005.94 (  0.00%)      1342.59 ( 55.34%)
Stddev 21      2657.19 (  0.00%)      2498.95 (  5.96%)
Stddev 22      2016.42 (  0.00%)      2078.84 ( -3.10%)
Stddev 23      2209.88 (  0.00%)      2939.24 (-33.00%)
Stddev 24      5325.86 (  0.00%)      2760.85 ( 48.16%)
Stddev 25      4659.26 (  0.00%)      1433.24 ( 69.24%)
Stddev 26      1169.78 (  0.00%)      1977.32 (-69.03%)
Stddev 27      2923.78 (  0.00%)      2675.50 (  8.49%)
Stddev 28      5335.85 (  0.00%)      1874.29 ( 64.87%)
Stddev 29      4381.68 (  0.00%)      3660.16 ( 16.47%)
Stddev 30      3437.44 (  0.00%)      6535.20 (-90.12%)
Stddev 31      3979.56 (  0.00%)      5032.62 (-26.46%)
Stddev 32      2614.04 (  0.00%)      5118.99 (-95.83%)
Stddev 33      5358.35 (  0.00%)      2488.64 ( 53.56%)
Stddev 34      6375.57 (  0.00%)      4105.34 ( 35.61%)
Stddev 35      8079.76 (  0.00%)      3696.10 ( 54.25%)
Stddev 36      8665.59 (  0.00%)      5155.29 ( 40.51%)
Stddev 37      8002.37 (  0.00%)      8660.12 ( -8.22%)
Stddev 38      4955.36 (  0.00%)      8615.78 (-73.87%)
Stddev 39      9940.79 (  0.00%)      9620.33 (  3.22%)
Stddev 40     12344.56 (  0.00%)     11248.42 (  8.88%)
Stddev 41     15834.32 (  0.00%)     13587.05 ( 14.19%)
Stddev 42     12006.48 (  0.00%)     10554.10 ( 12.10%)
Stddev 43      4141.73 (  0.00%)     13565.76 (-227.54%)
Stddev 44      7476.54 (  0.00%)     16442.62 (-119.92%)
Stddev 45     16048.04 (  0.00%)     17095.94 ( -6.53%)
Stddev 46     16198.20 (  0.00%)     17323.97 ( -6.95%)
Stddev 47     15743.04 (  0.00%)     17748.58 (-12.74%)
Stddev 48     12627.98 (  0.00%)     17082.27 (-35.27%)

These are the mean throughput figures between JVMs and the standard
deviation. Note that with the patches applied that there is a lot less
deviation between JVMs in many cases. As the number of clients increases
the performance improves. This is still far short of the theoritical best
performance but it's a step in the right direction.

TPut   1     121681.00 (  0.00%)    123252.00 (  1.29%)
TPut   2     246514.00 (  0.00%)    251092.00 (  1.86%)
TPut   3     359321.00 (  0.00%)    363120.00 (  1.06%)
TPut   4     462140.00 (  0.00%)    463850.00 (  0.37%)
TPut   5     553815.00 (  0.00%)    548568.00 ( -0.95%)
TPut   6     628831.00 (  0.00%)    619770.00 ( -1.44%)
TPut   7     636350.00 (  0.00%)    633207.00 ( -0.49%)
TPut   8     633812.00 (  0.00%)    628500.00 ( -0.84%)
TPut   9     626455.00 (  0.00%)    606030.00 ( -3.26%)
TPut   10    604519.00 (  0.00%)    587929.00 ( -2.74%)
TPut   11    567780.00 (  0.00%)    547326.00 ( -3.60%)
TPut   12    546615.00 (  0.00%)    531630.00 ( -2.74%)
TPut   13    541728.00 (  0.00%)    522394.00 ( -3.57%)
TPut   14    530516.00 (  0.00%)    521842.00 ( -1.64%)
TPut   15    510792.00 (  0.00%)    530037.00 (  3.77%)
TPut   16    514747.00 (  0.00%)    523745.00 (  1.75%)
TPut   17    494666.00 (  0.00%)    502319.00 (  1.55%)
TPut   18    486175.00 (  0.00%)    491694.00 (  1.14%)
TPut   19    474819.00 (  0.00%)    508928.00 (  7.18%)
TPut   20    469006.00 (  0.00%)    499979.00 (  6.60%)
TPut   21    456241.00 (  0.00%)    492662.00 (  7.98%)
TPut   22    434376.00 (  0.00%)    466864.00 (  7.48%)
TPut   23    433885.00 (  0.00%)    460473.00 (  6.13%)
TPut   24    440077.00 (  0.00%)    456599.00 (  3.75%)
TPut   25    437002.00 (  0.00%)    450027.00 (  2.98%)
TPut   26    431311.00 (  0.00%)    450798.00 (  4.52%)
TPut   27    417985.00 (  0.00%)    457040.00 (  9.34%)
TPut   28    416471.00 (  0.00%)    456563.00 (  9.63%)
TPut   29    412075.00 (  0.00%)    439318.00 (  6.61%)
TPut   30    418872.00 (  0.00%)    432777.00 (  3.32%)
TPut   31    406082.00 (  0.00%)    433245.00 (  6.69%)
TPut   32    390651.00 (  0.00%)    421259.00 (  7.84%)
TPut   33    406034.00 (  0.00%)    424305.00 (  4.50%)
TPut   34    394306.00 (  0.00%)    444082.00 ( 12.62%)
TPut   35    420723.00 (  0.00%)    435885.00 (  3.60%)
TPut   36    406068.00 (  0.00%)    435125.00 (  7.16%)
TPut   37    402656.00 (  0.00%)    438538.00 (  8.91%)
TPut   38    404049.00 (  0.00%)    443953.00 (  9.88%)
TPut   39    407868.00 (  0.00%)    423711.00 (  3.88%)
TPut   40    390930.00 (  0.00%)    442280.00 ( 13.14%)
TPut   41    415093.00 (  0.00%)    446332.00 (  7.53%)
TPut   42    420420.00 (  0.00%)    441284.00 (  4.96%)
TPut   43    409406.00 (  0.00%)    428583.00 (  4.68%)
TPut   44    423920.00 (  0.00%)    431754.00 (  1.85%)
TPut   45    444220.00 (  0.00%)    444637.00 (  0.09%)
TPut   46    451029.00 (  0.00%)    459228.00 (  1.82%)
TPut   47    374827.00 (  0.00%)    454725.00 ( 21.32%)
TPut   48    426496.00 (  0.00%)    469695.00 ( 10.13%)

Similarly overall throughput is improved for larger numbers of clients.

specjbb Peaks
                                       3.9.0                      3.9.0
                                     vanilla            resetscan-v1r29
 Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)
 Expctd Peak Bops               567780.00 (  0.00%)               547326.00 ( -3.60%)
 Actual Warehouse                    8.00 (  0.00%)                    8.00 (  0.00%)
 Actual Peak Bops               636350.00 (  0.00%)               633207.00 ( -0.49%)
 SpecJBB Bops                   487204.00 (  0.00%)               500705.00 (  2.77%)
 SpecJBB Bops/JVM               121801.00 (  0.00%)               125176.00 (  2.77%)

Peak performance is not great but the specjbb score is slightly improved.


               3.9.0       3.9.0
             vanillaresetscan-v1r29
User       479120.95   479525.04
System       1395.40     1124.93
Elapsed     10363.40    10376.34

System CPU time is reduced by quite a lot so automatic NUMA balancing now has less overhead.

                                 3.9.0       3.9.0
                               vanillaresetscan-v1r29
Minor Faults                  15711256    14962529
Major Faults                       132         151
Swap Ins                             0           0
Swap Outs                            0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate               0           0
Sector Reads                     32700       67420
Sector Writes                   108660      116092
Page rescued immediate               0           0
Slabs scanned                        0           0
Direct inode steals                  0           0
Kswapd inode steals                  0           0
Kswapd skipped wait                  0           0
THP fault alloc                  77041       76063
THP collapse alloc                 194         208
THP splits                         430         428
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0           0
Compaction success                   0           0
Compaction failures                  0           0
Page migrate success         134743458   102408111
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                 139863      106299
NUMA PTE updates            1167722150   961427213
NUMA hint faults               9915871     8411075
NUMA hint local faults         3660769     3212050
NUMA pages migrated          134743458   102408111
AutoNUMA cost                    60313       50731

Note that there are 20% fewer PTE updates reflecting the changes in the
scan rates. Similarly there are fewer hinting faults incurred and fewer
pages migrated.

Overall the performance has improved slightly but in general there is
less system overhead when delivering that performance so it's at least
a step in the right direction albeit far short of what it needs to be
ultimately.


 Documentation/sysctl/kernel.txt |  67 ++++++++++++++++
 include/linux/mm_types.h        |   3 -
 include/linux/sched.h           |  21 ++++-
 include/linux/sched/sysctl.h    |   1 -
 kernel/sched/core.c             |  33 +++++++-
 kernel/sched/fair.c             | 169 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h            |  12 +++
 kernel/sysctl.c                 |  14 ++--
 mm/huge_memory.c                |   7 +-
 mm/memory.c                     |   9 ++-
 10 files changed, 294 insertions(+), 42 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-06-26 14:37 ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

It's several months overdue and everything was quiet after 3.8 came out
but I recently had a chance to revisit automatic NUMA balancing for a few
days. I looked at basic scheduler integration resulting in the following
small series. Much of the following is heavily based on the numacore series
which in itself takes part of the autonuma series from back in November. In
particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
mm: Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps between
this and manual binding where possible and depending on the workload between
it and interleaving when hard bindings are not an option.  As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. This
will allow us to validate each step and keep reviewer stress to a minimum.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 7 splits the accounting of faults between those that passed the
	two-stage filter and those that did not. Task placement favours
	the filtered faults initially although ultimately this will need
	more smarts when node-local faults do not dominate.

Patch 8 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.
                        3.9.0                 3.9.0
                      vanilla       resetscan-v1r29
TPut 1      24770.00 (  0.00%)     24735.00 ( -0.14%)
TPut 2      54639.00 (  0.00%)     55727.00 (  1.99%)
TPut 3      88338.00 (  0.00%)     87322.00 ( -1.15%)
TPut 4     115379.00 (  0.00%)    115912.00 (  0.46%)
TPut 5     143165.00 (  0.00%)    142017.00 ( -0.80%)
TPut 6     170256.00 (  0.00%)    171133.00 (  0.52%)
TPut 7     194410.00 (  0.00%)    200601.00 (  3.18%)
TPut 8     225864.00 (  0.00%)    225518.00 ( -0.15%)
TPut 9     248977.00 (  0.00%)    251078.00 (  0.84%)
TPut 10    274911.00 (  0.00%)    275088.00 (  0.06%)
TPut 11    299963.00 (  0.00%)    305233.00 (  1.76%)
TPut 12    329709.00 (  0.00%)    326502.00 ( -0.97%)
TPut 13    347794.00 (  0.00%)    352284.00 (  1.29%)
TPut 14    372475.00 (  0.00%)    375917.00 (  0.92%)
TPut 15    392596.00 (  0.00%)    391675.00 ( -0.23%)
TPut 16    405273.00 (  0.00%)    418292.00 (  3.21%)
TPut 17    429656.00 (  0.00%)    438006.00 (  1.94%)
TPut 18    447152.00 (  0.00%)    458248.00 (  2.48%)
TPut 19    453475.00 (  0.00%)    482686.00 (  6.44%)
TPut 20    473828.00 (  0.00%)    494508.00 (  4.36%)
TPut 21    477896.00 (  0.00%)    516264.00 (  8.03%)
TPut 22    502557.00 (  0.00%)    521956.00 (  3.86%)
TPut 23    503415.00 (  0.00%)    545774.00 (  8.41%)
TPut 24    516095.00 (  0.00%)    555747.00 (  7.68%)
TPut 25    515441.00 (  0.00%)    562987.00 (  9.22%)
TPut 26    517906.00 (  0.00%)    562589.00 (  8.63%)
TPut 27    517312.00 (  0.00%)    551823.00 (  6.67%)
TPut 28    511740.00 (  0.00%)    548546.00 (  7.19%)
TPut 29    515789.00 (  0.00%)    552132.00 (  7.05%)
TPut 30    501366.00 (  0.00%)    556688.00 ( 11.03%)
TPut 31    509797.00 (  0.00%)    558124.00 (  9.48%)
TPut 32    514932.00 (  0.00%)    553529.00 (  7.50%)
TPut 33    502227.00 (  0.00%)    550933.00 (  9.70%)
TPut 34    509668.00 (  0.00%)    530995.00 (  4.18%)
TPut 35    500032.00 (  0.00%)    539452.00 (  7.88%)
TPut 36    483231.00 (  0.00%)    527146.00 (  9.09%)
TPut 37    493236.00 (  0.00%)    524913.00 (  6.42%)
TPut 38    483924.00 (  0.00%)    521526.00 (  7.77%)
TPut 39    467308.00 (  0.00%)    523683.00 ( 12.06%)
TPut 40    461353.00 (  0.00%)    494697.00 (  7.23%)
TPut 41    462128.00 (  0.00%)    513593.00 ( 11.14%)
TPut 42    450428.00 (  0.00%)    505080.00 ( 12.13%)
TPut 43    444065.00 (  0.00%)    491715.00 ( 10.73%)
TPut 44    455875.00 (  0.00%)    473548.00 (  3.88%)
TPut 45    413063.00 (  0.00%)    474189.00 ( 14.80%)
TPut 46    421084.00 (  0.00%)    457423.00 (  8.63%)
TPut 47    399403.00 (  0.00%)    450189.00 ( 12.72%)
TPut 48    411438.00 (  0.00%)    443868.00 (  7.88%)

Somewhat respectable performance improvement for most numbers of clients.

specjbb Peaks
                                       3.9.0                      3.9.0
                                     vanilla            resetscan-v1r29
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               399403.00 (  0.00%)               450189.00 ( 12.72%)
 Actual Warehouse                   27.00 (  0.00%)                   26.00 ( -3.70%)
 Actual Peak Bops               517906.00 (  0.00%)               562987.00 (  8.70%)
 SpecJBB Bops                     8397.00 (  0.00%)                 9059.00 (  7.88%)
 SpecJBB Bops/JVM                 8397.00 (  0.00%)                 9059.00 (  7.88%)

The specjbb score and peak bops are improved. The actual peak warehouse
is lower which is unfortunate.

               3.9.0       3.9.0
             vanillaresetscan-v1r29
User        44532.91    44541.85
System        145.18      133.87
Elapsed      1667.08     1666.65

System CPU usage is slightly lower so we get higher performance for lower overhead.

                                 3.9.0       3.9.0
                               vanillaresetscan-v1r29
Minor Faults                   1951410     1864310
Major Faults                       149         130
Swap Ins                             0           0
Swap Outs                            0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate               0           0
Sector Reads                     61964       37260
Sector Writes                    23408       17708
Page rescued immediate               0           0
Slabs scanned                        0           0
Direct inode steals                  0           0
Kswapd inode steals                  0           0
Kswapd skipped wait                  0           0
THP fault alloc                  42876       40951
THP collapse alloc                  61          66
THP splits                          58          52
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0           0
Compaction success                   0           0
Compaction failures                  0           0
Page migrate success          14446025    13710610
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                  14994       14231
NUMA PTE updates             112474717   106764423
NUMA hint faults                692716      543202
NUMA hint local faults          272512      154250
NUMA pages migrated           14446025    13710610
AutoNUMA cost                     4525        3723

Note that there are marginally fewer PTE updates, NUMA hinting faults and
pages migrated again showing we're getting the higher performance for lower overhea

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system. It's a lot of data unfortunately.

                          3.9.0                 3.9.0
                        vanilla       resetscan-v1r29
Mean   1      30420.25 (  0.00%)     30813.00 (  1.29%)
Mean   2      61628.50 (  0.00%)     62773.00 (  1.86%)
Mean   3      89830.25 (  0.00%)     90780.00 (  1.06%)
Mean   4     115535.00 (  0.00%)    115962.50 (  0.37%)
Mean   5     138453.75 (  0.00%)    137142.00 ( -0.95%)
Mean   6     157207.75 (  0.00%)    154942.50 ( -1.44%)
Mean   7     159087.50 (  0.00%)    158301.75 ( -0.49%)
Mean   8     158453.00 (  0.00%)    157125.00 ( -0.84%)
Mean   9     156613.75 (  0.00%)    151507.50 ( -3.26%)
Mean   10    151129.75 (  0.00%)    146982.25 ( -2.74%)
Mean   11    141945.00 (  0.00%)    136831.50 ( -3.60%)
Mean   12    136653.75 (  0.00%)    132907.50 ( -2.74%)
Mean   13    135432.00 (  0.00%)    130598.50 ( -3.57%)
Mean   14    132629.00 (  0.00%)    130460.50 ( -1.64%)
Mean   15    127698.00 (  0.00%)    132509.25 (  3.77%)
Mean   16    128686.75 (  0.00%)    130936.25 (  1.75%)
Mean   17    123666.50 (  0.00%)    125579.75 (  1.55%)
Mean   18    121543.75 (  0.00%)    122923.50 (  1.14%)
Mean   19    118704.75 (  0.00%)    127232.00 (  7.18%)
Mean   20    117251.50 (  0.00%)    124994.75 (  6.60%)
Mean   21    114060.25 (  0.00%)    123165.50 (  7.98%)
Mean   22    108594.00 (  0.00%)    116716.00 (  7.48%)
Mean   23    108471.25 (  0.00%)    115118.25 (  6.13%)
Mean   24    110019.25 (  0.00%)    114149.75 (  3.75%)
Mean   25    109250.50 (  0.00%)    112506.75 (  2.98%)
Mean   26    107827.75 (  0.00%)    112699.50 (  4.52%)
Mean   27    104496.25 (  0.00%)    114260.00 (  9.34%)
Mean   28    104117.75 (  0.00%)    114140.75 (  9.63%)
Mean   29    103018.75 (  0.00%)    109829.50 (  6.61%)
Mean   30    104718.00 (  0.00%)    108194.25 (  3.32%)
Mean   31    101520.50 (  0.00%)    108311.25 (  6.69%)
Mean   32     97662.75 (  0.00%)    105314.75 (  7.84%)
Mean   33    101508.50 (  0.00%)    106076.25 (  4.50%)
Mean   34     98576.50 (  0.00%)    111020.50 ( 12.62%)
Mean   35    105180.75 (  0.00%)    108971.25 (  3.60%)
Mean   36    101517.00 (  0.00%)    108781.25 (  7.16%)
Mean   37    100664.00 (  0.00%)    109634.50 (  8.91%)
Mean   38    101012.25 (  0.00%)    110988.25 (  9.88%)
Mean   39    101967.00 (  0.00%)    105927.75 (  3.88%)
Mean   40     97732.50 (  0.00%)    110570.00 ( 13.14%)
Mean   41    103773.25 (  0.00%)    111583.00 (  7.53%)
Mean   42    105105.00 (  0.00%)    110321.00 (  4.96%)
Mean   43    102351.50 (  0.00%)    107145.75 (  4.68%)
Mean   44    105980.00 (  0.00%)    107938.50 (  1.85%)
Mean   45    111055.00 (  0.00%)    111159.25 (  0.09%)
Mean   46    112757.25 (  0.00%)    114807.00 (  1.82%)
Mean   47     93706.75 (  0.00%)    113681.25 ( 21.32%)
Mean   48    106624.00 (  0.00%)    117423.75 ( 10.13%)
Stddev 1       1371.00 (  0.00%)       872.33 ( 36.37%)
Stddev 2       1326.07 (  0.00%)       310.98 ( 76.55%)
Stddev 3       1160.36 (  0.00%)      1074.95 (  7.36%)
Stddev 4       1689.80 (  0.00%)      1461.05 ( 13.54%)
Stddev 5       2214.45 (  0.00%)      1089.81 ( 50.79%)
Stddev 6       1756.74 (  0.00%)      2138.00 (-21.70%)
Stddev 7       3419.70 (  0.00%)      3335.13 (  2.47%)
Stddev 8       6511.71 (  0.00%)      4716.75 ( 27.57%)
Stddev 9       5373.19 (  0.00%)      2899.89 ( 46.03%)
Stddev 10      3732.23 (  0.00%)      2558.50 ( 31.45%)
Stddev 11      4616.71 (  0.00%)      5919.34 (-28.22%)
Stddev 12      5503.15 (  0.00%)      5953.85 ( -8.19%)
Stddev 13      5202.46 (  0.00%)      7507.23 (-44.30%)
Stddev 14      3526.10 (  0.00%)      2296.23 ( 34.88%)
Stddev 15      3576.78 (  0.00%)      3450.47 (  3.53%)
Stddev 16      2786.08 (  0.00%)       950.31 ( 65.89%)
Stddev 17      3055.44 (  0.00%)      2881.78 (  5.68%)
Stddev 18      2543.08 (  0.00%)      1332.83 ( 47.59%)
Stddev 19      3936.65 (  0.00%)      1403.64 ( 64.34%)
Stddev 20      3005.94 (  0.00%)      1342.59 ( 55.34%)
Stddev 21      2657.19 (  0.00%)      2498.95 (  5.96%)
Stddev 22      2016.42 (  0.00%)      2078.84 ( -3.10%)
Stddev 23      2209.88 (  0.00%)      2939.24 (-33.00%)
Stddev 24      5325.86 (  0.00%)      2760.85 ( 48.16%)
Stddev 25      4659.26 (  0.00%)      1433.24 ( 69.24%)
Stddev 26      1169.78 (  0.00%)      1977.32 (-69.03%)
Stddev 27      2923.78 (  0.00%)      2675.50 (  8.49%)
Stddev 28      5335.85 (  0.00%)      1874.29 ( 64.87%)
Stddev 29      4381.68 (  0.00%)      3660.16 ( 16.47%)
Stddev 30      3437.44 (  0.00%)      6535.20 (-90.12%)
Stddev 31      3979.56 (  0.00%)      5032.62 (-26.46%)
Stddev 32      2614.04 (  0.00%)      5118.99 (-95.83%)
Stddev 33      5358.35 (  0.00%)      2488.64 ( 53.56%)
Stddev 34      6375.57 (  0.00%)      4105.34 ( 35.61%)
Stddev 35      8079.76 (  0.00%)      3696.10 ( 54.25%)
Stddev 36      8665.59 (  0.00%)      5155.29 ( 40.51%)
Stddev 37      8002.37 (  0.00%)      8660.12 ( -8.22%)
Stddev 38      4955.36 (  0.00%)      8615.78 (-73.87%)
Stddev 39      9940.79 (  0.00%)      9620.33 (  3.22%)
Stddev 40     12344.56 (  0.00%)     11248.42 (  8.88%)
Stddev 41     15834.32 (  0.00%)     13587.05 ( 14.19%)
Stddev 42     12006.48 (  0.00%)     10554.10 ( 12.10%)
Stddev 43      4141.73 (  0.00%)     13565.76 (-227.54%)
Stddev 44      7476.54 (  0.00%)     16442.62 (-119.92%)
Stddev 45     16048.04 (  0.00%)     17095.94 ( -6.53%)
Stddev 46     16198.20 (  0.00%)     17323.97 ( -6.95%)
Stddev 47     15743.04 (  0.00%)     17748.58 (-12.74%)
Stddev 48     12627.98 (  0.00%)     17082.27 (-35.27%)

These are the mean throughput figures between JVMs and the standard
deviation. Note that with the patches applied that there is a lot less
deviation between JVMs in many cases. As the number of clients increases
the performance improves. This is still far short of the theoritical best
performance but it's a step in the right direction.

TPut   1     121681.00 (  0.00%)    123252.00 (  1.29%)
TPut   2     246514.00 (  0.00%)    251092.00 (  1.86%)
TPut   3     359321.00 (  0.00%)    363120.00 (  1.06%)
TPut   4     462140.00 (  0.00%)    463850.00 (  0.37%)
TPut   5     553815.00 (  0.00%)    548568.00 ( -0.95%)
TPut   6     628831.00 (  0.00%)    619770.00 ( -1.44%)
TPut   7     636350.00 (  0.00%)    633207.00 ( -0.49%)
TPut   8     633812.00 (  0.00%)    628500.00 ( -0.84%)
TPut   9     626455.00 (  0.00%)    606030.00 ( -3.26%)
TPut   10    604519.00 (  0.00%)    587929.00 ( -2.74%)
TPut   11    567780.00 (  0.00%)    547326.00 ( -3.60%)
TPut   12    546615.00 (  0.00%)    531630.00 ( -2.74%)
TPut   13    541728.00 (  0.00%)    522394.00 ( -3.57%)
TPut   14    530516.00 (  0.00%)    521842.00 ( -1.64%)
TPut   15    510792.00 (  0.00%)    530037.00 (  3.77%)
TPut   16    514747.00 (  0.00%)    523745.00 (  1.75%)
TPut   17    494666.00 (  0.00%)    502319.00 (  1.55%)
TPut   18    486175.00 (  0.00%)    491694.00 (  1.14%)
TPut   19    474819.00 (  0.00%)    508928.00 (  7.18%)
TPut   20    469006.00 (  0.00%)    499979.00 (  6.60%)
TPut   21    456241.00 (  0.00%)    492662.00 (  7.98%)
TPut   22    434376.00 (  0.00%)    466864.00 (  7.48%)
TPut   23    433885.00 (  0.00%)    460473.00 (  6.13%)
TPut   24    440077.00 (  0.00%)    456599.00 (  3.75%)
TPut   25    437002.00 (  0.00%)    450027.00 (  2.98%)
TPut   26    431311.00 (  0.00%)    450798.00 (  4.52%)
TPut   27    417985.00 (  0.00%)    457040.00 (  9.34%)
TPut   28    416471.00 (  0.00%)    456563.00 (  9.63%)
TPut   29    412075.00 (  0.00%)    439318.00 (  6.61%)
TPut   30    418872.00 (  0.00%)    432777.00 (  3.32%)
TPut   31    406082.00 (  0.00%)    433245.00 (  6.69%)
TPut   32    390651.00 (  0.00%)    421259.00 (  7.84%)
TPut   33    406034.00 (  0.00%)    424305.00 (  4.50%)
TPut   34    394306.00 (  0.00%)    444082.00 ( 12.62%)
TPut   35    420723.00 (  0.00%)    435885.00 (  3.60%)
TPut   36    406068.00 (  0.00%)    435125.00 (  7.16%)
TPut   37    402656.00 (  0.00%)    438538.00 (  8.91%)
TPut   38    404049.00 (  0.00%)    443953.00 (  9.88%)
TPut   39    407868.00 (  0.00%)    423711.00 (  3.88%)
TPut   40    390930.00 (  0.00%)    442280.00 ( 13.14%)
TPut   41    415093.00 (  0.00%)    446332.00 (  7.53%)
TPut   42    420420.00 (  0.00%)    441284.00 (  4.96%)
TPut   43    409406.00 (  0.00%)    428583.00 (  4.68%)
TPut   44    423920.00 (  0.00%)    431754.00 (  1.85%)
TPut   45    444220.00 (  0.00%)    444637.00 (  0.09%)
TPut   46    451029.00 (  0.00%)    459228.00 (  1.82%)
TPut   47    374827.00 (  0.00%)    454725.00 ( 21.32%)
TPut   48    426496.00 (  0.00%)    469695.00 ( 10.13%)

Similarly overall throughput is improved for larger numbers of clients.

specjbb Peaks
                                       3.9.0                      3.9.0
                                     vanilla            resetscan-v1r29
 Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)
 Expctd Peak Bops               567780.00 (  0.00%)               547326.00 ( -3.60%)
 Actual Warehouse                    8.00 (  0.00%)                    8.00 (  0.00%)
 Actual Peak Bops               636350.00 (  0.00%)               633207.00 ( -0.49%)
 SpecJBB Bops                   487204.00 (  0.00%)               500705.00 (  2.77%)
 SpecJBB Bops/JVM               121801.00 (  0.00%)               125176.00 (  2.77%)

Peak performance is not great but the specjbb score is slightly improved.


               3.9.0       3.9.0
             vanillaresetscan-v1r29
User       479120.95   479525.04
System       1395.40     1124.93
Elapsed     10363.40    10376.34

System CPU time is reduced by quite a lot so automatic NUMA balancing now has less overhead.

                                 3.9.0       3.9.0
                               vanillaresetscan-v1r29
Minor Faults                  15711256    14962529
Major Faults                       132         151
Swap Ins                             0           0
Swap Outs                            0           0
Direct pages scanned                 0           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed               0           0
Kswapd efficiency                 100%        100%
Kswapd velocity                  0.000       0.000
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Zone normal velocity             0.000       0.000
Zone dma32 velocity              0.000       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate               0           0
Sector Reads                     32700       67420
Sector Writes                   108660      116092
Page rescued immediate               0           0
Slabs scanned                        0           0
Direct inode steals                  0           0
Kswapd inode steals                  0           0
Kswapd skipped wait                  0           0
THP fault alloc                  77041       76063
THP collapse alloc                 194         208
THP splits                         430         428
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0           0
Compaction success                   0           0
Compaction failures                  0           0
Page migrate success         134743458   102408111
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                 139863      106299
NUMA PTE updates            1167722150   961427213
NUMA hint faults               9915871     8411075
NUMA hint local faults         3660769     3212050
NUMA pages migrated          134743458   102408111
AutoNUMA cost                    60313       50731

Note that there are 20% fewer PTE updates reflecting the changes in the
scan rates. Similarly there are fewer hinting faults incurred and fewer
pages migrated.

Overall the performance has improved slightly but in general there is
less system overhead when delivering that performance so it's at least
a step in the right direction albeit far short of what it needs to be
ultimately.


 Documentation/sysctl/kernel.txt |  67 ++++++++++++++++
 include/linux/mm_types.h        |   3 -
 include/linux/sched.h           |  21 ++++-
 include/linux/sched/sysctl.h    |   1 -
 kernel/sched/core.c             |  33 +++++++-
 kernel/sched/fair.c             | 169 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h            |  12 +++
 kernel/sysctl.c                 |  14 ++--
 mm/huge_memory.c                |   7 +-
 mm/memory.c                     |   9 ++-
 10 files changed, 294 insertions(+), 42 deletions(-)

-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH 1/8] mm: numa: Document automatic NUMA balancing sysctls
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 1/8] mm: numa: Document automatic NUMA balancing sysctls
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.  Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 +++++++++++-
 kernel/sched/sched.h  | 12 ++++++++++++
 4 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
+
+	/* Record the fault, double the weight if pages were migrated */
+	p->numa_faults[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..9c26d88 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node, int shared);
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.  Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 +++++++++++-
 kernel/sched/sched.h  | 12 ++++++++++++
 4 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
+
+	/* Record the fault, double the weight if pages were migrated */
+	p->numa_faults[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..9c26d88 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node, int shared);
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 10 ++++++++++
 kernel/sched/fair.c   | 16 ++++++++++++++--
 kernel/sched/sched.h  |  2 +-
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..019baae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -5713,6 +5714,15 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
+#ifdef CONFIG_NUMA_BALANCING
+
+/* Set a tasks preferred NUMA node */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	p->numa_preferred_nid = nid;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..f8c3f61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = 0;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -802,7 +803,18 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		sched_setnuma(p, max_nid);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9c26d88..65a0cf0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,7 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int node, int shared);
+extern void sched_setnuma(struct task_struct *p, int nid);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 10 ++++++++++
 kernel/sched/fair.c   | 16 ++++++++++++++--
 kernel/sched/sched.h  |  2 +-
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..019baae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -5713,6 +5714,15 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
+#ifdef CONFIG_NUMA_BALANCING
+
+/* Set a tasks preferred NUMA node */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	p->numa_preferred_nid = nid;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..f8c3f61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = 0;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -802,7 +803,18 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		sched_setnuma(p, max_nid);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9c26d88..65a0cf0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,7 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int node, int shared);
+extern void sched_setnuma(struct task_struct *p, int nid);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 4/8] sched: Update NUMA hinting faults once per scan
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 019baae..b00b81a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8c3f61..5893399 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -847,7 +857,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults[node] += pages << migrated;
+	p->numa_faults_buffer[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 4/8] sched: Update NUMA hinting faults once per scan
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 019baae..b00b81a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8c3f61..5893399 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -847,7 +857,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults[node] += pages << migrated;
+	p->numa_faults_buffer[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch favours moving tasks towards the preferred NUMA node when
it has just been selected. Ideally this is self-reinforcing as the
longer the the task runs on that node, the more faults it should incur
causing task_numa_placement to keep the task running on that node. In
reality a big weakness is that the nodes CPUs can be overloaded and it
would be more effficient to queue tasks on an idle node and migrate to
the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the
preferred node for a tunable number of PTE scans.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  4 +++-
 kernel/sched/fair.c             | 40 ++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c                 |  7 +++++++
 5 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -418,6 +419,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b00b81a..ba9470e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5721,6 +5721,7 @@ struct sched_domain_topology_level;
 void sched_setnuma(struct task_struct *p, int nid)
 {
 	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -6150,6 +6151,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5893399..5e7f728 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid)
+		return false;
+
+	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
+	    p->numa_preferred_nid == dst_nid)
+		return true;
+
+	return false;
+}
+
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
 
+	if (migrate_improves_locality(p, env))
+		return 1;
+
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..263486f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch favours moving tasks towards the preferred NUMA node when
it has just been selected. Ideally this is self-reinforcing as the
longer the the task runs on that node, the more faults it should incur
causing task_numa_placement to keep the task running on that node. In
reality a big weakness is that the nodes CPUs can be overloaded and it
would be more effficient to queue tasks on an idle node and migrate to
the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the
preferred node for a tunable number of PTE scans.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  4 +++-
 kernel/sched/fair.c             | 40 ++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c                 |  7 +++++++
 5 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -418,6 +419,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b00b81a..ba9470e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5721,6 +5721,7 @@ struct sched_domain_topology_level;
 void sched_setnuma(struct task_struct *p, int nid)
 {
 	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -6150,6 +6151,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5893399..5e7f728 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid)
+		return false;
+
+	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
+	    p->numa_preferred_nid == dst_nid)
+		return true;
+
+	return false;
+}
+
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
 
+	if (migrate_improves_locality(p, env))
+		return 1;
+
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..263486f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 18 +++++++++++++++--
 kernel/sched/fair.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  2 +-
 3 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba9470e..b4722d6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5717,11 +5717,25 @@ struct sched_domain_topology_level;
 
 #ifdef CONFIG_NUMA_BALANCING
 
-/* Set a tasks preferred NUMA node */
-void sched_setnuma(struct task_struct *p, int nid)
+/* Set a tasks preferred NUMA node and reschedule to it */
+void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
 {
+	int curr_cpu = task_cpu(p);
+	struct migration_arg arg = { p, idlest_cpu };
+
 	p->numa_preferred_nid = nid;
 	p->numa_migrate_seq = 0;
+
+	/* Do not reschedule if already running on the target CPU */
+	if (idlest_cpu == curr_cpu)
+		return;
+
+	/* Ensure the target CPU is eligible */
+	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
+		return;
+
+	/* Move current running task to idlest CPU on preferred node */
+	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e7f728..99951a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,39 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			struct task_struct *p;
+
+			/* Do not preempt a task running on its preferred node */
+			struct rq *rq = cpu_rq(i);
+			local_irq_disable();
+			raw_spin_lock(&rq->lock);
+			p = rq->curr;
+			if (p->numa_preferred_nid != nid) {
+				min_load = load;
+				idlest_cpu = i;
+			}
+			raw_spin_unlock(&rq->lock);
+			local_irq_disable();
+		}
+	}
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -829,8 +862,26 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	if (max_faults && max_nid != p->numa_preferred_nid)
-		sched_setnuma(p, max_nid);
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
+	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid)
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+
+		sched_setnuma(p, max_nid, preferred_cpu);
+	}
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65a0cf0..64c37a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,7 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int nid);
+extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 18 +++++++++++++++--
 kernel/sched/fair.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  2 +-
 3 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba9470e..b4722d6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5717,11 +5717,25 @@ struct sched_domain_topology_level;
 
 #ifdef CONFIG_NUMA_BALANCING
 
-/* Set a tasks preferred NUMA node */
-void sched_setnuma(struct task_struct *p, int nid)
+/* Set a tasks preferred NUMA node and reschedule to it */
+void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
 {
+	int curr_cpu = task_cpu(p);
+	struct migration_arg arg = { p, idlest_cpu };
+
 	p->numa_preferred_nid = nid;
 	p->numa_migrate_seq = 0;
+
+	/* Do not reschedule if already running on the target CPU */
+	if (idlest_cpu == curr_cpu)
+		return;
+
+	/* Ensure the target CPU is eligible */
+	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
+		return;
+
+	/* Move current running task to idlest CPU on preferred node */
+	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e7f728..99951a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,39 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			struct task_struct *p;
+
+			/* Do not preempt a task running on its preferred node */
+			struct rq *rq = cpu_rq(i);
+			local_irq_disable();
+			raw_spin_lock(&rq->lock);
+			p = rq->curr;
+			if (p->numa_preferred_nid != nid) {
+				min_load = load;
+				idlest_cpu = i;
+			}
+			raw_spin_unlock(&rq->lock);
+			local_irq_disable();
+		}
+	}
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -829,8 +862,26 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	if (max_faults && max_nid != p->numa_preferred_nid)
-		sched_setnuma(p, max_nid);
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
+	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid)
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+
+		sched_setnuma(p, max_nid, preferred_cpu);
+	}
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65a0cf0..64c37a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,7 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int nid);
+extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This would require
that the last task that accessed a page for a hinting fault would be
recorded which would increase the size of struct page. Instead this patch
approximates private pages by assuming that faults that pass the two-stage
filter are private pages and all others are shared. The preferred NUMA
node is then selected based on where the maximum number of approximately
private faults were measured.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  4 ++--
 kernel/sched/fair.c   | 32 ++++++++++++++++++++++----------
 mm/huge_memory.c      |  7 ++++---
 mm/memory.c           |  9 ++++++---
 4 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..a41edea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,10 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages, bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99951a8..490e601 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -833,6 +833,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
 	return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -849,13 +854,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
+
+			/* Decay existing window and copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
 
-		faults = p->numa_faults[nid];
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -887,24 +898,25 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
 
 	if (!sched_feat_numa(NUMA))
 		return;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
-		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
+		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -918,7 +930,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults_buffer[node] += pages << migrated;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..7cd7114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid;
+	int target_nid, last_nid;
 	int current_nid = -1;
 	bool migrated;
 
@@ -1307,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	last_nid = page_nid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1332,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1347,7 +1348,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..c28bf52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int current_nid = -1, last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	last_nid = page_nid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3586,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+		task_numa_fault(last_nid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,6 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,6 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
+		last_nid = page_nid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3666,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		task_numa_fault(last_nid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This would require
that the last task that accessed a page for a hinting fault would be
recorded which would increase the size of struct page. Instead this patch
approximates private pages by assuming that faults that pass the two-stage
filter are private pages and all others are shared. The preferred NUMA
node is then selected based on where the maximum number of approximately
private faults were measured.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  4 ++--
 kernel/sched/fair.c   | 32 ++++++++++++++++++++++----------
 mm/huge_memory.c      |  7 ++++---
 mm/memory.c           |  9 ++++++---
 4 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..a41edea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,10 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages, bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99951a8..490e601 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -833,6 +833,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
 	return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -849,13 +854,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
+
+			/* Decay existing window and copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
 
-		faults = p->numa_faults[nid];
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -887,24 +898,25 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
 
 	if (!sched_feat_numa(NUMA))
 		return;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
-		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
+		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -918,7 +930,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults_buffer[node] += pages << migrated;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..7cd7114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid;
+	int target_nid, last_nid;
 	int current_nid = -1;
 	bool migrated;
 
@@ -1307,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	last_nid = page_nid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1332,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1347,7 +1348,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..c28bf52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int current_nid = -1, last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	last_nid = page_nid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3586,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+		task_numa_fault(last_nid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,6 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,6 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
+		last_nid = page_nid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3666,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		task_numa_fault(last_nid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 8/8] sched: Increase NUMA PTE scanning when a new preferred node is selected
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-26 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 26 +++++++++++---------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 14 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4722d6..2d1fd93 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1585,7 +1585,6 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 490e601..e9bbb70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -881,6 +880,7 @@ static void task_numa_placement(struct task_struct *p)
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		int preferred_cpu;
+		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
 		 * If the task is not on the preferred node then find the most
@@ -892,6 +892,15 @@ static void task_numa_placement(struct task_struct *p)
 							     max_nid);
 
 		sched_setnuma(p, max_nid, preferred_cpu);
+
+		/*
+		 * If preferred nodes changes frequently then the scan rate
+		 * will be continually high. Mitigate this by increaseing the
+		 * scan rate only if the task was settled.
+		 */
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count)
+			p->numa_scan_period = max(p->numa_scan_period >> 1,
+					sysctl_numa_balancing_scan_period_min);
 	}
 }
 
@@ -985,19 +994,6 @@ void task_numa_work(struct callback_head *work)
 	}
 
 	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
-	}
-
-	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 263486f..1fcbc68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,13 +373,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 8/8] sched: Increase NUMA PTE scanning when a new preferred node is selected
@ 2013-06-26 14:38   ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-26 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 26 +++++++++++---------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 14 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4722d6..2d1fd93 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1585,7 +1585,6 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 490e601..e9bbb70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -881,6 +880,7 @@ static void task_numa_placement(struct task_struct *p)
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		int preferred_cpu;
+		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
 		 * If the task is not on the preferred node then find the most
@@ -892,6 +892,15 @@ static void task_numa_placement(struct task_struct *p)
 							     max_nid);
 
 		sched_setnuma(p, max_nid, preferred_cpu);
+
+		/*
+		 * If preferred nodes changes frequently then the scan rate
+		 * will be continually high. Mitigate this by increaseing the
+		 * scan rate only if the task was settled.
+		 */
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count)
+			p->numa_scan_period = max(p->numa_scan_period >> 1,
+					sysctl_numa_balancing_scan_period_min);
 	}
 }
 
@@ -985,19 +994,6 @@ void task_numa_work(struct callback_head *work)
 	}
 
 	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
-	}
-
-	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 263486f..1fcbc68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,13 +373,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-27 14:52     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}
> +
> +
>  /*
>   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
>   */
> @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  
>  	/*
>  	 * Aggressive migration if:
> -	 * 1) task is cache cold, or
> -	 * 2) too many balance attempts have failed.
> +	 * 1) destination numa is preferred
> +	 * 2) task is cache cold, or
> +	 * 3) too many balance attempts have failed.
>  	 */
>  
> +	if (migrate_improves_locality(p, env))
> +		return 1;
> +
>  	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
>  	if (!tsk_cache_hot ||
>  		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {

Should we not also do the reverse; make it harder to worsen locality?

Similar to the task_hot() thing; do not allow to migrate a task on low
nr_balance_failed when it makes the locality worse.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-27 14:52     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}
> +
> +
>  /*
>   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
>   */
> @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  
>  	/*
>  	 * Aggressive migration if:
> -	 * 1) task is cache cold, or
> -	 * 2) too many balance attempts have failed.
> +	 * 1) destination numa is preferred
> +	 * 2) task is cache cold, or
> +	 * 3) too many balance attempts have failed.
>  	 */
>  
> +	if (migrate_improves_locality(p, env))
> +		return 1;
> +
>  	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
>  	if (!tsk_cache_hot ||
>  		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {

Should we not also do the reverse; make it harder to worsen locality?

Similar to the task_hot() thing; do not allow to migrate a task on low
nr_balance_failed when it makes the locality worse.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-27 14:53     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:53 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> This patch favours moving tasks towards the preferred NUMA node when
> it has just been selected. Ideally this is self-reinforcing as the
> longer the the task runs on that node, the more faults it should incur
> causing task_numa_placement to keep the task running on that node. In
> reality a big weakness is that the nodes CPUs can be overloaded and it
> would be more effficient to queue tasks on an idle node and migrate to
> the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the
> preferred node for a tunable number of PTE scans.

This changelog fails to mention why you're adding the settle stuff in
this patch.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-27 14:53     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:53 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> This patch favours moving tasks towards the preferred NUMA node when
> it has just been selected. Ideally this is self-reinforcing as the
> longer the the task runs on that node, the more faults it should incur
> causing task_numa_placement to keep the task running on that node. In
> reality a big weakness is that the nodes CPUs can be overloaded and it
> would be more effficient to queue tasks on an idle node and migrate to
> the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the
> preferred node for a tunable number of PTE scans.

This changelog fails to mention why you're adding the settle stuff in
this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-27 14:54     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:05PM +0100, Mel Gorman wrote:
> +static int
> +find_idlest_cpu_node(int this_cpu, int nid)
> +{
> +	unsigned long load, min_load = ULONG_MAX;
> +	int i, idlest_cpu = this_cpu;
> +
> +	BUG_ON(cpu_to_node(this_cpu) == nid);
> +
> +	for_each_cpu(i, cpumask_of_node(nid)) {
> +		load = weighted_cpuload(i);
> +
> +		if (load < min_load) {
> +			struct task_struct *p;
> +
> +			/* Do not preempt a task running on its preferred node */
> +			struct rq *rq = cpu_rq(i);
> +			local_irq_disable();
> +			raw_spin_lock(&rq->lock);

raw_spin_lock_irq() ?

> +			p = rq->curr;
> +			if (p->numa_preferred_nid != nid) {
> +				min_load = load;
> +				idlest_cpu = i;
> +			}
> +			raw_spin_unlock(&rq->lock);
> +			local_irq_disable();
> +		}
> +	}
> +
> +	return idlest_cpu;
> +}

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-06-27 14:54     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:05PM +0100, Mel Gorman wrote:
> +static int
> +find_idlest_cpu_node(int this_cpu, int nid)
> +{
> +	unsigned long load, min_load = ULONG_MAX;
> +	int i, idlest_cpu = this_cpu;
> +
> +	BUG_ON(cpu_to_node(this_cpu) == nid);
> +
> +	for_each_cpu(i, cpumask_of_node(nid)) {
> +		load = weighted_cpuload(i);
> +
> +		if (load < min_load) {
> +			struct task_struct *p;
> +
> +			/* Do not preempt a task running on its preferred node */
> +			struct rq *rq = cpu_rq(i);
> +			local_irq_disable();
> +			raw_spin_lock(&rq->lock);

raw_spin_lock_irq() ?

> +			p = rq->curr;
> +			if (p->numa_preferred_nid != nid) {
> +				min_load = load;
> +				idlest_cpu = i;
> +			}
> +			raw_spin_unlock(&rq->lock);
> +			local_irq_disable();
> +		}
> +	}
> +
> +	return idlest_cpu;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-27 14:56     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:56 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:06PM +0100, Mel Gorman wrote:
> +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
>  {
>  	struct task_struct *p = current;
> +	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
>  
>  	if (!sched_feat_numa(NUMA))
>  		return;
>  
>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
> -		int size = sizeof(*p->numa_faults) * nr_node_ids;
> +		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
>  
>  		/* numa_faults and numa_faults_buffer share the allocation */
> -		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> +		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
>  		if (!p->numa_faults)
>  			return;

So you need a buffer 2x the size in total; but you're now allocating
a buffer 4x larger than before.

Isn't doubling size alone sufficient?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-27 14:56     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:56 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:06PM +0100, Mel Gorman wrote:
> +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
>  {
>  	struct task_struct *p = current;
> +	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
>  
>  	if (!sched_feat_numa(NUMA))
>  		return;
>  
>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
> -		int size = sizeof(*p->numa_faults) * nr_node_ids;
> +		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
>  
>  		/* numa_faults and numa_faults_buffer share the allocation */
> -		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> +		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
>  		if (!p->numa_faults)
>  			return;

So you need a buffer 2x the size in total; but you're now allocating
a buffer 4x larger than before.

Isn't doubling size alone sufficient?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-27 14:59   ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:37:59PM +0100, Mel Gorman wrote:
> It's several months overdue and everything was quiet after 3.8 came out
> but I recently had a chance to revisit automatic NUMA balancing for a few
> days. I looked at basic scheduler integration resulting in the following
> small series. Much of the following is heavily based on the numacore series
> which in itself takes part of the autonuma series from back in November. In
> particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> Signed-off-bys. As before, if the relevant authors are ok with it I'll
> add Signed-off-bys (or add them yourselves if you pick the patches up).
> 
> This is still far from complete and there are known performance gaps between
> this and manual binding where possible and depending on the workload between
> it and interleaving when hard bindings are not an option.  As before,
> the intention is not to complete the work but to incrementally improve
> mainline and preserve bisectability for any bug reports that crop up. This
> will allow us to validate each step and keep reviewer stress to a minimum.

Yah..

Except for the few things I've already replied to; and a very strong
urge to run:

  sed -e 's/NUMA_BALANCE/SCHED_NUMA/g' -e 's/numa_balance/sched_numa/'

on both the tree and these patches I'm all for merging this.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-06-27 14:59   ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 14:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:37:59PM +0100, Mel Gorman wrote:
> It's several months overdue and everything was quiet after 3.8 came out
> but I recently had a chance to revisit automatic NUMA balancing for a few
> days. I looked at basic scheduler integration resulting in the following
> small series. Much of the following is heavily based on the numacore series
> which in itself takes part of the autonuma series from back in November. In
> particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> Signed-off-bys. As before, if the relevant authors are ok with it I'll
> add Signed-off-bys (or add them yourselves if you pick the patches up).
> 
> This is still far from complete and there are known performance gaps between
> this and manual binding where possible and depending on the workload between
> it and interleaving when hard bindings are not an option.  As before,
> the intention is not to complete the work but to incrementally improve
> mainline and preserve bisectability for any bug reports that crop up. This
> will allow us to validate each step and keep reviewer stress to a minimum.

Yah..

Except for the few things I've already replied to; and a very strong
urge to run:

  sed -e 's/NUMA_BALANCE/SCHED_NUMA/g' -e 's/numa_balance/sched_numa/'

on both the tree and these patches I'm all for merging this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-27 15:57     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 15:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:01PM +0100, Mel Gorman wrote:
> @@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +extern void sched_setnuma(struct task_struct *p, int node, int shared);

Stray line; you're introducing that function later with a different
signature.

> +static inline void task_numa_free(struct task_struct *p)
> +{
> +	kfree(p->numa_faults);
> +}
> +#else /* CONFIG_NUMA_BALANCING */
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  #ifdef CONFIG_SMP
>  
>  #define rcu_dereference_check_sched_domain(p) \
> -- 
> 1.8.1.4
> 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
@ 2013-06-27 15:57     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 15:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:01PM +0100, Mel Gorman wrote:
> @@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +extern void sched_setnuma(struct task_struct *p, int node, int shared);

Stray line; you're introducing that function later with a different
signature.

> +static inline void task_numa_free(struct task_struct *p)
> +{
> +	kfree(p->numa_faults);
> +}
> +#else /* CONFIG_NUMA_BALANCING */
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  #ifdef CONFIG_SMP
>  
>  #define rcu_dereference_check_sched_domain(p) \
> -- 
> 1.8.1.4
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-27 16:01     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 16:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}
> +

This references ->numa_faults, which is declared under NUMA_BALANCING
but lacks any such conditionality here.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-27 16:01     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 16:01 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}
> +

This references ->numa_faults, which is declared under NUMA_BALANCING
but lacks any such conditionality here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-27 16:11     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 16:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}

Also, until I just actually _read_ that function; I assumed it would
compare p->numa_faults[src_nid] and p->numa_faults[dst_nid]. Because
even when the dst_nid isn't the preferred nid; it might still have more
pages than where we currently are.

Idem with the proposed migrate_degrades_locality().

Something like so I suppose

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3969,6 +3969,7 @@ task_hot(struct task_struct *p, u64 now,
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 /* Returns true if the destination node has incurred more faults */
 static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 {
@@ -3983,13 +3984,50 @@ static bool migrate_improves_locality(st
 	if (src_nid == dst_nid)
 		return false;
 
-	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
-	    p->numa_preferred_nid == dst_nid)
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_preferred_nid == dst_nid)
+		return true;
+
+	if (p->numa_faults[src_nid] < p->numa_faults[dst_nid])
+		return true;
+
+	return false;
+}
+
+static vool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid)
+		return false;
+
+	if (p->numa_faults[src_nid] > p->numa_faults[dst_nid])
 		return true;
 
 	return false;
 }
 
+#else
+
+static inline bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	return false;
+}
+
+static inline bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	return false;
+}
+
+#endif /* CONFIG_NUMA_BALANCING */
 
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -4055,8 +4093,10 @@ int can_migrate_task(struct task_struct
 		return 1;
 
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 	if (!tsk_cache_hot ||
-		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
+	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 
 		if (tsk_cache_hot) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-27 16:11     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-27 16:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}

Also, until I just actually _read_ that function; I assumed it would
compare p->numa_faults[src_nid] and p->numa_faults[dst_nid]. Because
even when the dst_nid isn't the preferred nid; it might still have more
pages than where we currently are.

Idem with the proposed migrate_degrades_locality().

Something like so I suppose

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3969,6 +3969,7 @@ task_hot(struct task_struct *p, u64 now,
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 /* Returns true if the destination node has incurred more faults */
 static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 {
@@ -3983,13 +3984,50 @@ static bool migrate_improves_locality(st
 	if (src_nid == dst_nid)
 		return false;
 
-	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
-	    p->numa_preferred_nid == dst_nid)
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_preferred_nid == dst_nid)
+		return true;
+
+	if (p->numa_faults[src_nid] < p->numa_faults[dst_nid])
+		return true;
+
+	return false;
+}
+
+static vool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid)
+		return false;
+
+	if (p->numa_faults[src_nid] > p->numa_faults[dst_nid])
 		return true;
 
 	return false;
 }
 
+#else
+
+static inline bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	return false;
+}
+
+static inline bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	return false;
+}
+
+#endif /* CONFIG_NUMA_BALANCING */
 
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -4055,8 +4093,10 @@ int can_migrate_task(struct task_struct
 		return 1;
 
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 	if (!tsk_cache_hot ||
-		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
+	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 
 		if (tsk_cache_hot) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-28  6:08     ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  6:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:01]:

> This patch tracks what nodes numa hinting faults were incurred on.  Greater
> weight is given if the pages were to be migrated on the understanding
> that such faults cost significantly more. If a task has paid the cost to
> migrating data to that node then in the future it would be preferred if the
> task did not migrate the data again unnecessarily. This information is later
> used to schedule a task on the node incurring the most NUMA hinting faults.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h |  2 ++
>  kernel/sched/core.c   |  3 +++
>  kernel/sched/fair.c   | 12 +++++++++++-
>  kernel/sched/sched.h  | 12 ++++++++++++
>  4 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e692a02..72861b4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1505,6 +1505,8 @@ struct task_struct {
>  	unsigned int numa_scan_period;
>  	u64 node_stamp;			/* migration stamp  */
>  	struct callback_head numa_work;
> +
> +	unsigned long *numa_faults;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  	struct rcu_head rcu;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 67d0465..f332ec0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
>  	p->numa_work.next = &p->numa_work;
> +	p->numa_faults = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
>  }
>  
> @@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  	if (mm)
>  		mmdrop(mm);
>  	if (unlikely(prev_state == TASK_DEAD)) {
> +		task_numa_free(prev);
> +
>  		/*
>  		 * Remove function-return probe instances associated with this
>  		 * task and put them back on the free list.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..904fd6f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	if (!sched_feat_numa(NUMA))
>  		return;
>  
> -	/* FIXME: Allocate task-specific structure for placement policy here */
> +	/* Allocate buffer to track faults on a per-node basis */
> +	if (unlikely(!p->numa_faults)) {
> +		int size = sizeof(*p->numa_faults) * nr_node_ids;
> +
> +		p->numa_faults = kzalloc(size, GFP_KERNEL);
> +		if (!p->numa_faults)
> +			return;
> +	}
>  
>  	/*
>  	 * If pages are properly placed (did not migrate) then scan slower.
> @@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
>  			p->numa_scan_period + jiffies_to_msecs(10));
>  
>  	task_numa_placement(p);
> +
> +	/* Record the fault, double the weight if pages were migrated */
> +	p->numa_faults[node] += pages << migrated;


Why are we doing this after the placement.
I mean we should probably be doing this in the task_numa_placement,


Since doubling the pages can have an effect on the preferred node. If we
do it here, wont it end up in a case where the numa_faults on one node
is actually higher but it may end up being not the preferred node?

>  }
>  
>  static void reset_ptenuma_scan(struct task_struct *p)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cc03cfd..9c26d88 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +extern void sched_setnuma(struct task_struct *p, int node, int shared);
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +	kfree(p->numa_faults);
> +}
> +#else /* CONFIG_NUMA_BALANCING */
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  #ifdef CONFIG_SMP
>  
>  #define rcu_dereference_check_sched_domain(p) \
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
@ 2013-06-28  6:08     ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  6:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:01]:

> This patch tracks what nodes numa hinting faults were incurred on.  Greater
> weight is given if the pages were to be migrated on the understanding
> that such faults cost significantly more. If a task has paid the cost to
> migrating data to that node then in the future it would be preferred if the
> task did not migrate the data again unnecessarily. This information is later
> used to schedule a task on the node incurring the most NUMA hinting faults.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h |  2 ++
>  kernel/sched/core.c   |  3 +++
>  kernel/sched/fair.c   | 12 +++++++++++-
>  kernel/sched/sched.h  | 12 ++++++++++++
>  4 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e692a02..72861b4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1505,6 +1505,8 @@ struct task_struct {
>  	unsigned int numa_scan_period;
>  	u64 node_stamp;			/* migration stamp  */
>  	struct callback_head numa_work;
> +
> +	unsigned long *numa_faults;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  	struct rcu_head rcu;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 67d0465..f332ec0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
>  	p->numa_work.next = &p->numa_work;
> +	p->numa_faults = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
>  }
>  
> @@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  	if (mm)
>  		mmdrop(mm);
>  	if (unlikely(prev_state == TASK_DEAD)) {
> +		task_numa_free(prev);
> +
>  		/*
>  		 * Remove function-return probe instances associated with this
>  		 * task and put them back on the free list.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..904fd6f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	if (!sched_feat_numa(NUMA))
>  		return;
>  
> -	/* FIXME: Allocate task-specific structure for placement policy here */
> +	/* Allocate buffer to track faults on a per-node basis */
> +	if (unlikely(!p->numa_faults)) {
> +		int size = sizeof(*p->numa_faults) * nr_node_ids;
> +
> +		p->numa_faults = kzalloc(size, GFP_KERNEL);
> +		if (!p->numa_faults)
> +			return;
> +	}
>  
>  	/*
>  	 * If pages are properly placed (did not migrate) then scan slower.
> @@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
>  			p->numa_scan_period + jiffies_to_msecs(10));
>  
>  	task_numa_placement(p);
> +
> +	/* Record the fault, double the weight if pages were migrated */
> +	p->numa_faults[node] += pages << migrated;


Why are we doing this after the placement.
I mean we should probably be doing this in the task_numa_placement,


Since doubling the pages can have an effect on the preferred node. If we
do it here, wont it end up in a case where the numa_faults on one node
is actually higher but it may end up being not the preferred node?

>  }
>  
>  static void reset_ptenuma_scan(struct task_struct *p)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cc03cfd..9c26d88 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +extern void sched_setnuma(struct task_struct *p, int node, int shared);
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +	kfree(p->numa_faults);
> +}
> +#else /* CONFIG_NUMA_BALANCING */
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  #ifdef CONFIG_SMP
>  
>  #define rcu_dereference_check_sched_domain(p) \
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-28  6:14     ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  6:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:02]:

> This patch selects a preferred node for a task to run on based on the
> NUMA hinting faults. This information is later used to migrate tasks
> towards the node during balancing.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 10 ++++++++++
>  kernel/sched/fair.c   | 16 ++++++++++++++--
>  kernel/sched/sched.h  |  2 +-
>  4 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 72861b4..ba46a64 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1507,6 +1507,7 @@ struct task_struct {
>  	struct callback_head numa_work;
>  
>  	unsigned long *numa_faults;
> +	int numa_preferred_nid;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  	struct rcu_head rcu;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f332ec0..019baae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
>  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> +	p->numa_preferred_nid = -1;

Though we may not want to inherit faults, I think the tasks generally
share pages with their siblings, parent. So will it make sense to
inherit the preferred node?

>  	p->numa_work.next = &p->numa_work;
>  	p->numa_faults = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> @@ -5713,6 +5714,15 @@ enum s_alloc {
>  
>  struct sched_domain_topology_level;
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +
> +/* Set a tasks preferred NUMA node */
> +void sched_setnuma(struct task_struct *p, int nid)
> +{
> +	p->numa_preferred_nid = nid;
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
>  typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 904fd6f..f8c3f61 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>  
>  static void task_numa_placement(struct task_struct *p)
>  {
> -	int seq;
> +	int seq, nid, max_nid = 0;
> +	unsigned long max_faults = 0;
>  
>  	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
>  		return;
> @@ -802,7 +803,18 @@ static void task_numa_placement(struct task_struct *p)
>  		return;
>  	p->numa_scan_seq = seq;
>  
> -	/* FIXME: Scheduling placement policy hints go here */
> +	/* Find the node with the highest number of faults */
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		unsigned long faults = p->numa_faults[nid];
> +		p->numa_faults[nid] >>= 1;
> +		if (faults > max_faults) {
> +			max_faults = faults;
> +			max_nid = nid;
> +		}
> +	}
> +
> +	if (max_faults && max_nid != p->numa_preferred_nid)
> +		sched_setnuma(p, max_nid);
>  }
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9c26d88..65a0cf0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -504,7 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern void sched_setnuma(struct task_struct *p, int node, int shared);
> +extern void sched_setnuma(struct task_struct *p, int nid);
>  static inline void task_numa_free(struct task_struct *p)
>  {
>  	kfree(p->numa_faults);
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
@ 2013-06-28  6:14     ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  6:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:02]:

> This patch selects a preferred node for a task to run on based on the
> NUMA hinting faults. This information is later used to migrate tasks
> towards the node during balancing.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 10 ++++++++++
>  kernel/sched/fair.c   | 16 ++++++++++++++--
>  kernel/sched/sched.h  |  2 +-
>  4 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 72861b4..ba46a64 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1507,6 +1507,7 @@ struct task_struct {
>  	struct callback_head numa_work;
>  
>  	unsigned long *numa_faults;
> +	int numa_preferred_nid;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  	struct rcu_head rcu;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f332ec0..019baae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
>  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> +	p->numa_preferred_nid = -1;

Though we may not want to inherit faults, I think the tasks generally
share pages with their siblings, parent. So will it make sense to
inherit the preferred node?

>  	p->numa_work.next = &p->numa_work;
>  	p->numa_faults = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> @@ -5713,6 +5714,15 @@ enum s_alloc {
>  
>  struct sched_domain_topology_level;
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +
> +/* Set a tasks preferred NUMA node */
> +void sched_setnuma(struct task_struct *p, int nid)
> +{
> +	p->numa_preferred_nid = nid;
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
>  typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
>  typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 904fd6f..f8c3f61 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>  
>  static void task_numa_placement(struct task_struct *p)
>  {
> -	int seq;
> +	int seq, nid, max_nid = 0;
> +	unsigned long max_faults = 0;
>  
>  	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
>  		return;
> @@ -802,7 +803,18 @@ static void task_numa_placement(struct task_struct *p)
>  		return;
>  	p->numa_scan_seq = seq;
>  
> -	/* FIXME: Scheduling placement policy hints go here */
> +	/* Find the node with the highest number of faults */
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		unsigned long faults = p->numa_faults[nid];
> +		p->numa_faults[nid] >>= 1;
> +		if (faults > max_faults) {
> +			max_faults = faults;
> +			max_nid = nid;
> +		}
> +	}
> +
> +	if (max_faults && max_nid != p->numa_preferred_nid)
> +		sched_setnuma(p, max_nid);
>  }
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9c26d88..65a0cf0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -504,7 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern void sched_setnuma(struct task_struct *p, int node, int shared);
> +extern void sched_setnuma(struct task_struct *p, int nid);
>  static inline void task_numa_free(struct task_struct *p)
>  {
>  	kfree(p->numa_faults);
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 4/8] sched: Update NUMA hinting faults once per scan
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-28  6:32     ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  6:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:03]:

> NUMA hinting faults counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
> 
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h | 13 +++++++++++++
>  kernel/sched/core.c   |  1 +
>  kernel/sched/fair.c   | 16 +++++++++++++---
>  3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ba46a64..42f9818 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1506,7 +1506,20 @@ struct task_struct {
>  	u64 node_stamp;			/* migration stamp  */
>  	struct callback_head numa_work;
>  
> +	/*
> +	 * Exponential decaying average of faults on a per-node basis.
> +	 * Scheduling placement decisions are made based on the these counts.
> +	 * The values remain static for the duration of a PTE scan
> +	 */
>  	unsigned long *numa_faults;
> +
> +	/*
> +	 * numa_faults_buffer records faults per node during the current
> +	 * scan window. When the scan completes, the counts in numa_faults
> +	 * decay and these values are copied.
> +	 */
> +	unsigned long *numa_faults_buffer;
> +
>  	int numa_preferred_nid;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 019baae..b00b81a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_preferred_nid = -1;
>  	p->numa_work.next = &p->numa_work;
>  	p->numa_faults = NULL;
> +	p->numa_faults_buffer = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
>  }
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f8c3f61..5893399 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
>  
>  	/* Find the node with the highest number of faults */
>  	for (nid = 0; nid < nr_node_ids; nid++) {
> -		unsigned long faults = p->numa_faults[nid];
> +		unsigned long faults;
> +
> +		/* Decay existing window and copy faults since last scan */
>  		p->numa_faults[nid] >>= 1;
> +		p->numa_faults[nid] += p->numa_faults_buffer[nid];
> +		p->numa_faults_buffer[nid] = 0;
> +
> +		faults = p->numa_faults[nid];
>  		if (faults > max_faults) {
>  			max_faults = faults;
>  			max_nid = nid;
> @@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	if (unlikely(!p->numa_faults)) {
>  		int size = sizeof(*p->numa_faults) * nr_node_ids;
>  
> -		p->numa_faults = kzalloc(size, GFP_KERNEL);
> +		/* numa_faults and numa_faults_buffer share the allocation */
> +		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);

Instead of allocating buffer to hold the current faults, cant we pass
the nr of pages and node information (and probably migrate) to
task_numa_placement()?.

Why should task_struct be passed as an argument to  task_numa_placement().
It seems it always will be current.

>  		if (!p->numa_faults)
>  			return;
> +
> +		BUG_ON(p->numa_faults_buffer);
> +		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
>  	}
>  
>  	/*
> @@ -847,7 +857,7 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	task_numa_placement(p);
>  
>  	/* Record the fault, double the weight if pages were migrated */
> -	p->numa_faults[node] += pages << migrated;
> +	p->numa_faults_buffer[node] += pages << migrated;
>  }
>  
>  static void reset_ptenuma_scan(struct task_struct *p)
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 4/8] sched: Update NUMA hinting faults once per scan
@ 2013-06-28  6:32     ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  6:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:03]:

> NUMA hinting faults counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
> 
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h | 13 +++++++++++++
>  kernel/sched/core.c   |  1 +
>  kernel/sched/fair.c   | 16 +++++++++++++---
>  3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ba46a64..42f9818 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1506,7 +1506,20 @@ struct task_struct {
>  	u64 node_stamp;			/* migration stamp  */
>  	struct callback_head numa_work;
>  
> +	/*
> +	 * Exponential decaying average of faults on a per-node basis.
> +	 * Scheduling placement decisions are made based on the these counts.
> +	 * The values remain static for the duration of a PTE scan
> +	 */
>  	unsigned long *numa_faults;
> +
> +	/*
> +	 * numa_faults_buffer records faults per node during the current
> +	 * scan window. When the scan completes, the counts in numa_faults
> +	 * decay and these values are copied.
> +	 */
> +	unsigned long *numa_faults_buffer;
> +
>  	int numa_preferred_nid;
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 019baae..b00b81a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_preferred_nid = -1;
>  	p->numa_work.next = &p->numa_work;
>  	p->numa_faults = NULL;
> +	p->numa_faults_buffer = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
>  }
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f8c3f61..5893399 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
>  
>  	/* Find the node with the highest number of faults */
>  	for (nid = 0; nid < nr_node_ids; nid++) {
> -		unsigned long faults = p->numa_faults[nid];
> +		unsigned long faults;
> +
> +		/* Decay existing window and copy faults since last scan */
>  		p->numa_faults[nid] >>= 1;
> +		p->numa_faults[nid] += p->numa_faults_buffer[nid];
> +		p->numa_faults_buffer[nid] = 0;
> +
> +		faults = p->numa_faults[nid];
>  		if (faults > max_faults) {
>  			max_faults = faults;
>  			max_nid = nid;
> @@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	if (unlikely(!p->numa_faults)) {
>  		int size = sizeof(*p->numa_faults) * nr_node_ids;
>  
> -		p->numa_faults = kzalloc(size, GFP_KERNEL);
> +		/* numa_faults and numa_faults_buffer share the allocation */
> +		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);

Instead of allocating buffer to hold the current faults, cant we pass
the nr of pages and node information (and probably migrate) to
task_numa_placement()?.

Why should task_struct be passed as an argument to  task_numa_placement().
It seems it always will be current.

>  		if (!p->numa_faults)
>  			return;
> +
> +		BUG_ON(p->numa_faults_buffer);
> +		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
>  	}
>  
>  	/*
> @@ -847,7 +857,7 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	task_numa_placement(p);
>  
>  	/* Record the fault, double the weight if pages were migrated */
> -	p->numa_faults[node] += pages << migrated;
> +	p->numa_faults_buffer[node] += pages << migrated;
>  }
>  
>  static void reset_ptenuma_scan(struct task_struct *p)
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-28  7:00     ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  7:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:06]:

> Ideally it would be possible to distinguish between NUMA hinting faults
> that are private to a task and those that are shared. This would require
> that the last task that accessed a page for a hinting fault would be
> recorded which would increase the size of struct page. Instead this patch
> approximates private pages by assuming that faults that pass the two-stage
> filter are private pages and all others are shared. The preferred NUMA
> node is then selected based on where the maximum number of approximately
> private faults were measured.

Should we consider only private faults for preferred node?
I would think if tasks have shared pages then moving all tasks that share
the same pages to a node where the share pages are around would be
preferred. No? If yes, how does the preferred node logic help to achieve
the above?

> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h |  4 ++--
>  kernel/sched/fair.c   | 32 ++++++++++++++++++++++----------
>  mm/huge_memory.c      |  7 ++++---
>  mm/memory.c           |  9 ++++++---
>  4 files changed, 34 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 82a6136..a41edea 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1600,10 +1600,10 @@ struct task_struct {
>  #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern void task_numa_fault(int node, int pages, bool migrated);
> +extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
>  extern void set_numabalancing_state(bool enabled);
>  #else
> -static inline void task_numa_fault(int node, int pages, bool migrated)
> +static inline void task_numa_fault(int last_node, int node, int pages, bool migrated)
>  {
>  }
>  static inline void set_numabalancing_state(bool enabled)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99951a8..490e601 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -833,6 +833,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
>  	return idlest_cpu;
>  }
>  
> +static inline int task_faults_idx(int nid, int priv)
> +{
> +	return 2 * nid + priv;
> +}
> +
>  static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = 0;
> @@ -849,13 +854,19 @@ static void task_numa_placement(struct task_struct *p)
>  	/* Find the node with the highest number of faults */
>  	for (nid = 0; nid < nr_node_ids; nid++) {
>  		unsigned long faults;
> +		int priv, i;
>  
> -		/* Decay existing window and copy faults since last scan */
> -		p->numa_faults[nid] >>= 1;
> -		p->numa_faults[nid] += p->numa_faults_buffer[nid];
> -		p->numa_faults_buffer[nid] = 0;
> +		for (priv = 0; priv < 2; priv++) {
> +			i = task_faults_idx(nid, priv);
> +
> +			/* Decay existing window and copy faults since last scan */
> +			p->numa_faults[i] >>= 1;
> +			p->numa_faults[i] += p->numa_faults_buffer[i];
> +			p->numa_faults_buffer[i] = 0;
> +		}
>  
> -		faults = p->numa_faults[nid];
> +		/* Find maximum private faults */
> +		faults = p->numa_faults[task_faults_idx(nid, 1)];
>  		if (faults > max_faults) {
>  			max_faults = faults;
>  			max_nid = nid;
> @@ -887,24 +898,25 @@ static void task_numa_placement(struct task_struct *p)
>  /*
>   * Got a PROT_NONE fault for a page on @node.
>   */
> -void task_numa_fault(int node, int pages, bool migrated)
> +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
>  {
>  	struct task_struct *p = current;
> +	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
>  
>  	if (!sched_feat_numa(NUMA))
>  		return;
>  
>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
> -		int size = sizeof(*p->numa_faults) * nr_node_ids;
> +		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
>  
>  		/* numa_faults and numa_faults_buffer share the allocation */
> -		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> +		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
>  		if (!p->numa_faults)
>  			return;
>  
>  		BUG_ON(p->numa_faults_buffer);
> -		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
> +		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
>  	}
>  
>  	/*
> @@ -918,7 +930,7 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	task_numa_placement(p);
>  
>  	/* Record the fault, double the weight if pages were migrated */
> -	p->numa_faults_buffer[node] += pages << migrated;
> +	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
>  }
>  
>  static void reset_ptenuma_scan(struct task_struct *p)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2f7f5aa..7cd7114 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	struct page *page;
>  	unsigned long haddr = addr & HPAGE_PMD_MASK;
> -	int target_nid;
> +	int target_nid, last_nid;
>  	int current_nid = -1;
>  	bool migrated;
>  
> @@ -1307,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (current_nid == numa_node_id())
>  		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>  
> +	last_nid = page_nid_last(page);
>  	target_nid = mpol_misplaced(page, vma, haddr);
>  	if (target_nid == -1) {
>  		put_page(page);
> @@ -1332,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (!migrated)
>  		goto check_same;
>  
> -	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
> +	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
>  	return 0;
>  
>  check_same:
> @@ -1347,7 +1348,7 @@ clear_pmdnuma:
>  out_unlock:
>  	spin_unlock(&mm->page_table_lock);
>  	if (current_nid != -1)
> -		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
> +		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
>  	return 0;
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index ba94dec..c28bf52 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	struct page *page = NULL;
>  	spinlock_t *ptl;
> -	int current_nid = -1;
> +	int current_nid = -1, last_nid;
>  	int target_nid;
>  	bool migrated = false;
>  
> @@ -3566,6 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		return 0;
>  	}
>  
> +	last_nid = page_nid_last(page);
>  	current_nid = page_to_nid(page);
>  	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
>  	pte_unmap_unlock(ptep, ptl);
> @@ -3586,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  out:
>  	if (current_nid != -1)
> -		task_numa_fault(current_nid, 1, migrated);
> +		task_numa_fault(last_nid, current_nid, 1, migrated);
>  	return 0;
>  }
>  
> @@ -3602,6 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	bool numa = false;
>  	int local_nid = numa_node_id();
> +	int last_nid;
>  
>  	spin_lock(&mm->page_table_lock);
>  	pmd = *pmdp;
> @@ -3654,6 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * migrated to.
>  		 */
>  		curr_nid = local_nid;
> +		last_nid = page_nid_last(page);
>  		target_nid = numa_migrate_prep(page, vma, addr,
>  					       page_to_nid(page));
>  		if (target_nid == -1) {
> @@ -3666,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		migrated = migrate_misplaced_page(page, target_nid);
>  		if (migrated)
>  			curr_nid = target_nid;
> -		task_numa_fault(curr_nid, 1, migrated);
> +		task_numa_fault(last_nid, curr_nid, 1, migrated);
>  
>  		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>  	}
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-28  7:00     ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  7:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:06]:

> Ideally it would be possible to distinguish between NUMA hinting faults
> that are private to a task and those that are shared. This would require
> that the last task that accessed a page for a hinting fault would be
> recorded which would increase the size of struct page. Instead this patch
> approximates private pages by assuming that faults that pass the two-stage
> filter are private pages and all others are shared. The preferred NUMA
> node is then selected based on where the maximum number of approximately
> private faults were measured.

Should we consider only private faults for preferred node?
I would think if tasks have shared pages then moving all tasks that share
the same pages to a node where the share pages are around would be
preferred. No? If yes, how does the preferred node logic help to achieve
the above?

> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/sched.h |  4 ++--
>  kernel/sched/fair.c   | 32 ++++++++++++++++++++++----------
>  mm/huge_memory.c      |  7 ++++---
>  mm/memory.c           |  9 ++++++---
>  4 files changed, 34 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 82a6136..a41edea 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1600,10 +1600,10 @@ struct task_struct {
>  #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern void task_numa_fault(int node, int pages, bool migrated);
> +extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
>  extern void set_numabalancing_state(bool enabled);
>  #else
> -static inline void task_numa_fault(int node, int pages, bool migrated)
> +static inline void task_numa_fault(int last_node, int node, int pages, bool migrated)
>  {
>  }
>  static inline void set_numabalancing_state(bool enabled)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99951a8..490e601 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -833,6 +833,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
>  	return idlest_cpu;
>  }
>  
> +static inline int task_faults_idx(int nid, int priv)
> +{
> +	return 2 * nid + priv;
> +}
> +
>  static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = 0;
> @@ -849,13 +854,19 @@ static void task_numa_placement(struct task_struct *p)
>  	/* Find the node with the highest number of faults */
>  	for (nid = 0; nid < nr_node_ids; nid++) {
>  		unsigned long faults;
> +		int priv, i;
>  
> -		/* Decay existing window and copy faults since last scan */
> -		p->numa_faults[nid] >>= 1;
> -		p->numa_faults[nid] += p->numa_faults_buffer[nid];
> -		p->numa_faults_buffer[nid] = 0;
> +		for (priv = 0; priv < 2; priv++) {
> +			i = task_faults_idx(nid, priv);
> +
> +			/* Decay existing window and copy faults since last scan */
> +			p->numa_faults[i] >>= 1;
> +			p->numa_faults[i] += p->numa_faults_buffer[i];
> +			p->numa_faults_buffer[i] = 0;
> +		}
>  
> -		faults = p->numa_faults[nid];
> +		/* Find maximum private faults */
> +		faults = p->numa_faults[task_faults_idx(nid, 1)];
>  		if (faults > max_faults) {
>  			max_faults = faults;
>  			max_nid = nid;
> @@ -887,24 +898,25 @@ static void task_numa_placement(struct task_struct *p)
>  /*
>   * Got a PROT_NONE fault for a page on @node.
>   */
> -void task_numa_fault(int node, int pages, bool migrated)
> +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
>  {
>  	struct task_struct *p = current;
> +	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
>  
>  	if (!sched_feat_numa(NUMA))
>  		return;
>  
>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
> -		int size = sizeof(*p->numa_faults) * nr_node_ids;
> +		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
>  
>  		/* numa_faults and numa_faults_buffer share the allocation */
> -		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> +		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
>  		if (!p->numa_faults)
>  			return;
>  
>  		BUG_ON(p->numa_faults_buffer);
> -		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
> +		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
>  	}
>  
>  	/*
> @@ -918,7 +930,7 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	task_numa_placement(p);
>  
>  	/* Record the fault, double the weight if pages were migrated */
> -	p->numa_faults_buffer[node] += pages << migrated;
> +	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
>  }
>  
>  static void reset_ptenuma_scan(struct task_struct *p)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2f7f5aa..7cd7114 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	struct page *page;
>  	unsigned long haddr = addr & HPAGE_PMD_MASK;
> -	int target_nid;
> +	int target_nid, last_nid;
>  	int current_nid = -1;
>  	bool migrated;
>  
> @@ -1307,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (current_nid == numa_node_id())
>  		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>  
> +	last_nid = page_nid_last(page);
>  	target_nid = mpol_misplaced(page, vma, haddr);
>  	if (target_nid == -1) {
>  		put_page(page);
> @@ -1332,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (!migrated)
>  		goto check_same;
>  
> -	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
> +	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
>  	return 0;
>  
>  check_same:
> @@ -1347,7 +1348,7 @@ clear_pmdnuma:
>  out_unlock:
>  	spin_unlock(&mm->page_table_lock);
>  	if (current_nid != -1)
> -		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
> +		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
>  	return 0;
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index ba94dec..c28bf52 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	struct page *page = NULL;
>  	spinlock_t *ptl;
> -	int current_nid = -1;
> +	int current_nid = -1, last_nid;
>  	int target_nid;
>  	bool migrated = false;
>  
> @@ -3566,6 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		return 0;
>  	}
>  
> +	last_nid = page_nid_last(page);
>  	current_nid = page_to_nid(page);
>  	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
>  	pte_unmap_unlock(ptep, ptl);
> @@ -3586,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  out:
>  	if (current_nid != -1)
> -		task_numa_fault(current_nid, 1, migrated);
> +		task_numa_fault(last_nid, current_nid, 1, migrated);
>  	return 0;
>  }
>  
> @@ -3602,6 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	bool numa = false;
>  	int local_nid = numa_node_id();
> +	int last_nid;
>  
>  	spin_lock(&mm->page_table_lock);
>  	pmd = *pmdp;
> @@ -3654,6 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * migrated to.
>  		 */
>  		curr_nid = local_nid;
> +		last_nid = page_nid_last(page);
>  		target_nid = numa_migrate_prep(page, vma, addr,
>  					       page_to_nid(page));
>  		if (target_nid == -1) {
> @@ -3666,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		migrated = migrate_misplaced_page(page, target_nid);
>  		if (migrated)
>  			curr_nid = target_nid;
> -		task_numa_fault(curr_nid, 1, migrated);
> +		task_numa_fault(last_nid, curr_nid, 1, migrated);
>  
>  		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>  	}
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-26 14:38   ` Mel Gorman
@ 2013-06-28  8:11     ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  8:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:04]:

> This patch favours moving tasks towards the preferred NUMA node when
> it has just been selected. Ideally this is self-reinforcing as the
> longer the the task runs on that node, the more faults it should incur
> causing task_numa_placement to keep the task running on that node. In
> reality a big weakness is that the nodes CPUs can be overloaded and it
> would be more effficient to queue tasks on an idle node and migrate to
> the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the
> preferred node for a tunable number of PTE scans.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  Documentation/sysctl/kernel.txt |  8 +++++++-
>  include/linux/sched.h           |  1 +
>  kernel/sched/core.c             |  4 +++-
>  kernel/sched/fair.c             | 40 ++++++++++++++++++++++++++++++++++++++--
>  kernel/sysctl.c                 |  7 +++++++
>  5 files changed, 56 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 0fe678c..246b128 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
>  feature is too high then the rate the kernel samples for NUMA hinting
>  faults may be controlled by the numa_balancing_scan_period_min_ms,
>  numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
> -numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
> +numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
> +numa_balancing_settle_count sysctls.
>  
>  ==============================================================
>  
> @@ -418,6 +419,11 @@ scanned for a given scan.
>  numa_balancing_scan_period_reset is a blunt instrument that controls how
>  often a tasks scan delay is reset to detect sudden changes in task behaviour.
>  
> +numa_balancing_settle_count is how many scan periods must complete before
> +the schedule balancer stops pushing the task towards a preferred node. This
> +gives the scheduler a chance to place the task on an alternative node if the
> +preferred node is overloaded.
> +
>  ==============================================================
>  
>  osrelease, ostype & version:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 42f9818..82a6136 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -815,6 +815,7 @@ enum cpu_idle_type {
>  #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
>  #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
>  #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
> +#define SD_NUMA			0x4000	/* cross-node balancing */
>  
>  extern int __weak arch_sd_sibiling_asym_packing(void);
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b00b81a..ba9470e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
>  
>  	p->node_stamp = 0ULL;
>  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> -	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> +	p->numa_migrate_seq = 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
>  	p->numa_preferred_nid = -1;
>  	p->numa_work.next = &p->numa_work;
> @@ -5721,6 +5721,7 @@ struct sched_domain_topology_level;
>  void sched_setnuma(struct task_struct *p, int nid)
>  {
>  	p->numa_preferred_nid = nid;
> +	p->numa_migrate_seq = 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> @@ -6150,6 +6151,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
>  					| 0*SD_SHARE_PKG_RESOURCES
>  					| 1*SD_SERIALIZE
>  					| 0*SD_PREFER_SIBLING
> +					| 1*SD_NUMA
>  					| sd_local_flags(level)
>  					,
>  		.last_balance		= jiffies,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5893399..5e7f728 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
>  /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
>  unsigned int sysctl_numa_balancing_scan_delay = 1000;
>  
> +/*
> + * Once a preferred node is selected the scheduler balancer will prefer moving
> + * a task to that node for sysctl_numa_balancing_settle_count number of PTE
> + * scans. This will give the process the chance to accumulate more faults on
> + * the preferred node but still allow the scheduler to move the task again if
> + * the nodes CPUs are overloaded.
> + */
> +unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> +
>  static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = 0;
> @@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
>  	if (p->numa_scan_seq == seq)
>  		return;
>  	p->numa_scan_seq = seq;
> +	p->numa_migrate_seq++;
>  
>  	/* Find the node with the highest number of faults */
>  	for (nid = 0; nid < nr_node_ids; nid++) {
> @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&

Lets say even if the numa_migrate_seq is greater than settle_count but running
on a wrong node, then shouldnt this be taken as a good opportunity to 
move the task?

> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}
> +
> +
>  /*
>   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
>   */
> @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  
>  	/*
>  	 * Aggressive migration if:
> -	 * 1) task is cache cold, or
> -	 * 2) too many balance attempts have failed.
> +	 * 1) destination numa is preferred
> +	 * 2) task is cache cold, or
> +	 * 3) too many balance attempts have failed.
>  	 */
>  
> +	if (migrate_improves_locality(p, env))
> +		return 1;

Shouldnt this be under tsk_cache_hot check?

If the task is cache hot, then we would have to update the corresponding  schedstat
metrics.


> +
>  	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
>  	if (!tsk_cache_hot ||
>  		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..263486f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname       = "numa_balancing_settle_count",
> +		.data           = &sysctl_numa_balancing_settle_count,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_SCHED_DEBUG */
>  	{
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28  8:11     ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28  8:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:04]:

> This patch favours moving tasks towards the preferred NUMA node when
> it has just been selected. Ideally this is self-reinforcing as the
> longer the the task runs on that node, the more faults it should incur
> causing task_numa_placement to keep the task running on that node. In
> reality a big weakness is that the nodes CPUs can be overloaded and it
> would be more effficient to queue tasks on an idle node and migrate to
> the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the
> preferred node for a tunable number of PTE scans.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  Documentation/sysctl/kernel.txt |  8 +++++++-
>  include/linux/sched.h           |  1 +
>  kernel/sched/core.c             |  4 +++-
>  kernel/sched/fair.c             | 40 ++++++++++++++++++++++++++++++++++++++--
>  kernel/sysctl.c                 |  7 +++++++
>  5 files changed, 56 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 0fe678c..246b128 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
>  feature is too high then the rate the kernel samples for NUMA hinting
>  faults may be controlled by the numa_balancing_scan_period_min_ms,
>  numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
> -numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
> +numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
> +numa_balancing_settle_count sysctls.
>  
>  ==============================================================
>  
> @@ -418,6 +419,11 @@ scanned for a given scan.
>  numa_balancing_scan_period_reset is a blunt instrument that controls how
>  often a tasks scan delay is reset to detect sudden changes in task behaviour.
>  
> +numa_balancing_settle_count is how many scan periods must complete before
> +the schedule balancer stops pushing the task towards a preferred node. This
> +gives the scheduler a chance to place the task on an alternative node if the
> +preferred node is overloaded.
> +
>  ==============================================================
>  
>  osrelease, ostype & version:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 42f9818..82a6136 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -815,6 +815,7 @@ enum cpu_idle_type {
>  #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
>  #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
>  #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
> +#define SD_NUMA			0x4000	/* cross-node balancing */
>  
>  extern int __weak arch_sd_sibiling_asym_packing(void);
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b00b81a..ba9470e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
>  
>  	p->node_stamp = 0ULL;
>  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> -	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> +	p->numa_migrate_seq = 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
>  	p->numa_preferred_nid = -1;
>  	p->numa_work.next = &p->numa_work;
> @@ -5721,6 +5721,7 @@ struct sched_domain_topology_level;
>  void sched_setnuma(struct task_struct *p, int nid)
>  {
>  	p->numa_preferred_nid = nid;
> +	p->numa_migrate_seq = 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> @@ -6150,6 +6151,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
>  					| 0*SD_SHARE_PKG_RESOURCES
>  					| 1*SD_SERIALIZE
>  					| 0*SD_PREFER_SIBLING
> +					| 1*SD_NUMA
>  					| sd_local_flags(level)
>  					,
>  		.last_balance		= jiffies,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5893399..5e7f728 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
>  /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
>  unsigned int sysctl_numa_balancing_scan_delay = 1000;
>  
> +/*
> + * Once a preferred node is selected the scheduler balancer will prefer moving
> + * a task to that node for sysctl_numa_balancing_settle_count number of PTE
> + * scans. This will give the process the chance to accumulate more faults on
> + * the preferred node but still allow the scheduler to move the task again if
> + * the nodes CPUs are overloaded.
> + */
> +unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> +
>  static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = 0;
> @@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
>  	if (p->numa_scan_seq == seq)
>  		return;
>  	p->numa_scan_seq = seq;
> +	p->numa_migrate_seq++;
>  
>  	/* Find the node with the highest number of faults */
>  	for (nid = 0; nid < nr_node_ids; nid++) {
> @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +/* Returns true if the destination node has incurred more faults */
> +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&

Lets say even if the numa_migrate_seq is greater than settle_count but running
on a wrong node, then shouldnt this be taken as a good opportunity to 
move the task?

> +	    p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	return false;
> +}
> +
> +
>  /*
>   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
>   */
> @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  
>  	/*
>  	 * Aggressive migration if:
> -	 * 1) task is cache cold, or
> -	 * 2) too many balance attempts have failed.
> +	 * 1) destination numa is preferred
> +	 * 2) task is cache cold, or
> +	 * 3) too many balance attempts have failed.
>  	 */
>  
> +	if (migrate_improves_locality(p, env))
> +		return 1;

Shouldnt this be under tsk_cache_hot check?

If the task is cache hot, then we would have to update the corresponding  schedstat
metrics.


> +
>  	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
>  	if (!tsk_cache_hot ||
>  		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..263486f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname       = "numa_balancing_settle_count",
> +		.data           = &sysctl_numa_balancing_settle_count,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_SCHED_DEBUG */
>  	{
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
  2013-06-28  6:08     ` Srikar Dronamraju
@ 2013-06-28  8:56       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  8:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:38:29AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:01]:
> > @@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  			p->numa_scan_period + jiffies_to_msecs(10));
> >  
> >  	task_numa_placement(p);
> > +
> > +	/* Record the fault, double the weight if pages were migrated */
> > +	p->numa_faults[node] += pages << migrated;
> 
> 
> Why are we doing this after the placement.
> I mean we should probably be doing this in the task_numa_placement,

The placement only does something when we've completed a full scan; this
would then be the first fault of the next scan. Hence we do placement
first so as not to add this first fault of the next scan to
->numa_faults[].

This all gets changed later on when ->numa_faults_curr[] gets
introduced.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
@ 2013-06-28  8:56       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  8:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:38:29AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:01]:
> > @@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  			p->numa_scan_period + jiffies_to_msecs(10));
> >  
> >  	task_numa_placement(p);
> > +
> > +	/* Record the fault, double the weight if pages were migrated */
> > +	p->numa_faults[node] += pages << migrated;
> 
> 
> Why are we doing this after the placement.
> I mean we should probably be doing this in the task_numa_placement,

The placement only does something when we've completed a full scan; this
would then be the first fault of the next scan. Hence we do placement
first so as not to add this first fault of the next scan to
->numa_faults[].

This all gets changed later on when ->numa_faults_curr[] gets
introduced.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
  2013-06-28  6:14     ` Srikar Dronamraju
@ 2013-06-28  8:59       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  8:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:44:28AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:02]:
> 
> > This patch selects a preferred node for a task to run on based on the
> > NUMA hinting faults. This information is later used to migrate tasks
> > towards the node during balancing.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/sched.h |  1 +
> >  kernel/sched/core.c   | 10 ++++++++++
> >  kernel/sched/fair.c   | 16 ++++++++++++++--
> >  kernel/sched/sched.h  |  2 +-
> >  4 files changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 72861b4..ba46a64 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1507,6 +1507,7 @@ struct task_struct {
> >  	struct callback_head numa_work;
> >  
> >  	unsigned long *numa_faults;
> > +	int numa_preferred_nid;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> >  	struct rcu_head rcu;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f332ec0..019baae 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_preferred_nid = -1;
> 
> Though we may not want to inherit faults, I think the tasks generally
> share pages with their siblings, parent. So will it make sense to
> inherit the preferred node?

One of the patches I have locally wipes the numa state on exec(). I
think we want to do that if we're going to think about inheriting stuff.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
@ 2013-06-28  8:59       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  8:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:44:28AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:02]:
> 
> > This patch selects a preferred node for a task to run on based on the
> > NUMA hinting faults. This information is later used to migrate tasks
> > towards the node during balancing.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/sched.h |  1 +
> >  kernel/sched/core.c   | 10 ++++++++++
> >  kernel/sched/fair.c   | 16 ++++++++++++++--
> >  kernel/sched/sched.h  |  2 +-
> >  4 files changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 72861b4..ba46a64 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1507,6 +1507,7 @@ struct task_struct {
> >  	struct callback_head numa_work;
> >  
> >  	unsigned long *numa_faults;
> > +	int numa_preferred_nid;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> >  	struct rcu_head rcu;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f332ec0..019baae 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_preferred_nid = -1;
> 
> Though we may not want to inherit faults, I think the tasks generally
> share pages with their siblings, parent. So will it make sense to
> inherit the preferred node?

One of the patches I have locally wipes the numa state on exec(). I
think we want to do that if we're going to think about inheriting stuff.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 4/8] sched: Update NUMA hinting faults once per scan
  2013-06-28  6:32     ` Srikar Dronamraju
@ 2013-06-28  9:01       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  9:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 12:02:33PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:03]:
> > @@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  	if (unlikely(!p->numa_faults)) {
> >  		int size = sizeof(*p->numa_faults) * nr_node_ids;
> >  
> > -		p->numa_faults = kzalloc(size, GFP_KERNEL);
> > +		/* numa_faults and numa_faults_buffer share the allocation */
> > +		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> 
> Instead of allocating buffer to hold the current faults, cant we pass
> the nr of pages and node information (and probably migrate) to
> task_numa_placement()?.

I'm afraid I don't get your question; there's more storage required than
just the arguments.

> Why should task_struct be passed as an argument to  task_numa_placement().
> It seems it always will be current.

Customary for parts -- motivated by the fact that usage of current
is/can be more expensive than passing an argument.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 4/8] sched: Update NUMA hinting faults once per scan
@ 2013-06-28  9:01       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  9:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 12:02:33PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:03]:
> > @@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  	if (unlikely(!p->numa_faults)) {
> >  		int size = sizeof(*p->numa_faults) * nr_node_ids;
> >  
> > -		p->numa_faults = kzalloc(size, GFP_KERNEL);
> > +		/* numa_faults and numa_faults_buffer share the allocation */
> > +		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> 
> Instead of allocating buffer to hold the current faults, cant we pass
> the nr of pages and node information (and probably migrate) to
> task_numa_placement()?.

I'm afraid I don't get your question; there's more storage required than
just the arguments.

> Why should task_struct be passed as an argument to  task_numa_placement().
> It seems it always will be current.

Customary for parts -- motivated by the fact that usage of current
is/can be more expensive than passing an argument.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28  8:11     ` Srikar Dronamraju
@ 2013-06-28  9:04       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  9:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 01:41:20PM +0530, Srikar Dronamraju wrote:

Please trim your replies.

> > +/* Returns true if the destination node has incurred more faults */
> > +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> > +{
> > +	int src_nid, dst_nid;
> > +
> > +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> > +		return false;
> > +
> > +	src_nid = cpu_to_node(env->src_cpu);
> > +	dst_nid = cpu_to_node(env->dst_cpu);
> > +
> > +	if (src_nid == dst_nid)
> > +		return false;
> > +
> > +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> 
> Lets say even if the numa_migrate_seq is greater than settle_count but running
> on a wrong node, then shouldnt this be taken as a good opportunity to 
> move the task?

I think that's what its doing; so this stmt says; if seq is large and
we're trying to move to the 'right' node; move it noaw.

> > +	    p->numa_preferred_nid == dst_nid)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +
> >  /*
> >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> >   */
> > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >  
> >  	/*
> >  	 * Aggressive migration if:
> > -	 * 1) task is cache cold, or
> > -	 * 2) too many balance attempts have failed.
> > +	 * 1) destination numa is preferred
> > +	 * 2) task is cache cold, or
> > +	 * 3) too many balance attempts have failed.
> >  	 */
> >  
> > +	if (migrate_improves_locality(p, env))
> > +		return 1;
> 
> Shouldnt this be under tsk_cache_hot check?
> 
> If the task is cache hot, then we would have to update the corresponding  schedstat
> metrics.

No; you want migrate_degrades_locality() to be like task_hot(). You want
to _always_ migrate tasks towards better locality irrespective of local
cache hotness.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28  9:04       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  9:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 01:41:20PM +0530, Srikar Dronamraju wrote:

Please trim your replies.

> > +/* Returns true if the destination node has incurred more faults */
> > +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> > +{
> > +	int src_nid, dst_nid;
> > +
> > +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> > +		return false;
> > +
> > +	src_nid = cpu_to_node(env->src_cpu);
> > +	dst_nid = cpu_to_node(env->dst_cpu);
> > +
> > +	if (src_nid == dst_nid)
> > +		return false;
> > +
> > +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> 
> Lets say even if the numa_migrate_seq is greater than settle_count but running
> on a wrong node, then shouldnt this be taken as a good opportunity to 
> move the task?

I think that's what its doing; so this stmt says; if seq is large and
we're trying to move to the 'right' node; move it noaw.

> > +	    p->numa_preferred_nid == dst_nid)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +
> >  /*
> >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> >   */
> > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >  
> >  	/*
> >  	 * Aggressive migration if:
> > -	 * 1) task is cache cold, or
> > -	 * 2) too many balance attempts have failed.
> > +	 * 1) destination numa is preferred
> > +	 * 2) task is cache cold, or
> > +	 * 3) too many balance attempts have failed.
> >  	 */
> >  
> > +	if (migrate_improves_locality(p, env))
> > +		return 1;
> 
> Shouldnt this be under tsk_cache_hot check?
> 
> If the task is cache hot, then we would have to update the corresponding  schedstat
> metrics.

No; you want migrate_degrades_locality() to be like task_hot(). You want
to _always_ migrate tasks towards better locality irrespective of local
cache hotness.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-28  7:00     ` Srikar Dronamraju
@ 2013-06-28  9:36       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  9:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 12:30:27PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:06]:
> 
> > Ideally it would be possible to distinguish between NUMA hinting faults
> > that are private to a task and those that are shared. This would require
> > that the last task that accessed a page for a hinting fault would be
> > recorded which would increase the size of struct page. Instead this patch
> > approximates private pages by assuming that faults that pass the two-stage
> > filter are private pages and all others are shared. The preferred NUMA
> > node is then selected based on where the maximum number of approximately
> > private faults were measured.
> 
> Should we consider only private faults for preferred node?

I don't think so; its optimal for the task to be nearest most of its pages;
irrespective of whether they be private or shared.

> I would think if tasks have shared pages then moving all tasks that share
> the same pages to a node where the share pages are around would be
> preferred. No? 

Well no; not if there's only 5 shared pages but 1024 private pages.

> If yes, how does the preferred node logic help to achieve
> the above?

There's no packing logic yet...

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-28  9:36       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28  9:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 12:30:27PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:06]:
> 
> > Ideally it would be possible to distinguish between NUMA hinting faults
> > that are private to a task and those that are shared. This would require
> > that the last task that accessed a page for a hinting fault would be
> > recorded which would increase the size of struct page. Instead this patch
> > approximates private pages by assuming that faults that pass the two-stage
> > filter are private pages and all others are shared. The preferred NUMA
> > node is then selected based on where the maximum number of approximately
> > private faults were measured.
> 
> Should we consider only private faults for preferred node?

I don't think so; its optimal for the task to be nearest most of its pages;
irrespective of whether they be private or shared.

> I would think if tasks have shared pages then moving all tasks that share
> the same pages to a node where the share pages are around would be
> preferred. No? 

Well no; not if there's only 5 shared pages but 1024 private pages.

> If yes, how does the preferred node logic help to achieve
> the above?

There's no packing logic yet...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28  9:04       ` Peter Zijlstra
@ 2013-06-28 10:07         ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 10:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > > +
> > > +
> > >  /*
> > >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> > >   */
> > > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > >  
> > >  	/*
> > >  	 * Aggressive migration if:
> > > -	 * 1) task is cache cold, or
> > > -	 * 2) too many balance attempts have failed.
> > > +	 * 1) destination numa is preferred
> > > +	 * 2) task is cache cold, or
> > > +	 * 3) too many balance attempts have failed.
> > >  	 */
> > >  
> > > +	if (migrate_improves_locality(p, env))
> > > +		return 1;
> > 
> > Shouldnt this be under tsk_cache_hot check?
> > 
> > If the task is cache hot, then we would have to update the corresponding  schedstat
> > metrics.
> 
> No; you want migrate_degrades_locality() to be like task_hot(). You want
> to _always_ migrate tasks towards better locality irrespective of local
> cache hotness.
> 

Yes, I understand that numa should have more priority over cache.
But the schedstats will not be updated about whether the task was hot or
cold.

So lets say the task was cache hot but numa wants it to move, then we
should certainly move it but we should update the schedstats to mention that we
moved a cache hot task.

Something akin to this.

	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
	if (tsk_cache_hot) {
		if (migrate_improves_locality(p, env) || 
		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
#ifdef CONFIG_SCHEDSTATS
			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
			schedstat_inc(p, se.statistics.nr_forced_migrations);
#endif
			return 1;
		}
		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
		return 0;
	}
	return 1;

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 10:07         ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 10:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > > +
> > > +
> > >  /*
> > >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> > >   */
> > > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > >  
> > >  	/*
> > >  	 * Aggressive migration if:
> > > -	 * 1) task is cache cold, or
> > > -	 * 2) too many balance attempts have failed.
> > > +	 * 1) destination numa is preferred
> > > +	 * 2) task is cache cold, or
> > > +	 * 3) too many balance attempts have failed.
> > >  	 */
> > >  
> > > +	if (migrate_improves_locality(p, env))
> > > +		return 1;
> > 
> > Shouldnt this be under tsk_cache_hot check?
> > 
> > If the task is cache hot, then we would have to update the corresponding  schedstat
> > metrics.
> 
> No; you want migrate_degrades_locality() to be like task_hot(). You want
> to _always_ migrate tasks towards better locality irrespective of local
> cache hotness.
> 

Yes, I understand that numa should have more priority over cache.
But the schedstats will not be updated about whether the task was hot or
cold.

So lets say the task was cache hot but numa wants it to move, then we
should certainly move it but we should update the schedstats to mention that we
moved a cache hot task.

Something akin to this.

	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
	if (tsk_cache_hot) {
		if (migrate_improves_locality(p, env) || 
		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
#ifdef CONFIG_SCHEDSTATS
			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
			schedstat_inc(p, se.statistics.nr_forced_migrations);
#endif
			return 1;
		}
		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
		return 0;
	}
	return 1;

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-28  9:36       ` Peter Zijlstra
@ 2013-06-28 10:12         ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 10:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > 
> > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > that are private to a task and those that are shared. This would require
> > > that the last task that accessed a page for a hinting fault would be
> > > recorded which would increase the size of struct page. Instead this patch
> > > approximates private pages by assuming that faults that pass the two-stage
> > > filter are private pages and all others are shared. The preferred NUMA
> > > node is then selected based on where the maximum number of approximately
> > > private faults were measured.
> > 
> > Should we consider only private faults for preferred node?
> 
> I don't think so; its optimal for the task to be nearest most of its pages;
> irrespective of whether they be private or shared.

Then the preferred node should have been chosen based on both the
private and shared faults and not just private faults.

> 
> > I would think if tasks have shared pages then moving all tasks that share
> > the same pages to a node where the share pages are around would be
> > preferred. No? 
> 
> Well no; not if there's only 5 shared pages but 1024 private pages.

Yes, agree, but should we try to give the shared pages some additional weightage?

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-28 10:12         ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 10:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > 
> > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > that are private to a task and those that are shared. This would require
> > > that the last task that accessed a page for a hinting fault would be
> > > recorded which would increase the size of struct page. Instead this patch
> > > approximates private pages by assuming that faults that pass the two-stage
> > > filter are private pages and all others are shared. The preferred NUMA
> > > node is then selected based on where the maximum number of approximately
> > > private faults were measured.
> > 
> > Should we consider only private faults for preferred node?
> 
> I don't think so; its optimal for the task to be nearest most of its pages;
> irrespective of whether they be private or shared.

Then the preferred node should have been chosen based on both the
private and shared faults and not just private faults.

> 
> > I would think if tasks have shared pages then moving all tasks that share
> > the same pages to a node where the share pages are around would be
> > preferred. No? 
> 
> Well no; not if there's only 5 shared pages but 1024 private pages.

Yes, agree, but should we try to give the shared pages some additional weightage?

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
  2013-06-28  8:59       ` Peter Zijlstra
@ 2013-06-28 10:24         ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 10:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > >  
> > >  	struct rcu_head rcu;
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f332ec0..019baae 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> > >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> > >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> > >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > > +	p->numa_preferred_nid = -1;
> > 
> > Though we may not want to inherit faults, I think the tasks generally
> > share pages with their siblings, parent. So will it make sense to
> > inherit the preferred node?
> 
> One of the patches I have locally wipes the numa state on exec(). I
> think we want to do that if we're going to think about inheriting stuff.
> 
> 

Agree, if we inherit the preferred node, we would have to reset on exec.
Since we have to reset the numa_faults also on exec, the reset of
preferred node can go in task_numa_free

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
@ 2013-06-28 10:24         ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 10:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > >  
> > >  	struct rcu_head rcu;
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f332ec0..019baae 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> > >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> > >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> > >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > > +	p->numa_preferred_nid = -1;
> > 
> > Though we may not want to inherit faults, I think the tasks generally
> > share pages with their siblings, parent. So will it make sense to
> > inherit the preferred node?
> 
> One of the patches I have locally wipes the numa state on exec(). I
> think we want to do that if we're going to think about inheriting stuff.
> 
> 

Agree, if we inherit the preferred node, we would have to reset on exec.
Since we have to reset the numa_faults also on exec, the reset of
preferred node can go in task_numa_free

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28 10:07         ` Srikar Dronamraju
@ 2013-06-28 10:24           ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 10:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:37:23PM +0530, Srikar Dronamraju wrote:
> > > > +
> > > > +
> > > >  /*
> > > >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> > > >   */
> > > > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > > >  
> > > >  	/*
> > > >  	 * Aggressive migration if:
> > > > -	 * 1) task is cache cold, or
> > > > -	 * 2) too many balance attempts have failed.
> > > > +	 * 1) destination numa is preferred
> > > > +	 * 2) task is cache cold, or
> > > > +	 * 3) too many balance attempts have failed.
> > > >  	 */
> > > >  
> > > > +	if (migrate_improves_locality(p, env))
> > > > +		return 1;
> > > 
> > > Shouldnt this be under tsk_cache_hot check?
> > > 
> > > If the task is cache hot, then we would have to update the corresponding  schedstat
> > > metrics.
> > 
> > No; you want migrate_degrades_locality() to be like task_hot(). You want
> > to _always_ migrate tasks towards better locality irrespective of local
> > cache hotness.
> > 
> 
> Yes, I understand that numa should have more priority over cache.
> But the schedstats will not be updated about whether the task was hot or
> cold.
> 
> So lets say the task was cache hot but numa wants it to move, then we
> should certainly move it but we should update the schedstats to mention that we
> moved a cache hot task.
> 
> Something akin to this.
> 
> 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> 	if (tsk_cache_hot) {
> 		if (migrate_improves_locality(p, env) || 
> 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> #ifdef CONFIG_SCHEDSTATS
> 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> #endif
> 			return 1;
> 		}
> 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> 		return 0;
> 	}
> 	return 1;

Ah right.. ok that might make sense.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 10:24           ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 10:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:37:23PM +0530, Srikar Dronamraju wrote:
> > > > +
> > > > +
> > > >  /*
> > > >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> > > >   */
> > > > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > > >  
> > > >  	/*
> > > >  	 * Aggressive migration if:
> > > > -	 * 1) task is cache cold, or
> > > > -	 * 2) too many balance attempts have failed.
> > > > +	 * 1) destination numa is preferred
> > > > +	 * 2) task is cache cold, or
> > > > +	 * 3) too many balance attempts have failed.
> > > >  	 */
> > > >  
> > > > +	if (migrate_improves_locality(p, env))
> > > > +		return 1;
> > > 
> > > Shouldnt this be under tsk_cache_hot check?
> > > 
> > > If the task is cache hot, then we would have to update the corresponding  schedstat
> > > metrics.
> > 
> > No; you want migrate_degrades_locality() to be like task_hot(). You want
> > to _always_ migrate tasks towards better locality irrespective of local
> > cache hotness.
> > 
> 
> Yes, I understand that numa should have more priority over cache.
> But the schedstats will not be updated about whether the task was hot or
> cold.
> 
> So lets say the task was cache hot but numa wants it to move, then we
> should certainly move it but we should update the schedstats to mention that we
> moved a cache hot task.
> 
> Something akin to this.
> 
> 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> 	if (tsk_cache_hot) {
> 		if (migrate_improves_locality(p, env) || 
> 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> #ifdef CONFIG_SCHEDSTATS
> 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> #endif
> 			return 1;
> 		}
> 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> 		return 0;
> 	}
> 	return 1;

Ah right.. ok that might make sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-28 10:12         ` Srikar Dronamraju
@ 2013-06-28 10:33           ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 10:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:42:45PM +0530, Srikar Dronamraju wrote:
> > > 
> > > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > > that are private to a task and those that are shared. This would require
> > > > that the last task that accessed a page for a hinting fault would be
> > > > recorded which would increase the size of struct page. Instead this patch
> > > > approximates private pages by assuming that faults that pass the two-stage
> > > > filter are private pages and all others are shared. The preferred NUMA
> > > > node is then selected based on where the maximum number of approximately
> > > > private faults were measured.
> > > 
> > > Should we consider only private faults for preferred node?
> > 
> > I don't think so; its optimal for the task to be nearest most of its pages;
> > irrespective of whether they be private or shared.
> 
> Then the preferred node should have been chosen based on both the
> private and shared faults and not just private faults.

Oh duh indeed. I totally missed it did that. Changelog also isn't giving
rationale for this. Mel?

> > 
> > > I would think if tasks have shared pages then moving all tasks that share
> > > the same pages to a node where the share pages are around would be
> > > preferred. No? 
> > 
> > Well no; not if there's only 5 shared pages but 1024 private pages.
> 
> Yes, agree, but should we try to give the shared pages some additional weightage?

Yes because you'll get 1/n amount of this on shared pages for threads --
other threads will contend for the same PTE fault. And no because for
inter process shared memory they'll each have their own PTE. And maybe
because even for the threaded case its hard to tell how many threads
will actually contend for that one PTE.

Confused enough? :-)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-28 10:33           ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 10:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:42:45PM +0530, Srikar Dronamraju wrote:
> > > 
> > > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > > that are private to a task and those that are shared. This would require
> > > > that the last task that accessed a page for a hinting fault would be
> > > > recorded which would increase the size of struct page. Instead this patch
> > > > approximates private pages by assuming that faults that pass the two-stage
> > > > filter are private pages and all others are shared. The preferred NUMA
> > > > node is then selected based on where the maximum number of approximately
> > > > private faults were measured.
> > > 
> > > Should we consider only private faults for preferred node?
> > 
> > I don't think so; its optimal for the task to be nearest most of its pages;
> > irrespective of whether they be private or shared.
> 
> Then the preferred node should have been chosen based on both the
> private and shared faults and not just private faults.

Oh duh indeed. I totally missed it did that. Changelog also isn't giving
rationale for this. Mel?

> > 
> > > I would think if tasks have shared pages then moving all tasks that share
> > > the same pages to a node where the share pages are around would be
> > > preferred. No? 
> > 
> > Well no; not if there's only 5 shared pages but 1024 private pages.
> 
> Yes, agree, but should we try to give the shared pages some additional weightage?

Yes because you'll get 1/n amount of this on shared pages for threads --
other threads will contend for the same PTE fault. And no because for
inter process shared memory they'll each have their own PTE. And maybe
because even for the threaded case its hard to tell how many threads
will actually contend for that one PTE.

Confused enough? :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
  2013-06-27 15:57     ` Peter Zijlstra
@ 2013-06-28 12:22       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 05:57:48PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:01PM +0100, Mel Gorman wrote:
> > @@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
> >  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
> >  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
> >  
> > +#ifdef CONFIG_NUMA_BALANCING
> > +extern void sched_setnuma(struct task_struct *p, int node, int shared);
> 
> Stray line; you're introducing that function later with a different
> signature.
> 

Fixed, thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
@ 2013-06-28 12:22       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 05:57:48PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:01PM +0100, Mel Gorman wrote:
> > @@ -503,6 +503,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
> >  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
> >  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
> >  
> > +#ifdef CONFIG_NUMA_BALANCING
> > +extern void sched_setnuma(struct task_struct *p, int node, int shared);
> 
> Stray line; you're introducing that function later with a different
> signature.
> 

Fixed, thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
  2013-06-28  6:08     ` Srikar Dronamraju
@ 2013-06-28 12:30       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 12:30 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:38:29AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:01]:
> 
> > This patch tracks what nodes numa hinting faults were incurred on.  Greater
> > weight is given if the pages were to be migrated on the understanding
> > that such faults cost significantly more. If a task has paid the cost to
> > migrating data to that node then in the future it would be preferred if the
> > task did not migrate the data again unnecessarily. This information is later
> > used to schedule a task on the node incurring the most NUMA hinting faults.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/sched.h |  2 ++
> >  kernel/sched/core.c   |  3 +++
> >  kernel/sched/fair.c   | 12 +++++++++++-
> >  kernel/sched/sched.h  | 12 ++++++++++++
> >  4 files changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e692a02..72861b4 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1505,6 +1505,8 @@ struct task_struct {
> >  	unsigned int numa_scan_period;
> >  	u64 node_stamp;			/* migration stamp  */
> >  	struct callback_head numa_work;
> > +
> > +	unsigned long *numa_faults;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> >  	struct rcu_head rcu;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 67d0465..f332ec0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> >  	p->numa_work.next = &p->numa_work;
> > +	p->numa_faults = NULL;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  }
> >  
> > @@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> >  	if (mm)
> >  		mmdrop(mm);
> >  	if (unlikely(prev_state == TASK_DEAD)) {
> > +		task_numa_free(prev);
> > +
> >  		/*
> >  		 * Remove function-return probe instances associated with this
> >  		 * task and put them back on the free list.
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7a33e59..904fd6f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  	if (!sched_feat_numa(NUMA))
> >  		return;
> >  
> > -	/* FIXME: Allocate task-specific structure for placement policy here */
> > +	/* Allocate buffer to track faults on a per-node basis */
> > +	if (unlikely(!p->numa_faults)) {
> > +		int size = sizeof(*p->numa_faults) * nr_node_ids;
> > +
> > +		p->numa_faults = kzalloc(size, GFP_KERNEL);
> > +		if (!p->numa_faults)
> > +			return;
> > +	}
> >  
> >  	/*
> >  	 * If pages are properly placed (did not migrate) then scan slower.
> > @@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  			p->numa_scan_period + jiffies_to_msecs(10));
> >  
> >  	task_numa_placement(p);
> > +
> > +	/* Record the fault, double the weight if pages were migrated */
> > +	p->numa_faults[node] += pages << migrated;
> 
> 
> Why are we doing this after the placement.
> I mean we should probably be doing this in the task_numa_placement,
> 

Peter covered this.

> Since doubling the pages can have an effect on the preferred node. If we
> do it here, wont it end up in a case where the numa_faults on one node
> is actually higher but it may end up being not the preferred node?
> 

Possibly but it's important to take into account the cost of migration. I
want to prefer keeping tasks on nodes that data was migrated to.

There is a much more serious problem with fault sampling that I have yet
to think of a good solution for. Consider a task that exhibits very high
locality and occasionally updates shared statistics. This hypothetical
workload is dominated by addressing a small array with the shared statistics
in a large array. In this case the PTE scanner will incur a larger number
of faults in the shared array even though it's less important to the
workload. The preferred node will be wrong in this case and is a much more
serious problem.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis
@ 2013-06-28 12:30       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 12:30 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:38:29AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:01]:
> 
> > This patch tracks what nodes numa hinting faults were incurred on.  Greater
> > weight is given if the pages were to be migrated on the understanding
> > that such faults cost significantly more. If a task has paid the cost to
> > migrating data to that node then in the future it would be preferred if the
> > task did not migrate the data again unnecessarily. This information is later
> > used to schedule a task on the node incurring the most NUMA hinting faults.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/sched.h |  2 ++
> >  kernel/sched/core.c   |  3 +++
> >  kernel/sched/fair.c   | 12 +++++++++++-
> >  kernel/sched/sched.h  | 12 ++++++++++++
> >  4 files changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e692a02..72861b4 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1505,6 +1505,8 @@ struct task_struct {
> >  	unsigned int numa_scan_period;
> >  	u64 node_stamp;			/* migration stamp  */
> >  	struct callback_head numa_work;
> > +
> > +	unsigned long *numa_faults;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> >  	struct rcu_head rcu;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 67d0465..f332ec0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> >  	p->numa_work.next = &p->numa_work;
> > +	p->numa_faults = NULL;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  }
> >  
> > @@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> >  	if (mm)
> >  		mmdrop(mm);
> >  	if (unlikely(prev_state == TASK_DEAD)) {
> > +		task_numa_free(prev);
> > +
> >  		/*
> >  		 * Remove function-return probe instances associated with this
> >  		 * task and put them back on the free list.
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7a33e59..904fd6f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  	if (!sched_feat_numa(NUMA))
> >  		return;
> >  
> > -	/* FIXME: Allocate task-specific structure for placement policy here */
> > +	/* Allocate buffer to track faults on a per-node basis */
> > +	if (unlikely(!p->numa_faults)) {
> > +		int size = sizeof(*p->numa_faults) * nr_node_ids;
> > +
> > +		p->numa_faults = kzalloc(size, GFP_KERNEL);
> > +		if (!p->numa_faults)
> > +			return;
> > +	}
> >  
> >  	/*
> >  	 * If pages are properly placed (did not migrate) then scan slower.
> > @@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  			p->numa_scan_period + jiffies_to_msecs(10));
> >  
> >  	task_numa_placement(p);
> > +
> > +	/* Record the fault, double the weight if pages were migrated */
> > +	p->numa_faults[node] += pages << migrated;
> 
> 
> Why are we doing this after the placement.
> I mean we should probably be doing this in the task_numa_placement,
> 

Peter covered this.

> Since doubling the pages can have an effect on the preferred node. If we
> do it here, wont it end up in a case where the numa_faults on one node
> is actually higher but it may end up being not the preferred node?
> 

Possibly but it's important to take into account the cost of migration. I
want to prefer keeping tasks on nodes that data was migrated to.

There is a much more serious problem with fault sampling that I have yet
to think of a good solution for. Consider a task that exhibits very high
locality and occasionally updates shared statistics. This hypothetical
workload is dominated by addressing a small array with the shared statistics
in a large array. In this case the PTE scanner will incur a larger number
of faults in the shared array even though it's less important to the
workload. The preferred node will be wrong in this case and is a much more
serious problem.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
  2013-06-28  6:14     ` Srikar Dronamraju
@ 2013-06-28 12:33       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 12:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:44:28AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:02]:
> 
> > This patch selects a preferred node for a task to run on based on the
> > NUMA hinting faults. This information is later used to migrate tasks
> > towards the node during balancing.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/sched.h |  1 +
> >  kernel/sched/core.c   | 10 ++++++++++
> >  kernel/sched/fair.c   | 16 ++++++++++++++--
> >  kernel/sched/sched.h  |  2 +-
> >  4 files changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 72861b4..ba46a64 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1507,6 +1507,7 @@ struct task_struct {
> >  	struct callback_head numa_work;
> >  
> >  	unsigned long *numa_faults;
> > +	int numa_preferred_nid;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> >  	struct rcu_head rcu;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f332ec0..019baae 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_preferred_nid = -1;
> 
> Though we may not want to inherit faults, I think the tasks generally
> share pages with their siblings, parent. So will it make sense to
> inherit the preferred node?
> 

If it really shared data with its parent then it will be detected by the PTE
scanner later as normal. I would expect that initially it would be scheduled
to run on CPUs on the local node and I would think that inheriting it here
will not make a detectable difference. If you think it will, I can do it
but then the data should certainly be cleared on exec.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults
@ 2013-06-28 12:33       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 12:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 11:44:28AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:02]:
> 
> > This patch selects a preferred node for a task to run on based on the
> > NUMA hinting faults. This information is later used to migrate tasks
> > towards the node during balancing.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/sched.h |  1 +
> >  kernel/sched/core.c   | 10 ++++++++++
> >  kernel/sched/fair.c   | 16 ++++++++++++++--
> >  kernel/sched/sched.h  |  2 +-
> >  4 files changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 72861b4..ba46a64 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1507,6 +1507,7 @@ struct task_struct {
> >  	struct callback_head numa_work;
> >  
> >  	unsigned long *numa_faults;
> > +	int numa_preferred_nid;
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> >  	struct rcu_head rcu;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f332ec0..019baae 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_preferred_nid = -1;
> 
> Though we may not want to inherit faults, I think the tasks generally
> share pages with their siblings, parent. So will it make sense to
> inherit the preferred node?
> 

If it really shared data with its parent then it will be detected by the PTE
scanner later as normal. I would expect that initially it would be scheduled
to run on CPUs on the local node and I would think that inheriting it here
will not make a detectable difference. If you think it will, I can do it
but then the data should certainly be cleared on exec.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-27 14:53     ` Peter Zijlstra
@ 2013-06-28 13:00       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 04:53:45PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> > This patch favours moving tasks towards the preferred NUMA node when
> > it has just been selected. Ideally this is self-reinforcing as the
> > longer the the task runs on that node, the more faults it should incur
> > causing task_numa_placement to keep the task running on that node. In
> > reality a big weakness is that the nodes CPUs can be overloaded and it
> > would be more effficient to queue tasks on an idle node and migrate to
> > the new node. This would require additional smarts in the balancer so
> > for now the balancer will simply prefer to place the task on the
> > preferred node for a tunable number of PTE scans.
> 
> This changelog fails to mention why you're adding the settle stuff in
> this patch.

Updated the change.

This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality
a big weakness is that the nodes CPUs can be overloaded and it would be
more efficient to queue tasks on an idle node and migrate to the new node.
This would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 13:00       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 04:53:45PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> > This patch favours moving tasks towards the preferred NUMA node when
> > it has just been selected. Ideally this is self-reinforcing as the
> > longer the the task runs on that node, the more faults it should incur
> > causing task_numa_placement to keep the task running on that node. In
> > reality a big weakness is that the nodes CPUs can be overloaded and it
> > would be more effficient to queue tasks on an idle node and migrate to
> > the new node. This would require additional smarts in the balancer so
> > for now the balancer will simply prefer to place the task on the
> > preferred node for a tunable number of PTE scans.
> 
> This changelog fails to mention why you're adding the settle stuff in
> this patch.

Updated the change.

This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality
a big weakness is that the nodes CPUs can be overloaded and it would be
more efficient to queue tasks on an idle node and migrate to the new node.
This would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-27 16:01     ` Peter Zijlstra
@ 2013-06-28 13:01       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 06:01:27PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> > @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
> >  	return delta < (s64)sysctl_sched_migration_cost;
> >  }
> >  
> > +/* Returns true if the destination node has incurred more faults */
> > +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> > +{
> > +	int src_nid, dst_nid;
> > +
> > +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> > +		return false;
> > +
> > +	src_nid = cpu_to_node(env->src_cpu);
> > +	dst_nid = cpu_to_node(env->dst_cpu);
> > +
> > +	if (src_nid == dst_nid)
> > +		return false;
> > +
> > +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> > +	    p->numa_preferred_nid == dst_nid)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> 
> This references ->numa_faults, which is declared under NUMA_BALANCING
> but lacks any such conditionality here.

Fixed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 13:01       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 06:01:27PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> > @@ -3897,6 +3907,28 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
> >  	return delta < (s64)sysctl_sched_migration_cost;
> >  }
> >  
> > +/* Returns true if the destination node has incurred more faults */
> > +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> > +{
> > +	int src_nid, dst_nid;
> > +
> > +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> > +		return false;
> > +
> > +	src_nid = cpu_to_node(env->src_cpu);
> > +	dst_nid = cpu_to_node(env->dst_cpu);
> > +
> > +	if (src_nid == dst_nid)
> > +		return false;
> > +
> > +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> > +	    p->numa_preferred_nid == dst_nid)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> 
> This references ->numa_faults, which is declared under NUMA_BALANCING
> but lacks any such conditionality here.

Fixed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-27 16:11     ` Peter Zijlstra
@ 2013-06-28 13:45       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 06:11:27PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> > +/* Returns true if the destination node has incurred more faults */
> > +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> > +{
> > +	int src_nid, dst_nid;
> > +
> > +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> > +		return false;
> > +
> > +	src_nid = cpu_to_node(env->src_cpu);
> > +	dst_nid = cpu_to_node(env->dst_cpu);
> > +
> > +	if (src_nid == dst_nid)
> > +		return false;
> > +
> > +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> > +	    p->numa_preferred_nid == dst_nid)
> > +		return true;
> > +
> > +	return false;
> > +}
> 
> Also, until I just actually _read_ that function; I assumed it would
> compare p->numa_faults[src_nid] and p->numa_faults[dst_nid]. Because
> even when the dst_nid isn't the preferred nid; it might still have more
> pages than where we currently are.
> 

I tested something like this and also tested it when only taking shared
accesses into account but it performed badly in some cases.  I've included
the last patch I tested below for reference but dropped it until I figured
out why it performed badly. I guessed it was due to increased bouncing
due to shared faults but didn't prove it.

> Idem with the proposed migrate_degrades_locality().
> 
> Something like so I suppose
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3969,6 +3969,7 @@ task_hot(struct task_struct *p, u64 now,
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +#ifdef CONFIG_NUMA_BALANCING
>  /* Returns true if the destination node has incurred more faults */
>  static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
>  {
> @@ -3983,13 +3984,50 @@ static bool migrate_improves_locality(st
>  	if (src_nid == dst_nid)
>  		return false;
>  
> -	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> -	    p->numa_preferred_nid == dst_nid)
> +	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
> +		return false;
> +
> +	if (p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	if (p->numa_faults[src_nid] < p->numa_faults[dst_nid])
> +		return true;
> +
> +	return false;
> +}
> +

I tested something like this.

> +static vool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_faults[src_nid] > p->numa_faults[dst_nid])
>  		return true;
>  
>  	return false;
>  }

But I had not tried this and it makes sense. I'll test it out and include
it in the next revision if it looks good. Unless you object I'll add
your signed-off because the version of the patch I'm about to test looks
almost identical to this.

>  
> +#else
> +
> +static inline bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	return false;
> +}
> +
> +static inline bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_NUMA_BALANCING */
>  
>  /*
>   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> @@ -4055,8 +4093,10 @@ int can_migrate_task(struct task_struct
>  		return 1;
>  
>  	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
> +	if (!tsk_cache_hot)
> +		tsk_cache_hot = migrate_degrades_locality(p, env);
>  	if (!tsk_cache_hot ||
> -		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> +	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
>  
>  		if (tsk_cache_hot) {
>  			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> 

This is the last patch similar to this idea I tested.

---8<---
sched: Favour moving tasks towards nodes that incurred more faults

Signed-off-by: Mel Gorman <mgorman@suse.de>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9bbb70..3379ca4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3980,9 +3980,18 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (src_nid == dst_nid)
 		return false;
 
-	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
-	    p->numa_preferred_nid == dst_nid)
-		return true;
+	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count) {
+		if (p->numa_preferred_nid == dst_nid)
+			return true;
+
+		/*
+		 * Move towards node if there were a higher number of shared
+		 * NUMA hinting faults
+		 */
+		if (p->numa_faults[task_faults_idx(dst_nid, 0)] >
+		    p->numa_faults[task_faults_idx(src_nid, 0)])
+			return true;
+	}
 
 	return false;
 }


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 13:45       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 06:11:27PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:04PM +0100, Mel Gorman wrote:
> > +/* Returns true if the destination node has incurred more faults */
> > +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> > +{
> > +	int src_nid, dst_nid;
> > +
> > +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> > +		return false;
> > +
> > +	src_nid = cpu_to_node(env->src_cpu);
> > +	dst_nid = cpu_to_node(env->dst_cpu);
> > +
> > +	if (src_nid == dst_nid)
> > +		return false;
> > +
> > +	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> > +	    p->numa_preferred_nid == dst_nid)
> > +		return true;
> > +
> > +	return false;
> > +}
> 
> Also, until I just actually _read_ that function; I assumed it would
> compare p->numa_faults[src_nid] and p->numa_faults[dst_nid]. Because
> even when the dst_nid isn't the preferred nid; it might still have more
> pages than where we currently are.
> 

I tested something like this and also tested it when only taking shared
accesses into account but it performed badly in some cases.  I've included
the last patch I tested below for reference but dropped it until I figured
out why it performed badly. I guessed it was due to increased bouncing
due to shared faults but didn't prove it.

> Idem with the proposed migrate_degrades_locality().
> 
> Something like so I suppose
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3969,6 +3969,7 @@ task_hot(struct task_struct *p, u64 now,
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +#ifdef CONFIG_NUMA_BALANCING
>  /* Returns true if the destination node has incurred more faults */
>  static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
>  {
> @@ -3983,13 +3984,50 @@ static bool migrate_improves_locality(st
>  	if (src_nid == dst_nid)
>  		return false;
>  
> -	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
> -	    p->numa_preferred_nid == dst_nid)
> +	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
> +		return false;
> +
> +	if (p->numa_preferred_nid == dst_nid)
> +		return true;
> +
> +	if (p->numa_faults[src_nid] < p->numa_faults[dst_nid])
> +		return true;
> +
> +	return false;
> +}
> +

I tested something like this.

> +static vool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	int src_nid, dst_nid;
> +
> +	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +		return false;
> +
> +	src_nid = cpu_to_node(env->src_cpu);
> +	dst_nid = cpu_to_node(env->dst_cpu);
> +
> +	if (src_nid == dst_nid)
> +		return false;
> +
> +	if (p->numa_faults[src_nid] > p->numa_faults[dst_nid])
>  		return true;
>  
>  	return false;
>  }

But I had not tried this and it makes sense. I'll test it out and include
it in the next revision if it looks good. Unless you object I'll add
your signed-off because the version of the patch I'm about to test looks
almost identical to this.

>  
> +#else
> +
> +static inline bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	return false;
> +}
> +
> +static inline bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_NUMA_BALANCING */
>  
>  /*
>   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> @@ -4055,8 +4093,10 @@ int can_migrate_task(struct task_struct
>  		return 1;
>  
>  	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
> +	if (!tsk_cache_hot)
> +		tsk_cache_hot = migrate_degrades_locality(p, env);
>  	if (!tsk_cache_hot ||
> -		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> +	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
>  
>  		if (tsk_cache_hot) {
>  			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> 

This is the last patch similar to this idea I tested.

---8<---
sched: Favour moving tasks towards nodes that incurred more faults

Signed-off-by: Mel Gorman <mgorman@suse.de>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9bbb70..3379ca4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3980,9 +3980,18 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (src_nid == dst_nid)
 		return false;
 
-	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count &&
-	    p->numa_preferred_nid == dst_nid)
-		return true;
+	if (p->numa_migrate_seq < sysctl_numa_balancing_settle_count) {
+		if (p->numa_preferred_nid == dst_nid)
+			return true;
+
+		/*
+		 * Move towards node if there were a higher number of shared
+		 * NUMA hinting faults
+		 */
+		if (p->numa_faults[task_faults_idx(dst_nid, 0)] >
+		    p->numa_faults[task_faults_idx(src_nid, 0)])
+			return true;
+	}
 
 	return false;
 }


-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28 10:07         ` Srikar Dronamraju
@ 2013-06-28 13:51           ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:37:23PM +0530, Srikar Dronamraju wrote:
> > > > +
> > > > +
> > > >  /*
> > > >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> > > >   */
> > > > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > > >  
> > > >  	/*
> > > >  	 * Aggressive migration if:
> > > > -	 * 1) task is cache cold, or
> > > > -	 * 2) too many balance attempts have failed.
> > > > +	 * 1) destination numa is preferred
> > > > +	 * 2) task is cache cold, or
> > > > +	 * 3) too many balance attempts have failed.
> > > >  	 */
> > > >  
> > > > +	if (migrate_improves_locality(p, env))
> > > > +		return 1;
> > > 
> > > Shouldnt this be under tsk_cache_hot check?
> > > 
> > > If the task is cache hot, then we would have to update the corresponding  schedstat
> > > metrics.
> > 
> > No; you want migrate_degrades_locality() to be like task_hot(). You want
> > to _always_ migrate tasks towards better locality irrespective of local
> > cache hotness.
> > 
> 
> Yes, I understand that numa should have more priority over cache.
> But the schedstats will not be updated about whether the task was hot or
> cold.
> 
> So lets say the task was cache hot but numa wants it to move, then we
> should certainly move it but we should update the schedstats to mention that we
> moved a cache hot task.
> 
> Something akin to this.
> 
> 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> 	if (tsk_cache_hot) {
> 		if (migrate_improves_locality(p, env) || 
> 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> #ifdef CONFIG_SCHEDSTATS
> 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> #endif
> 			return 1;
> 		}
> 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> 		return 0;
> 	}
> 	return 1;
> 

Thanks. Is this acceptable?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3848e0..c3a153e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4088,8 +4088,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 
-	if (migrate_improves_locality(p, env))
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+		schedstat_inc(p, se.statistics.nr_forced_migrations);
+#endif
 		return 1;
+	}
 
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
 	if (!tsk_cache_hot)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 13:51           ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:37:23PM +0530, Srikar Dronamraju wrote:
> > > > +
> > > > +
> > > >  /*
> > > >   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> > > >   */
> > > > @@ -3945,10 +3977,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > > >  
> > > >  	/*
> > > >  	 * Aggressive migration if:
> > > > -	 * 1) task is cache cold, or
> > > > -	 * 2) too many balance attempts have failed.
> > > > +	 * 1) destination numa is preferred
> > > > +	 * 2) task is cache cold, or
> > > > +	 * 3) too many balance attempts have failed.
> > > >  	 */
> > > >  
> > > > +	if (migrate_improves_locality(p, env))
> > > > +		return 1;
> > > 
> > > Shouldnt this be under tsk_cache_hot check?
> > > 
> > > If the task is cache hot, then we would have to update the corresponding  schedstat
> > > metrics.
> > 
> > No; you want migrate_degrades_locality() to be like task_hot(). You want
> > to _always_ migrate tasks towards better locality irrespective of local
> > cache hotness.
> > 
> 
> Yes, I understand that numa should have more priority over cache.
> But the schedstats will not be updated about whether the task was hot or
> cold.
> 
> So lets say the task was cache hot but numa wants it to move, then we
> should certainly move it but we should update the schedstats to mention that we
> moved a cache hot task.
> 
> Something akin to this.
> 
> 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> 	if (tsk_cache_hot) {
> 		if (migrate_improves_locality(p, env) || 
> 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> #ifdef CONFIG_SCHEDSTATS
> 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> #endif
> 			return 1;
> 		}
> 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> 		return 0;
> 	}
> 	return 1;
> 

Thanks. Is this acceptable?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3848e0..c3a153e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4088,8 +4088,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 
-	if (migrate_improves_locality(p, env))
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+		schedstat_inc(p, se.statistics.nr_forced_migrations);
+#endif
 		return 1;
+	}
 
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
 	if (!tsk_cache_hot)

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
  2013-06-26 14:37 ` Mel Gorman
@ 2013-06-28 13:54   ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 13:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:

> It's several months overdue and everything was quiet after 3.8 came out
> but I recently had a chance to revisit automatic NUMA balancing for a few
> days. I looked at basic scheduler integration resulting in the following
> small series. Much of the following is heavily based on the numacore series
> which in itself takes part of the autonuma series from back in November. In
> particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> Signed-off-bys. As before, if the relevant authors are ok with it I'll
> add Signed-off-bys (or add them yourselves if you pick the patches up).


Here is a snapshot of the results of running autonuma-benchmark running on 8
node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
setup

	KernelVersion: 3.9.0-mainline_v39+()
				Testcase:      Min      Max      Avg
				  numa01:  1784.16  1864.15  1800.16
				  numa02:    32.07    32.72    32.59

	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
				Testcase:      Min      Max      Avg  %Change
				  numa01:  1752.48  1859.60  1785.60    0.82%
				  numa02:    47.21    60.58    53.43  -39.00%

So numa02 case; we see a degradation of around 39%.

Details below
-----------------------------------------------------------------------------------------

numa01
	KernelVersion: 3.9.0-mainline_v39+()
	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
		   554,289 cs                                                           [100.00%]
		    26,727 migrations                                                   [100.00%]
		 1,982,054 faults                                                       [100.00%]
		     5,819 migrate:mm_migrate_pages                                    

	    1784.171745972 seconds time elapsed

	numa01 1784.16 352.58 68140.96 141242 4862

	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

		 1,072,118 cs                                                           [100.00%]
		    43,796 migrations                                                   [100.00%]
		 5,226,896 faults                                                       [100.00%]
		     2,815 migrate:mm_migrate_pages                                    

	    1763.961631143 seconds time elapsed

	numa01 1763.95 321.62 78358.88 233740 2712


numa02
	KernelVersion: 3.9.0-mainline_v39+()

	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

		    14,018 cs                                                           [100.00%]
		     1,209 migrations                                                   [100.00%]
		    40,847 faults                                                       [100.00%]
		       629 migrate:mm_migrate_pages                                    

	      32.729238004 seconds time elapsed

	numa02 32.72 51.25 1415.06 6013 111

	KernelVersion: 3.9.0-mainline_v39+() + mel's patches

	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

		    35,891 cs                                                           [100.00%]
		     1,579 migrations                                                   [100.00%]
		   173,443 faults                                                       [100.00%]
		     1,106 migrate:mm_migrate_pages                                    

	      53.970814899 seconds time elapsed

	numa02 53.96 128.90 2301.90 9291 148

Notes:
In the numa01 case, we see a slight benefit + lesser system and user time.
We see more context switches and task migrations but lesser page migrations.


In the numa02 case, we see a larger degradation + higher system + higher user
time. We see more context switches and more page migrations too.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-06-28 13:54   ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 13:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:

> It's several months overdue and everything was quiet after 3.8 came out
> but I recently had a chance to revisit automatic NUMA balancing for a few
> days. I looked at basic scheduler integration resulting in the following
> small series. Much of the following is heavily based on the numacore series
> which in itself takes part of the autonuma series from back in November. In
> particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> Signed-off-bys. As before, if the relevant authors are ok with it I'll
> add Signed-off-bys (or add them yourselves if you pick the patches up).


Here is a snapshot of the results of running autonuma-benchmark running on 8
node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
setup

	KernelVersion: 3.9.0-mainline_v39+()
				Testcase:      Min      Max      Avg
				  numa01:  1784.16  1864.15  1800.16
				  numa02:    32.07    32.72    32.59

	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
				Testcase:      Min      Max      Avg  %Change
				  numa01:  1752.48  1859.60  1785.60    0.82%
				  numa02:    47.21    60.58    53.43  -39.00%

So numa02 case; we see a degradation of around 39%.

Details below
-----------------------------------------------------------------------------------------

numa01
	KernelVersion: 3.9.0-mainline_v39+()
	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
		   554,289 cs                                                           [100.00%]
		    26,727 migrations                                                   [100.00%]
		 1,982,054 faults                                                       [100.00%]
		     5,819 migrate:mm_migrate_pages                                    

	    1784.171745972 seconds time elapsed

	numa01 1784.16 352.58 68140.96 141242 4862

	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

		 1,072,118 cs                                                           [100.00%]
		    43,796 migrations                                                   [100.00%]
		 5,226,896 faults                                                       [100.00%]
		     2,815 migrate:mm_migrate_pages                                    

	    1763.961631143 seconds time elapsed

	numa01 1763.95 321.62 78358.88 233740 2712


numa02
	KernelVersion: 3.9.0-mainline_v39+()

	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

		    14,018 cs                                                           [100.00%]
		     1,209 migrations                                                   [100.00%]
		    40,847 faults                                                       [100.00%]
		       629 migrate:mm_migrate_pages                                    

	      32.729238004 seconds time elapsed

	numa02 32.72 51.25 1415.06 6013 111

	KernelVersion: 3.9.0-mainline_v39+() + mel's patches

	 Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

		    35,891 cs                                                           [100.00%]
		     1,579 migrations                                                   [100.00%]
		   173,443 faults                                                       [100.00%]
		     1,106 migrate:mm_migrate_pages                                    

	      53.970814899 seconds time elapsed

	numa02 53.96 128.90 2301.90 9291 148

Notes:
In the numa01 case, we see a slight benefit + lesser system and user time.
We see more context switches and task migrations but lesser page migrations.


In the numa02 case, we see a larger degradation + higher system + higher user
time. We see more context switches and more page migrations too.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-06-27 14:54     ` Peter Zijlstra
@ 2013-06-28 13:54       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 04:54:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:05PM +0100, Mel Gorman wrote:
> > +static int
> > +find_idlest_cpu_node(int this_cpu, int nid)
> > +{
> > +	unsigned long load, min_load = ULONG_MAX;
> > +	int i, idlest_cpu = this_cpu;
> > +
> > +	BUG_ON(cpu_to_node(this_cpu) == nid);
> > +
> > +	for_each_cpu(i, cpumask_of_node(nid)) {
> > +		load = weighted_cpuload(i);
> > +
> > +		if (load < min_load) {
> > +			struct task_struct *p;
> > +
> > +			/* Do not preempt a task running on its preferred node */
> > +			struct rq *rq = cpu_rq(i);
> > +			local_irq_disable();
> > +			raw_spin_lock(&rq->lock);
> 
> raw_spin_lock_irq() ?
> 

/me slaps self

Fixed. Thanks.

> > +			p = rq->curr;
> > +			if (p->numa_preferred_nid != nid) {
> > +				min_load = load;
> > +				idlest_cpu = i;
> > +			}
> > +			raw_spin_unlock(&rq->lock);
> > +			local_irq_disable();
> > +		}
> > +	}
> > +
> > +	return idlest_cpu;
> > +}

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-06-28 13:54       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 04:54:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:05PM +0100, Mel Gorman wrote:
> > +static int
> > +find_idlest_cpu_node(int this_cpu, int nid)
> > +{
> > +	unsigned long load, min_load = ULONG_MAX;
> > +	int i, idlest_cpu = this_cpu;
> > +
> > +	BUG_ON(cpu_to_node(this_cpu) == nid);
> > +
> > +	for_each_cpu(i, cpumask_of_node(nid)) {
> > +		load = weighted_cpuload(i);
> > +
> > +		if (load < min_load) {
> > +			struct task_struct *p;
> > +
> > +			/* Do not preempt a task running on its preferred node */
> > +			struct rq *rq = cpu_rq(i);
> > +			local_irq_disable();
> > +			raw_spin_lock(&rq->lock);
> 
> raw_spin_lock_irq() ?
> 

/me slaps self

Fixed. Thanks.

> > +			p = rq->curr;
> > +			if (p->numa_preferred_nid != nid) {
> > +				min_load = load;
> > +				idlest_cpu = i;
> > +			}
> > +			raw_spin_unlock(&rq->lock);
> > +			local_irq_disable();
> > +		}
> > +	}
> > +
> > +	return idlest_cpu;
> > +}

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-27 14:56     ` Peter Zijlstra
@ 2013-06-28 14:00       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 04:56:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:06PM +0100, Mel Gorman wrote:
> > +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
> >  {
> >  	struct task_struct *p = current;
> > +	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
> >  
> >  	if (!sched_feat_numa(NUMA))
> >  		return;
> >  
> >  	/* Allocate buffer to track faults on a per-node basis */
> >  	if (unlikely(!p->numa_faults)) {
> > -		int size = sizeof(*p->numa_faults) * nr_node_ids;
> > +		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
> >  
> >  		/* numa_faults and numa_faults_buffer share the allocation */
> > -		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> > +		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
> >  		if (!p->numa_faults)
> >  			return;
> 
> So you need a buffer 2x the size in total; but you're now allocating
> a buffer 4x larger than before.
> 
> Isn't doubling size alone sufficient?

/me slaps self

This was a rebase screwup. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-28 14:00       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Jun 27, 2013 at 04:56:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2013 at 03:38:06PM +0100, Mel Gorman wrote:
> > +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
> >  {
> >  	struct task_struct *p = current;
> > +	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
> >  
> >  	if (!sched_feat_numa(NUMA))
> >  		return;
> >  
> >  	/* Allocate buffer to track faults on a per-node basis */
> >  	if (unlikely(!p->numa_faults)) {
> > -		int size = sizeof(*p->numa_faults) * nr_node_ids;
> > +		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
> >  
> >  		/* numa_faults and numa_faults_buffer share the allocation */
> > -		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> > +		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
> >  		if (!p->numa_faults)
> >  			return;
> 
> So you need a buffer 2x the size in total; but you're now allocating
> a buffer 4x larger than before.
> 
> Isn't doubling size alone sufficient?

/me slaps self

This was a rebase screwup. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-28 10:33           ` Peter Zijlstra
@ 2013-06-28 14:29             ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Jun 28, 2013 at 12:33:04PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 28, 2013 at 03:42:45PM +0530, Srikar Dronamraju wrote:
> > > > 
> > > > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > > > that are private to a task and those that are shared. This would require
> > > > > that the last task that accessed a page for a hinting fault would be
> > > > > recorded which would increase the size of struct page. Instead this patch
> > > > > approximates private pages by assuming that faults that pass the two-stage
> > > > > filter are private pages and all others are shared. The preferred NUMA
> > > > > node is then selected based on where the maximum number of approximately
> > > > > private faults were measured.
> > > > 
> > > > Should we consider only private faults for preferred node?
> > > 
> > > I don't think so; its optimal for the task to be nearest most of its pages;
> > > irrespective of whether they be private or shared.
> > 
> > Then the preferred node should have been chosen based on both the
> > private and shared faults and not just private faults.
> 
> Oh duh indeed. I totally missed it did that. Changelog also isn't giving
> rationale for this. Mel?
> 

There were a few reasons

First, if there are many tasks sharing the page then they'll all move towards
the same node. The node will be compute overloaded and then scheduled away
later only to bounce back again. Alternatively the shared tasks would
just bounce around nodes because the fault information is effectively
noise. Either way I felt that accounting for shared faults with private
faults would be slower overall.

The second reason was based on a hypothetical workload that had a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason was because multiple threads in a process will race
each other to fault the shared page making the information unreliable.

It is important that *something* be done with shared faults but I haven't
thought of what exactly yet. One possibility would be to give them a
different weight, maybe based on the number of active NUMA nodes, but I had
not tested anything yet. Peter suggested privately that if shared faults
dominate the workload that the shared pages would be migrated based on an
interleave policy which has some potential.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-28 14:29             ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Jun 28, 2013 at 12:33:04PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 28, 2013 at 03:42:45PM +0530, Srikar Dronamraju wrote:
> > > > 
> > > > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > > > that are private to a task and those that are shared. This would require
> > > > > that the last task that accessed a page for a hinting fault would be
> > > > > recorded which would increase the size of struct page. Instead this patch
> > > > > approximates private pages by assuming that faults that pass the two-stage
> > > > > filter are private pages and all others are shared. The preferred NUMA
> > > > > node is then selected based on where the maximum number of approximately
> > > > > private faults were measured.
> > > > 
> > > > Should we consider only private faults for preferred node?
> > > 
> > > I don't think so; its optimal for the task to be nearest most of its pages;
> > > irrespective of whether they be private or shared.
> > 
> > Then the preferred node should have been chosen based on both the
> > private and shared faults and not just private faults.
> 
> Oh duh indeed. I totally missed it did that. Changelog also isn't giving
> rationale for this. Mel?
> 

There were a few reasons

First, if there are many tasks sharing the page then they'll all move towards
the same node. The node will be compute overloaded and then scheduled away
later only to bounce back again. Alternatively the shared tasks would
just bounce around nodes because the fault information is effectively
noise. Either way I felt that accounting for shared faults with private
faults would be slower overall.

The second reason was based on a hypothetical workload that had a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason was because multiple threads in a process will race
each other to fault the shared page making the information unreliable.

It is important that *something* be done with shared faults but I haven't
thought of what exactly yet. One possibility would be to give them a
different weight, maybe based on the number of active NUMA nodes, but I had
not tested anything yet. Peter suggested privately that if shared faults
dominate the workload that the shared pages would be migrated based on an
interleave policy which has some potential.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28 13:45       ` Mel Gorman
@ 2013-06-28 15:10         ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 15:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Fri, Jun 28, 2013 at 02:45:35PM +0100, Mel Gorman wrote:

> > Also, until I just actually _read_ that function; I assumed it would
> > compare p->numa_faults[src_nid] and p->numa_faults[dst_nid]. Because
> > even when the dst_nid isn't the preferred nid; it might still have more
> > pages than where we currently are.
> > 
> 
> I tested something like this and also tested it when only taking shared
> accesses into account but it performed badly in some cases.  I've included
> the last patch I tested below for reference but dropped it until I figured
> out why it performed badly. I guessed it was due to increased bouncing
> due to shared faults but didn't prove it.

Oh, interesting. Yeah it would be good to figure out why that gave
funnies.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 15:10         ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 15:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Fri, Jun 28, 2013 at 02:45:35PM +0100, Mel Gorman wrote:

> > Also, until I just actually _read_ that function; I assumed it would
> > compare p->numa_faults[src_nid] and p->numa_faults[dst_nid]. Because
> > even when the dst_nid isn't the preferred nid; it might still have more
> > pages than where we currently are.
> > 
> 
> I tested something like this and also tested it when only taking shared
> accesses into account but it performed badly in some cases.  I've included
> the last patch I tested below for reference but dropped it until I figured
> out why it performed badly. I guessed it was due to increased bouncing
> due to shared faults but didn't prove it.

Oh, interesting. Yeah it would be good to figure out why that gave
funnies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-06-28 14:29             ` Mel Gorman
@ 2013-06-28 15:12               ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 15:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:29:25PM +0100, Mel Gorman wrote:
> > Oh duh indeed. I totally missed it did that. Changelog also isn't giving
> > rationale for this. Mel?
> > 
> 
> There were a few reasons
> 
> First, if there are many tasks sharing the page then they'll all move towards
> the same node. The node will be compute overloaded and then scheduled away
> later only to bounce back again. Alternatively the shared tasks would
> just bounce around nodes because the fault information is effectively
> noise. Either way I felt that accounting for shared faults with private
> faults would be slower overall.
> 
> The second reason was based on a hypothetical workload that had a small
> number of very important, heavily accessed private pages but a large shared
> array. The shared array would dominate the number of faults and be selected
> as a preferred node even though it's the wrong decision.
> 
> The third reason was because multiple threads in a process will race
> each other to fault the shared page making the information unreliable.
> 
> It is important that *something* be done with shared faults but I haven't
> thought of what exactly yet. One possibility would be to give them a
> different weight, maybe based on the number of active NUMA nodes, but I had
> not tested anything yet. Peter suggested privately that if shared faults
> dominate the workload that the shared pages would be migrated based on an
> interleave policy which has some potential.
> 

It would be good to put something like this in the Changelog, or even as
a comment near how we select the preferred node.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter
@ 2013-06-28 15:12               ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-06-28 15:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Jun 28, 2013 at 03:29:25PM +0100, Mel Gorman wrote:
> > Oh duh indeed. I totally missed it did that. Changelog also isn't giving
> > rationale for this. Mel?
> > 
> 
> There were a few reasons
> 
> First, if there are many tasks sharing the page then they'll all move towards
> the same node. The node will be compute overloaded and then scheduled away
> later only to bounce back again. Alternatively the shared tasks would
> just bounce around nodes because the fault information is effectively
> noise. Either way I felt that accounting for shared faults with private
> faults would be slower overall.
> 
> The second reason was based on a hypothetical workload that had a small
> number of very important, heavily accessed private pages but a large shared
> array. The shared array would dominate the number of faults and be selected
> as a preferred node even though it's the wrong decision.
> 
> The third reason was because multiple threads in a process will race
> each other to fault the shared page making the information unreliable.
> 
> It is important that *something* be done with shared faults but I haven't
> thought of what exactly yet. One possibility would be to give them a
> different weight, maybe based on the number of active NUMA nodes, but I had
> not tested anything yet. Peter suggested privately that if shared faults
> dominate the workload that the shared pages would be migrated based on an
> interleave policy which has some potential.
> 

It would be good to put something like this in the Changelog, or even as
a comment near how we select the preferred node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28 13:51           ` Mel Gorman
@ 2013-06-28 17:14             ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 17:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > Yes, I understand that numa should have more priority over cache.
> > But the schedstats will not be updated about whether the task was hot or
> > cold.
> > 
> > So lets say the task was cache hot but numa wants it to move, then we
> > should certainly move it but we should update the schedstats to mention that we
> > moved a cache hot task.
> > 
> > Something akin to this.
> > 
> > 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> > 	if (tsk_cache_hot) {
> > 		if (migrate_improves_locality(p, env) || 
> > 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> > #ifdef CONFIG_SCHEDSTATS
> > 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> > #endif
> > 			return 1;
> > 		}
> > 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> > 		return 0;
> > 	}
> > 	return 1;
> > 
> 
> Thanks. Is this acceptable?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3848e0..c3a153e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4088,8 +4088,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	 * 3) too many balance attempts have failed.
>  	 */
> 
> -	if (migrate_improves_locality(p, env))
> +	if (migrate_improves_locality(p, env)) {
> +#ifdef CONFIG_SCHEDSTATS
> +		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> +		schedstat_inc(p, se.statistics.nr_forced_migrations);
> +#endif
>  		return 1;
> +	}
> 

In this case, we account even cache cold threads as _cache hot_ in
schedstats.

We need the task_hot() call to determine if task is cache hot or not.
So the migrate_improves_locality(), I think should be called within the
tsk_cache_hot check.

Do you have issues with the above snippet that I posted earlier?

>  	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
>  	if (!tsk_cache_hot)
> 
> -- 
> Mel Gorman
> SUSE Labs
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 17:14             ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 17:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > Yes, I understand that numa should have more priority over cache.
> > But the schedstats will not be updated about whether the task was hot or
> > cold.
> > 
> > So lets say the task was cache hot but numa wants it to move, then we
> > should certainly move it but we should update the schedstats to mention that we
> > moved a cache hot task.
> > 
> > Something akin to this.
> > 
> > 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> > 	if (tsk_cache_hot) {
> > 		if (migrate_improves_locality(p, env) || 
> > 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> > #ifdef CONFIG_SCHEDSTATS
> > 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> > #endif
> > 			return 1;
> > 		}
> > 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> > 		return 0;
> > 	}
> > 	return 1;
> > 
> 
> Thanks. Is this acceptable?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3848e0..c3a153e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4088,8 +4088,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	 * 3) too many balance attempts have failed.
>  	 */
> 
> -	if (migrate_improves_locality(p, env))
> +	if (migrate_improves_locality(p, env)) {
> +#ifdef CONFIG_SCHEDSTATS
> +		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> +		schedstat_inc(p, se.statistics.nr_forced_migrations);
> +#endif
>  		return 1;
> +	}
> 

In this case, we account even cache cold threads as _cache hot_ in
schedstats.

We need the task_hot() call to determine if task is cache hot or not.
So the migrate_improves_locality(), I think should be called within the
tsk_cache_hot check.

Do you have issues with the above snippet that I posted earlier?

>  	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
>  	if (!tsk_cache_hot)
> 
> -- 
> Mel Gorman
> SUSE Labs
> 

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28 17:14             ` Srikar Dronamraju
@ 2013-06-28 17:34               ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 17:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 10:44:27PM +0530, Srikar Dronamraju wrote:
> > > Yes, I understand that numa should have more priority over cache.
> > > But the schedstats will not be updated about whether the task was hot or
> > > cold.
> > > 
> > > So lets say the task was cache hot but numa wants it to move, then we
> > > should certainly move it but we should update the schedstats to mention that we
> > > moved a cache hot task.
> > > 
> > > Something akin to this.
> > > 
> > > 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> > > 	if (tsk_cache_hot) {
> > > 		if (migrate_improves_locality(p, env) || 
> > > 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> > > #ifdef CONFIG_SCHEDSTATS
> > > 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > > 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> > > #endif
> > > 			return 1;
> > > 		}
> > > 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> > > 		return 0;
> > > 	}
> > > 	return 1;
> > > 
> > 
> > Thanks. Is this acceptable?
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3848e0..c3a153e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4088,8 +4088,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >  	 * 3) too many balance attempts have failed.
> >  	 */
> > 
> > -	if (migrate_improves_locality(p, env))
> > +	if (migrate_improves_locality(p, env)) {
> > +#ifdef CONFIG_SCHEDSTATS
> > +		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > +		schedstat_inc(p, se.statistics.nr_forced_migrations);
> > +#endif
> >  		return 1;
> > +	}
> > 
> 
> In this case, we account even cache cold threads as _cache hot_ in
> schedstats.
> 
> We need the task_hot() call to determine if task is cache hot or not.
> So the migrate_improves_locality(), I think should be called within the
> tsk_cache_hot check.
> 
> Do you have issues with the above snippet that I posted earlier?
> 

The migrate_improves_locality call had already happened so it cannot be
true after the tsk_cache_hot check is made so I was confused. If the call is
moved within task cache hot then it changes the intent of the patch because
cache hotness then trumps memory locality which is not intended. Memory
locality is expected to trump cache hotness.

How about this?

        tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);

        if (migrate_improves_locality(p, env)) {
#ifdef CONFIG_SCHEDSTATS
                if (tsk_cache_hot) {
                        schedstat_inc(env->sd, lb_hot_gained[env->idle]);
                        schedstat_inc(p, se.statistics.nr_forced_migrations);
                }
#endif
                return 1;
        }

        if (!tsk_cache_hot ||
                env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
#ifdef CONFIG_SCHEDSTATS
                if (tsk_cache_hot) {
                        schedstat_inc(env->sd, lb_hot_gained[env->idle]);
                        schedstat_inc(p, se.statistics.nr_forced_migrations);
                }
#endif
                return 1;
        }


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 17:34               ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-06-28 17:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 10:44:27PM +0530, Srikar Dronamraju wrote:
> > > Yes, I understand that numa should have more priority over cache.
> > > But the schedstats will not be updated about whether the task was hot or
> > > cold.
> > > 
> > > So lets say the task was cache hot but numa wants it to move, then we
> > > should certainly move it but we should update the schedstats to mention that we
> > > moved a cache hot task.
> > > 
> > > Something akin to this.
> > > 
> > > 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> > > 	if (tsk_cache_hot) {
> > > 		if (migrate_improves_locality(p, env) || 
> > > 		 	(env->sd->nr_balance_failed > env->sd->cache_nice_tries)) {
> > > #ifdef CONFIG_SCHEDSTATS
> > > 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > > 			schedstat_inc(p, se.statistics.nr_forced_migrations);
> > > #endif
> > > 			return 1;
> > > 		}
> > > 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
> > > 		return 0;
> > > 	}
> > > 	return 1;
> > > 
> > 
> > Thanks. Is this acceptable?
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3848e0..c3a153e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4088,8 +4088,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >  	 * 3) too many balance attempts have failed.
> >  	 */
> > 
> > -	if (migrate_improves_locality(p, env))
> > +	if (migrate_improves_locality(p, env)) {
> > +#ifdef CONFIG_SCHEDSTATS
> > +		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > +		schedstat_inc(p, se.statistics.nr_forced_migrations);
> > +#endif
> >  		return 1;
> > +	}
> > 
> 
> In this case, we account even cache cold threads as _cache hot_ in
> schedstats.
> 
> We need the task_hot() call to determine if task is cache hot or not.
> So the migrate_improves_locality(), I think should be called within the
> tsk_cache_hot check.
> 
> Do you have issues with the above snippet that I posted earlier?
> 

The migrate_improves_locality call had already happened so it cannot be
true after the tsk_cache_hot check is made so I was confused. If the call is
moved within task cache hot then it changes the intent of the patch because
cache hotness then trumps memory locality which is not intended. Memory
locality is expected to trump cache hotness.

How about this?

        tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);

        if (migrate_improves_locality(p, env)) {
#ifdef CONFIG_SCHEDSTATS
                if (tsk_cache_hot) {
                        schedstat_inc(env->sd, lb_hot_gained[env->idle]);
                        schedstat_inc(p, se.statistics.nr_forced_migrations);
                }
#endif
                return 1;
        }

        if (!tsk_cache_hot ||
                env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
#ifdef CONFIG_SCHEDSTATS
                if (tsk_cache_hot) {
                        schedstat_inc(env->sd, lb_hot_gained[env->idle]);
                        schedstat_inc(p, se.statistics.nr_forced_migrations);
                }
#endif
                return 1;
        }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
  2013-06-28 17:34               ` Mel Gorman
@ 2013-06-28 17:44                 ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 17:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > > 
> > > -	if (migrate_improves_locality(p, env))
> > > +	if (migrate_improves_locality(p, env)) {
> > > +#ifdef CONFIG_SCHEDSTATS
> > > +		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > > +		schedstat_inc(p, se.statistics.nr_forced_migrations);
> > > +#endif
> > >  		return 1;
> > > +	}
> > > 
> > 
> > In this case, we account even cache cold threads as _cache hot_ in
> > schedstats.
> > 
> > We need the task_hot() call to determine if task is cache hot or not.
> > So the migrate_improves_locality(), I think should be called within the
> > tsk_cache_hot check.
> > 
> > Do you have issues with the above snippet that I posted earlier?
> > 
> 
> The migrate_improves_locality call had already happened so it cannot be
> true after the tsk_cache_hot check is made so I was confused. If the call is
> moved within task cache hot then it changes the intent of the patch because

Yes,  I was suggesting moving it inside.

> cache hotness then trumps memory locality which is not intended. Memory
> locality is expected to trump cache hotness.
> 

But whether the memory locality trumps or the cache hotness, the result
would still be the same but a little concise code.

> How about this?
> 
>         tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> 
>         if (migrate_improves_locality(p, env)) {
> #ifdef CONFIG_SCHEDSTATS
>                 if (tsk_cache_hot) {
>                         schedstat_inc(env->sd, lb_hot_gained[env->idle]);
>                         schedstat_inc(p, se.statistics.nr_forced_migrations);
>                 }
> #endif
>                 return 1;
>         }
> 
>         if (!tsk_cache_hot ||
>                 env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> #ifdef CONFIG_SCHEDSTATS
>                 if (tsk_cache_hot) {
>                         schedstat_inc(env->sd, lb_hot_gained[env->idle]);
>                         schedstat_inc(p, se.statistics.nr_forced_migrations);
>                 }
> #endif
>                 return 1;
>         }

Yes, this looks fine to me.
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 5/8] sched: Favour moving tasks towards the preferred node
@ 2013-06-28 17:44                 ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-06-28 17:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> > > 
> > > -	if (migrate_improves_locality(p, env))
> > > +	if (migrate_improves_locality(p, env)) {
> > > +#ifdef CONFIG_SCHEDSTATS
> > > +		schedstat_inc(env->sd, lb_hot_gained[env->idle]);
> > > +		schedstat_inc(p, se.statistics.nr_forced_migrations);
> > > +#endif
> > >  		return 1;
> > > +	}
> > > 
> > 
> > In this case, we account even cache cold threads as _cache hot_ in
> > schedstats.
> > 
> > We need the task_hot() call to determine if task is cache hot or not.
> > So the migrate_improves_locality(), I think should be called within the
> > tsk_cache_hot check.
> > 
> > Do you have issues with the above snippet that I posted earlier?
> > 
> 
> The migrate_improves_locality call had already happened so it cannot be
> true after the tsk_cache_hot check is made so I was confused. If the call is
> moved within task cache hot then it changes the intent of the patch because

Yes,  I was suggesting moving it inside.

> cache hotness then trumps memory locality which is not intended. Memory
> locality is expected to trump cache hotness.
> 

But whether the memory locality trumps or the cache hotness, the result
would still be the same but a little concise code.

> How about this?
> 
>         tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> 
>         if (migrate_improves_locality(p, env)) {
> #ifdef CONFIG_SCHEDSTATS
>                 if (tsk_cache_hot) {
>                         schedstat_inc(env->sd, lb_hot_gained[env->idle]);
>                         schedstat_inc(p, se.statistics.nr_forced_migrations);
>                 }
> #endif
>                 return 1;
>         }
> 
>         if (!tsk_cache_hot ||
>                 env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> #ifdef CONFIG_SCHEDSTATS
>                 if (tsk_cache_hot) {
>                         schedstat_inc(env->sd, lb_hot_gained[env->idle]);
>                         schedstat_inc(p, se.statistics.nr_forced_migrations);
>                 }
> #endif
>                 return 1;
>         }

Yes, this looks fine to me.
> 

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
  2013-06-28 13:54   ` Srikar Dronamraju
@ 2013-07-01  5:39     ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-01  5:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2013-06-28 19:24:22]:

> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:
> 
> > It's several months overdue and everything was quiet after 3.8 came out
> > but I recently had a chance to revisit automatic NUMA balancing for a few
> > days. I looked at basic scheduler integration resulting in the following
> > small series. Much of the following is heavily based on the numacore series
> > which in itself takes part of the autonuma series from back in November. In
> > particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> > mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> > Signed-off-bys. As before, if the relevant authors are ok with it I'll
> > add Signed-off-bys (or add them yourselves if you pick the patches up).
> 
> 
> Here is a snapshot of the results of running autonuma-benchmark running on 8
> node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
> setup
> 
> 	KernelVersion: 3.9.0-mainline_v39+()
> 				Testcase:      Min      Max      Avg
> 				  numa01:  1784.16  1864.15  1800.16
> 				  numa02:    32.07    32.72    32.59
> 
> 	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
> 				Testcase:      Min      Max      Avg  %Change
> 				  numa01:  1752.48  1859.60  1785.60    0.82%
> 				  numa02:    47.21    60.58    53.43  -39.00%
> 
> So numa02 case; we see a degradation of around 39%.
> 

I reran the tests again 

KernelVersion: 3.9.0-mainline_v39+()
                        Testcase:      Min      Max      Avg
                          numa01:  1784.16  1864.15  1800.16
             numa01_THREAD_ALLOC:   293.75   315.35   311.03
                          numa02:    32.07    32.72    32.59
                      numa02_SMT:    39.27    39.79    39.69

KernelVersion: 3.9.0-mainline_v39+() + your patches
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  1720.40  1876.89  1767.75    1.83%
             numa01_THREAD_ALLOC:   464.34   554.82   496.64  -37.37%
                          numa02:    52.02    58.57    56.21  -42.02%
                      numa02_SMT:    42.07    52.64    47.33  -16.14%


-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-07-01  5:39     ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-01  5:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2013-06-28 19:24:22]:

> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:
> 
> > It's several months overdue and everything was quiet after 3.8 came out
> > but I recently had a chance to revisit automatic NUMA balancing for a few
> > days. I looked at basic scheduler integration resulting in the following
> > small series. Much of the following is heavily based on the numacore series
> > which in itself takes part of the autonuma series from back in November. In
> > particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> > mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> > Signed-off-bys. As before, if the relevant authors are ok with it I'll
> > add Signed-off-bys (or add them yourselves if you pick the patches up).
> 
> 
> Here is a snapshot of the results of running autonuma-benchmark running on 8
> node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
> setup
> 
> 	KernelVersion: 3.9.0-mainline_v39+()
> 				Testcase:      Min      Max      Avg
> 				  numa01:  1784.16  1864.15  1800.16
> 				  numa02:    32.07    32.72    32.59
> 
> 	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
> 				Testcase:      Min      Max      Avg  %Change
> 				  numa01:  1752.48  1859.60  1785.60    0.82%
> 				  numa02:    47.21    60.58    53.43  -39.00%
> 
> So numa02 case; we see a degradation of around 39%.
> 

I reran the tests again 

KernelVersion: 3.9.0-mainline_v39+()
                        Testcase:      Min      Max      Avg
                          numa01:  1784.16  1864.15  1800.16
             numa01_THREAD_ALLOC:   293.75   315.35   311.03
                          numa02:    32.07    32.72    32.59
                      numa02_SMT:    39.27    39.79    39.69

KernelVersion: 3.9.0-mainline_v39+() + your patches
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  1720.40  1876.89  1767.75    1.83%
             numa01_THREAD_ALLOC:   464.34   554.82   496.64  -37.37%
                          numa02:    52.02    58.57    56.21  -42.02%
                      numa02_SMT:    42.07    52.64    47.33  -16.14%


-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
  2013-07-01  5:39     ` Srikar Dronamraju
@ 2013-07-01  8:43       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-07-01  8:43 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Mon, Jul 01, 2013 at 11:09:47AM +0530, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2013-06-28 19:24:22]:
> 
> > * Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:
> > 
> > > It's several months overdue and everything was quiet after 3.8 came out
> > > but I recently had a chance to revisit automatic NUMA balancing for a few
> > > days. I looked at basic scheduler integration resulting in the following
> > > small series. Much of the following is heavily based on the numacore series
> > > which in itself takes part of the autonuma series from back in November. In
> > > particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> > > mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> > > Signed-off-bys. As before, if the relevant authors are ok with it I'll
> > > add Signed-off-bys (or add them yourselves if you pick the patches up).
> > 
> > 
> > Here is a snapshot of the results of running autonuma-benchmark running on 8
> > node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
> > setup
> > 
> > 	KernelVersion: 3.9.0-mainline_v39+()
> > 				Testcase:      Min      Max      Avg
> > 				  numa01:  1784.16  1864.15  1800.16
> > 				  numa02:    32.07    32.72    32.59
> > 
> > 	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
> > 				Testcase:      Min      Max      Avg  %Change
> > 				  numa01:  1752.48  1859.60  1785.60    0.82%
> > 				  numa02:    47.21    60.58    53.43  -39.00%
> > 
> > So numa02 case; we see a degradation of around 39%.
> > 
> 
> I reran the tests again 
> 
> KernelVersion: 3.9.0-mainline_v39+()
>                         Testcase:      Min      Max      Avg
>                           numa01:  1784.16  1864.15  1800.16
>              numa01_THREAD_ALLOC:   293.75   315.35   311.03
>                           numa02:    32.07    32.72    32.59
>                       numa02_SMT:    39.27    39.79    39.69
> 
> KernelVersion: 3.9.0-mainline_v39+() + your patches
>                         Testcase:      Min      Max      Avg  %Change
>                           numa01:  1720.40  1876.89  1767.75    1.83%
>              numa01_THREAD_ALLOC:   464.34   554.82   496.64  -37.37%
>                           numa02:    52.02    58.57    56.21  -42.02%
>                       numa02_SMT:    42.07    52.64    47.33  -16.14%
> 

Thanks. Each of the the two runs had 5 iterations and there is a
difference in the reported average. Do you know what the standard
deviation is of the results?

I'm less concerned about the numa01 results as it is an adverse
workload on machins with more than two sockets but the numa02 results
are certainly of concern. My own testing for numa02 showed little or no
change. Would you mind testing with "Increase NUMA PTE scanning when a
new preferred node is selected" reverted please?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-07-01  8:43       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-07-01  8:43 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Mon, Jul 01, 2013 at 11:09:47AM +0530, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2013-06-28 19:24:22]:
> 
> > * Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:
> > 
> > > It's several months overdue and everything was quiet after 3.8 came out
> > > but I recently had a chance to revisit automatic NUMA balancing for a few
> > > days. I looked at basic scheduler integration resulting in the following
> > > small series. Much of the following is heavily based on the numacore series
> > > which in itself takes part of the autonuma series from back in November. In
> > > particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> > > mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> > > Signed-off-bys. As before, if the relevant authors are ok with it I'll
> > > add Signed-off-bys (or add them yourselves if you pick the patches up).
> > 
> > 
> > Here is a snapshot of the results of running autonuma-benchmark running on 8
> > node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
> > setup
> > 
> > 	KernelVersion: 3.9.0-mainline_v39+()
> > 				Testcase:      Min      Max      Avg
> > 				  numa01:  1784.16  1864.15  1800.16
> > 				  numa02:    32.07    32.72    32.59
> > 
> > 	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
> > 				Testcase:      Min      Max      Avg  %Change
> > 				  numa01:  1752.48  1859.60  1785.60    0.82%
> > 				  numa02:    47.21    60.58    53.43  -39.00%
> > 
> > So numa02 case; we see a degradation of around 39%.
> > 
> 
> I reran the tests again 
> 
> KernelVersion: 3.9.0-mainline_v39+()
>                         Testcase:      Min      Max      Avg
>                           numa01:  1784.16  1864.15  1800.16
>              numa01_THREAD_ALLOC:   293.75   315.35   311.03
>                           numa02:    32.07    32.72    32.59
>                       numa02_SMT:    39.27    39.79    39.69
> 
> KernelVersion: 3.9.0-mainline_v39+() + your patches
>                         Testcase:      Min      Max      Avg  %Change
>                           numa01:  1720.40  1876.89  1767.75    1.83%
>              numa01_THREAD_ALLOC:   464.34   554.82   496.64  -37.37%
>                           numa02:    52.02    58.57    56.21  -42.02%
>                       numa02_SMT:    42.07    52.64    47.33  -16.14%
> 

Thanks. Each of the the two runs had 5 iterations and there is a
difference in the reported average. Do you know what the standard
deviation is of the results?

I'm less concerned about the numa01 results as it is an adverse
workload on machins with more than two sockets but the numa02 results
are certainly of concern. My own testing for numa02 showed little or no
change. Would you mind testing with "Increase NUMA PTE scanning when a
new preferred node is selected" reverted please?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
  2013-07-01  8:43       ` Mel Gorman
@ 2013-07-02  5:28         ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-02  5:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-07-01 09:43:21]:

> 
> Thanks. Each of the the two runs had 5 iterations and there is a
> difference in the reported average. Do you know what the standard
> deviation is of the results?

Yes, the results were from 2 different runs. 
I hadnt calculated the std deviation for those runs.
> 
> I'm less concerned about the numa01 results as it is an adverse
> workload on machins with more than two sockets but the numa02 results
> are certainly of concern. My own testing for numa02 showed little or no
> change. Would you mind testing with "Increase NUMA PTE scanning when a
> new preferred node is selected" reverted please?
> 

Here are the results with the last patch reverted as requested by you.

KernelVersion: 3.9.0-mainline_v39+ your patches - last patch
		Testcase:      Min      Max      Avg  StdDev  %Change
		  numa01:  1704.50  1841.82  1757.55   49.27    2.42%
     numa01_THREAD_ALLOC:   433.25   517.07   464.17   28.15  -32.99%
		  numa02:    55.64    61.75    57.70    2.19  -43.52%
	      numa02_SMT:    44.78    53.45    48.72    2.91  -18.53%



Detailed run output here 

numa01 1704.50 248.67 71999.86 207091 1093
numa01_THREAD_ALLOC 461.62 416.89 23064.79 90283 961
numa02 61.75 93.86 2444.21 10652 6
numa02_SMT 46.79 23.13 977.94 1925 8
numa01 1769.09 262.00 74607.77 226677 1313
numa01_THREAD_ALLOC 433.25 365.12 21994.25 88597 773
numa02 55.64 89.52 2250.01 8848 210
numa02_SMT 49.39 19.81 938.86 1376 33
numa01 1841.82 407.73 78683.69 227428 1834
numa01_THREAD_ALLOC 517.07 465.71 26152.60 111689 978
numa02 55.95 103.26 2223.36 8471 158
numa02_SMT 53.45 19.73 962.08 1349 26
numa01 1760.41 474.74 76094.03 231278 2802
numa01_THREAD_ALLOC 456.80 395.35 23170.23 88049 835
numa02 57.18 87.31 2390.11 10804 3
numa02_SMT 44.78 26.48 944.28 1314 7
numa01 1711.91 421.49 77728.30 224185 2103
numa01_THREAD_ALLOC 452.09 430.88 22271.38 83418 2035
numa02 57.97 126.86 2354.34 8991 135
numa02_SMT 49.19 34.99 914.35 1308 22


> -- 
> Mel Gorman
> SUSE Labs
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-07-02  5:28         ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-02  5:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-07-01 09:43:21]:

> 
> Thanks. Each of the the two runs had 5 iterations and there is a
> difference in the reported average. Do you know what the standard
> deviation is of the results?

Yes, the results were from 2 different runs. 
I hadnt calculated the std deviation for those runs.
> 
> I'm less concerned about the numa01 results as it is an adverse
> workload on machins with more than two sockets but the numa02 results
> are certainly of concern. My own testing for numa02 showed little or no
> change. Would you mind testing with "Increase NUMA PTE scanning when a
> new preferred node is selected" reverted please?
> 

Here are the results with the last patch reverted as requested by you.

KernelVersion: 3.9.0-mainline_v39+ your patches - last patch
		Testcase:      Min      Max      Avg  StdDev  %Change
		  numa01:  1704.50  1841.82  1757.55   49.27    2.42%
     numa01_THREAD_ALLOC:   433.25   517.07   464.17   28.15  -32.99%
		  numa02:    55.64    61.75    57.70    2.19  -43.52%
	      numa02_SMT:    44.78    53.45    48.72    2.91  -18.53%



Detailed run output here 

numa01 1704.50 248.67 71999.86 207091 1093
numa01_THREAD_ALLOC 461.62 416.89 23064.79 90283 961
numa02 61.75 93.86 2444.21 10652 6
numa02_SMT 46.79 23.13 977.94 1925 8
numa01 1769.09 262.00 74607.77 226677 1313
numa01_THREAD_ALLOC 433.25 365.12 21994.25 88597 773
numa02 55.64 89.52 2250.01 8848 210
numa02_SMT 49.39 19.81 938.86 1376 33
numa01 1841.82 407.73 78683.69 227428 1834
numa01_THREAD_ALLOC 517.07 465.71 26152.60 111689 978
numa02 55.95 103.26 2223.36 8471 158
numa02_SMT 53.45 19.73 962.08 1349 26
numa01 1760.41 474.74 76094.03 231278 2802
numa01_THREAD_ALLOC 456.80 395.35 23170.23 88049 835
numa02 57.18 87.31 2390.11 10804 3
numa02_SMT 44.78 26.48 944.28 1314 7
numa01 1711.91 421.49 77728.30 224185 2103
numa01_THREAD_ALLOC 452.09 430.88 22271.38 83418 2035
numa02 57.97 126.86 2354.34 8991 135
numa02_SMT 49.19 34.99 914.35 1308 22


> -- 
> Mel Gorman
> SUSE Labs
> 

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
  2013-06-28 13:54   ` Srikar Dronamraju
@ 2013-07-02  7:46     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02  7:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 07:24:22PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:
> 
> > It's several months overdue and everything was quiet after 3.8 came out
> > but I recently had a chance to revisit automatic NUMA balancing for a few
> > days. I looked at basic scheduler integration resulting in the following
> > small series. Much of the following is heavily based on the numacore series
> > which in itself takes part of the autonuma series from back in November. In
> > particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> > mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> > Signed-off-bys. As before, if the relevant authors are ok with it I'll
> > add Signed-off-bys (or add them yourselves if you pick the patches up).
> 
> 
> Here is a snapshot of the results of running autonuma-benchmark running on 8
> node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
> setup
> 
> 	KernelVersion: 3.9.0-mainline_v39+()
> 				Testcase:      Min      Max      Avg
> 				  numa01:  1784.16  1864.15  1800.16
> 				  numa02:    32.07    32.72    32.59
> 
> 	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
> 				Testcase:      Min      Max      Avg  %Change
> 				  numa01:  1752.48  1859.60  1785.60    0.82%
> 				  numa02:    47.21    60.58    53.43  -39.00%

I had to go look at these benchmarks again; and numa02 is the one that's purely
private and thus should run well with this patch set. numa01 is the purely
shared one and should fare less good for now.


So on the biggest system I've got; 4 nodes 32 cpus:

 Performance counter stats for './numa02' (5 runs):

3.10.0+ - NO_NUMA		57.973118199 seconds time elapsed    ( +-  0.71% )
3.10.0+ -    NUMA		17.619811716 seconds time elapsed    ( +-  0.32% )

3.10.0+ + patches - NO_NUMA	58.235353126 seconds time elapsed    ( +-  0.45% )
3.10.0+ + patches -    NUMA     17.580963359 seconds time elapsed    ( +-  0.09% )


Which is a small to no improvement. We'd have to look at what makes the 8 node
go funny, but I don't think its realistic to hold off on the patches for that
system.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-07-02  7:46     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02  7:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Fri, Jun 28, 2013 at 07:24:22PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-06-26 15:37:59]:
> 
> > It's several months overdue and everything was quiet after 3.8 came out
> > but I recently had a chance to revisit automatic NUMA balancing for a few
> > days. I looked at basic scheduler integration resulting in the following
> > small series. Much of the following is heavily based on the numacore series
> > which in itself takes part of the autonuma series from back in November. In
> > particular it borrows heavily from Peter Ziljstra's work in "sched, numa,
> > mm: Add adaptive NUMA affinity support" but deviates too much to preserve
> > Signed-off-bys. As before, if the relevant authors are ok with it I'll
> > add Signed-off-bys (or add them yourselves if you pick the patches up).
> 
> 
> Here is a snapshot of the results of running autonuma-benchmark running on 8
> node 64 cpu system with hyper threading disabled. Ran 5 iterations for each
> setup
> 
> 	KernelVersion: 3.9.0-mainline_v39+()
> 				Testcase:      Min      Max      Avg
> 				  numa01:  1784.16  1864.15  1800.16
> 				  numa02:    32.07    32.72    32.59
> 
> 	KernelVersion: 3.9.0-mainline_v39+() + mel's patches
> 				Testcase:      Min      Max      Avg  %Change
> 				  numa01:  1752.48  1859.60  1785.60    0.82%
> 				  numa02:    47.21    60.58    53.43  -39.00%

I had to go look at these benchmarks again; and numa02 is the one that's purely
private and thus should run well with this patch set. numa01 is the purely
shared one and should fare less good for now.


So on the biggest system I've got; 4 nodes 32 cpus:

 Performance counter stats for './numa02' (5 runs):

3.10.0+ - NO_NUMA		57.973118199 seconds time elapsed    ( +-  0.71% )
3.10.0+ -    NUMA		17.619811716 seconds time elapsed    ( +-  0.32% )

3.10.0+ + patches - NO_NUMA	58.235353126 seconds time elapsed    ( +-  0.45% )
3.10.0+ + patches -    NUMA     17.580963359 seconds time elapsed    ( +-  0.09% )


Which is a small to no improvement. We'd have to look at what makes the 8 node
go funny, but I don't think its realistic to hold off on the patches for that
system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
  2013-07-02  7:46     ` Peter Zijlstra
@ 2013-07-02  8:55       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02  8:55 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Tue, Jul 02, 2013 at 09:46:59AM +0200, Peter Zijlstra wrote:
> So on the biggest system I've got; 4 nodes 32 cpus:
> 
>  Performance counter stats for './numa02' (5 runs):
> 
> 3.10.0+ + patches - NO_NUMA	58.235353126 seconds time elapsed    ( +-  0.45% )
> 3.10.0+ + patches -    NUMA   17.580963359 seconds time elapsed    ( +-  0.09% )

I just 'noticed' that I included my migrate_degrades_locality patch -- the one
posted somewhere in this thread (+ compile fixes).

Let me re-run without that one to see if there's any difference.

NO_NUMA		57.961384751 seconds time elapsed    ( +-  0.64% )
   NUMA		17.482115801 seconds time elapsed    ( +-  0.15% )

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 0/6] Basic scheduler support for automatic NUMA balancing
@ 2013-07-02  8:55       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02  8:55 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Tue, Jul 02, 2013 at 09:46:59AM +0200, Peter Zijlstra wrote:
> So on the biggest system I've got; 4 nodes 32 cpus:
> 
>  Performance counter stats for './numa02' (5 runs):
> 
> 3.10.0+ + patches - NO_NUMA	58.235353126 seconds time elapsed    ( +-  0.45% )
> 3.10.0+ + patches -    NUMA   17.580963359 seconds time elapsed    ( +-  0.09% )

I just 'noticed' that I included my migrate_degrades_locality patch -- the one
posted somewhere in this thread (+ compile fixes).

Let me re-run without that one to see if there's any difference.

NO_NUMA		57.961384751 seconds time elapsed    ( +-  0.64% )
   NUMA		17.482115801 seconds time elapsed    ( +-  0.15% )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-06-26 14:38   ` Mel Gorman
@ 2013-07-02 12:06     ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-02 12:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
> 

Should we be making this decisions just on the numa hinting faults alone? 

How are we making sure that the preferred node selection is persistent?
i.e due to memory accesses patterns, what if the preferred node
selection keeps moving.

If a large process having several threads were to allocate memory in one
node, then all threads will try to mark that node as their preferred
node. Till they get a chance those tasks will move pages over to the
local node. But if they get a chance to move to their preferred node
before moving enough number of pages, then it would have to fetch back
all the pages.

Can we look at using accumulating process weights and using the process
weights to consolidate tasks to one node?

> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  kernel/sched/core.c  | 18 +++++++++++++++--
>  kernel/sched/fair.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |  2 +-
>  3 files changed, 70 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ba9470e..b4722d6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5717,11 +5717,25 @@ struct sched_domain_topology_level;
>  
>  #ifdef CONFIG_NUMA_BALANCING
>  
> -/* Set a tasks preferred NUMA node */
> -void sched_setnuma(struct task_struct *p, int nid)
> +/* Set a tasks preferred NUMA node and reschedule to it */
> +void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
>  {
> +	int curr_cpu = task_cpu(p);
> +	struct migration_arg arg = { p, idlest_cpu };
> +
>  	p->numa_preferred_nid = nid;
>  	p->numa_migrate_seq = 0;
> +
> +	/* Do not reschedule if already running on the target CPU */
> +	if (idlest_cpu == curr_cpu)
> +		return;
> +
> +	/* Ensure the target CPU is eligible */
> +	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
> +		return;
> +
> +	/* Move current running task to idlest CPU on preferred node */
> +	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);

Here, moving tasks this way doesnt update the schedstats at all.
So task migrations from perf stat and schedstats dont match.
I know migration_cpu_stop was used this way before, but we are making
schedstats more unreliable. Also I dont think migration_cpu_stop was
used all that much. But now it gets used pretty persistently.
Probably we need to make migration_cpu_stop schedstats aware.

migration_cpu_stop has the other drawback that it doesnt check for
cpu_throttling. So we might move a task from the present cpu to a
different cpu and task might end up being throttled instead of being
run.

>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5e7f728..99951a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -800,6 +800,39 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>   */
>  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
>  
> +static unsigned long weighted_cpuload(const int cpu);
> +
> +static int
> +find_idlest_cpu_node(int this_cpu, int nid)
> +{
> +	unsigned long load, min_load = ULONG_MAX;
> +	int i, idlest_cpu = this_cpu;
> +
> +	BUG_ON(cpu_to_node(this_cpu) == nid);
> +
> +	for_each_cpu(i, cpumask_of_node(nid)) {
> +		load = weighted_cpuload(i);
> +
> +		if (load < min_load) {
> +			struct task_struct *p;
> +
> +			/* Do not preempt a task running on its preferred node */
> +			struct rq *rq = cpu_rq(i);
> +			local_irq_disable();
> +			raw_spin_lock(&rq->lock);
> +			p = rq->curr;
> +			if (p->numa_preferred_nid != nid) {
> +				min_load = load;
> +				idlest_cpu = i;
> +			}
> +			raw_spin_unlock(&rq->lock);
> +			local_irq_disable();
> +		}
> +	}
> +
> +	return idlest_cpu;

Here we are not checking the preferred node is already loaded. If the
preferred node is already loaded than the current local node, (either
because of task pinning, cpuset configurations,) pushing task to that
node might only end up with the task being pulled back in the next
balancing cycle.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-02 12:06     ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-02 12:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
> 

Should we be making this decisions just on the numa hinting faults alone? 

How are we making sure that the preferred node selection is persistent?
i.e due to memory accesses patterns, what if the preferred node
selection keeps moving.

If a large process having several threads were to allocate memory in one
node, then all threads will try to mark that node as their preferred
node. Till they get a chance those tasks will move pages over to the
local node. But if they get a chance to move to their preferred node
before moving enough number of pages, then it would have to fetch back
all the pages.

Can we look at using accumulating process weights and using the process
weights to consolidate tasks to one node?

> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  kernel/sched/core.c  | 18 +++++++++++++++--
>  kernel/sched/fair.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |  2 +-
>  3 files changed, 70 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ba9470e..b4722d6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5717,11 +5717,25 @@ struct sched_domain_topology_level;
>  
>  #ifdef CONFIG_NUMA_BALANCING
>  
> -/* Set a tasks preferred NUMA node */
> -void sched_setnuma(struct task_struct *p, int nid)
> +/* Set a tasks preferred NUMA node and reschedule to it */
> +void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
>  {
> +	int curr_cpu = task_cpu(p);
> +	struct migration_arg arg = { p, idlest_cpu };
> +
>  	p->numa_preferred_nid = nid;
>  	p->numa_migrate_seq = 0;
> +
> +	/* Do not reschedule if already running on the target CPU */
> +	if (idlest_cpu == curr_cpu)
> +		return;
> +
> +	/* Ensure the target CPU is eligible */
> +	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
> +		return;
> +
> +	/* Move current running task to idlest CPU on preferred node */
> +	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);

Here, moving tasks this way doesnt update the schedstats at all.
So task migrations from perf stat and schedstats dont match.
I know migration_cpu_stop was used this way before, but we are making
schedstats more unreliable. Also I dont think migration_cpu_stop was
used all that much. But now it gets used pretty persistently.
Probably we need to make migration_cpu_stop schedstats aware.

migration_cpu_stop has the other drawback that it doesnt check for
cpu_throttling. So we might move a task from the present cpu to a
different cpu and task might end up being throttled instead of being
run.

>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5e7f728..99951a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -800,6 +800,39 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>   */
>  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
>  
> +static unsigned long weighted_cpuload(const int cpu);
> +
> +static int
> +find_idlest_cpu_node(int this_cpu, int nid)
> +{
> +	unsigned long load, min_load = ULONG_MAX;
> +	int i, idlest_cpu = this_cpu;
> +
> +	BUG_ON(cpu_to_node(this_cpu) == nid);
> +
> +	for_each_cpu(i, cpumask_of_node(nid)) {
> +		load = weighted_cpuload(i);
> +
> +		if (load < min_load) {
> +			struct task_struct *p;
> +
> +			/* Do not preempt a task running on its preferred node */
> +			struct rq *rq = cpu_rq(i);
> +			local_irq_disable();
> +			raw_spin_lock(&rq->lock);
> +			p = rq->curr;
> +			if (p->numa_preferred_nid != nid) {
> +				min_load = load;
> +				idlest_cpu = i;
> +			}
> +			raw_spin_unlock(&rq->lock);
> +			local_irq_disable();
> +		}
> +	}
> +
> +	return idlest_cpu;

Here we are not checking the preferred node is already loaded. If the
preferred node is already loaded than the current local node, (either
because of task pinning, cpuset configurations,) pushing task to that
node might only end up with the task being pulled back in the next
balancing cycle.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-07-02 12:06     ` Srikar Dronamraju
@ 2013-07-02 16:29       ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-07-02 16:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> > A preferred node is selected based on the node the most NUMA hinting
> > faults was incurred on. There is no guarantee that the task is running
> > on that node at the time so this patch rescheules the task to run on
> > the most idle CPU of the selected node when selected. This avoids
> > waiting for the balancer to make a decision.
> > 
> 
> Should we be making this decisions just on the numa hinting faults alone? 
> 

No, we should not. More is required which will expand the scope of this
series. If a task is not running on the preferred node then why? Probably
because it was compute overloaded and the scheduler moved it off. Now
this is trying to push it back on. Instead we should account for how many
"preferred placed" tasks are running on that node and if it's more than
the number of CPUs then select the second-preferred or more preferred node
instead. Alternatively on the preferred node, find the task with the fewest
faults for that node and swap nodes with it.

> How are we making sure that the preferred node selection is persistent?

We aren't. That's why we only stick to a node a number of PTE scans with
this check with the full series applied

        if (*src_nid == *dst_nid ||
            p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
                return false;


> i.e due to memory accesses patterns, what if the preferred node
> selection keeps moving.
> 

If the preferred node keeps moving we are certainly in trouble currently.

> If a large process having several threads were to allocate memory in one
> node, then all threads will try to mark that node as their preferred
> node. Till they get a chance those tasks will move pages over to the
> local node. But if they get a chance to move to their preferred node
> before moving enough number of pages, then it would have to fetch back
> all the pages.
> 
> Can we look at using accumulating process weights and using the process
> weights to consolidate tasks to one node?
> 

Yes, that is ultimately required. Peter's original numacore series did
something like this but I had not disected which parts of it actually
matter in this round.

> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  kernel/sched/core.c  | 18 +++++++++++++++--
> >  kernel/sched/fair.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  kernel/sched/sched.h |  2 +-
> >  3 files changed, 70 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ba9470e..b4722d6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5717,11 +5717,25 @@ struct sched_domain_topology_level;
> >  
> >  #ifdef CONFIG_NUMA_BALANCING
> >  
> > -/* Set a tasks preferred NUMA node */
> > -void sched_setnuma(struct task_struct *p, int nid)
> > +/* Set a tasks preferred NUMA node and reschedule to it */
> > +void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
> >  {
> > +	int curr_cpu = task_cpu(p);
> > +	struct migration_arg arg = { p, idlest_cpu };
> > +
> >  	p->numa_preferred_nid = nid;
> >  	p->numa_migrate_seq = 0;
> > +
> > +	/* Do not reschedule if already running on the target CPU */
> > +	if (idlest_cpu == curr_cpu)
> > +		return;
> > +
> > +	/* Ensure the target CPU is eligible */
> > +	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
> > +		return;
> > +
> > +	/* Move current running task to idlest CPU on preferred node */
> > +	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> 
> Here, moving tasks this way doesnt update the schedstats at all.

I did not update stats because the existing users did not either. Then
again they are doing things like exec or updating the allowed mask so
it's not "interesting" as such

> So task migrations from perf stat and schedstats dont match.
> I know migration_cpu_stop was used this way before, but we are making
> schedstats more unreliable. Also I dont think migration_cpu_stop was
> used all that much. But now it gets used pretty persistently.

I know, this is going to be a concern, particularly when task swapping is
added to the mix. However, I'm not seeing a better way around it right now
other than waiting of the load balancer to kick in which is far from optimal.

> Probably we need to make migration_cpu_stop schedstats aware.
> 

Due to a lack of deep familiarity with the scheduler, it's not obvious
what the appropriate stats are. Do you mean duplicating something like
what set_task_cpu does within migration_cpu_stop?

> migration_cpu_stop has the other drawback that it doesnt check for
> cpu_throttling. So we might move a task from the present cpu to a
> different cpu and task might end up being throttled instead of being
> run.
> 

find_idlest_cpu_node at least reduces the risk of this but sure, if even if
the most idle CPU on the target node is overloaded then it's still a problem.

> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5e7f728..99951a8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -800,6 +800,39 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
> >   */
> >  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> >  
> > +static unsigned long weighted_cpuload(const int cpu);
> > +
> > +static int
> > +find_idlest_cpu_node(int this_cpu, int nid)
> > +{
> > +	unsigned long load, min_load = ULONG_MAX;
> > +	int i, idlest_cpu = this_cpu;
> > +
> > +	BUG_ON(cpu_to_node(this_cpu) == nid);
> > +
> > +	for_each_cpu(i, cpumask_of_node(nid)) {
> > +		load = weighted_cpuload(i);
> > +
> > +		if (load < min_load) {
> > +			struct task_struct *p;
> > +
> > +			/* Do not preempt a task running on its preferred node */
> > +			struct rq *rq = cpu_rq(i);
> > +			local_irq_disable();
> > +			raw_spin_lock(&rq->lock);
> > +			p = rq->curr;
> > +			if (p->numa_preferred_nid != nid) {
> > +				min_load = load;
> > +				idlest_cpu = i;
> > +			}
> > +			raw_spin_unlock(&rq->lock);
> > +			local_irq_disable();
> > +		}
> > +	}
> > +
> > +	return idlest_cpu;
> 
> Here we are not checking the preferred node is already loaded.

Correct. Long term we would need to check load based on number of "preferred
node" tasks running on it and also what the absolute load is. I had not
planned on dealing with it in this cycle as this number of patches is
already quite a mouthful but I'm aware the problem needs to be addressed.

> If the
> preferred node is already loaded than the current local node, (either
> because of task pinning, cpuset configurations,) pushing task to that
> node might only end up with the task being pulled back in the next
> balancing cycle.
> 

Yes, this is true.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-02 16:29       ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-07-02 16:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> > A preferred node is selected based on the node the most NUMA hinting
> > faults was incurred on. There is no guarantee that the task is running
> > on that node at the time so this patch rescheules the task to run on
> > the most idle CPU of the selected node when selected. This avoids
> > waiting for the balancer to make a decision.
> > 
> 
> Should we be making this decisions just on the numa hinting faults alone? 
> 

No, we should not. More is required which will expand the scope of this
series. If a task is not running on the preferred node then why? Probably
because it was compute overloaded and the scheduler moved it off. Now
this is trying to push it back on. Instead we should account for how many
"preferred placed" tasks are running on that node and if it's more than
the number of CPUs then select the second-preferred or more preferred node
instead. Alternatively on the preferred node, find the task with the fewest
faults for that node and swap nodes with it.

> How are we making sure that the preferred node selection is persistent?

We aren't. That's why we only stick to a node a number of PTE scans with
this check with the full series applied

        if (*src_nid == *dst_nid ||
            p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
                return false;


> i.e due to memory accesses patterns, what if the preferred node
> selection keeps moving.
> 

If the preferred node keeps moving we are certainly in trouble currently.

> If a large process having several threads were to allocate memory in one
> node, then all threads will try to mark that node as their preferred
> node. Till they get a chance those tasks will move pages over to the
> local node. But if they get a chance to move to their preferred node
> before moving enough number of pages, then it would have to fetch back
> all the pages.
> 
> Can we look at using accumulating process weights and using the process
> weights to consolidate tasks to one node?
> 

Yes, that is ultimately required. Peter's original numacore series did
something like this but I had not disected which parts of it actually
matter in this round.

> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  kernel/sched/core.c  | 18 +++++++++++++++--
> >  kernel/sched/fair.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  kernel/sched/sched.h |  2 +-
> >  3 files changed, 70 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ba9470e..b4722d6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5717,11 +5717,25 @@ struct sched_domain_topology_level;
> >  
> >  #ifdef CONFIG_NUMA_BALANCING
> >  
> > -/* Set a tasks preferred NUMA node */
> > -void sched_setnuma(struct task_struct *p, int nid)
> > +/* Set a tasks preferred NUMA node and reschedule to it */
> > +void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
> >  {
> > +	int curr_cpu = task_cpu(p);
> > +	struct migration_arg arg = { p, idlest_cpu };
> > +
> >  	p->numa_preferred_nid = nid;
> >  	p->numa_migrate_seq = 0;
> > +
> > +	/* Do not reschedule if already running on the target CPU */
> > +	if (idlest_cpu == curr_cpu)
> > +		return;
> > +
> > +	/* Ensure the target CPU is eligible */
> > +	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
> > +		return;
> > +
> > +	/* Move current running task to idlest CPU on preferred node */
> > +	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> 
> Here, moving tasks this way doesnt update the schedstats at all.

I did not update stats because the existing users did not either. Then
again they are doing things like exec or updating the allowed mask so
it's not "interesting" as such

> So task migrations from perf stat and schedstats dont match.
> I know migration_cpu_stop was used this way before, but we are making
> schedstats more unreliable. Also I dont think migration_cpu_stop was
> used all that much. But now it gets used pretty persistently.

I know, this is going to be a concern, particularly when task swapping is
added to the mix. However, I'm not seeing a better way around it right now
other than waiting of the load balancer to kick in which is far from optimal.

> Probably we need to make migration_cpu_stop schedstats aware.
> 

Due to a lack of deep familiarity with the scheduler, it's not obvious
what the appropriate stats are. Do you mean duplicating something like
what set_task_cpu does within migration_cpu_stop?

> migration_cpu_stop has the other drawback that it doesnt check for
> cpu_throttling. So we might move a task from the present cpu to a
> different cpu and task might end up being throttled instead of being
> run.
> 

find_idlest_cpu_node at least reduces the risk of this but sure, if even if
the most idle CPU on the target node is overloaded then it's still a problem.

> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5e7f728..99951a8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -800,6 +800,39 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
> >   */
> >  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> >  
> > +static unsigned long weighted_cpuload(const int cpu);
> > +
> > +static int
> > +find_idlest_cpu_node(int this_cpu, int nid)
> > +{
> > +	unsigned long load, min_load = ULONG_MAX;
> > +	int i, idlest_cpu = this_cpu;
> > +
> > +	BUG_ON(cpu_to_node(this_cpu) == nid);
> > +
> > +	for_each_cpu(i, cpumask_of_node(nid)) {
> > +		load = weighted_cpuload(i);
> > +
> > +		if (load < min_load) {
> > +			struct task_struct *p;
> > +
> > +			/* Do not preempt a task running on its preferred node */
> > +			struct rq *rq = cpu_rq(i);
> > +			local_irq_disable();
> > +			raw_spin_lock(&rq->lock);
> > +			p = rq->curr;
> > +			if (p->numa_preferred_nid != nid) {
> > +				min_load = load;
> > +				idlest_cpu = i;
> > +			}
> > +			raw_spin_unlock(&rq->lock);
> > +			local_irq_disable();
> > +		}
> > +	}
> > +
> > +	return idlest_cpu;
> 
> Here we are not checking the preferred node is already loaded.

Correct. Long term we would need to check load based on number of "preferred
node" tasks running on it and also what the absolute load is. I had not
planned on dealing with it in this cycle as this number of patches is
already quite a mouthful but I'm aware the problem needs to be addressed.

> If the
> preferred node is already loaded than the current local node, (either
> because of task pinning, cpuset configurations,) pushing task to that
> node might only end up with the task being pulled back in the next
> balancing cycle.
> 

Yes, this is true.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-06-26 14:38   ` Mel Gorman
@ 2013-07-02 18:15     ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02 18:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML



Something like this should avoid tasks being lumped back onto one node..

Compile tested only, need food.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1037,6 +1037,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+int migrate_curr_to(int cpu)
+{
+	struct task_struct *p = current;
+	struct migration_arg arg = { p, cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	if (curr_cpu == cpu)
+		return 0;
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -5183,30 +5200,6 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
-#ifdef CONFIG_NUMA_BALANCING
-
-/* Set a tasks preferred NUMA node and reschedule to it */
-void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
-{
-	int curr_cpu = task_cpu(p);
-	struct migration_arg arg = { p, idlest_cpu };
-
-	p->numa_preferred_nid = nid;
-	p->numa_migrate_seq = 0;
-
-	/* Do not reschedule if already running on the target CPU */
-	if (idlest_cpu == curr_cpu)
-		return;
-
-	/* Ensure the target CPU is eligible */
-	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
-		return;
-
-	/* Move current running task to idlest CPU on preferred node */
-	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -838,34 +838,76 @@ unsigned int sysctl_numa_balancing_scan_
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
-static unsigned long weighted_cpuload(const int cpu);
 
-static int find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+
+static int task_numa_find_cpu(struct task_struct *p)
+{
+	int nid = p->numa_preferred_nid;
+	int node_cpu = cpumask_first(cpumask_of_node(nid));
+	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+	unsigned long src_load, dst_load, min_load = ULONG_MAX;
+	struct task_group *tg = task_group(p);
+	s64 src_eff_load, dst_eff_load;
+	struct sched_domain *sd;
+	unsigned long weight;
+	bool balanced;
+	int idx;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	rcu_read_lock();
+	for_each_domain(src_cpu, sd) {
+		if (cpumask_test_cpu(src_cpu,  sched_domain_span(sd)) &&
+		    cpumask_test_cpu(node_cpu, sched_domain_span(sd)))
+			break;
+	}
 
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	if (WARN_ON_ONCE(!sd)) {
+		rcu_read_unlock();
+		return dst_cpu;
+	}
 
-		if (load < min_load) {
-			struct task_struct *p;
-			struct rq *rq = cpu_rq(i);
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
 
-			/* Do not preempt a task running on its preferred node */
-			raw_spin_lock_irq(&rq->lock);
-			p = rq->curr;
-			if (p->numa_preferred_nid != nid) {
-				min_load = load;
-				idlest_cpu = i;
-			}
-			raw_spin_unlock_irq(&rq->lock);
+	idx = sd->busy_idx; /* XXX do we want another idx? */
+	weight = p->se.load.weight;
+
+	src_load = source_load(src_cpu, idx);
+
+	src_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	src_eff_load *= power_of(src_cpu);
+	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
+
+	rcu_read_unlock();
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		dst_load = target_load(cpu, idx);
+
+		dst_eff_load = 100;
+		dst_eff_load *= power_of(cpu);
+		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+
+		balanced = (dst_eff_load <= src_eff_load);
+
+		/*
+		 * If the dst cpu wasn't idle; don't allow imbalances
+		 */
+		if (dst_load && !balanced)
+			continue;
+
+		if (dst_load < min_load) {
+			min_load = dst_load;
+			dst_cpu = cpu;
 		}
 	}
 
-	return idlest_cpu;
+	return dst_cpu;
 }
 
 static inline int task_faults_idx(int nid, int priv)
@@ -915,29 +957,31 @@ static void task_numa_placement(struct t
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
 		int old_migrate_seq = p->numa_migrate_seq;
 
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid)
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-
-		sched_setnuma(p, max_nid, preferred_cpu);
+		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
 
 		/*
 		 * If preferred nodes changes frequently then the scan rate
 		 * will be continually high. Mitigate this by increaseing the
 		 * scan rate only if the task was settled.
 		 */
-		if (old_migrate_seq >= sysctl_numa_balancing_settle_count)
-			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period =
+				max(p->numa_scan_period >> 1,
+				    sysctl_numa_balancing_scan_period_min);
+		}
 	}
+
+	if (p->numa_preferred_nid == numa_node_id())
+		return;
+
+	/*
+	 * If the task is not on the preferred node then find the most
+	 * idle CPU to migrate to.
+	 */
+	migrate_curr_to(task_numa_find_cpu(p));
 }
 
 /*
@@ -956,7 +1000,7 @@ void task_numa_fault(int last_nid, int n
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
-		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
 
@@ -968,9 +1012,10 @@ void task_numa_fault(int last_nid, int n
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
+        if (!migrated) {
 		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	}
 
 	task_numa_placement(p);
 
@@ -3245,8 +3290,7 @@ static long effective_load(struct task_g
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,7 +555,7 @@ static inline u64 rq_clock_task(struct r
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
+extern int migrate_curr_to(int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-02 18:15     ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02 18:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML



Something like this should avoid tasks being lumped back onto one node..

Compile tested only, need food.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1037,6 +1037,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+int migrate_curr_to(int cpu)
+{
+	struct task_struct *p = current;
+	struct migration_arg arg = { p, cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	if (curr_cpu == cpu)
+		return 0;
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -5183,30 +5200,6 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
-#ifdef CONFIG_NUMA_BALANCING
-
-/* Set a tasks preferred NUMA node and reschedule to it */
-void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
-{
-	int curr_cpu = task_cpu(p);
-	struct migration_arg arg = { p, idlest_cpu };
-
-	p->numa_preferred_nid = nid;
-	p->numa_migrate_seq = 0;
-
-	/* Do not reschedule if already running on the target CPU */
-	if (idlest_cpu == curr_cpu)
-		return;
-
-	/* Ensure the target CPU is eligible */
-	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
-		return;
-
-	/* Move current running task to idlest CPU on preferred node */
-	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -838,34 +838,76 @@ unsigned int sysctl_numa_balancing_scan_
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
-static unsigned long weighted_cpuload(const int cpu);
 
-static int find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+
+static int task_numa_find_cpu(struct task_struct *p)
+{
+	int nid = p->numa_preferred_nid;
+	int node_cpu = cpumask_first(cpumask_of_node(nid));
+	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+	unsigned long src_load, dst_load, min_load = ULONG_MAX;
+	struct task_group *tg = task_group(p);
+	s64 src_eff_load, dst_eff_load;
+	struct sched_domain *sd;
+	unsigned long weight;
+	bool balanced;
+	int idx;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	rcu_read_lock();
+	for_each_domain(src_cpu, sd) {
+		if (cpumask_test_cpu(src_cpu,  sched_domain_span(sd)) &&
+		    cpumask_test_cpu(node_cpu, sched_domain_span(sd)))
+			break;
+	}
 
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	if (WARN_ON_ONCE(!sd)) {
+		rcu_read_unlock();
+		return dst_cpu;
+	}
 
-		if (load < min_load) {
-			struct task_struct *p;
-			struct rq *rq = cpu_rq(i);
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
 
-			/* Do not preempt a task running on its preferred node */
-			raw_spin_lock_irq(&rq->lock);
-			p = rq->curr;
-			if (p->numa_preferred_nid != nid) {
-				min_load = load;
-				idlest_cpu = i;
-			}
-			raw_spin_unlock_irq(&rq->lock);
+	idx = sd->busy_idx; /* XXX do we want another idx? */
+	weight = p->se.load.weight;
+
+	src_load = source_load(src_cpu, idx);
+
+	src_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	src_eff_load *= power_of(src_cpu);
+	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
+
+	rcu_read_unlock();
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		dst_load = target_load(cpu, idx);
+
+		dst_eff_load = 100;
+		dst_eff_load *= power_of(cpu);
+		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+
+		balanced = (dst_eff_load <= src_eff_load);
+
+		/*
+		 * If the dst cpu wasn't idle; don't allow imbalances
+		 */
+		if (dst_load && !balanced)
+			continue;
+
+		if (dst_load < min_load) {
+			min_load = dst_load;
+			dst_cpu = cpu;
 		}
 	}
 
-	return idlest_cpu;
+	return dst_cpu;
 }
 
 static inline int task_faults_idx(int nid, int priv)
@@ -915,29 +957,31 @@ static void task_numa_placement(struct t
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
 		int old_migrate_seq = p->numa_migrate_seq;
 
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid)
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-
-		sched_setnuma(p, max_nid, preferred_cpu);
+		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
 
 		/*
 		 * If preferred nodes changes frequently then the scan rate
 		 * will be continually high. Mitigate this by increaseing the
 		 * scan rate only if the task was settled.
 		 */
-		if (old_migrate_seq >= sysctl_numa_balancing_settle_count)
-			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period =
+				max(p->numa_scan_period >> 1,
+				    sysctl_numa_balancing_scan_period_min);
+		}
 	}
+
+	if (p->numa_preferred_nid == numa_node_id())
+		return;
+
+	/*
+	 * If the task is not on the preferred node then find the most
+	 * idle CPU to migrate to.
+	 */
+	migrate_curr_to(task_numa_find_cpu(p));
 }
 
 /*
@@ -956,7 +1000,7 @@ void task_numa_fault(int last_nid, int n
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
-		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
 
@@ -968,9 +1012,10 @@ void task_numa_fault(int last_nid, int n
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
+        if (!migrated) {
 		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	}
 
 	task_numa_placement(p);
 
@@ -3245,8 +3290,7 @@ static long effective_load(struct task_g
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,7 +555,7 @@ static inline u64 rq_clock_task(struct r
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
+extern int migrate_curr_to(int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-07-02 12:06     ` Srikar Dronamraju
@ 2013-07-02 18:17       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02 18:17 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> Here, moving tasks this way doesnt update the schedstats at all.

Do you actually use schedstats? 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-02 18:17       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-02 18:17 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> Here, moving tasks this way doesnt update the schedstats at all.

Do you actually use schedstats? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-07-02 18:15     ` Peter Zijlstra
@ 2013-07-03  9:50       ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-03  9:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Jul 02, 2013 at 08:15:22PM +0200, Peter Zijlstra wrote:
> 
> 
> Something like this should avoid tasks being lumped back onto one node..
> 
> Compile tested only, need food.

OK, this one actually ran on my system and showed no negative effects on
numa02 -- then again, I didn't have the problem to begin with :/

Srikar, could you see what your 8-node does with this?

I'll go dig around to see where I left my SpecJBB.

---
Subject: sched, numa: Rework direct migration code to take load levels into account

Srikar mentioned he saw the direct migration code bounce all tasks
back to the first node only to be spread out by the regular balancer.

Rewrite the direct migration code to take load balance into account
such that we will not migrate to a cpu if the result is in direct
conflict with the load balance goals.

I removed the clause where we would not migrate towards a cpu that is
already running a task on the right node. If balance allows its
perfectly fine to run two tasks per cpu -- think overloaded scenarios.

There's a few XXXs in there that want consideration, but the code
compiles and runs.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  |   41 +++++++-----------
 kernel/sched/fair.c  |  115 ++++++++++++++++++++++++++++++++++-----------------
 kernel/sched/sched.h |    2 
 3 files changed, 95 insertions(+), 63 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1037,6 +1037,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+int migrate_curr_to(int cpu)
+{
+	struct task_struct *p = current;
+	struct migration_arg arg = { p, cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	if (curr_cpu == cpu)
+		return 0;
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -5188,30 +5205,6 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
-#ifdef CONFIG_NUMA_BALANCING
-
-/* Set a tasks preferred NUMA node and reschedule to it */
-void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
-{
-	int curr_cpu = task_cpu(p);
-	struct migration_arg arg = { p, idlest_cpu };
-
-	p->numa_preferred_nid = nid;
-	p->numa_migrate_seq = 0;
-
-	/* Do not reschedule if already running on the target CPU */
-	if (idlest_cpu == curr_cpu)
-		return;
-
-	/* Ensure the target CPU is eligible */
-	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
-		return;
-
-	/* Move current running task to idlest CPU on preferred node */
-	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -838,34 +838,71 @@ unsigned int sysctl_numa_balancing_scan_
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
-static unsigned long weighted_cpuload(const int cpu);
 
-static int find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+
+static int task_numa_find_cpu(struct task_struct *p)
+{
+	int nid = p->numa_preferred_nid;
+	int node_cpu = cpumask_first(cpumask_of_node(nid));
+	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+	unsigned long src_load, dst_load, min_load = ULONG_MAX;
+	struct task_group *tg = task_group(p);
+	s64 src_eff_load, dst_eff_load;
+	struct sched_domain *sd;
+	unsigned long weight;
+	bool balanced;
+	int idx = 0, imbalance_pct = 125;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	rcu_read_lock();
+	for_each_domain(src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			idx = sd->busy_idx; /* XXX another idx? */
+			imbalance_pct = sd->imbalance_pct;
+			break;
+		}
+	}
+	rcu_read_unlock();
 
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
 
-		if (load < min_load) {
-			struct task_struct *p;
-			struct rq *rq = cpu_rq(i);
+	weight = p->se.load.weight;
 
-			/* Do not preempt a task running on its preferred node */
-			raw_spin_lock_irq(&rq->lock);
-			p = rq->curr;
-			if (p->numa_preferred_nid != nid) {
-				min_load = load;
-				idlest_cpu = i;
-			}
-			raw_spin_unlock_irq(&rq->lock);
+	src_load = source_load(src_cpu, idx);
+
+	src_eff_load = 100 + (imbalance_pct - 100) / 2;
+	src_eff_load *= power_of(src_cpu);
+	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		dst_load = target_load(cpu, idx);
+
+		dst_eff_load = 100;
+		dst_eff_load *= power_of(cpu);
+		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+
+		balanced = (dst_eff_load <= src_eff_load);
+
+		/*
+		 * If the dst cpu wasn't idle; don't allow imbalances
+		 */
+		if (dst_load && !balanced)
+			continue;
+
+		if (dst_load < min_load) {
+			min_load = dst_load;
+			dst_cpu = cpu;
 		}
 	}
 
-	return idlest_cpu;
+	return dst_cpu;
 }
 
 static inline int task_faults_idx(int nid, int priv)
@@ -915,29 +952,31 @@ static void task_numa_placement(struct t
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
 		int old_migrate_seq = p->numa_migrate_seq;
 
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid)
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-
-		sched_setnuma(p, max_nid, preferred_cpu);
+		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
 
 		/*
 		 * If preferred nodes changes frequently then the scan rate
 		 * will be continually high. Mitigate this by increaseing the
 		 * scan rate only if the task was settled.
 		 */
-		if (old_migrate_seq >= sysctl_numa_balancing_settle_count)
-			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period =
+				max(p->numa_scan_period >> 1,
+				    sysctl_numa_balancing_scan_period_min);
+		}
 	}
+
+	if (p->numa_preferred_nid == numa_node_id())
+		return;
+
+	/*
+	 * If the task is not on the preferred node then find the most
+	 * idle CPU to migrate to.
+	 */
+	migrate_curr_to(task_numa_find_cpu(p));
 }
 
 /*
@@ -956,7 +995,7 @@ void task_numa_fault(int last_nid, int n
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
-		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
 
@@ -968,9 +1007,10 @@ void task_numa_fault(int last_nid, int n
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
+        if (!migrated) {
 		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	}
 
 	task_numa_placement(p);
 
@@ -3263,8 +3303,7 @@ static long effective_load(struct task_g
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,7 +555,7 @@ static inline u64 rq_clock_task(struct r
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
+extern int migrate_curr_to(int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-03  9:50       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-03  9:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Jul 02, 2013 at 08:15:22PM +0200, Peter Zijlstra wrote:
> 
> 
> Something like this should avoid tasks being lumped back onto one node..
> 
> Compile tested only, need food.

OK, this one actually ran on my system and showed no negative effects on
numa02 -- then again, I didn't have the problem to begin with :/

Srikar, could you see what your 8-node does with this?

I'll go dig around to see where I left my SpecJBB.

---
Subject: sched, numa: Rework direct migration code to take load levels into account

Srikar mentioned he saw the direct migration code bounce all tasks
back to the first node only to be spread out by the regular balancer.

Rewrite the direct migration code to take load balance into account
such that we will not migrate to a cpu if the result is in direct
conflict with the load balance goals.

I removed the clause where we would not migrate towards a cpu that is
already running a task on the right node. If balance allows its
perfectly fine to run two tasks per cpu -- think overloaded scenarios.

There's a few XXXs in there that want consideration, but the code
compiles and runs.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  |   41 +++++++-----------
 kernel/sched/fair.c  |  115 ++++++++++++++++++++++++++++++++++-----------------
 kernel/sched/sched.h |    2 
 3 files changed, 95 insertions(+), 63 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1037,6 +1037,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+int migrate_curr_to(int cpu)
+{
+	struct task_struct *p = current;
+	struct migration_arg arg = { p, cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	if (curr_cpu == cpu)
+		return 0;
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -5188,30 +5205,6 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
-#ifdef CONFIG_NUMA_BALANCING
-
-/* Set a tasks preferred NUMA node and reschedule to it */
-void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
-{
-	int curr_cpu = task_cpu(p);
-	struct migration_arg arg = { p, idlest_cpu };
-
-	p->numa_preferred_nid = nid;
-	p->numa_migrate_seq = 0;
-
-	/* Do not reschedule if already running on the target CPU */
-	if (idlest_cpu == curr_cpu)
-		return;
-
-	/* Ensure the target CPU is eligible */
-	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
-		return;
-
-	/* Move current running task to idlest CPU on preferred node */
-	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -838,34 +838,71 @@ unsigned int sysctl_numa_balancing_scan_
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
-static unsigned long weighted_cpuload(const int cpu);
 
-static int find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+
+static int task_numa_find_cpu(struct task_struct *p)
+{
+	int nid = p->numa_preferred_nid;
+	int node_cpu = cpumask_first(cpumask_of_node(nid));
+	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+	unsigned long src_load, dst_load, min_load = ULONG_MAX;
+	struct task_group *tg = task_group(p);
+	s64 src_eff_load, dst_eff_load;
+	struct sched_domain *sd;
+	unsigned long weight;
+	bool balanced;
+	int idx = 0, imbalance_pct = 125;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	rcu_read_lock();
+	for_each_domain(src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			idx = sd->busy_idx; /* XXX another idx? */
+			imbalance_pct = sd->imbalance_pct;
+			break;
+		}
+	}
+	rcu_read_unlock();
 
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
 
-		if (load < min_load) {
-			struct task_struct *p;
-			struct rq *rq = cpu_rq(i);
+	weight = p->se.load.weight;
 
-			/* Do not preempt a task running on its preferred node */
-			raw_spin_lock_irq(&rq->lock);
-			p = rq->curr;
-			if (p->numa_preferred_nid != nid) {
-				min_load = load;
-				idlest_cpu = i;
-			}
-			raw_spin_unlock_irq(&rq->lock);
+	src_load = source_load(src_cpu, idx);
+
+	src_eff_load = 100 + (imbalance_pct - 100) / 2;
+	src_eff_load *= power_of(src_cpu);
+	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		dst_load = target_load(cpu, idx);
+
+		dst_eff_load = 100;
+		dst_eff_load *= power_of(cpu);
+		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+
+		balanced = (dst_eff_load <= src_eff_load);
+
+		/*
+		 * If the dst cpu wasn't idle; don't allow imbalances
+		 */
+		if (dst_load && !balanced)
+			continue;
+
+		if (dst_load < min_load) {
+			min_load = dst_load;
+			dst_cpu = cpu;
 		}
 	}
 
-	return idlest_cpu;
+	return dst_cpu;
 }
 
 static inline int task_faults_idx(int nid, int priv)
@@ -915,29 +952,31 @@ static void task_numa_placement(struct t
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
 		int old_migrate_seq = p->numa_migrate_seq;
 
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid)
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-
-		sched_setnuma(p, max_nid, preferred_cpu);
+		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
 
 		/*
 		 * If preferred nodes changes frequently then the scan rate
 		 * will be continually high. Mitigate this by increaseing the
 		 * scan rate only if the task was settled.
 		 */
-		if (old_migrate_seq >= sysctl_numa_balancing_settle_count)
-			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period =
+				max(p->numa_scan_period >> 1,
+				    sysctl_numa_balancing_scan_period_min);
+		}
 	}
+
+	if (p->numa_preferred_nid == numa_node_id())
+		return;
+
+	/*
+	 * If the task is not on the preferred node then find the most
+	 * idle CPU to migrate to.
+	 */
+	migrate_curr_to(task_numa_find_cpu(p));
 }
 
 /*
@@ -956,7 +995,7 @@ void task_numa_fault(int last_nid, int n
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
-		p->numa_faults = kzalloc(size * 4, GFP_KERNEL);
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
 
@@ -968,9 +1007,10 @@ void task_numa_fault(int last_nid, int n
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
+        if (!migrated) {
 		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	}
 
 	task_numa_placement(p);
 
@@ -3263,8 +3303,7 @@ static long effective_load(struct task_g
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,7 +555,7 @@ static inline u64 rq_clock_task(struct r
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
+extern int migrate_curr_to(int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-07-03  9:50       ` Peter Zijlstra
@ 2013-07-03 15:28         ` Mel Gorman
  -1 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-07-03 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 11:50:59AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 02, 2013 at 08:15:22PM +0200, Peter Zijlstra wrote:
> > 
> > 
> > Something like this should avoid tasks being lumped back onto one node..
> > 
> > Compile tested only, need food.
> 
> OK, this one actually ran on my system and showed no negative effects on
> numa02 -- then again, I didn't have the problem to begin with :/
> 
> Srikar, could you see what your 8-node does with this?
> 
> I'll go dig around to see where I left my SpecJBB.
> 

I reshuffled the v2 series a bit to match your implied preference for layout
and rebased this on top of the end result. May not have the beans to
absorb it before I quit for the evening but I'll at least queue it up
overnight.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-03 15:28         ` Mel Gorman
  0 siblings, 0 replies; 124+ messages in thread
From: Mel Gorman @ 2013-07-03 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 11:50:59AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 02, 2013 at 08:15:22PM +0200, Peter Zijlstra wrote:
> > 
> > 
> > Something like this should avoid tasks being lumped back onto one node..
> > 
> > Compile tested only, need food.
> 
> OK, this one actually ran on my system and showed no negative effects on
> numa02 -- then again, I didn't have the problem to begin with :/
> 
> Srikar, could you see what your 8-node does with this?
> 
> I'll go dig around to see where I left my SpecJBB.
> 

I reshuffled the v2 series a bit to match your implied preference for layout
and rebased this on top of the end result. May not have the beans to
absorb it before I quit for the evening but I'll at least queue it up
overnight.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-07-03 15:28         ` Mel Gorman
@ 2013-07-03 18:46           ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-03 18:46 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 04:28:21PM +0100, Mel Gorman wrote:

> I reshuffled the v2 series a bit to match your implied preference for layout
> and rebased this on top of the end result. May not have the beans to
> absorb it before I quit for the evening but I'll at least queue it up
> overnight.

It probably caused that snafu that got you all tangled up with your v3 series
:-) Just my luck.

I couldn't find much difference on my SpecJBB runs -- in fact so little that
I'm beginning to think I'm doing something really wrong :/

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-03 18:46           ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-03 18:46 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 04:28:21PM +0100, Mel Gorman wrote:

> I reshuffled the v2 series a bit to match your implied preference for layout
> and rebased this on top of the end result. May not have the beans to
> absorb it before I quit for the evening but I'll at least queue it up
> overnight.

It probably caused that snafu that got you all tangled up with your v3 series
:-) Just my luck.

I couldn't find much difference on my SpecJBB runs -- in fact so little that
I'm beginning to think I'm doing something really wrong :/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-07-02 18:17       ` Peter Zijlstra
@ 2013-07-06  6:44         ` Srikar Dronamraju
  -1 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-06  6:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Peter Zijlstra <peterz@infradead.org> [2013-07-02 20:17:32]:

> On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> > Here, moving tasks this way doesnt update the schedstats at all.
> 
> Do you actually use schedstats? 
> 

Yes, I do use schedstats. Are there any plans to obsolete it?

It gave me good information about how many times we did load balancing
and how many times we were successful in the load balancing esp across
domains.

-- 
Thanks and Regards
Srikar Dronamraju



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-06  6:44         ` Srikar Dronamraju
  0 siblings, 0 replies; 124+ messages in thread
From: Srikar Dronamraju @ 2013-07-06  6:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Peter Zijlstra <peterz@infradead.org> [2013-07-02 20:17:32]:

> On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> > Here, moving tasks this way doesnt update the schedstats at all.
> 
> Do you actually use schedstats? 
> 

Yes, I do use schedstats. Are there any plans to obsolete it?

It gave me good information about how many times we did load balancing
and how many times we were successful in the load balancing esp across
domains.

-- 
Thanks and Regards
Srikar Dronamraju


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
  2013-07-06  6:44         ` Srikar Dronamraju
@ 2013-07-06 10:47           ` Peter Zijlstra
  -1 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-06 10:47 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:14:08PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2013-07-02 20:17:32]:
> 
> > On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> > > Here, moving tasks this way doesnt update the schedstats at all.
> > 
> > Do you actually use schedstats? 
> > 
> 
> Yes, I do use schedstats. Are there any plans to obsolete it?

Not really, its just something I've never used and keeping the stats correct
made Mel's patch uglier which makes me dislike them more ;-)

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-06 10:47           ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2013-07-06 10:47 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:14:08PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2013-07-02 20:17:32]:
> 
> > On Tue, Jul 02, 2013 at 05:36:55PM +0530, Srikar Dronamraju wrote:
> > > Here, moving tasks this way doesnt update the schedstats at all.
> > 
> > Do you actually use schedstats? 
> > 
> 
> Yes, I do use schedstats. Are there any plans to obsolete it?

Not really, its just something I've never used and keeping the stats correct
made Mel's patch uglier which makes me dislike them more ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2013-07-06 10:48 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-26 14:37 [PATCH 0/6] Basic scheduler support for automatic NUMA balancing Mel Gorman
2013-06-26 14:37 ` Mel Gorman
2013-06-26 14:38 ` [PATCH 1/8] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-26 14:38 ` [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-27 15:57   ` Peter Zijlstra
2013-06-27 15:57     ` Peter Zijlstra
2013-06-28 12:22     ` Mel Gorman
2013-06-28 12:22       ` Mel Gorman
2013-06-28  6:08   ` Srikar Dronamraju
2013-06-28  6:08     ` Srikar Dronamraju
2013-06-28  8:56     ` Peter Zijlstra
2013-06-28  8:56       ` Peter Zijlstra
2013-06-28 12:30     ` Mel Gorman
2013-06-28 12:30       ` Mel Gorman
2013-06-26 14:38 ` [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-28  6:14   ` Srikar Dronamraju
2013-06-28  6:14     ` Srikar Dronamraju
2013-06-28  8:59     ` Peter Zijlstra
2013-06-28  8:59       ` Peter Zijlstra
2013-06-28 10:24       ` Srikar Dronamraju
2013-06-28 10:24         ` Srikar Dronamraju
2013-06-28 12:33     ` Mel Gorman
2013-06-28 12:33       ` Mel Gorman
2013-06-26 14:38 ` [PATCH 4/8] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-28  6:32   ` Srikar Dronamraju
2013-06-28  6:32     ` Srikar Dronamraju
2013-06-28  9:01     ` Peter Zijlstra
2013-06-28  9:01       ` Peter Zijlstra
2013-06-26 14:38 ` [PATCH 5/8] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-27 14:52   ` Peter Zijlstra
2013-06-27 14:52     ` Peter Zijlstra
2013-06-27 14:53   ` Peter Zijlstra
2013-06-27 14:53     ` Peter Zijlstra
2013-06-28 13:00     ` Mel Gorman
2013-06-28 13:00       ` Mel Gorman
2013-06-27 16:01   ` Peter Zijlstra
2013-06-27 16:01     ` Peter Zijlstra
2013-06-28 13:01     ` Mel Gorman
2013-06-28 13:01       ` Mel Gorman
2013-06-27 16:11   ` Peter Zijlstra
2013-06-27 16:11     ` Peter Zijlstra
2013-06-28 13:45     ` Mel Gorman
2013-06-28 13:45       ` Mel Gorman
2013-06-28 15:10       ` Peter Zijlstra
2013-06-28 15:10         ` Peter Zijlstra
2013-06-28  8:11   ` Srikar Dronamraju
2013-06-28  8:11     ` Srikar Dronamraju
2013-06-28  9:04     ` Peter Zijlstra
2013-06-28  9:04       ` Peter Zijlstra
2013-06-28 10:07       ` Srikar Dronamraju
2013-06-28 10:07         ` Srikar Dronamraju
2013-06-28 10:24         ` Peter Zijlstra
2013-06-28 10:24           ` Peter Zijlstra
2013-06-28 13:51         ` Mel Gorman
2013-06-28 13:51           ` Mel Gorman
2013-06-28 17:14           ` Srikar Dronamraju
2013-06-28 17:14             ` Srikar Dronamraju
2013-06-28 17:34             ` Mel Gorman
2013-06-28 17:34               ` Mel Gorman
2013-06-28 17:44               ` Srikar Dronamraju
2013-06-28 17:44                 ` Srikar Dronamraju
2013-06-26 14:38 ` [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-27 14:54   ` Peter Zijlstra
2013-06-27 14:54     ` Peter Zijlstra
2013-06-28 13:54     ` Mel Gorman
2013-06-28 13:54       ` Mel Gorman
2013-07-02 12:06   ` Srikar Dronamraju
2013-07-02 12:06     ` Srikar Dronamraju
2013-07-02 16:29     ` Mel Gorman
2013-07-02 16:29       ` Mel Gorman
2013-07-02 18:17     ` Peter Zijlstra
2013-07-02 18:17       ` Peter Zijlstra
2013-07-06  6:44       ` Srikar Dronamraju
2013-07-06  6:44         ` Srikar Dronamraju
2013-07-06 10:47         ` Peter Zijlstra
2013-07-06 10:47           ` Peter Zijlstra
2013-07-02 18:15   ` Peter Zijlstra
2013-07-02 18:15     ` Peter Zijlstra
2013-07-03  9:50     ` Peter Zijlstra
2013-07-03  9:50       ` Peter Zijlstra
2013-07-03 15:28       ` Mel Gorman
2013-07-03 15:28         ` Mel Gorman
2013-07-03 18:46         ` Peter Zijlstra
2013-07-03 18:46           ` Peter Zijlstra
2013-06-26 14:38 ` [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-27 14:56   ` Peter Zijlstra
2013-06-27 14:56     ` Peter Zijlstra
2013-06-28 14:00     ` Mel Gorman
2013-06-28 14:00       ` Mel Gorman
2013-06-28  7:00   ` Srikar Dronamraju
2013-06-28  7:00     ` Srikar Dronamraju
2013-06-28  9:36     ` Peter Zijlstra
2013-06-28  9:36       ` Peter Zijlstra
2013-06-28 10:12       ` Srikar Dronamraju
2013-06-28 10:12         ` Srikar Dronamraju
2013-06-28 10:33         ` Peter Zijlstra
2013-06-28 10:33           ` Peter Zijlstra
2013-06-28 14:29           ` Mel Gorman
2013-06-28 14:29             ` Mel Gorman
2013-06-28 15:12             ` Peter Zijlstra
2013-06-28 15:12               ` Peter Zijlstra
2013-06-26 14:38 ` [PATCH 8/8] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-06-26 14:38   ` Mel Gorman
2013-06-27 14:59 ` [PATCH 0/6] Basic scheduler support for automatic NUMA balancing Peter Zijlstra
2013-06-27 14:59   ` Peter Zijlstra
2013-06-28 13:54 ` Srikar Dronamraju
2013-06-28 13:54   ` Srikar Dronamraju
2013-07-01  5:39   ` Srikar Dronamraju
2013-07-01  5:39     ` Srikar Dronamraju
2013-07-01  8:43     ` Mel Gorman
2013-07-01  8:43       ` Mel Gorman
2013-07-02  5:28       ` Srikar Dronamraju
2013-07-02  5:28         ` Srikar Dronamraju
2013-07-02  7:46   ` Peter Zijlstra
2013-07-02  7:46     ` Peter Zijlstra
2013-07-02  8:55     ` Peter Zijlstra
2013-07-02  8:55       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.