linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3
@ 2013-07-05 23:08 Mel Gorman
  2013-07-05 23:08 ` [PATCH 01/15] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
                   ` (14 more replies)
  0 siblings, 15 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This continues to build on the previous feedback. The results are a mix of
gains and losses but when looking at the losses I think it's also important
to consider the reduced overhead when the patches are applied. I still
have not had the chance to closely review Peter's or Srikar's approach to
scheduling but the tests are queued to do a comparison.

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 7 adds infrastructure to allow separate tracking of shared/private
	pages but treats all faults as if they are private accesses. Laying
	it out this way reduces churn later in the series when private
	fault detection is introduced

Patch 8 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

Patch 9 avoids some unnecessary allocation

Patch 10 sets the scan rate proportional to the size of the task being scanned.

Patch 11-12 kicks away some training wheels and scans shared pages and small VMAs.

Patch 13 introduces private fault detection based on the PID of the faulting
	process and accounts for shared/private accesses differently

Patch 14 accounts for how many "preferred placed" tasks are running on an node
	and attempts to avoid overloading them. This patch is the primary
	candidate for replacing with proper load tracking of nodes. This patch
	is crude but acts as a basis for comparison

Patch 15 favours moving tasks towards nodes where more faults were incurred
	even if it is not the preferred node.

Testing on this is only partial as full tests take a long time to run. A
full specjbb for both single and multi takes over 4 hours. NPB D class
also takes a few hours. With all the kernels in question, it'll take a
weekend to churn through them so here is the shorter tests.

I tested 9 kernels using 3.9.0 as a baseline

o 3.9.0-vanilla			vanilla kernel with automatic numa balancing enabled
o 3.9.0-favorpref-v3   		Patches 1-9
o 3.9.0-scalescan-v3   		Patches 1-10
o 3.9.0-scanshared-v3   	Patches 1-12
o 3.9.0-splitprivate-v3   	Patches 1-13
o 3.9.0-accountpreferred-v3   	Patches 1-14
o 3.9.0-peterz-v3   		Patches 1-14 + Peter's scheduling patch
o 3.9.0-srikar-v3   		vanilla kernel + Srikar's scheduling patch
o 3.9.0-favorfaults-v3   	Patches 1-15

Note that Peters patch has been rebased by me and acts as a replacement
for the crude per-node accounting. Srikar's patch was standalone and I
made to attempt to pick it apart and rebase it on top of the series.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time.

specjbb
                        3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                      vanilla            favorpref-v3           scalescan-v3          scanshared-v3        splitprivate-v3    accountpreferred-v3              peterz-v3              srikar-v3         favorfaults-v3   
TPut 1      26099.00 (  0.00%)     23289.00 (-10.77%)     23343.00 (-10.56%)     24450.00 ( -6.32%)     24660.00 ( -5.51%)     24378.00 ( -6.59%)     23294.00 (-10.75%)     24990.00 ( -4.25%)     22938.00 (-12.11%)
TPut 7     187276.00 (  0.00%)    188696.00 (  0.76%)    188049.00 (  0.41%)    188734.00 (  0.78%)    189033.00 (  0.94%)    188507.00 (  0.66%)    187746.00 (  0.25%)    188660.00 (  0.74%)    189032.00 (  0.94%)
TPut 13    318028.00 (  0.00%)    337735.00 (  6.20%)    332076.00 (  4.42%)    325244.00 (  2.27%)    330248.00 (  3.84%)    338799.00 (  6.53%)    333955.00 (  5.01%)    303900.00 ( -4.44%)    340888.00 (  7.19%)
TPut 19    368547.00 (  0.00%)    427211.00 ( 15.92%)    416539.00 ( 13.02%)    383505.00 (  4.06%)    416156.00 ( 12.92%)    428810.00 ( 16.35%)    435828.00 ( 18.26%)    399560.00 (  8.41%)    444654.00 ( 20.65%)
TPut 25    377522.00 (  0.00%)    469175.00 ( 24.28%)    491030.00 ( 30.07%)    412740.00 (  9.33%)    475783.00 ( 26.03%)    463198.00 ( 22.69%)    504612.00 ( 33.66%)    419442.00 ( 11.10%)    524288.00 ( 38.88%)
TPut 31    347642.00 (  0.00%)    440729.00 ( 26.78%)    466510.00 ( 34.19%)    381921.00 (  9.86%)    453361.00 ( 30.41%)    408340.00 ( 17.46%)    476475.00 ( 37.06%)    410060.00 ( 17.95%)    501662.00 ( 44.30%)
TPut 37    313439.00 (  0.00%)    418485.00 ( 33.51%)    442592.00 ( 41.21%)    352373.00 ( 12.42%)    448875.00 ( 43.21%)    399340.00 ( 27.41%)    457167.00 ( 45.86%)    398125.00 ( 27.02%)    484381.00 ( 54.54%)
TPut 43    291958.00 (  0.00%)    385404.00 ( 32.01%)    386700.00 ( 32.45%)    336810.00 ( 15.36%)    412089.00 ( 41.15%)    366572.00 ( 25.56%)    418745.00 ( 43.43%)    335165.00 ( 14.80%)    438455.00 ( 50.18%)

First off, note what the shared/private split patch does. Once we start
scanning all pages there is a degradation in performance as the shared
page faults introduce noise to the statistics. Splitting the shared/private
faults restores the performance and the key task in the future is to use
this shared/private information for maximum benefit.

Note that my account-preferred patch that limits the number of tasks that can
run on a node degrades performance in this case where as Peter's patch improves
performance nicely.

Note the performance of favour-faults which moves tasks towards towards
with more faults or resists moving away from nodes with more faults also
improves performance.

Srikar's patch that considers just compute load does improve performance
from the vanilla kernel but not as much as the series does.

Results for this benchmark at least are very positive with indications
that I should ditch Patch 14 and work on Peter's version.

specjbb Peaks
                         3.9.0                      3.9.0               3.9.0               3.9.0               3.9.0               3.9.0               3.9.0               3.9.0               3.9.0
                       vanilla            favorpref-v3        scalescan-v3       scanshared-v3     splitprivate-v3    accountpreferred-v3        peterz-v3           srikar-v3      favorfaults-v3   
 Expctd Warehouse     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)
 Actual Warehouse     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)
 Actual Peak Bops 377522.00 (  0.00%) 469175.00 ( 24.28%) 491030.00 ( 30.07%) 412740.00 (  9.33%) 475783.00 ( 26.03%) 463198.00 ( 22.69%) 504612.00 ( 33.66%) 419442.00 ( 11.10%) 524288.00 ( 38.88%)

All kernels peaked at the same number of warehouses with the series
performing well overall with the same conclusion that Peter's version of
the compute node overload detection should be used.


               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillafavorpref-v3   scalescan-v3   scanshared-v3   splitprivate-v3   accountpreferred-v3   peterz-v3   srikar-v3   favorfaults-v3   
User         5184.53     5210.17     5174.95     5166.97     5184.01     5185.70     5202.89     5197.41     5175.89
System         59.61       65.68       64.39       61.62       60.77       59.47       61.51       56.02       60.18
Elapsed       254.52      255.01      253.81      255.16      254.19      254.34      254.08      254.89      254.84

No major change.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v3 scalescan-v3 scanshared-v3 splitprivate accountpref peterz-v3   srikar-v3 favorfaults-v3   
THP fault alloc                  33297       34087       33651       32943       35069       33473       34932       37053       32736
THP collapse alloc                   9          14          18          12          11          13          13          10          15
THP splits                           3           4           4           2           5           8           2           4           4
THP fault fallback                   0           0           0           0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0           0           0           0
Page migrate success           1773768     1769532     1420235     1360864     1310354     1423995     1367669     1927281     1327653
Page migrate failure                 0           0           0           0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0           0           0           0
Compaction cost                   1841        1836        1474        1412        1360        1478        1419        2000        1378
NUMA PTE updates              17461135    17386539    15022653    14480121    14335180    15379855    14428691    18202363    14282962
NUMA hint faults                 85873       77686       75782       79742       79048       90556       79064      178027       76533
NUMA hint local faults           27145       24279       24412       29548       31882       32952       29363      114382       29604
NUMA hint local percent             31          31          32          37          40          36          37          64          38
NUMA pages migrated            1773768     1769532     1420235     1360864     1310354     1423995     1367669     1927281     1327653
AutoNUMA cost                      585         543         511         525         520         587         522        1054         507

The series reduced the amount of PTE scanning and migrated less. Interestingly
the percentage of local faults is not changed much so even without comparing
it with an interleaved JVM, there is room for improvement there.

Srikar's patch behaviour is interesting. In updates roughly the same number
of PTEs but incurs more faults with a higher percentage of local faults
even though performance is worse overall. It does indicate that might
have fared better if it was rebased on top and dealt with just calculating
compute node overloading as a potential alternative to Peter's patch.


Next is the autonuma benchmark results. These were only run once so I have no
idea what the variance is. Obviously they could be run multiple times but with
this number of kernels we would die of old age waiting on the results.

autonumabench
                                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                                        vanilla       favorpref-v3          scalescan-v3         scanshared-v3       splitprivate-v3      accountpreferred-v3             peterz-v3             srikar-v3        favorfaults-v3   
User    NUMA01               52623.86 (  0.00%)    58607.67 (-11.37%)    56861.80 ( -8.05%)    51173.76 (  2.76%)    55995.75 ( -6.41%)    58891.91 (-11.91%)    53156.13 ( -1.01%)    42207.06 ( 19.79%)    59405.06 (-12.89%)
User    NUMA01_THEADLOCAL    17595.48 (  0.00%)    18613.09 ( -5.78%)    19832.77 (-12.72%)    19737.98 (-12.18%)    20600.88 (-17.08%)    18716.43 ( -6.37%)    18647.37 ( -5.98%)    17547.60 (  0.27%)    18129.29 ( -3.03%)
User    NUMA02                2043.84 (  0.00%)     2129.21 ( -4.18%)     2068.72 ( -1.22%)     2091.60 ( -2.34%)     1948.59 (  4.66%)     2072.05 ( -1.38%)     2035.36 (  0.41%)     2075.80 ( -1.56%)     2029.90 (  0.68%)
User    NUMA02_SMT            1057.11 (  0.00%)     1069.20 ( -1.14%)      992.14 (  6.15%)     1045.40 (  1.11%)      970.20 (  8.22%)     1021.87 (  3.33%)     1027.08 (  2.84%)      953.90 (  9.76%)      983.51 (  6.96%)
System  NUMA01                 414.17 (  0.00%)      377.36 (  8.89%)      338.80 ( 18.20%)      130.60 ( 68.47%)      115.62 ( 72.08%)      158.80 ( 61.66%)      116.45 ( 71.88%)      183.47 ( 55.70%)      404.15 (  2.42%)
System  NUMA01_THEADLOCAL      105.17 (  0.00%)       98.46 (  6.38%)       96.87 (  7.89%)      101.17 (  3.80%)      101.29 (  3.69%)       87.57 ( 16.73%)       94.89 (  9.77%)       95.30 (  9.38%)       77.63 ( 26.19%)
System  NUMA02                   9.36 (  0.00%)       11.21 (-19.76%)        8.92 (  4.70%)       10.64 (-13.68%)       10.02 ( -7.05%)        9.73 ( -3.95%)       10.57 (-12.93%)        6.46 ( 30.98%)       10.06 ( -7.48%)
System  NUMA02_SMT               3.54 (  0.00%)        4.04 (-14.12%)        2.59 ( 26.84%)        3.23 (  8.76%)        2.66 ( 24.86%)        3.19 (  9.89%)        3.70 ( -4.52%)        4.64 (-31.07%)        3.15 ( 11.02%)
Elapsed NUMA01                1201.52 (  0.00%)     1341.55 (-11.65%)     1304.61 ( -8.58%)     1173.59 (  2.32%)     1293.92 ( -7.69%)     1338.15 (-11.37%)     1258.95 ( -4.78%)     1008.45 ( 16.07%)     1356.31 (-12.88%)
Elapsed NUMA01_THEADLOCAL      393.91 (  0.00%)      416.46 ( -5.72%)      449.30 (-14.06%)      449.69 (-14.16%)      475.32 (-20.67%)      449.98 (-14.23%)      431.20 ( -9.47%)      399.82 ( -1.50%)      446.03 (-13.23%)
Elapsed NUMA02                  50.30 (  0.00%)       51.64 ( -2.66%)       49.70 (  1.19%)       52.03 ( -3.44%)       49.72 (  1.15%)       50.87 ( -1.13%)       49.59 (  1.41%)       50.65 ( -0.70%)       50.10 (  0.40%)
Elapsed NUMA02_SMT              58.48 (  0.00%)       54.57 (  6.69%)       61.05 ( -4.39%)       50.51 ( 13.63%)       59.38 ( -1.54%)       47.53 ( 18.72%)       55.17 (  5.66%)       50.95 ( 12.88%)       47.93 ( 18.04%)
CPU     NUMA01                4414.00 (  0.00%)     4396.00 (  0.41%)     4384.00 (  0.68%)     4371.00 (  0.97%)     4336.00 (  1.77%)     4412.00 (  0.05%)     4231.00 (  4.15%)     4203.00 (  4.78%)     4409.00 (  0.11%)
CPU     NUMA01_THEADLOCAL     4493.00 (  0.00%)     4492.00 (  0.02%)     4435.00 (  1.29%)     4411.00 (  1.83%)     4355.00 (  3.07%)     4178.00 (  7.01%)     4346.00 (  3.27%)     4412.00 (  1.80%)     4081.00 (  9.17%)
CPU     NUMA02                4081.00 (  0.00%)     4144.00 ( -1.54%)     4180.00 ( -2.43%)     4040.00 (  1.00%)     3939.00 (  3.48%)     4091.00 ( -0.25%)     4124.00 ( -1.05%)     4111.00 ( -0.74%)     4071.00 (  0.25%)
CPU     NUMA02_SMT            1813.00 (  0.00%)     1966.00 ( -8.44%)     1629.00 ( 10.15%)     2075.00 (-14.45%)     1638.00 (  9.65%)     2156.00 (-18.92%)     1868.00 ( -3.03%)     1881.00 ( -3.75%)     2058.00 (-13.51%)

numa01 had a rocky road through the series. On this machine it is an
adverse workload and interestingly favor faults fares worse with a large
increase in system CPU usage. Srikar's patch shows that this can be much
improved but as it is the adverse case, I am not inclined to condemn the
series and instead consider how the problem can be detected in the future.

numa01_threadlocal is interesting in that performance degraded. The
vanilla kernel very likely running optimally already as this is an ideal
case. While it is possible this is a statistics error, it is far more
likely an impact due to the scan rate adaption because you can see the
bulk of the degradation was introduced in that patch.

numa02 showed no improvement but it should also be already running close
to as quickly as possible.

numa02_smt is interesting though. Overall the series did very well. In the
single jvm specjbb case, Peter's scheduling patch did much better than mine.
In this test, mine performed better and it would be worthwhile figuring
out why that is and if both can be merged in some sensible fashion.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v3   scalescan-v3   scanshared-v3   splitprivate-v3   accountpreferred-v3   peterz-v3   srikar-v3   favorfaults-v3   
THP fault alloc                  14325       13843       14457       14618       14165       14814       14629       16792       13308
THP collapse alloc                   6           8           2           6           3           8           4           7           7
THP splits                           4           5           2           2           2           2           3           4           2
THP fault fallback                   0           0           0           0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0           0           0           0
Page migrate success           9020528     5072181     4719346     5360917     5129210     4968068     4550697     7006284     4864309
Page migrate failure                 0           0           0           0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0           0           0           0
Compaction cost                   9363        5264        4898        5564        5324        5156        4723        7272        5049
NUMA PTE updates             119292401    71557939    70633856    71043501    83412737    77186984    80719110   118076970    84957883
NUMA hint faults                755901      452863      207502      216838      249153      207811      237083      608391      247585
NUMA hint local faults          595478      365390      125907      121476      136318      110254      140220      478856      137721
NUMA hint local percent             78          80          60          56          54          53          59          78          55
NUMA pages migrated            9020528     5072181     4719346     5360917     5129210     4968068     4550697     7006284     4864309
AutoNUMA cost                     4785        2861        1621        1683        1927        1673        1836        4001        1925

As all the tests are mashed together it is possible to make specific
conclusions on each testcase.  However, in general the series is doing a lot
less work with PTE updates, faults and so on. THe percentage of local faults
varies a lot but this data does not indicate which test case is affected.


I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system.

specjbb
                          3.9.0                 3.9.0                 3.9.0                 3.9.0                      3.9.0                  3.9.0                 3.9.0                 3.9.0                 3.9.0
                        vanilla       favorpref-v3          scalescan-v3         scanshared-v3               splitprivate-v3    accountpreferred-v3             peterz-v3             srikar-v3        favorfaults-v3   
Mean   1      30640.75 (  0.00%)     29752.00 ( -2.90%)     30475.00 ( -0.54%)     31206.50 (  1.85%)     31056.75 (  1.36%)     31131.75 (  1.60%)     31093.00 (  1.48%)     30659.25 (  0.06%)     31105.50 (  1.52%)
Mean   10    136983.25 (  0.00%)    140038.00 (  2.23%)    133589.75 ( -2.48%)    145615.50 (  6.30%)    143027.50 (  4.41%)    144137.25 (  5.22%)    129712.75 ( -5.31%)    138238.25 (  0.92%)    129383.00 ( -5.55%)
Mean   19    124005.25 (  0.00%)    119630.25 ( -3.53%)    125307.50 (  1.05%)    125454.50 (  1.17%)    124757.75 (  0.61%)    122126.50 ( -1.52%)    111949.75 ( -9.72%)    121013.25 ( -2.41%)    120418.25 ( -2.89%)
Mean   28    114672.00 (  0.00%)    106671.00 ( -6.98%)    115164.50 (  0.43%)    112532.25 ( -1.87%)    114629.50 ( -0.04%)    116116.00 (  1.26%)    105418.00 ( -8.07%)    112967.00 ( -1.49%)    108037.50 ( -5.79%)
Mean   37    110916.50 (  0.00%)    102696.50 ( -7.41%)    111580.50 (  0.60%)    107410.75 ( -3.16%)    104110.75 ( -6.14%)    106203.25 ( -4.25%)    108752.25 ( -1.95%)    108677.50 ( -2.02%)    104177.00 ( -6.08%)
Mean   46    110139.25 (  0.00%)    103473.75 ( -6.05%)    106920.75 ( -2.92%)    109062.00 ( -0.98%)    107684.50 ( -2.23%)    100882.75 ( -8.40%)    103070.50 ( -6.42%)    102208.50 ( -7.20%)    104402.50 ( -5.21%)
Stddev 1       1002.06 (  0.00%)      1151.12 (-14.88%)       948.37 (  5.36%)       714.89 ( 28.66%)      1455.54 (-45.25%)       697.63 ( 30.38%)      1082.10 ( -7.99%)      1507.51 (-50.44%)       737.14 ( 26.44%)
Stddev 10      4656.47 (  0.00%)      4974.97 ( -6.84%)      6502.35 (-39.64%)      6645.90 (-42.72%)      5881.13 (-26.30%)      3828.53 ( 17.78%)      5799.04 (-24.54%)      4297.12 (  7.72%)     10885.11 (-133.76%)
Stddev 19      2578.12 (  0.00%)      1975.51 ( 23.37%)      2563.47 (  0.57%)      6254.55 (-142.60%)      3401.11 (-31.92%)      2539.02 (  1.52%)      8162.13 (-216.59%)      1532.98 ( 40.54%)      8479.33 (-228.90%)
Stddev 28      4123.69 (  0.00%)      2562.60 ( 37.86%)      3188.89 ( 22.67%)      6831.77 (-65.67%)      1378.53 ( 66.57%)      5196.71 (-26.02%)      3942.17 (  4.40%)      8060.48 (-95.47%)      7675.13 (-86.12%)
Stddev 37      2301.94 (  0.00%)      4126.45 (-79.26%)      3255.11 (-41.41%)      5492.87 (-138.62%)      4489.53 (-95.03%)      5610.45 (-143.73%)      5047.08 (-119.25%)      1621.31 ( 29.57%)     10608.90 (-360.87%)
Stddev 46      8317.91 (  0.00%)      8073.31 (  2.94%)      7647.06 (  8.07%)      6361.55 ( 23.52%)      3940.12 ( 52.63%)      8185.37 (  1.59%)      8261.33 (  0.68%)      3822.28 ( 54.05%)     10296.79 (-23.79%)
TPut   1     122563.00 (  0.00%)    119008.00 ( -2.90%)    121900.00 ( -0.54%)    124826.00 (  1.85%)    124227.00 (  1.36%)    124527.00 (  1.60%)    124372.00 (  1.48%)    122637.00 (  0.06%)    124422.00 (  1.52%)
TPut   10    547933.00 (  0.00%)    560152.00 (  2.23%)    534359.00 ( -2.48%)    582462.00 (  6.30%)    572110.00 (  4.41%)    576549.00 (  5.22%)    518851.00 ( -5.31%)    552953.00 (  0.92%)    517532.00 ( -5.55%)
TPut   19    496021.00 (  0.00%)    478521.00 ( -3.53%)    501230.00 (  1.05%)    501818.00 (  1.17%)    499031.00 (  0.61%)    488506.00 ( -1.52%)    447799.00 ( -9.72%)    484053.00 ( -2.41%)    481673.00 ( -2.89%)
TPut   28    458688.00 (  0.00%)    426684.00 ( -6.98%)    460658.00 (  0.43%)    450129.00 ( -1.87%)    458518.00 ( -0.04%)    464464.00 (  1.26%)    421672.00 ( -8.07%)    451868.00 ( -1.49%)    432150.00 ( -5.79%)
TPut   37    443666.00 (  0.00%)    410786.00 ( -7.41%)    446322.00 (  0.60%)    429643.00 ( -3.16%)    416443.00 ( -6.14%)    424813.00 ( -4.25%)    435009.00 ( -1.95%)    434710.00 ( -2.02%)    416708.00 ( -6.08%)
TPut   46    440557.00 (  0.00%)    413895.00 ( -6.05%)    427683.00 ( -2.92%)    436248.00 ( -0.98%)    430738.00 ( -2.23%)    403531.00 ( -8.40%)    412282.00 ( -6.42%)    408834.00 ( -7.20%)    417610.00 ( -5.21%)

This shows a mix of gains and regressions with big differences in the
variation introduced by the favorfaults patch. The stddev is large enough
that the performance may be flat or at least comparable after the series
is applied.  I know that performance is massively short of performance
if the four JVMs are hard-bound to each node. Improving this requires
that group of related threads be identified and moved towards the same
node. There are a variety of ways on how something like that could be
implemented although the devil will be in the details for any of them.

o When selecting node with most faults weight the faults by the number
  of tasks sharing the same address space. Would not work for multi-process
  applications sharing data though.

o If the pid is not matching on a given page then converge for memory as
  normal. However, in the load balancer favour moving related tasks with
  the task incurring more local faults having greater weight.

o When selecting a CPU on another node to run, select a task B to swap with.
  Task B should not be already running on its preferred node and ideally
  it should improve its locality when migrated to the new node

etc. Handling any part of the problem has different costs in storage
and complexity. It's a case of working through it and given the likely
complexity, I think it deserves a dedicated series.


               3.9.0       3.9.0      3.9.0      3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillafavorpref-v3 scalescan-v3 scanshared-v3 splpriv  accountpref   peterz-v3   srikar-v3   favorfaults-v3   
User        52899.04    53210.81    53042.21    53328.70    52918.56    53603.58    53063.66    52851.59    52829.96
System        250.42      224.78      201.53      193.12      205.82      214.38      209.86      228.30      211.12
Elapsed      1199.72     1204.36     1197.77     1208.94     1199.23     1223.66     1206.86     1198.51     1205.00

Interestingly though the performance is comparable but system CPU usage
is lower which is something.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanilla favorpref-v3 scalescan scanshared-v3 splitpriv-v3 accountpref peterz-v3   srikar-v3   favorfaults-v3   
THP fault alloc                  65188       66097       67667       66195       68326       69270       67150       60141       63869
THP collapse alloc                  97         104         103         101          95          91         104          99         103
THP splits                          38          34          35          29          38          39          33          36          31
THP fault fallback                   0           0           0           0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0           0           0           0
Page migrate success          14583860    10507899     8023771     7251275     8175290     8268183     8477546    12511430     8686134
Page migrate failure                 0           0           0           0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0           0           0           0
Compaction cost                  15138       10907        8328        7526        8485        8582        8799       12986        9016
NUMA PTE updates             128327468   102978689    76226351    74280333    75229069    77110305    78175561   128407433    80020924
NUMA hint faults               2103190     1745470     1039953     1342325     1344201     1448015     1328751     2068061     1499687
NUMA hint local faults          734136      641359      334299      452808      388403      417083      517108      875830      617246
NUMA hint local percent             34          36          32          33          28          28          38          42          41
NUMA pages migrated           14583860    10507899     8023771     7251275     8175290     8268183     8477546    12511430     8686134
AutoNUMA cost                    11691        9647        5885        7369        7402        7936        7352       11476        8223

PTE scan activity is much reduced by the series with with comparable
percentages of local numa hinting faults.

Longer tests are running but this is already a tonne of data and it's well
past Beer O'Clock on a Friday but based on this I think the series mostly
improves matters (exception being NUMA01_THEADLOCAL). The multi-jvm case
needs more work to identify groups of related tasks and migrate them together
but I think that is beyond the scope of this series and is a separate
issue with its own complexities to consider. There is a question whether to
replace Patch 14 with Peter's patch or mash them together. We could always
start with Patch 14 as a comparison point until Peter's version is complete.

Thoughts?

 Documentation/sysctl/kernel.txt   |  68 +++++++
 include/linux/migrate.h           |   7 +-
 include/linux/mm.h                |  59 +++---
 include/linux/mm_types.h          |   7 +-
 include/linux/page-flags-layout.h |  28 +--
 include/linux/sched.h             |  23 ++-
 include/linux/sched/sysctl.h      |   1 -
 kernel/sched/core.c               |  60 ++++++-
 kernel/sched/fair.c               | 368 ++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h              |  17 ++
 kernel/sysctl.c                   |  14 +-
 mm/huge_memory.c                  |   9 +-
 mm/memory.c                       |  17 +-
 mm/mempolicy.c                    |  10 +-
 mm/migrate.c                      |  21 +--
 mm/mm_init.c                      |  18 +-
 mm/mmzone.c                       |  12 +-
 mm/mprotect.c                     |   4 +-
 mm/page_alloc.c                   |   4 +-
 19 files changed, 610 insertions(+), 137 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/15] mm: numa: Document automatic NUMA balancing sysctls
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 02/15] sched: Track NUMA hinting faults on per-node basis Mel Gorman
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/15] sched: Track NUMA hinting faults on per-node basis
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
  2013-07-05 23:08 ` [PATCH 01/15] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 03/15] sched: Select a preferred node with the most numa hinting faults Mel Gorman
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.  Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 +++++++++++-
 kernel/sched/sched.h  | 11 +++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
+
+	/* Record the fault, double the weight if pages were migrated */
+	p->numa_faults[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..c5f773d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/15] sched: Select a preferred node with the most numa hinting faults
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
  2013-07-05 23:08 ` [PATCH 01/15] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
  2013-07-05 23:08 ` [PATCH 02/15] sched: Track NUMA hinting faults on per-node basis Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 04/15] sched: Update NUMA hinting faults once per scan Mel Gorman
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..ed4e785 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..c0bee41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = 0;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/15] sched: Update NUMA hinting faults once per scan
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (2 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 03/15] sched: Select a preferred node with the most numa hinting faults Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 05/15] sched: Favour moving tasks towards the preferred node Mel Gorman
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed4e785..0bd541c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0bee41..8dc9ff9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -832,9 +838,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -848,7 +858,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults[node] += pages << migrated;
+	p->numa_faults_buffer[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/15] sched: Favour moving tasks towards the preferred node
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (3 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 04/15] sched: Update NUMA hinting faults once per scan Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality a
big weakness is that the nodes CPUs can be overloaded and it would be more
efficient to queue tasks on an idle node and migrate to the new node. This
would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 ++-
 kernel/sched/fair.c             | 60 ++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c                 |  7 +++++
 5 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -418,6 +419,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bd541c..5e02507 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -6141,6 +6141,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8dc9ff9..5055bf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -820,8 +830,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -3898,6 +3910,35 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_preferred_nid == dst_nid)
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3946,11 +3987,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..263486f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (4 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 05/15] sched: Favour moving tasks towards the preferred node Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-06 10:38   ` Peter Zijlstra
  2013-07-05 23:08 ` [PATCH 07/15] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 17 ++++++++++++++++
 kernel/sched/fair.c  | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e02507..e4c1832 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -992,6 +992,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5055bf9..5a01dcb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,40 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			/*
+			 * Kernel threads can be preempted. For others, do
+			 * not preempt if running on their preferred node
+			 * or pinned.
+			 */
+			struct task_struct *p = cpu_rq(i)->curr;
+			if ((p->flags & PF_KTHREAD) ||
+			    (p->numa_preferred_nid != nid && p->nr_cpus_allowed > 1)) {
+				min_load = load;
+				idlest_cpu = i;
+			}
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -829,10 +863,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5f773d..795346d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,6 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/15] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (5 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 08/15] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 33 ++++++++++++++++++++++++---------
 mm/huge_memory.c      |  7 ++++---
 mm/memory.c           |  9 ++++++---
 4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..b81195e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5a01dcb..0f3f01c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -834,6 +834,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
 	return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -850,13 +855,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -892,16 +903,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!sched_feat_numa(NUMA))
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
@@ -909,7 +924,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -923,7 +938,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults_buffer[node] += pages << migrated;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..7cd7114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid;
+	int target_nid, last_nid;
 	int current_nid = -1;
 	bool migrated;
 
@@ -1307,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	last_nid = page_nid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1332,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1347,7 +1348,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..c28bf52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int current_nid = -1, last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	last_nid = page_nid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3586,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+		task_numa_fault(last_nid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,6 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,6 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
+		last_nid = page_nid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3666,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		task_numa_fault(last_nid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/15] sched: Increase NUMA PTE scanning when a new preferred node is selected
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (6 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 07/15] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 09/15] sched: Check current->mm before allocating NUMA faults Mel Gorman
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 27 ++++++++++++---------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 15 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4c1832..02db92a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1602,7 +1602,6 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f3f01c..3c69b599 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -882,6 +881,7 @@ static void task_numa_placement(struct task_struct *p)
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		int preferred_cpu;
+		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
 		 * If the task is not on the preferred node then find the most
@@ -897,6 +897,16 @@ static void task_numa_placement(struct task_struct *p)
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
 		migrate_task_to(p, preferred_cpu);
+
+		/*
+		 * If preferred nodes changes frequently then the scan rate
+		 * will be continually high. Mitigate this by increasing the
+		 * scan rate only if the task was settled.
+		 */
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period = max(p->numa_scan_period >> 1,
+					sysctl_numa_balancing_scan_period_min);
+		}
 	}
 }
 
@@ -993,19 +1003,6 @@ void task_numa_work(struct callback_head *work)
 	}
 
 	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
-	}
-
-	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 263486f..1fcbc68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,13 +373,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/15] sched: Check current->mm before allocating NUMA faults
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (7 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 08/15] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 10/15] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c69b599..aee3e0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -843,8 +843,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = 0;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -921,6 +919,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/15] sched: Set the scan rate proportional to the size of the task being scanned
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (8 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 09/15] sched: Check current->mm before allocating NUMA faults Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 11/15] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks virtual
address space. Conceptually this is a lot easier to understand. There is a
"sanity" check to ensure the scan rate is never extremely fast based on the
amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 ++++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 72 +++++++++++++++++++++++++++++++++++------
 3 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a275042..f38d4f4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,15 +401,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b81195e..d44fbc6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aee3e0b..66306c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,10 +779,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -790,6 +792,46 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long nr_vm_pages = 0;
+	unsigned long nr_scan_pages;
+
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	nr_vm_pages = p->mm->total_vm;
+	if (!nr_vm_pages)
+		nr_vm_pages = nr_scan_pages;
+
+	nr_vm_pages = round_up(nr_vm_pages, nr_scan_pages);
+	return nr_vm_pages / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 /*
  * Once a preferred node is selected the scheduler balancer will prefer moving
  * a task to that node for sysctl_numa_balancing_settle_count number of PTE
@@ -848,6 +890,7 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 	p->numa_migrate_seq++;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -903,7 +946,7 @@ static void task_numa_placement(struct task_struct *p)
 		 */
 		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
 			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+					task_scan_min(p));
 		}
 	}
 }
@@ -944,7 +987,7 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	 * This is reset periodically in case of phase changes
 	 */
         if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
@@ -970,6 +1013,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -1011,8 +1055,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -1051,7 +1097,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1098,7 +1152,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/15] mm: numa: Scan pages with elevated page_mapcount
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (9 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 10/15] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:08 ` [PATCH 12/15] sched: Remove check that skips small VMAs Mel Gorman
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/memory.c             |  4 ++--
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 4 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index c28bf52..b06022a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3581,7 +3581,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		current_nid = target_nid;
 
@@ -3666,7 +3666,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Migrate to the requested node */
 		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
+		migrated = migrate_misplaced_page(page, vma, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
 		task_numa_fault(last_nid, curr_nid, 1, migrated);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3bbaf5d..23f8122 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1579,7 +1579,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1641,13 +1643,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..cacc64a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 12/15] sched: Remove check that skips small VMAs
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (10 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 11/15] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
@ 2013-07-05 23:08 ` Mel Gorman
  2013-07-05 23:09 ` [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults Mel Gorman
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:08 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 66306c7..47276a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1089,10 +1089,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (11 preceding siblings ...)
  2013-07-05 23:08 ` [PATCH 12/15] sched: Remove check that skips small VMAs Mel Gorman
@ 2013-07-05 23:09 ` Mel Gorman
  2013-07-06 10:41   ` Peter Zijlstra
  2013-07-06 10:44   ` Peter Zijlstra
  2013-07-05 23:09 ` [PATCH 14/15] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
  2013-07-05 23:09 ` [PATCH 15/15] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
  14 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:09 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 59 ++++++++++++++++++++++++---------------
 include/linux/mm_types.h          |  4 +--
 include/linux/page-flags-layout.h | 28 +++++++++++--------
 kernel/sched/fair.c               | 12 +++++---
 mm/huge_memory.c                  | 10 +++----
 mm/memory.c                       | 16 +++++------
 mm/mempolicy.c                    | 10 +++++--
 mm/migrate.c                      |  4 +--
 mm/mm_init.c                      | 18 ++++++------
 mm/mmzone.c                       | 12 ++++----
 mm/page_alloc.c                   |  4 +--
 11 files changed, 103 insertions(+), 74 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..569beec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -582,11 +582,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -596,7 +596,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -618,7 +618,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -662,48 +662,63 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	page->_last_nid = -1;
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de70964..4137f67 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47276a3..5933e24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -954,10 +954,10 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
-	int priv;
+	int priv, last_pid;
 
 	if (!sched_feat_numa(NUMA))
 		return;
@@ -966,8 +966,12 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	last_pid = nidpid_to_pid(last_nidpid);
+	priv = (last_pid == -1) ? 1 : ((p->pid & LAST__PID_MASK) == last_pid);
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7cd7114..efded83 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid, last_nid;
+	int target_nid, last_nidpid;
 	int current_nid = -1;
 	bool migrated;
 
@@ -1307,7 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1333,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nidpid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1348,7 +1348,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (current_nid != -1)
-		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nidpid, current_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
@@ -1640,7 +1640,7 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index b06022a..9ebad7e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1, last_nid;
+	int current_nid = -1, last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,7 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3587,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(last_nid, current_nid, 1, migrated);
+		task_numa_fault(last_nidpid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3603,7 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3669,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, vma, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(last_nid, curr_nid, 1, migrated);
+		task_numa_fault(last_nidpid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..1eaccd2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2288,9 +2288,12 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int last_pid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);;
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2313,8 +2316,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		last_pid = nidpid_to_pid(last_nidpid);
+		if (last_pid != -1 && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 23f8122..01d653d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1478,7 +1478,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1655,7 +1655,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c280a02..eecdc64 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -69,26 +69,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -100,9 +100,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..89b3b7e 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -98,19 +98,19 @@ void lruvec_init(struct lruvec *lruvec)
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..f7c9c0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -613,7 +613,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3910,7 +3910,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 14/15] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (12 preceding siblings ...)
  2013-07-05 23:09 ` [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults Mel Gorman
@ 2013-07-05 23:09 ` Mel Gorman
  2013-07-06 10:46   ` Peter Zijlstra
  2013-07-05 23:09 ` [PATCH 15/15] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
  14 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:09 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

It is preferred that tasks always run local to their memory but it is
not optimal if that node is compute overloaded and failing to get
access to a CPU. This would compete with the load balancer trying to
move tasks off and NUMA balancing moving it back.

Ultimately, it will be required that the compute load be calculated
of each node and minimise that as well as minimising the number of
remote accesses until the optimal balance point is reached. Begin
this process by simply accounting for the number of tasks that are
running on their preferred node. When deciding what node to place
a task on, do not place a task on a node that has more preferred
placement tasks than there are CPUs.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 49 +++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  5 +++++
 3 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 02db92a..13b9068 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6112,6 +6112,40 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_NUMA_BALANCING
+void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
+{
+	struct rq *rq;
+	unsigned long flags;
+	bool on_rq, running;
+
+	/*
+	 * Dequeue task before updating preferred_nid so
+	 * rq->nr_preferred_running is accurate
+	 */
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	/* Update the preferred nid and migrate task if possible */
+	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 0;
+
+	/* Requeue task if necessary */
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
+#endif
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5933e24..c303ba6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -777,6 +777,18 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_preferred_running +=
+			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_preferred_running -=
+			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -880,6 +892,21 @@ static inline int task_faults_idx(int nid, int priv)
 	return 2 * nid + priv;
 }
 
+/* Returns true if the given node is compute overloaded */
+static bool sched_numa_overloaded(int nid)
+{
+	int nr_cpus = 0;
+	int nr_preferred = 0;
+	int i;
+
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		nr_cpus++;
+		nr_preferred += cpu_rq(i)->nr_preferred_running;
+	}
+
+	return nr_preferred >= nr_cpus << 1;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -908,7 +935,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Find maximum private faults */
 		faults = p->numa_faults[task_faults_idx(nid, 1)];
-		if (faults > max_faults) {
+		if (faults > max_faults && !sched_numa_overloaded(nid)) {
 			max_faults = faults;
 			max_nid = nid;
 		}
@@ -934,9 +961,7 @@ static void task_numa_placement(struct task_struct *p)
 							     max_nid);
 		}
 
-		/* Update the preferred nid and migrate task if possible */
-		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 0;
+		sched_setnuma(p, max_nid, preferred_cpu);
 		migrate_task_to(p, preferred_cpu);
 
 		/*
@@ -1165,6 +1190,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1174,8 +1207,10 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_enqueue(rq_of(cfs_rq), task_of(se));
 		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1186,8 +1221,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 795346d..1d7c0fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -433,6 +433,10 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long nr_preferred_running;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -504,6 +508,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 15/15] sched: Favour moving tasks towards nodes that incurred more faults
  2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
                   ` (13 preceding siblings ...)
  2013-07-05 23:09 ` [PATCH 14/15] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
@ 2013-07-05 23:09 ` Mel Gorman
  14 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-05 23:09 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The scheduler already favours moving tasks towards its preferred node but
does nothing special if the destination node is anything else. This patch
favours moving tasks towards a destination node if more NUMA hinting faults
were recorded on it. Similarly if migrating to a destination node would
degrade locality based on NUMA hinting faults then it will be resisted.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c303ba6..1a4af96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4069,24 +4069,65 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-/* Returns true if the destination node has incurred more faults */
-static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+
+static bool migrate_locality_prepare(struct task_struct *p, struct lb_env *env,
+			int *src_nid, int *dst_nid,
+			unsigned long *src_faults, unsigned long *dst_faults)
 {
-	int src_nid, dst_nid;
+	int priv;
 
 	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
 		return false;
 
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
+	*src_nid = cpu_to_node(env->src_cpu);
+	*dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
+	if (*src_nid == *dst_nid ||
 	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
 		return false;
 
+	/* Calculate private/shared faults on the two nodes */
+	*src_faults = 0;
+	*dst_faults = 0;
+	for (priv = 0; priv < 2; priv++) {
+		*src_faults += p->numa_faults[task_faults_idx(*src_nid, priv)];
+		*dst_faults += p->numa_faults[task_faults_idx(*dst_nid, priv)];
+	}
+
+	return true;
+}
+
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+	unsigned long src, dst;
+
+	if (!migrate_locality_prepare(p, env, &src_nid, &dst_nid, &src, &dst))
+		return false;
+
+	/* Move towards node if it is the preferred node */
 	if (p->numa_preferred_nid == dst_nid)
 		return true;
 
+	/* Move towards node if there were more NUMA hinting faults recorded */
+	if (dst > src)
+		return true;
+
+	return false;
+}
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+	unsigned long src, dst;
+
+	if (!migrate_locality_prepare(p, env, &src_nid, &dst_nid, &src, &dst))
+		return false;
+
+	if (src > dst)
+		return true;
+
 	return false;
 }
 #else
@@ -4095,6 +4136,14 @@ static inline bool migrate_improves_locality(struct task_struct *p,
 {
 	return false;
 }
+
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+
 #endif
 
 /*
@@ -4150,6 +4199,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 
 	if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected
  2013-07-05 23:08 ` [PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
@ 2013-07-06 10:38   ` Peter Zijlstra
  2013-07-08  8:34     ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-07-06 10:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:08:53AM +0100, Mel Gorman wrote:
> +static int
> +find_idlest_cpu_node(int this_cpu, int nid)
> +{
> +	unsigned long load, min_load = ULONG_MAX;
> +	int i, idlest_cpu = this_cpu;
> +
> +	BUG_ON(cpu_to_node(this_cpu) == nid);
> +
> +	rcu_read_lock();
> +	for_each_cpu(i, cpumask_of_node(nid)) {
> +		load = weighted_cpuload(i);
> +
> +		if (load < min_load) {
> +			/*
> +			 * Kernel threads can be preempted. For others, do
> +			 * not preempt if running on their preferred node
> +			 * or pinned.
> +			 */
> +			struct task_struct *p = cpu_rq(i)->curr;
> +			if ((p->flags & PF_KTHREAD) ||
> +			    (p->numa_preferred_nid != nid && p->nr_cpus_allowed > 1)) {
> +				min_load = load;
> +				idlest_cpu = i;
> +			}

So I really don't get this stuff.. if it is indeed the idlest cpu preempting
others shouldn't matter. Also, migrating a task there doesn't actually mean it
will get preempted either.

In overloaded scenarios it expected that multiple tasks will run on the same
cpu. So this condition will also explicitly make overloaded scenarios work less
well.

> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return idlest_cpu;
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults
  2013-07-05 23:09 ` [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults Mel Gorman
@ 2013-07-06 10:41   ` Peter Zijlstra
  2013-07-08  9:23     ` Mel Gorman
  2013-07-06 10:44   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-07-06 10:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:09:00AM +0100, Mel Gorman wrote:
> +++ b/include/linux/mm.h
> @@ -582,11 +582,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   * sets it, so none of the operations on it need to be atomic.
>   */
>  
> -/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
> +/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
>  #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
>  #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
>  #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
> -#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
> +#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)

I saw the same with Ingo's patch doing the similar thing. But why do we fuse
these two into a single field? Would it not make more sense to have them be
separate fields?

Yes I get we update and read them together, and we could still do that with
appropriate helper function, but they are two independent values stored in the
page flags.

Its not something I care too much about, just something that strikes me as weird.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults
  2013-07-05 23:09 ` [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults Mel Gorman
  2013-07-06 10:41   ` Peter Zijlstra
@ 2013-07-06 10:44   ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-07-06 10:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:09:00AM +0100, Mel Gorman wrote:
> The third reason is that multiple threads in a process will race each
> other to fault the shared page making the fault information unreliable.

Ingo and I played around with that particular issue for a while and we had a
patch that worked fairly well for cpu bound threads and made sure the
task_numa_work() thing indeed interleaved between the threads and wasn't done
by the same thread every time.

I don't know what the current code does and if that is indeed still an issue.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 14/15] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-05 23:09 ` [PATCH 14/15] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
@ 2013-07-06 10:46   ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-07-06 10:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:09:01AM +0100, Mel Gorman wrote:
> +/* Returns true if the given node is compute overloaded */
> +static bool sched_numa_overloaded(int nid)
> +{
> +	int nr_cpus = 0;
> +	int nr_preferred = 0;
> +	int i;
> +
> +	for_each_cpu(i, cpumask_of_node(nid)) {
> +		nr_cpus++;
> +		nr_preferred += cpu_rq(i)->nr_preferred_running;
> +	}
> +
> +	return nr_preferred >= nr_cpus << 1;
> +}
> +
>  static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = 0;
> @@ -908,7 +935,7 @@ static void task_numa_placement(struct task_struct *p)
>  
>  		/* Find maximum private faults */
>  		faults = p->numa_faults[task_faults_idx(nid, 1)];
> -		if (faults > max_faults) {
> +		if (faults > max_faults && !sched_numa_overloaded(nid)) {
>  			max_faults = faults;
>  			max_nid = nid;
>  		}

This again very explicitly breaks for overloaded scenarios.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected
  2013-07-06 10:38   ` Peter Zijlstra
@ 2013-07-08  8:34     ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-08  8:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:38:13PM +0200, Peter Zijlstra wrote:
> On Sat, Jul 06, 2013 at 12:08:53AM +0100, Mel Gorman wrote:
> > +static int
> > +find_idlest_cpu_node(int this_cpu, int nid)
> > +{
> > +	unsigned long load, min_load = ULONG_MAX;
> > +	int i, idlest_cpu = this_cpu;
> > +
> > +	BUG_ON(cpu_to_node(this_cpu) == nid);
> > +
> > +	rcu_read_lock();
> > +	for_each_cpu(i, cpumask_of_node(nid)) {
> > +		load = weighted_cpuload(i);
> > +
> > +		if (load < min_load) {
> > +			/*
> > +			 * Kernel threads can be preempted. For others, do
> > +			 * not preempt if running on their preferred node
> > +			 * or pinned.
> > +			 */
> > +			struct task_struct *p = cpu_rq(i)->curr;
> > +			if ((p->flags & PF_KTHREAD) ||
> > +			    (p->numa_preferred_nid != nid && p->nr_cpus_allowed > 1)) {
> > +				min_load = load;
> > +				idlest_cpu = i;
> > +			}
> 
> So I really don't get this stuff.. if it is indeed the idlest cpu preempting
> others shouldn't matter. Also, migrating a task there doesn't actually mean it
> will get preempted either.
> 

At one point this was part of a patch that swapped tasks on the target
node where it really was preempting the running task as the comment
describes. Swapping was premature because it was not evaluating if the
swap would improve performance overall.  You're right, this check should
be removed entirely and it will be in the next update.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults
  2013-07-06 10:41   ` Peter Zijlstra
@ 2013-07-08  9:23     ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2013-07-08  9:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Sat, Jul 06, 2013 at 12:41:07PM +0200, Peter Zijlstra wrote:
> On Sat, Jul 06, 2013 at 12:09:00AM +0100, Mel Gorman wrote:
> > +++ b/include/linux/mm.h
> > @@ -582,11 +582,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >   * sets it, so none of the operations on it need to be atomic.
> >   */
> >  
> > -/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
> > +/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
> >  #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
> >  #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
> >  #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
> > -#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
> > +#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
> 
> I saw the same with Ingo's patch doing the similar thing. But why do we fuse
> these two into a single field? Would it not make more sense to have them be
> separate fields?
> 
> Yes I get we update and read them together, and we could still do that with
> appropriate helper function, but they are two independent values stored in the
> page flags.
> 

There were two reasons. First, it is because we update and read them
together. Second, it's all or nothing if this field is included in the
page->flags or not. I know this could also be done with helpers and
other tricks but I did not think it would be any easier to understand.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-07-08  9:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-05 23:08 [PATCH 0/15] Basic scheduler support for automatic NUMA balancing V3 Mel Gorman
2013-07-05 23:08 ` [PATCH 01/15] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-05 23:08 ` [PATCH 02/15] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-05 23:08 ` [PATCH 03/15] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-05 23:08 ` [PATCH 04/15] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-05 23:08 ` [PATCH 05/15] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-05 23:08 ` [PATCH 06/15] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-06 10:38   ` Peter Zijlstra
2013-07-08  8:34     ` Mel Gorman
2013-07-05 23:08 ` [PATCH 07/15] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-07-05 23:08 ` [PATCH 08/15] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-05 23:08 ` [PATCH 09/15] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-05 23:08 ` [PATCH 10/15] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
2013-07-05 23:08 ` [PATCH 11/15] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-05 23:08 ` [PATCH 12/15] sched: Remove check that skips small VMAs Mel Gorman
2013-07-05 23:09 ` [PATCH 13/15] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-07-06 10:41   ` Peter Zijlstra
2013-07-08  9:23     ` Mel Gorman
2013-07-06 10:44   ` Peter Zijlstra
2013-07-05 23:09 ` [PATCH 14/15] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
2013-07-06 10:46   ` Peter Zijlstra
2013-07-05 23:09 ` [PATCH 15/15] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).