linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2
@ 2013-07-03 14:21 Mel Gorman
  2013-07-03 14:21 ` [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
                   ` (14 more replies)
  0 siblings, 15 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This builds on the V1 series a bit. The performance still needs to be tied
down but it brings in a few more essential basics. Note that Peter has
posted another patch related to avoiding overloading compute nodes but I
have not had the chance to examine it yet. I'll be doing that after this
is posted as I decided not to postpone releasing this series as I'm two
days overdue to release an update.

Changelog since V2
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patches 3-5 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 6 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 7 splits the accounting of faults between those that passed the
	two-stage filter and those that did not. Task placement favours
	the filtered faults initially although ultimately this will need
	more smarts when node-local faults do not dominate.

Patch 8 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

Patch 9 favours moving tasks towards nodes where more faults were incurred
	even if it is not the preferred node

Patch 10 sets the scan rate proportional to the size of the task being scanned.

Patch 11 avoids some unnecessary allocation

Patch 12 kicks away some training wheels and scans shared pages

Patch 13 accounts for how many "preferred placed" tasks are running on an node
	 and attempts to avoid overloading them

Testing on this is only partial as full tests take a long time to run.

I tested 5 kernels using 3.9.0 as a basline

o 3.9.0-vanilla		vanilla kernel with automatic numa balancing enabled
o 3.9.0-morefaults	Patches 1-9
o 3.9.0-scalescan	Patches 1-10
o 3.9.0-scanshared	Patches 1-12
o 3.9.0-accountpreferred Patches 1-13

autonumabench
                                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                                        vanilla       morefaults            scalescan            scanshared      accountpreferred      
User    NUMA01               52623.86 (  0.00%)    53408.85 ( -1.49%)    52042.73 (  1.10%)    60404.32 (-14.79%)    57403.32 ( -9.08%)
User    NUMA01_THEADLOCAL    17595.48 (  0.00%)    17902.64 ( -1.75%)    18070.07 ( -2.70%)    18937.22 ( -7.63%)    17675.13 ( -0.45%)
User    NUMA02                2043.84 (  0.00%)     2029.40 (  0.71%)     2183.84 ( -6.85%)     2173.80 ( -6.36%)     2259.45 (-10.55%)
User    NUMA02_SMT            1057.11 (  0.00%)      999.71 (  5.43%)     1045.10 (  1.14%)     1046.01 (  1.05%)     1048.58 (  0.81%)
System  NUMA01                 414.17 (  0.00%)      328.68 ( 20.64%)      326.08 ( 21.27%)      155.69 ( 62.41%)      144.53 ( 65.10%)
System  NUMA01_THEADLOCAL      105.17 (  0.00%)       93.22 ( 11.36%)       97.63 (  7.17%)       95.46 (  9.23%)      102.47 (  2.57%)
System  NUMA02                   9.36 (  0.00%)        9.39 ( -0.32%)        9.25 (  1.18%)        8.42 ( 10.04%)       10.46 (-11.75%)
System  NUMA02_SMT               3.54 (  0.00%)        3.32 (  6.21%)        4.27 (-20.62%)        3.41 (  3.67%)        3.72 ( -5.08%)
Elapsed NUMA01                1201.52 (  0.00%)     1238.04 ( -3.04%)     1220.85 ( -1.61%)     1385.58 (-15.32%)     1335.06 (-11.11%)
Elapsed NUMA01_THEADLOCAL      393.91 (  0.00%)      410.64 ( -4.25%)      414.33 ( -5.18%)      434.54 (-10.31%)      406.84 ( -3.28%)
Elapsed NUMA02                  50.30 (  0.00%)       50.30 (  0.00%)       54.49 ( -8.33%)       52.14 ( -3.66%)       56.81 (-12.94%)
Elapsed NUMA02_SMT              58.48 (  0.00%)       52.91 (  9.52%)       58.71 ( -0.39%)       53.12 (  9.17%)       60.82 ( -4.00%)
CPU     NUMA01                4414.00 (  0.00%)     4340.00 (  1.68%)     4289.00 (  2.83%)     4370.00 (  1.00%)     4310.00 (  2.36%)
CPU     NUMA01_THEADLOCAL     4493.00 (  0.00%)     4382.00 (  2.47%)     4384.00 (  2.43%)     4379.00 (  2.54%)     4369.00 (  2.76%)
CPU     NUMA02                4081.00 (  0.00%)     4052.00 (  0.71%)     4024.00 (  1.40%)     4184.00 ( -2.52%)     3995.00 (  2.11%)
CPU     NUMA02_SMT            1813.00 (  0.00%)     1895.00 ( -4.52%)     1787.00 (  1.43%)     1975.00 ( -8.94%)     1730.00 (  4.58%)

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillamorefaults     scalescan      scanshared      accountpreferred      
User        73328.02    74347.84    73349.21    82568.87    78393.27
System        532.89      435.24      437.90      263.61      261.79
Elapsed      1714.18     1763.03     1759.02     1936.17     1869.51

numa01 suffers a bit here but numa01 is also an adverse workload on this
machine. The result is poor but I'm not concentrating on it right now.

Just patches 1-9 (morefaults) performs ok. numa02 is flat and numa02_smt
sees a small performance gain. I do not have variance data to establish
if this is significant or not. After that, altering the scanning had a
large impact and I'll re-examine if the default scan rate is just too slow.

It's worth noting the impact on system CPU time. Overall it is much reduced
and I think we need to keep pushing for keeping the overhead as low as possible
particularly in the future as memory sizes grow.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillamorefaults     scalescan      scanshared      accountpreferred      
THP fault alloc                  14325       14293       14103       14259       14081
THP collapse alloc                   6           3           1          10           5
THP splits                           4           5           5           3           2
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           9020528     5227450     5355703     5597558     5637844
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   9363        5426        5559        5810        5852
NUMA PTE updates             119292401    79765854    73441393    76125744    75857594
NUMA hint faults                755901      384660      206195      214063      193969
NUMA hint local faults          595478      292221      120436      113812      109472
NUMA pages migrated            9020528     5227450     5355703     5597558     5637844
AutoNUMA cost                     4785        2580        1646        1709        1607

Primary take-away point is to note the reduction in NUMA hinting faults
and PTE updates.  Patches 1-9 incur about half the number of faults for
comparable overall performance.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time. The full set is queued.

specjbb
                        3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                      vanilla       morefaults            scalescan            scanshared      accountpreferred      
TPut 1      26099.00 (  0.00%)     24848.00 ( -4.79%)     23990.00 ( -8.08%)     24350.00 ( -6.70%)     24248.00 ( -7.09%)
TPut 7     187276.00 (  0.00%)    189731.00 (  1.31%)    189065.00 (  0.96%)    188680.00 (  0.75%)    189774.00 (  1.33%)
TPut 13    318028.00 (  0.00%)    337374.00 (  6.08%)    339016.00 (  6.60%)    329143.00 (  3.49%)    338743.00 (  6.51%)
TPut 19    368547.00 (  0.00%)    429440.00 ( 16.52%)    423973.00 ( 15.04%)    403563.00 (  9.50%)    430941.00 ( 16.93%)
TPut 25    377522.00 (  0.00%)    497621.00 ( 31.81%)    488108.00 ( 29.29%)    437813.00 ( 15.97%)    485013.00 ( 28.47%)
TPut 31    347642.00 (  0.00%)    487253.00 ( 40.16%)    466366.00 ( 34.15%)    386972.00 ( 11.31%)    437104.00 ( 25.73%)
TPut 37    313439.00 (  0.00%)    478601.00 ( 52.69%)    443415.00 ( 41.47%)    379081.00 ( 20.94%)    425452.00 ( 35.74%)
TPut 43    291958.00 (  0.00%)    458614.00 ( 57.08%)    398195.00 ( 36.39%)    349661.00 ( 19.76%)    393102.00 ( 34.64%)

Pathces 1-9 again perform extremely well here. Reducing the scan rate
had an impact as did scanning shared pages which may indicate that the
shared/private identification is insufficient. Reducing the scan rate might
be the dominant factor as the tests are very short lived -- 30 seconds
each which is just 10 PTE scan windows. Basic accounting of compute load
helped again and overall the series was competetive.

specjbb Peaks
                         3.9.0                      3.9.0        3.9.0                3.9.0                      3.9.0
                       vanilla            morefaults          scalescan          scanshared           accountpreferred      
 Expctd Warehouse     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)     48.00 (  0.00%)
 Actual Warehouse     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)     26.00 (  0.00%)
 Actual Peak Bops 377522.00 (  0.00%) 497621.00 ( 31.81%) 488108.00 ( 29.29%)  37813.00 ( 15.97%) 485013.00 ( 28.47%)

At least peak bops always improved.

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillamorefaults     scalescan      scanshared      accountpreferred      
User         5184.53     5190.44     5195.01     5173.33     5185.88
System         59.61       58.91       61.47       73.84       64.81
Elapsed       254.52      254.17      254.12      254.80      254.55

Interestingly system CPU times were mixed. Scan shared incurred fewer
faults, migrated fewer pages and updated fewer PTEs so the time is being
lost elsewhere.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanilla morefaults    scalescan  scanshared accountpreferred      
THP fault alloc                  33297       33251       34306       35144       33898
THP collapse alloc                   9           8          14           8          16
THP splits                           3           4           5           4           4
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           1773768     1716596     2075251     1815999     1858598
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   1841        1781        2154        1885        1929
NUMA PTE updates              17461135    17268534    18766637    17602518    17406295
NUMA hint faults                 85873      170092       86195       80027       84052
NUMA hint local faults           27145      116070       30651       28293       29919
NUMA pages migrated            1773768     1716596     2075251     1815999     1858598
AutoNUMA cost                      585        1003         601         557         577

Not much of note there other than Patches 1-9 had a very high number of hinting faults
and it's not immediately obvious why.

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system. Similar to
the other test, it's only for a limited number of clients to save time.

specjbb
                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                        vanilla       morefaults-v2r2       scalescan-v2r12      scanshared-v2r12accountpreferred-v2r12
Mean   1      30331.00 (  0.00%)     31076.75 (  2.46%)     30813.50 (  1.59%)     30612.50 (  0.93%)     30411.00 (  0.26%)
Mean   7     150487.75 (  0.00%)    153060.00 (  1.71%)    155117.50 (  3.08%)    152602.25 (  1.41%)    151431.25 (  0.63%)
Mean   13    130513.00 (  0.00%)    135521.25 (  3.84%)    136205.50 (  4.36%)    135635.25 (  3.92%)    130575.50 (  0.05%)
Mean   19    123404.75 (  0.00%)    131505.75 (  6.56%)    126020.75 (  2.12%)    127171.25 (  3.05%)    119632.75 ( -3.06%)
Mean   25    116276.00 (  0.00%)    120041.75 (  3.24%)    117053.25 (  0.67%)    121249.75 (  4.28%)    112591.75 ( -3.17%)
Mean   31    108080.00 (  0.00%)    113237.00 (  4.77%)    113738.00 (  5.24%)    114078.00 (  5.55%)    106955.00 ( -1.04%)
Mean   37    102704.00 (  0.00%)    107246.75 (  4.42%)    113435.75 ( 10.45%)    111945.50 (  9.00%)    106184.75 (  3.39%)
Mean   43     98132.00 (  0.00%)    105014.75 (  7.01%)    109398.75 ( 11.48%)    106662.75 (  8.69%)    103322.75 (  5.29%)
Stddev 1        792.83 (  0.00%)      1127.16 (-42.17%)      1321.59 (-66.69%)      1356.36 (-71.08%)       715.51 (  9.75%)
Stddev 7       4080.34 (  0.00%)       526.84 ( 87.09%)      3153.16 ( 22.72%)      3781.85 (  7.32%)      2863.35 ( 29.83%)
Stddev 13      6614.16 (  0.00%)      2086.04 ( 68.46%)      4139.26 ( 37.42%)      2486.95 ( 62.40%)      4066.48 ( 38.52%)
Stddev 19      2835.73 (  0.00%)      1928.86 ( 31.98%)      4097.14 (-44.48%)       591.59 ( 79.14%)      3182.51 (-12.23%)
Stddev 25      3608.71 (  0.00%)      3198.96 ( 11.35%)      5391.60 (-49.41%)      1606.37 ( 55.49%)      3326.21 (  7.83%)
Stddev 31      2778.25 (  0.00%)       784.02 ( 71.78%)      6802.53 (-144.85%)      1738.20 ( 37.44%)      1126.27 ( 59.46%)
Stddev 37      4069.13 (  0.00%)      5009.93 (-23.12%)      5022.13 (-23.42%)      4191.94 ( -3.02%)      1031.05 ( 74.66%)
Stddev 43      9215.73 (  0.00%)      5589.12 ( 39.35%)      8915.80 (  3.25%)      8042.72 ( 12.73%)      3113.04 ( 66.22%)
TPut   1     121324.00 (  0.00%)    124307.00 (  2.46%)    123254.00 (  1.59%)    122450.00 (  0.93%)    121644.00 (  0.26%)
TPut   7     601951.00 (  0.00%)    612240.00 (  1.71%)    620470.00 (  3.08%)    610409.00 (  1.41%)    605725.00 (  0.63%)
TPut   13    522052.00 (  0.00%)    542085.00 (  3.84%)    544822.00 (  4.36%)    542541.00 (  3.92%)    522302.00 (  0.05%)
TPut   19    493619.00 (  0.00%)    526023.00 (  6.56%)    504083.00 (  2.12%)    508685.00 (  3.05%)    478531.00 ( -3.06%)
TPut   25    465104.00 (  0.00%)    480167.00 (  3.24%)    468213.00 (  0.67%)    484999.00 (  4.28%)    450367.00 ( -3.17%)
TPut   31    432320.00 (  0.00%)    452948.00 (  4.77%)    454952.00 (  5.24%)    456312.00 (  5.55%)    427820.00 ( -1.04%)
TPut   37    410816.00 (  0.00%)    428987.00 (  4.42%)    453743.00 ( 10.45%)    447782.00 (  9.00%)    424739.00 (  3.39%)
TPut   43    392528.00 (  0.00%)    420059.00 (  7.01%)    437595.00 ( 11.48%)    426651.00 (  8.69%)    413291.00 (  5.29%)

These are the mean throughput figures between JVMs and the standard
deviation. Note that with the patches applied that there is a lot less
deviation between JVMs in many cases. As the number of clients increases
the performance improves. This is still far short of the theoritical best
performance but it's a step in the right direction.

specjbb Peaks
                         3.9.0                      3.9.0               3.9.0              3.9.0                3.9.0
                       vanilla            morefaults-v2r2     scalescan-v2r12    scanshared-v2r12 accountpreferred-v2r12
 Expctd Warehouse     12.00 (  0.00%)     12.00 (  0.00%)     12.00 (  0.00%)     12.00 (  0.00%)     12.00 (  0.00%)
 Actual Warehouse      8.00 (  0.00%)      8.00 (  0.00%)      8.00 (  0.00%)      8.00 (  0.00%)      8.00 (  0.00%)
 Actual Peak Bops 601951.00 (  0.00%) 612240.00 (  1.71%) 620470.00 (  3.08%) 610409.00 (  1.41%) 605725.00 (  0.63%)

Peaks are only marginally improved even though many of the individual
throughput figures look ok.

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillamorefaults-v2r2scalescan-v2r12scanshared-v2r12accountpreferred-v2r12
User        78020.94    77250.13    78334.69    78027.78    77752.27
System        305.94      261.17      228.74      234.28      240.04
Elapsed      1744.52     1717.79     1744.31     1742.62     1730.79

And the performance is improved with a healthy reduction in system CPU time

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillamorefaults-v2r2scalescan-v2r12scanshared-v2r12accountpreferred-v2r12
THP fault alloc                  65433       64779       64234       65547       63519
THP collapse alloc                  51          54          58          55          55
THP splits                          55          49          46          51          56
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success          20348847    15323475    11375529    11777597    12110444
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                  21122       15905       11807       12225       12570
NUMA PTE updates             180124094   145320534   109608785   108346894   107390100
NUMA hint faults               2358728     1623277     1489903     1472556     1378097
NUMA hint local faults          835051      603375      585183      516949      425342
NUMA pages migrated           20348847    15323475    11375529    11777597    12110444
AutoNUMA cost                    13441        9424        8432        8344        7872

Much fewer PTE updates and faults.

The performance is still a mixed bag. Patches 1-9 are generally
good. Conceptually I think the other patches make sense but need a bit
more love. The last patch in particularly will be replaced with more of
Peter's work.

 Documentation/sysctl/kernel.txt |  68 +++++++++
 include/linux/migrate.h         |   7 +-
 include/linux/mm_types.h        |   3 -
 include/linux/sched.h           |  21 ++-
 include/linux/sched/sysctl.h    |   1 -
 kernel/sched/core.c             |  33 +++-
 kernel/sched/fair.c             | 330 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h            |  16 ++
 kernel/sysctl.c                 |  14 +-
 mm/huge_memory.c                |   7 +-
 mm/memory.c                     |  13 +-
 mm/migrate.c                    |  17 +--
 mm/mprotect.c                   |   4 +-
 13 files changed, 462 insertions(+), 72 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 14:21 ` [PATCH 02/13] sched: Track NUMA hinting faults on per-node basis Mel Gorman
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 02/13] sched: Track NUMA hinting faults on per-node basis
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
  2013-07-03 14:21 ` [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 14:21 ` [PATCH 03/13] sched: Select a preferred node with the most numa hinting faults Mel Gorman
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.  Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 +++++++++++-
 kernel/sched/sched.h  | 11 +++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
+
+	/* Record the fault, double the weight if pages were migrated */
+	p->numa_faults[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..c5f773d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 03/13] sched: Select a preferred node with the most numa hinting faults
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
  2013-07-03 14:21 ` [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
  2013-07-03 14:21 ` [PATCH 02/13] sched: Track NUMA hinting faults on per-node basis Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 14:21 ` [PATCH 04/13] sched: Update NUMA hinting faults once per scan Mel Gorman
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 10 ++++++++++
 kernel/sched/fair.c   | 16 ++++++++++++++--
 kernel/sched/sched.h  |  1 +
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..019baae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -5713,6 +5714,15 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
+#ifdef CONFIG_NUMA_BALANCING
+
+/* Set a tasks preferred NUMA node */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	p->numa_preferred_nid = nid;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..f8c3f61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = 0;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -802,7 +803,18 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		sched_setnuma(p, max_nid);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5f773d..65a0cf0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,6 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int nid);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 04/13] sched: Update NUMA hinting faults once per scan
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (2 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 03/13] sched: Select a preferred node with the most numa hinting faults Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 14:21 ` [PATCH 05/13] sched: Favour moving tasks towards the preferred node Mel Gorman
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 019baae..b00b81a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8c3f61..5893399 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -847,7 +857,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults[node] += pages << migrated;
+	p->numa_faults_buffer[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 05/13] sched: Favour moving tasks towards the preferred node
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (3 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 04/13] sched: Update NUMA hinting faults once per scan Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 14:21 ` [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality a
big weakness is that the nodes CPUs can be overloaded and it would be more
efficient to queue tasks on an idle node and migrate to the new node. This
would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  4 ++-
 kernel/sched/fair.c             | 56 ++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c                 |  7 ++++++
 5 files changed, 71 insertions(+), 5 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -418,6 +419,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b00b81a..ba9470e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5721,6 +5721,7 @@ struct sched_domain_topology_level;
 void sched_setnuma(struct task_struct *p, int nid)
 {
 	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -6150,6 +6151,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5893399..2a0bbc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -3897,6 +3907,35 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_preferred_nid == dst_nid)
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3945,11 +3984,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..263486f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (4 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 05/13] sched: Favour moving tasks towards the preferred node Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-04 12:26   ` Srikar Dronamraju
  2013-07-03 14:21 ` [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter Mel Gorman
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 18 ++++++++++++++++--
 kernel/sched/fair.c  | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  2 +-
 3 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba9470e..b4722d6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5717,11 +5717,25 @@ struct sched_domain_topology_level;
 
 #ifdef CONFIG_NUMA_BALANCING
 
-/* Set a tasks preferred NUMA node */
-void sched_setnuma(struct task_struct *p, int nid)
+/* Set a tasks preferred NUMA node and reschedule to it */
+void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
 {
+	int curr_cpu = task_cpu(p);
+	struct migration_arg arg = { p, idlest_cpu };
+
 	p->numa_preferred_nid = nid;
 	p->numa_migrate_seq = 0;
+
+	/* Do not reschedule if already running on the target CPU */
+	if (idlest_cpu == curr_cpu)
+		return;
+
+	/* Ensure the target CPU is eligible */
+	if (!cpumask_test_cpu(idlest_cpu, tsk_cpus_allowed(p)))
+		return;
+
+	/* Move current running task to idlest CPU on preferred node */
+	stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a0bbc2..b9139be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,37 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			struct task_struct *p;
+
+			/* Do not preempt a task running on its preferred node */
+			struct rq *rq = cpu_rq(i);
+			raw_spin_lock_irq(&rq->lock);
+			p = rq->curr;
+			if (p->numa_preferred_nid != nid) {
+				min_load = load;
+				idlest_cpu = i;
+			}
+			raw_spin_unlock_irq(&rq->lock);
+		}
+	}
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -829,8 +860,27 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	if (max_faults && max_nid != p->numa_preferred_nid)
-		sched_setnuma(p, max_nid);
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
+	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		sched_setnuma(p, max_nid, preferred_cpu);
+	}
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65a0cf0..64c37a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,7 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void sched_setnuma(struct task_struct *p, int nid);
+extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (5 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 21:56   ` Johannes Weiner
  2013-07-03 14:21 ` [PATCH 08/13] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This would require
that the last task that accessed a page for a hinting fault would be
recorded which would increase the size of struct page. Instead this patch
approximates private pages by assuming that faults that pass the two-stage
filter are private pages and all others are shared. The preferred NUMA
node is then selected based on where the maximum number of approximately
private faults were measured. Shared faults are not taken into
consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults may result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  4 ++--
 kernel/sched/fair.c   | 30 +++++++++++++++++++++---------
 mm/huge_memory.c      |  7 ++++---
 mm/memory.c           |  9 ++++++---
 4 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..a41edea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,10 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages, bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9139be..a66f2bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -831,6 +831,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
 	return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -847,13 +852,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
+
+			/* Decay existing window and copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
 
-		faults = p->numa_faults[nid];
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -886,16 +897,17 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv = (cpu_to_node(task_cpu(p)) == last_nid);
 
 	if (!sched_feat_numa(NUMA))
 		return;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
@@ -903,7 +915,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -917,7 +929,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults_buffer[node] += pages << migrated;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..7cd7114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid;
+	int target_nid, last_nid;
 	int current_nid = -1;
 	bool migrated;
 
@@ -1307,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	last_nid = page_nid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1332,7 +1333,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1347,7 +1348,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nid, current_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..c28bf52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int current_nid = -1, last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3566,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	last_nid = page_nid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3586,7 +3587,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+		task_numa_fault(last_nid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,6 +3603,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,6 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
+		last_nid = page_nid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3666,7 +3669,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		task_numa_fault(last_nid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 08/13] sched: Increase NUMA PTE scanning when a new preferred node is selected
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (6 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 14:21 ` [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 27 ++++++++++++---------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 15 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4722d6..2d1fd93 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1585,7 +1585,6 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a66f2bb..e8d9b3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -879,6 +878,7 @@ static void task_numa_placement(struct task_struct *p)
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		int preferred_cpu;
+		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
 		 * If the task is not on the preferred node then find the most
@@ -891,6 +891,16 @@ static void task_numa_placement(struct task_struct *p)
 		}
 
 		sched_setnuma(p, max_nid, preferred_cpu);
+
+		/*
+		 * If preferred nodes changes frequently then the scan rate
+		 * will be continually high. Mitigate this by increaseing the
+		 * scan rate only if the task was settled.
+		 */
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period = max(p->numa_scan_period >> 1,
+					sysctl_numa_balancing_scan_period_min);
+		}
 	}
 }
 
@@ -984,19 +994,6 @@ void task_numa_work(struct callback_head *work)
 	}
 
 	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
-	}
-
-	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 263486f..1fcbc68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,13 +373,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (7 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 08/13] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 18:27   ` Peter Zijlstra
  2013-07-03 14:21 ` [PATCH 10/13] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The scheduler already favours moving tasks towards its preferred node but
does nothing special if the destination node is anything else. This patch
favours moving tasks towards a destination node if more NUMA hinting faults
were recorded on it. Similarly if migrating to a destination node would
degrade locality based on NUMA hinting faults then it will be resisted.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8d9b3e..e451859 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3967,22 +3967,54 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+
+static bool migrate_locality_prepare(struct task_struct *p, struct lb_env *env,
+				int *src_nid, int *dst_nid)
+{
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	*src_nid = cpu_to_node(env->src_cpu);
+	*dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (*src_nid == *dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	return true;
+}
+
 /* Returns true if the destination node has incurred more faults */
 static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 {
 	int src_nid, dst_nid;
 
-	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+	if (!migrate_locality_prepare(p, env, &src_nid, &dst_nid))
 		return false;
 
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
+	/* Move towards node if it is the preferred node */
+	if (p->numa_preferred_nid == dst_nid)
+		return true;
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	/*
+	 * Move towards node if there were a higher number of private
+	 * NUMA hinting faults recorded on it
+	 */
+	if (p->numa_faults[task_faults_idx(dst_nid, 1)] >
+	    p->numa_faults[task_faults_idx(src_nid, 1)])
+		return true;
+
+	return false;
+}
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!migrate_locality_prepare(p, env, &src_nid, &dst_nid))
 		return false;
 
-	if (p->numa_preferred_nid == dst_nid)
+	if (p->numa_faults[src_nid] > p->numa_faults[dst_nid])
 		return true;
 
 	return false;
@@ -3993,6 +4025,14 @@ static inline bool migrate_improves_locality(struct task_struct *p,
 {
 	return false;
 }
+
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+
 #endif
 
 /*
@@ -4048,6 +4088,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 
 	if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 10/13] sched: Set the scan rate proportional to the size of the task being scanned
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (8 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 14:21 ` [PATCH 11/13] sched: Check current->mm before allocating NUMA faults Mel Gorman
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks virtual
address space. Conceptually this is a lot easier to understand.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 ++++----
 kernel/sched/fair.c             | 56 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a275042..f38d4f4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,15 +401,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e451859..336074f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,10 +779,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 3000;
+unsigned int sysctl_numa_balancing_scan_period_max = 300000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -790,6 +792,34 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long nr_vm_pages = 0;
+	unsigned long nr_scan_pages;
+
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	nr_vm_pages = p->mm->total_vm;
+	if (!nr_vm_pages)
+		nr_vm_pages = nr_scan_pages;
+
+	nr_vm_pages = round_up(nr_vm_pages, nr_scan_pages);
+	return nr_vm_pages / nr_scan_pages;
+}
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int period;
+
+	/* For scanning sanity sake, never scan faster than 100ms */
+	period = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, 100, period);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	return sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+}
+
 /*
  * Once a preferred node is selected the scheduler balancer will prefer moving
  * a task to that node for sysctl_numa_balancing_settle_count number of PTE
@@ -899,7 +929,7 @@ static void task_numa_placement(struct task_struct *p)
 		 */
 		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
 			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+					task_scan_min(p));
 		}
 	}
 }
@@ -933,7 +963,7 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	 * This is reset periodically in case of phase changes
 	 */
         if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+		p->numa_scan_period = min(task_scan_max(p),
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
@@ -959,6 +989,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -1001,7 +1032,7 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+		p->numa_scan_period = task_scan_min(p);
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -1040,10 +1071,17 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+			pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
-			if (pages <= 0)
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * least one PTE is updated so that unused virtual
+			 * address space is quickly skipped
+			 */
+			if (pages <= 0 && nr_pte_updates)
 				goto out;
 		} while (end != vma->vm_end);
 	}
@@ -1087,7 +1125,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 11/13] sched: Check current->mm before allocating NUMA faults
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (9 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 10/13] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 15:33   ` Mel Gorman
  2013-07-04 12:48   ` Srikar Dronamraju
  2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 336074f..3c796b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -870,8 +870,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = 0;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -945,6 +943,12 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm) {
+		p->numa_scan_period = sysctl_numa_balancing_scan_period_max;
+		return;
+	}
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
@@ -1072,16 +1076,18 @@ void task_numa_work(struct callback_head *work)
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
 			nr_pte_updates += change_prot_numa(vma, start, end);
-			pages -= (end - start) >> PAGE_SHIFT;
-
-			start = end;
 
 			/*
 			 * Scan sysctl_numa_balancing_scan_size but ensure that
-			 * least one PTE is updated so that unused virtual
-			 * address space is quickly skipped
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
 			 */
-			if (pages <= 0 && nr_pte_updates)
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
+
+			start = end;
+
+			if (pages <= 0)
 				goto out;
 		} while (end != vma->vm_end);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (10 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 11/13] sched: Check current->mm before allocating NUMA faults Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 18:35   ` Peter Zijlstra
                     ` (2 more replies)
  2013-07-03 14:21 ` [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
                   ` (2 subsequent siblings)
  14 siblings, 3 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Initial support for automatic NUMA balancing was unable to distinguish
between false shared versus private pages except by ignoring pages with an
elevated page_mapcount entirely. This patch kicks away the training wheels
as initial support for identifying shared/private pages is now in place.
Note that the patch still leaves shared, file-backed in VM_EXEC vmas in
place guessing that these are shared library pages. Migrating them are
likely to be of major benefit as generally the expectation would be that
these are read-shared between caches and that iTLB and iCache pressure is
generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/memory.c             |  4 ++--
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 4 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index c28bf52..b06022a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3581,7 +3581,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		current_nid = target_nid;
 
@@ -3666,7 +3666,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Migrate to the requested node */
 		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
+		migrated = migrate_misplaced_page(page, vma, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
 		task_numa_fault(last_nid, curr_nid, 1, migrated);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3bbaf5d..23f8122 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1579,7 +1579,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1641,13 +1643,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..cacc64a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (11 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
@ 2013-07-03 14:21 ` Mel Gorman
  2013-07-03 18:32   ` Peter Zijlstra
  2013-07-03 16:19 ` [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
  2013-07-04 18:02 ` [PATCH RFC WIP] Process weights based scheduling for better consolidation Srikar Dronamraju
  14 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

It is preferred that tasks always run local to their memory but it is
not optimal if that node is compute overloaded and failing to get
access to a CPU. This would compete with the load balancer trying to
move tasks off and NUMA balancing moving it back.

Ultimately, it will be required that the compute load be calculated
of each node and minimise that as well as minimising the number of
remote accesses until the optimal balance point is reached. Begin
this process by simply accounting for the number of tasks that are
running on their preferred node. When deciding what node to place
a task on, do not place a task on a node that has more preferred
placement tasks than there are CPUs.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c  | 45 ++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h |  4 ++++
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c796b0..9ffdff3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -777,6 +777,18 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_preferred_running +=
+			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_preferred_running -=
+			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -865,6 +877,21 @@ static inline int task_faults_idx(int nid, int priv)
 	return 2 * nid + priv;
 }
 
+/* Returns true if the given node is compute overloaded */
+static bool sched_numa_overloaded(int nid)
+{
+	int nr_cpus = 0;
+	int nr_preferred = 0;
+	int i;
+
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		nr_cpus++;
+		nr_preferred += cpu_rq(i)->nr_preferred_running;
+	}
+
+	return nr_preferred >= nr_cpus << 1;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -892,7 +919,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Find maximum private faults */
 		faults = p->numa_faults[task_faults_idx(nid, 1)];
-		if (faults > max_faults) {
+		if (faults > max_faults && !sched_numa_overloaded(nid)) {
 			max_faults = faults;
 			max_nid = nid;
 		}
@@ -1144,6 +1171,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1153,8 +1188,10 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_enqueue(rq_of(cfs_rq), task_of(se));
 		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1165,8 +1202,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 64c37a3..f05b31b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -433,6 +433,10 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long nr_preferred_running;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/13] sched: Check current->mm before allocating NUMA faults
  2013-07-03 14:21 ` [PATCH 11/13] sched: Check current->mm before allocating NUMA faults Mel Gorman
@ 2013-07-03 15:33   ` Mel Gorman
  2013-07-04 12:48   ` Srikar Dronamraju
  1 sibling, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 15:33 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:38PM +0100, Mel Gorman wrote:
> @@ -1072,16 +1076,18 @@ void task_numa_work(struct callback_head *work)
>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>  			end = min(end, vma->vm_end);
>  			nr_pte_updates += change_prot_numa(vma, start, end);
> -			pages -= (end - start) >> PAGE_SHIFT;
> -
> -			start = end;
>  
>  			/*
>  			 * Scan sysctl_numa_balancing_scan_size but ensure that
> -			 * least one PTE is updated so that unused virtual
> -			 * address space is quickly skipped
> +			 * at least one PTE is updated so that unused virtual
> +			 * address space is quickly skipped.
>  			 */
> -			if (pages <= 0 && nr_pte_updates)
> +			if (nr_pte_updates)
> +				pages -= (end - start) >> PAGE_SHIFT;
> +
> +			start = end;
> +
> +			if (pages <= 0)
>  				goto out;
>  		} while (end != vma->vm_end);

This hunk is a rebasing error that should have been in the previous
patch. Fixed now.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (12 preceding siblings ...)
  2013-07-03 14:21 ` [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
@ 2013-07-03 16:19 ` Mel Gorman
  2013-07-03 16:26   ` Mel Gorman
  2013-07-04 18:02 ` [PATCH RFC WIP] Process weights based scheduling for better consolidation Srikar Dronamraju
  14 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 16:19 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:27PM +0100, Mel Gorman wrote:
> o 3.9.0-vanilla		vanilla kernel with automatic numa balancing enabled
> o 3.9.0-morefaults	Patches 1-9
> o 3.9.0-scalescan	Patches 1-10
> o 3.9.0-scanshared	Patches 1-12
> o 3.9.0-accountpreferred Patches 1-13
> 

I screwed up the testing as 3.9.0-morefaults is not patches 1-9 at all and
I only noticed when examining an anomaly. It's a unreleased series that
I screwed up the patch generation for. The conclusions about patches 1-9
are invalid. I'll redo the testing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2
  2013-07-03 16:19 ` [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
@ 2013-07-03 16:26   ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-03 16:26 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 05:19:15PM +0100, Mel Gorman wrote:
> On Wed, Jul 03, 2013 at 03:21:27PM +0100, Mel Gorman wrote:
> > o 3.9.0-vanilla		vanilla kernel with automatic numa balancing enabled
> > o 3.9.0-morefaults	Patches 1-9
> > o 3.9.0-scalescan	Patches 1-10
> > o 3.9.0-scanshared	Patches 1-12
> > o 3.9.0-accountpreferred Patches 1-13
> > 
> 
> I screwed up the testing as 3.9.0-morefaults is not patches 1-9 at all and
> I only noticed when examining an anomaly. It's a unreleased series that
> I screwed up the patch generation for. The conclusions about patches 1-9
> are invalid. I'll redo the testing.
> 

Wow..... I'm a double idiot. 3.9.0-morefaults really is patches 1-9 that
was released. The conclusions are ok. I was looking at the unreleased v3
of the series just there and the anomaly is in there.  Long day, giving
up before I manage to stick my foot in it for a third time.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults
  2013-07-03 14:21 ` [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
@ 2013-07-03 18:27   ` Peter Zijlstra
  2013-07-04  9:25     ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-03 18:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:36PM +0100, Mel Gorman wrote:
>  static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
>  {

> +	if (p->numa_faults[task_faults_idx(dst_nid, 1)] >
> +	    p->numa_faults[task_faults_idx(src_nid, 1)])
> +		return true;

> +}

> +static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> +{

> +	if (p->numa_faults[src_nid] > p->numa_faults[dst_nid])
>  		return true;

I bet you wanted to use task_faults_idx() there too ;-)



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-03 14:21 ` [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
@ 2013-07-03 18:32   ` Peter Zijlstra
  2013-07-04  9:37     ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-03 18:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:40PM +0100, Mel Gorman wrote:
> ---
>  kernel/sched/fair.c  | 45 ++++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/sched.h |  4 ++++
>  2 files changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3c796b0..9ffdff3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -777,6 +777,18 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   * Scheduling class queueing methods:
>   */
>  
> +static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
> +{
> +	rq->nr_preferred_running +=
> +			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
> +}
> +
> +static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
> +{
> +	rq->nr_preferred_running -=
> +			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
> +}

Ah doing this requires you dequeue before changing ->numa_preferred_nid. I
don't remember seeing that change in this series.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount
  2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
@ 2013-07-03 18:35   ` Peter Zijlstra
  2013-07-04  9:27     ` Mel Gorman
  2013-07-03 18:41   ` Peter Zijlstra
  2013-07-03 18:42   ` Peter Zijlstra
  2 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-03 18:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:39PM +0100, Mel Gorman wrote:
> Initial support for automatic NUMA balancing was unable to distinguish
> between false shared versus private pages except by ignoring pages with an
> elevated page_mapcount entirely. This patch kicks away the training wheels
> as initial support for identifying shared/private pages is now in place.
> Note that the patch still leaves shared, file-backed in VM_EXEC vmas in
> place guessing that these are shared library pages. Migrating them are
> likely to be of major benefit as generally the expectation would be that
> these are read-shared between caches and that iTLB and iCache pressure is
> generally low.

This reminds me; there a clause in task_numa_work() that skips 'small' VMAs. I
don't see the point of that.

In fact; when using things like electric fence this might mean skipping most
memory.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount
  2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
  2013-07-03 18:35   ` Peter Zijlstra
@ 2013-07-03 18:41   ` Peter Zijlstra
  2013-07-04  9:32     ` Mel Gorman
  2013-07-03 18:42   ` Peter Zijlstra
  2 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-03 18:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:39PM +0100, Mel Gorman wrote:

> Note that the patch still leaves shared, file-backed in VM_EXEC vmas in
> place guessing that these are shared library pages. Migrating them are
> likely to be of major benefit as generally the expectation would be that
> these are read-shared between caches and that iTLB and iCache pressure is
> generally low.

I'm failing to grasp.. we don't migrate them because migrating them would
likely be beneficial?

Missing a negative somewhere?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount
  2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
  2013-07-03 18:35   ` Peter Zijlstra
  2013-07-03 18:41   ` Peter Zijlstra
@ 2013-07-03 18:42   ` Peter Zijlstra
  2 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-03 18:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:39PM +0100, Mel Gorman wrote:
> @@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
>  	LIST_HEAD(migratepages);
>  
>  	/*
> +	 * Don't migrate file pages that are mapped in multiple processes
> +	 * with execute permissions as they are probably shared libraries.
>  	 */
> +	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
> +	    (vma->vm_flags & VM_EXEC))
>  		goto out;

So we will migrate DSOs that are mapped but once. That's fair enough I suppose.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-07-03 14:21 ` [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter Mel Gorman
@ 2013-07-03 21:56   ` Johannes Weiner
  2013-07-04  9:23     ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2013-07-03 21:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Linux-MM, LKML

On Wed, Jul 03, 2013 at 03:21:34PM +0100, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults
> that are private to a task and those that are shared. This would require
> that the last task that accessed a page for a hinting fault would be
> recorded which would increase the size of struct page. Instead this patch
> approximates private pages by assuming that faults that pass the two-stage
> filter are private pages and all others are shared. The preferred NUMA
> node is then selected based on where the maximum number of approximately
> private faults were measured. Shared faults are not taken into
> consideration for a few reasons.

Ingo had a patch that would just encode a few bits of the PID along
with the last_nid (last_cpu in his case) member of struct page.  No
extra space required and should be accurate enough.

Otherwise this is blind to sharedness within the node the task is
currently running on, right?

> First, if there are many tasks sharing the page then they'll all move
> towards the same node. The node will be compute overloaded and then
> scheduled away later only to bounce back again. Alternatively the shared
> tasks would just bounce around nodes because the fault information is
> effectively noise. Either way accounting for shared faults the same as
> private faults may result in lower performance overall.

When the node with many shared pages is compute overloaded then there
is arguably not an optimal node for the tasks and moving them off is
inevitable.  However, the node with the most page accesses, private or
shared, is still the preferred node from a memory stand point.
Compute load being equal, the task should go to the node with 2GB of
shared memory and not to the one with 2 private pages.

If the load balancer moves the task off due to cpu load reasons,
wouldn't the settle count mechanism prevent it from bouncing back?

Likewise, if the cpu load situation changes, the balancer could move
the task back to its truly preferred node.

> The second reason is based on a hypothetical workload that has a small
> number of very important, heavily accessed private pages but a large shared
> array. The shared array would dominate the number of faults and be selected
> as a preferred node even though it's the wrong decision.

That's a scan granularity problem and I can't see how you solve it
with ignoring the shared pages.  What if the situation is opposite
with a small, heavily used shared set and many rarely used private
pages?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-07-03 21:56   ` Johannes Weiner
@ 2013-07-04  9:23     ` Mel Gorman
  2013-07-04 14:24       ` Rik van Riel
  2013-07-04 19:36       ` Johannes Weiner
  0 siblings, 2 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-04  9:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Linux-MM, LKML

On Wed, Jul 03, 2013 at 05:56:54PM -0400, Johannes Weiner wrote:
> On Wed, Jul 03, 2013 at 03:21:34PM +0100, Mel Gorman wrote:
> > Ideally it would be possible to distinguish between NUMA hinting faults
> > that are private to a task and those that are shared. This would require
> > that the last task that accessed a page for a hinting fault would be
> > recorded which would increase the size of struct page. Instead this patch
> > approximates private pages by assuming that faults that pass the two-stage
> > filter are private pages and all others are shared. The preferred NUMA
> > node is then selected based on where the maximum number of approximately
> > private faults were measured. Shared faults are not taken into
> > consideration for a few reasons.
> 
> Ingo had a patch that would just encode a few bits of the PID along
> with the last_nid (last_cpu in his case) member of struct page.  No
> extra space required and should be accurate enough.
> 

Yes, I'm aware of it. I noted in the changelog that ideally we'd record
the task both to remind myself and so that the patch that introduces it
could refer to this changelog so there is some sort of logical progression
for reviewers.

I was not keen on the use of last_cpu because I felt there was an implicit
assumption that scanning would always be fast enough to record hinting
faults before a task got moved to another CPU for any reason. I feared this
would be worse as memory and task sizes increased. That's why I stayed
with tracking the nid for the two-stage filter until it could be proven
it was insufficient for some reason.

The lack of anything resembling pid tracking now is that the series is
already a bit of a mouthful and I thought the other parts were more
important for now.

> Otherwise this is blind to sharedness within the node the task is
> currently running on, right?
> 

Yes, it is.

> > First, if there are many tasks sharing the page then they'll all move
> > towards the same node. The node will be compute overloaded and then
> > scheduled away later only to bounce back again. Alternatively the shared
> > tasks would just bounce around nodes because the fault information is
> > effectively noise. Either way accounting for shared faults the same as
> > private faults may result in lower performance overall.
> 
> When the node with many shared pages is compute overloaded then there
> is arguably not an optimal node for the tasks and moving them off is
> inevitable. 

Yes. If such an event occurs then the ideal is that the task interleaves
between a subset of nodes. The situation could be partially detected by
tracking if the number of historical faults is approximately larger than
the preferred node and then interleave between the top N nodes most faulted
nodes until the working set fits. Starting the interleave should just be
a matter of coding. The difficulty is correctly backing off that if there
is a phase change.

> However, the node with the most page accesses, private or
> shared, is still the preferred node from a memory stand point.
> Compute load being equal, the task should go to the node with 2GB of
> shared memory and not to the one with 2 private pages.
> 

Agreed. The level of shared vs private needs to be detected. The problem
here is that detecting private dominated workloads is not straight-forward,
particularly as the scan rate slows as we've already discussed.

> If the load balancer moves the task off due to cpu load reasons,
> wouldn't the settle count mechanism prevent it from bouncing back?
> 
> Likewise, if the cpu load situation changes, the balancer could move
> the task back to its truly preferred node.
> 
> > The second reason is based on a hypothetical workload that has a small
> > number of very important, heavily accessed private pages but a large shared
> > array. The shared array would dominate the number of faults and be selected
> > as a preferred node even though it's the wrong decision.
> 
> That's a scan granularity problem and I can't see how you solve it
> with ignoring the shared pages. 

I acknowledge it's a problem and basically I'm making a big assumption
that private-dominated workloads are going to be the common case. Threaded
application on UMA with heavy amounts of shared data (within cache lines)
already suck in terms of performance so I'm expecting programmers already
try and avoid this sort of sharing. Obviously we are at a page granularity
here so the assumption will depend entirely on alignments and buffer sizes
so it might still fall apart.

I think that dealing with this specific problem is a series all on its
own and treating it on its own in isolation would be best.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults
  2013-07-03 18:27   ` Peter Zijlstra
@ 2013-07-04  9:25     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-04  9:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 08:27:48PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2013 at 03:21:36PM +0100, Mel Gorman wrote:
> >  static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
> >  {
> 
> > +	if (p->numa_faults[task_faults_idx(dst_nid, 1)] >
> > +	    p->numa_faults[task_faults_idx(src_nid, 1)])
> > +		return true;
> 
> > +}
> 
> > +static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> > +{
> 
> > +	if (p->numa_faults[src_nid] > p->numa_faults[dst_nid])
> >  		return true;
> 
> I bet you wanted to use task_faults_idx() there too ;-)
> 

You won that bet. Fixed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount
  2013-07-03 18:35   ` Peter Zijlstra
@ 2013-07-04  9:27     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-04  9:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 08:35:17PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2013 at 03:21:39PM +0100, Mel Gorman wrote:
> > Initial support for automatic NUMA balancing was unable to distinguish
> > between false shared versus private pages except by ignoring pages with an
> > elevated page_mapcount entirely. This patch kicks away the training wheels
> > as initial support for identifying shared/private pages is now in place.
> > Note that the patch still leaves shared, file-backed in VM_EXEC vmas in
> > place guessing that these are shared library pages. Migrating them are
> > likely to be of major benefit as generally the expectation would be that
> > these are read-shared between caches and that iTLB and iCache pressure is
> > generally low.
> 
> This reminds me; there a clause in task_numa_work() that skips 'small' VMAs. I
> don't see the point of that.
> 

It was a stupid hack initially to keep scan rates down and it was on the
TODO list to get rid of it and replace it with something else. I'll just
get rid of it for now without the replacement. Patch looks like this.

---8<---
sched: Remove check that skips small VMAs

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best. It
would make much more sense to cache where faults have been observed and
only rescan those regions during subsequent PTE scans. Remove this hack
as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d34c6e..921265b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1075,10 +1075,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount
  2013-07-03 18:41   ` Peter Zijlstra
@ 2013-07-04  9:32     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-04  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 08:41:24PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2013 at 03:21:39PM +0100, Mel Gorman wrote:
> 
> > Note that the patch still leaves shared, file-backed in VM_EXEC vmas in
> > place guessing that these are shared library pages. Migrating them are
> > likely to be of major benefit as generally the expectation would be that
> > these are read-shared between caches and that iTLB and iCache pressure is
> > generally low.
> 
> I'm failing to grasp.. we don't migrate them because migrating them would
> likely be beneficial?
> 
> Missing a negative somewhere?

Yes.

Note that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-03 18:32   ` Peter Zijlstra
@ 2013-07-04  9:37     ` Mel Gorman
  2013-07-04 13:07       ` Srikar Dronamraju
  0 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2013-07-04  9:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Jul 03, 2013 at 08:32:43PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2013 at 03:21:40PM +0100, Mel Gorman wrote:
> > ---
> >  kernel/sched/fair.c  | 45 ++++++++++++++++++++++++++++++++++++++++++---
> >  kernel/sched/sched.h |  4 ++++
> >  2 files changed, 46 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3c796b0..9ffdff3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -777,6 +777,18 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >   * Scheduling class queueing methods:
> >   */
> >  
> > +static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
> > +{
> > +	rq->nr_preferred_running +=
> > +			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
> > +}
> > +
> > +static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
> > +{
> > +	rq->nr_preferred_running -=
> > +			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
> > +}
> 
> Ah doing this requires you dequeue before changing ->numa_preferred_nid. I
> don't remember seeing that change in this series.

That's because it never happened. This is what the patch currently looks
like after the shuffling

---8<---

sched: Account for the number of preferred tasks running on a node when selecting a preferred node

It is preferred that tasks always run local to their memory but it is
not optimal if that node is compute overloaded and failing to get
access to a CPU. This would compete with the load balancer trying to
move tasks off and NUMA balancing moving it back.

Ultimately, it will be required that the compute load be calculated
of each node and minimise that as well as minimising the number of
remote accesses until the optimal balance point is reached. Begin
this process by simply accounting for the number of tasks that are
running on their preferred node. When deciding what node to place
a task on, do not place a task on a node that has more preferred
placement tasks than there are CPUs.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 49 +++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  5 +++++
 3 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 02db92a..13b9068 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6112,6 +6112,40 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_NUMA_BALANCING
+void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu)
+{
+	struct rq *rq;
+	unsigned long flags;
+	bool on_rq, running;
+
+	/*
+	 * Dequeue task before updating preferred_nid so
+	 * rq->nr_preferred_running is accurate
+	 */
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	/* Update the preferred nid and migrate task if possible */
+	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 0;
+
+	/* Requeue task if necessary */
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
+#endif
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 921265b..99f3cad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -777,6 +777,18 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_preferred_running +=
+			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_preferred_running -=
+			(cpu_to_node(task_cpu(p)) == p->numa_preferred_nid);
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -870,6 +882,21 @@ static inline int task_faults_idx(int nid, int priv)
 	return 2 * nid + priv;
 }
 
+/* Returns true if the given node is compute overloaded */
+static bool sched_numa_overloaded(int nid)
+{
+	int nr_cpus = 0;
+	int nr_preferred = 0;
+	int i;
+
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		nr_cpus++;
+		nr_preferred += cpu_rq(i)->nr_preferred_running;
+	}
+
+	return nr_preferred >= nr_cpus << 1;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -897,7 +924,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Find maximum private faults */
 		faults = p->numa_faults[task_faults_idx(nid, 1)];
-		if (faults > max_faults) {
+		if (faults > max_faults && !sched_numa_overloaded(nid)) {
 			max_faults = faults;
 			max_nid = nid;
 		}
@@ -923,9 +950,7 @@ static void task_numa_placement(struct task_struct *p)
 							     max_nid);
 		}
 
-		/* Update the preferred nid and migrate task if possible */
-		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 0;
+		sched_setnuma(p, max_nid, preferred_cpu);
 		migrate_task_to(p, preferred_cpu);
 
 		/*
@@ -1148,6 +1173,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1157,8 +1190,10 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_enqueue(rq_of(cfs_rq), task_of(se));
 		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1169,8 +1204,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 795346d..1d7c0fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -433,6 +433,10 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long nr_preferred_running;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -504,6 +508,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int nid, int idlest_cpu);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected
  2013-07-03 14:21 ` [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
@ 2013-07-04 12:26   ` Srikar Dronamraju
  2013-07-04 13:29     ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Srikar Dronamraju @ 2013-07-04 12:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-07-03 15:21:33]:

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2a0bbc2..b9139be 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -800,6 +800,37 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>   */
>  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> 
> +static unsigned long weighted_cpuload(const int cpu);
> +
> +static int
> +find_idlest_cpu_node(int this_cpu, int nid)
> +{
> +	unsigned long load, min_load = ULONG_MAX;
> +	int i, idlest_cpu = this_cpu;
> +
> +	BUG_ON(cpu_to_node(this_cpu) == nid);
> +
> +	for_each_cpu(i, cpumask_of_node(nid)) {
> +		load = weighted_cpuload(i);
> +
> +		if (load < min_load) {
> +			struct task_struct *p;
> +
> +			/* Do not preempt a task running on its preferred node */
> +			struct rq *rq = cpu_rq(i);
> +			raw_spin_lock_irq(&rq->lock);

Not sure why we need this spin_lock? Cant this be done in a rcu block
instead?


-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/13] sched: Check current->mm before allocating NUMA faults
  2013-07-03 14:21 ` [PATCH 11/13] sched: Check current->mm before allocating NUMA faults Mel Gorman
  2013-07-03 15:33   ` Mel Gorman
@ 2013-07-04 12:48   ` Srikar Dronamraju
  2013-07-05 10:07     ` Mel Gorman
  1 sibling, 1 reply; 43+ messages in thread
From: Srikar Dronamraju @ 2013-07-04 12:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2013-07-03 15:21:38]:

> task_numa_placement checks current->mm but after buffers for faults
> have already been uselessly allocated. Move the check earlier.
> 
> [peterz@infradead.org: Identified the problem]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  kernel/sched/fair.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 336074f..3c796b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -870,8 +870,6 @@ static void task_numa_placement(struct task_struct *p)
>  	int seq, nid, max_nid = 0;
>  	unsigned long max_faults = 0;
> 
> -	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
> -		return;
>  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
>  	if (p->numa_scan_seq == seq)
>  		return;
> @@ -945,6 +943,12 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
>  	if (!sched_feat_numa(NUMA))
>  		return;
> 
> +	/* for example, ksmd faulting in a user's mm */
> +	if (!p->mm) {
> +		p->numa_scan_period = sysctl_numa_balancing_scan_period_max;

Naive question:
Why are we resetting the scan_period?

> +		return;
> +	}
> +
>  	/* Allocate buffer to track faults on a per-node basis */
>  	if (unlikely(!p->numa_faults)) {
>  		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
> @@ -1072,16 +1076,18 @@ void task_numa_work(struct callback_head *work)
>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>  			end = min(end, vma->vm_end);
>  			nr_pte_updates += change_prot_numa(vma, start, end);
> -			pages -= (end - start) >> PAGE_SHIFT;
> -
> -			start = end;
> 
>  			/*
>  			 * Scan sysctl_numa_balancing_scan_size but ensure that
> -			 * least one PTE is updated so that unused virtual
> -			 * address space is quickly skipped
> +			 * at least one PTE is updated so that unused virtual
> +			 * address space is quickly skipped.
>  			 */
> -			if (pages <= 0 && nr_pte_updates)
> +			if (nr_pte_updates)
> +				pages -= (end - start) >> PAGE_SHIFT;
> +
> +			start = end;
> +
> +			if (pages <= 0)
>  				goto out;
>  		} while (end != vma->vm_end);
>  	}
> -- 
> 1.8.1.4
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-04  9:37     ` Mel Gorman
@ 2013-07-04 13:07       ` Srikar Dronamraju
  2013-07-04 13:54         ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Srikar Dronamraju @ 2013-07-04 13:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

>  static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = 0;
> @@ -897,7 +924,7 @@ static void task_numa_placement(struct task_struct *p)
> 
>  		/* Find maximum private faults */
>  		faults = p->numa_faults[task_faults_idx(nid, 1)];
> -		if (faults > max_faults) {
> +		if (faults > max_faults && !sched_numa_overloaded(nid)) {

Should we take the other approach of setting the preferred nid but not 
moving the task to the node?

So if some task moves out of the preferred node, then we should still be
able to move this task there. 

However your current approach has an advantage that it atleast runs on
second preferred choice if not the first.

Also should sched_numa_overloaded() also consider pinned tasks?

>  			max_faults = faults;
>  			max_nid = nid;
>  		}
> @@ -923,9 +950,7 @@ static void task_numa_placement(struct task_struct *p)
>  							     max_nid);
>  		}
> 
> -		/* Update the preferred nid and migrate task if possible */
> -		p->numa_preferred_nid = max_nid;
> -		p->numa_migrate_seq = 0;
> +		sched_setnuma(p, max_nid, preferred_cpu);
>  		migrate_task_to(p, preferred_cpu);
> 
>  		/*


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected
  2013-07-04 12:26   ` Srikar Dronamraju
@ 2013-07-04 13:29     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-04 13:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Thu, Jul 04, 2013 at 05:56:44PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-07-03 15:21:33]:
> 
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2a0bbc2..b9139be 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -800,6 +800,37 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
> >   */
> >  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> > 
> > +static unsigned long weighted_cpuload(const int cpu);
> > +
> > +static int
> > +find_idlest_cpu_node(int this_cpu, int nid)
> > +{
> > +	unsigned long load, min_load = ULONG_MAX;
> > +	int i, idlest_cpu = this_cpu;
> > +
> > +	BUG_ON(cpu_to_node(this_cpu) == nid);
> > +
> > +	for_each_cpu(i, cpumask_of_node(nid)) {
> > +		load = weighted_cpuload(i);
> > +
> > +		if (load < min_load) {
> > +			struct task_struct *p;
> > +
> > +			/* Do not preempt a task running on its preferred node */
> > +			struct rq *rq = cpu_rq(i);
> > +			raw_spin_lock_irq(&rq->lock);
> 
> Not sure why we need this spin_lock? Cant this be done in a rcu block
> instead?
> 

Judging by how find_idlest_cpu works it would appear you are correct.
Thanks very much, I'm still pretty much a scheduler wuss. I know what I
want but not always how to get it :)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-04 13:07       ` Srikar Dronamraju
@ 2013-07-04 13:54         ` Mel Gorman
  2013-07-04 14:06           ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2013-07-04 13:54 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Thu, Jul 04, 2013 at 06:37:19PM +0530, Srikar Dronamraju wrote:
> >  static void task_numa_placement(struct task_struct *p)
> >  {
> >  	int seq, nid, max_nid = 0;
> > @@ -897,7 +924,7 @@ static void task_numa_placement(struct task_struct *p)
> > 
> >  		/* Find maximum private faults */
> >  		faults = p->numa_faults[task_faults_idx(nid, 1)];
> > -		if (faults > max_faults) {
> > +		if (faults > max_faults && !sched_numa_overloaded(nid)) {
> 
> Should we take the other approach of setting the preferred nid but not 
> moving the task to the node?
> 

Why would that be better?

> So if some task moves out of the preferred node, then we should still be
> able to move this task there. 
> 

I think if we were to do that then I'd revisit the "task swap" logic from
autonuma (numacore had something similar) and search for pairs of tasks
that both benefit from a swap. I prototyped something basic alont this
lines but it was premature. It's a more directed approach but one that
should be done only when the private/shared and load logic is solidified.

> However your current approach has an advantage that it atleast runs on
> second preferred choice if not the first.
> 

That was the intention.

> Also should sched_numa_overloaded() also consider pinned tasks?
> 

I don't think sched_numa_overloaded() needs to as such, least I don't see how
it would do it sensibly right now. However, you still make an important point
in that find_idlest_cpu_node should take it into account. How about this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9247345..387f28d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -863,9 +863,13 @@ find_idlest_cpu_node(int this_cpu, int nid)
 		load = weighted_cpuload(i);
 
 		if (load < min_load) {
-			/* Do not preempt a task running on a preferred node */
+			/*
+			 * Do not preempt a task running on a preferred node or
+			 * tasks are are pinned to their current CPU
+			 */
 			struct task_struct *p = cpu_rq(i)->curr;
-			if (p->numa_preferred_nid != nid) {
+			if (p->numa_preferred_nid != nid &&
+			    cpumask_weight(tsk_cpus_allowed(p)) > 1) {
 				min_load = load;
 				idlest_cpu = i;
 			}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-04 13:54         ` Mel Gorman
@ 2013-07-04 14:06           ` Peter Zijlstra
  2013-07-04 14:40             ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-04 14:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 04, 2013 at 02:54:15PM +0100, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9247345..387f28d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -863,9 +863,13 @@ find_idlest_cpu_node(int this_cpu, int nid)
>  		load = weighted_cpuload(i);
>  
>  		if (load < min_load) {
> -			/* Do not preempt a task running on a preferred node */
> +			/*
> +			 * Do not preempt a task running on a preferred node or
> +			 * tasks are are pinned to their current CPU
> +			 */
>  			struct task_struct *p = cpu_rq(i)->curr;
> -			if (p->numa_preferred_nid != nid) {
> +			if (p->numa_preferred_nid != nid &&
> +			    cpumask_weight(tsk_cpus_allowed(p)) > 1) {

We have p->nr_cpus_allowed for that.

>  				min_load = load;
>  				idlest_cpu = i;
>  			}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-07-04  9:23     ` Mel Gorman
@ 2013-07-04 14:24       ` Rik van Riel
  2013-07-04 19:36       ` Johannes Weiner
  1 sibling, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2013-07-04 14:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Peter Zijlstra, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Linux-MM, LKML

On 07/04/2013 05:23 AM, Mel Gorman wrote:

> I think that dealing with this specific problem is a series all on its
> own and treating it on its own in isolation would be best.

Agreed, lets tackle one thing at a time, otherwise we will
(once again) end up with a patch series that is too large
to merge.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node
  2013-07-04 14:06           ` Peter Zijlstra
@ 2013-07-04 14:40             ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-04 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 04, 2013 at 04:06:13PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 04, 2013 at 02:54:15PM +0100, Mel Gorman wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9247345..387f28d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -863,9 +863,13 @@ find_idlest_cpu_node(int this_cpu, int nid)
> >  		load = weighted_cpuload(i);
> >  
> >  		if (load < min_load) {
> > -			/* Do not preempt a task running on a preferred node */
> > +			/*
> > +			 * Do not preempt a task running on a preferred node or
> > +			 * tasks are are pinned to their current CPU
> > +			 */
> >  			struct task_struct *p = cpu_rq(i)->curr;
> > -			if (p->numa_preferred_nid != nid) {
> > +			if (p->numa_preferred_nid != nid &&
> > +			    cpumask_weight(tsk_cpus_allowed(p)) > 1) {
> 
> We have p->nr_cpus_allowed for that.
> 

/me slaps self

That's a bit easier to calculate, thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC WIP] Process weights based scheduling for better consolidation
  2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
                   ` (13 preceding siblings ...)
  2013-07-03 16:19 ` [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
@ 2013-07-04 18:02 ` Srikar Dronamraju
  2013-07-05 10:16   ` Peter Zijlstra
  14 siblings, 1 reply; 43+ messages in thread
From: Srikar Dronamraju @ 2013-07-04 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

Here is an approach to look at numa balanced scheduling from a non numa fault
angle. This approach uses process weights instead of faults as a basis to
move or bring tasks together.

Here are the advantages of this approach.
1. Provides excellent consolidation of tasks.
	- I have verified with  sched_autonuma_dump_mm() which was part
	  of Andrea's autonuma patches refer commit id: 

	commit aba373d04251691b5e0987a0fff2fa7007311810
	Author: Andrea Arcangeli <aarcange@redhat.com>
	Date:   Fri Mar 23 20:35:07 2012 +0100

	    autonuma: CPU follows memory algorithm
 
 From limited experiments, I have found that the better the task
 consolidation, we achieve better the memory layout, which results in
 better the performance.

2. Provides good benefit in whatever limited testing that I have done so
  far. For example it provides _20+%_ improvement for numa01
  (autonuma-benchmark). 

3. Since it doesnt depend on numa faulting, it doesnt have the overhead
  of having to get the scanning rate correctly.

4. Looks to extend the load balancer esp when the cpus are idling.

5. Code looks much simpler and naive to me. (But yes this is relative!!!)

Results on a 2 node 12 core system:

KernelVersion: 3.9.0 (with hyper threading)
		Testcase:      Min      Max      Avg
		  numa01:   220.12   246.96   239.18
		  numa02:    41.85    43.02    42.43

KernelVersion: 3.9.0 + code (with hyper threading)
		Testcase:      Min      Max      Avg  %Change
		  numa01:   174.97   219.38   198.99   20.20%
		  numa02:    38.24    38.50    38.38   10.55%

KernelVersion: 3.9.0 (noht)
		Testcase:      Min      Max      Avg
		  numa01:   118.72   121.04   120.23
		  numa02:    36.64    37.56    36.99

KernelVersion: 3.9.0 + code (noht)
		Testcase:      Min      Max      Avg  %Change
		  numa01:    92.95   113.36   108.21   11.11%
		  numa02:    36.76    38.91    37.34   -0.94%


/usr/bin/time -f %e %S %U %c %w 
i.e elapsed,user,sys, voluntary  and involuntary context switches
Best case performance for v3.9

numa01 		220.12 17.14 5041.27 522147 1273
numa02		 41.91 2.47 887.46 92079 8

Best case performance for v3.9 + code.
numa01			 174.97 17.46 4102.64 433804 1846
numa01_THREAD_ALLOC	 288.04 15.76 6783.86 718220 174
numa02			 38.41 0.75 905.65 95364 5
numa02_SMT		 46.43 0.55 487.30 66416 7

Best case memory layout for v3.9
9	416.44		5728.73	
19	356.42		5788.75	
30	722.49		5422.68	
40	1936.50		4208.67	
50	1372.40		4772.77	
60	1354.39		4790.78	
71	1512.39		4632.78	
81	1598.40		4546.77	
91	2242.40		3902.77	
101	2242.40		3902.78	
111	2654.41		3490.77	
122	2654.40		3490.77	
132	2976.30		3168.87	
142	2956.30		3188.87	
152	2956.30		3188.87	
162	2956.30		3188.87	
173	3044.30		3100.87	
183	3058.30		3086.87	
193	3204.20		2942.87	
203	3262.20		2884.89	
213	3262.18		2884.91	

Best case memory layout for v3.9 + code
10	6140.55		4.64	
20	3728.99		2416.18	
30	3066.45		3078.73	
40	3072.46		3072.73	
51	3072.46		3072.73	
61	3072.46		3072.73	
71	3072.46		3072.73	
81	3072.46		3072.73	
91	3072.46		3072.73	
102	3072.46		3072.73	
112	3072.46		3072.73	
122	3072.46		3072.73	
132	3072.46		3072.73	
142	3072.46		3072.73	
152	3072.46		3072.73	
163	3072.46		3072.73	
173	3072.44		3072.74	


Having said that I am sure the experts would have already thought of
this approach and might have reasons to discard it. Hence the code is
not yet in a patchset format, nor do I have extensive analysis that Mel
has for his patchset. I thought of posting the code out in some form so
that I know if there are any obvious pitfalls for which this approach

Here is the outline of the approach.

- Every process has a per node array where we store the weight of all
  its tasks running on that node. This arrays gets updated on task
  enqueue/dequeue.

- Added a 2 pass mechanism (somewhat taken from numacore but not
  exactly) while choosing tasks to move across nodes. 

  In the first pass, choose only tasks that are ideal to be moved.
  While choosing a task, look at the per node process arrays to see if
  moving task helps.
  If the first pass fails to move a task, any task can be chosen on the
  second pass.
 
- If the regular load balancer (rebalance_domain()) fails to balance the
  load (or finds no imbalance) and there is a cpu, use the cpu to
  consolidate tasks to the nodes by using the information in the per
  node process arrays.

  Every idle cpu if its doesnt have tasks queued after load balance,
  - will walk thro the cpus in its node and checks if there are buddy
    tasks that are not part of the node but should have been ideally
    part of this node. 
  - To make sure that we dont pull all buddy tasks and create an
    imbalance, we look at load on the load, pinned tasks and the
    processes contribution to the load for this node.
  - Each cpu looks at the node which has the least number of buddy tasks
    running and tries to pull the tasks from such nodes.

  - Once it finds the cpu from which to pull the tasks, it triggers
    active_balancing. This type of active balancing triggers just one
    pass. i.e it only fetches tasks that increase numa locality.

Thanks for taking a look and providing your valuable inputs.

---8<---

sched: Using process weights to consolidate tasks

If we consolidate related tasks to one node, memory tends to follow to
that node. If the memory and tasks end up in one node, it results in
better performance. 

To achieve this, the code below tries to extend the current load
balancing while idling to move tasks in such a way that the related
tasks end up being based on the same node. Care is taken not to overload
the tasks while moving the tasks. 

This code also adds iterations logic to the regular move task logic to
further consolidate tasks while performing the regular load balancing.

Not-yet-signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 fs/exec.c                |    4 +
 include/linux/migrate.h  |    1 -
 include/linux/mm_types.h |    1 +
 include/linux/sched.h    |    2 +
 kernel/fork.c            |   10 +-
 kernel/sched/core.c      |    2 +
 kernel/sched/fair.c      |  338 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h     |    4 +
 8 files changed, 344 insertions(+), 18 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a96a488..54589d0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -833,6 +833,10 @@ static int exec_mmap(struct mm_struct *mm)
 	activate_mm(active_mm, mm);
 	task_unlock(tsk);
 	arch_pick_mmap_layout(mm);
+#ifdef CONFIG_NUMA_BALANCING
+	mm->numa_weights = kzalloc(sizeof(unsigned long) * (nr_node_ids + 1), GFP_KERNEL);
+	tsk->task_load = 0;
+#endif
 	if (old_mm) {
 		up_read(&old_mm->mmap_sem);
 		BUG_ON(active_mm != old_mm);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3d..086bd33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -93,7 +93,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
 extern bool migrate_ratelimited(int node);
 #else
 static inline int migrate_misplaced_page(struct page *page, int node)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..bb402d3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -435,6 +435,7 @@ struct mm_struct {
 	 * a different node than Make PTE Scan Go Now.
 	 */
 	int first_nid;
+	unsigned long *numa_weights;
 #endif
 	struct uprobes_state uprobes_state;
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..2736ec6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
@@ -1505,6 +1506,7 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+	unsigned long task_load;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/fork.c b/kernel/fork.c
index 1766d32..14c7aea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,9 @@ void mmput(struct mm_struct *mm)
 		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
+#ifdef CONFIG_NUMA_BALANCING
+		kfree(mm->numa_weights);
+#endif
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			list_del(&mm->mmlist);
@@ -823,9 +826,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
@@ -844,6 +844,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 	if (mm->binfmt && !try_module_get(mm->binfmt->module))
 		goto free_pt;
 
+#ifdef CONFIG_NUMA_BALANCING
+	mm->first_nid = NUMA_PTE_SCAN_INIT;
+	mm->numa_weights = kzalloc(sizeof(unsigned long) * (nr_node_ids + 1), GFP_KERNEL);
+#endif
 	return mm;
 
 free_pt:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..82f8f79 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->task_load = 0;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
 }
@@ -6136,6 +6137,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..15d71a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -777,6 +777,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static unsigned long task_h_load(struct task_struct *p);
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * numa task sample period in ms
@@ -791,6 +793,60 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	unsigned long task_load = 0;
+	int curnode = cpu_to_node(cpu_of(rq));
+#ifdef CONFIG_SCHED_AUTOGROUP
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq)];
+	if (!se)
+		return;
+
+	if (cfs_rq->load.weight) {
+		task_load =  p->se.load.weight * se->load.weight;
+		task_load /= cfs_rq->load.weight;
+	} else {
+		task_load = 0;
+	}
+#else
+	task_load = p->se.load.weight;
+#endif
+	p->task_load = 0;
+	if (!task_load)
+		return;
+
+	if (p->mm && p->mm->numa_weights) {
+		p->mm->numa_weights[curnode] += task_load;
+		p->mm->numa_weights[nr_node_ids] += task_load;
+	}
+
+	if (p->nr_cpus_allowed != num_online_cpus())
+		rq->pinned_load += task_load;
+	p->task_load = task_load;
+}
+
+static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	unsigned long task_load = p->task_load;
+	int curnode = cpu_to_node(cpu_of(rq));
+
+	p->task_load = 0;
+	if (!task_load)
+		return;
+
+	if (p->mm && p->mm->numa_weights) {
+		p->mm->numa_weights[curnode] -= task_load;
+		p->mm->numa_weights[nr_node_ids] -= task_load;
+	}
+
+	if (p->nr_cpus_allowed != num_online_cpus())
+		rq->pinned_load -= task_load;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq;
@@ -999,6 +1055,12 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+}
+static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1008,8 +1070,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		list_add(&se->group_node, &rq->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1713,6 +1778,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
+	if (entity_is_task(se))
+		account_numa_enqueue(cfs_rq, task_of(se));
 
 	if (cfs_rq->nr_running == 1) {
 		list_add_leaf_cfs_rq(cfs_rq);
@@ -1810,6 +1877,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
+	if (entity_is_task(se))
+		account_numa_dequeue(cfs_rq, task_of(se));
 }
 
 /*
@@ -3292,6 +3361,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static int
+check_numa_affinity(struct task_struct *p, int cpu, int prev_cpu)
+{
+	struct mm_struct *mm = p->mm;
+	struct rq *rq = cpu_rq(prev_cpu);
+	int source_node = cpu_to_node(prev_cpu);
+	int target_node = cpu_to_node(cpu);
+
+	if (mm && mm->numa_weights) {
+		unsigned long *weights = mm->numa_weights;
+
+		if (weights[target_node] > weights[source_node]) {
+			if (!rq->ab_node_load || weights[target_node] < rq->ab_node_load)
+				return 1;
+		}
+	}
+	return 0;
+}
+#else
+static int
+check_numa_affinity(struct task_struct *p, int cpu, int prev_cpu)
+{
+	return 0;
+}
+#endif
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3317,7 +3413,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) && check_numa_affinity(p, cpu, prev_cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3819,6 +3915,7 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	unsigned int		iterations;
 };
 
 /*
@@ -3865,6 +3962,37 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static bool force_migrate(struct lb_env *env, struct task_struct *p)
+{
+	struct mm_struct *mm = p->mm;
+	struct rq *rq = env->src_rq;
+	int source_node = cpu_to_node(env->src_cpu);
+	int target_node = cpu_to_node(env->dst_cpu);
+
+	if (env->sd->nr_balance_failed > env->sd->cache_nice_tries)
+		return true;
+
+	if (!(env->sd->flags & SD_NUMA))
+		return false;
+
+	if (mm && mm->numa_weights) {
+		unsigned long *weights = mm->numa_weights;
+
+		if (weights[target_node] > weights[source_node]) {
+			if (!rq->ab_node_load || weights[target_node] < rq->ab_node_load)
+				return true;
+		}
+	}
+	return false;
+}
+#else
+static bool force_migrate(struct lb_env *env, struct task_struct *p)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3916,26 +4044,51 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 1) task is cache cold, or
 	 * 2) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
-	if (!tsk_cache_hot ||
-		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
+	if (tsk_cache_hot) {
+		if (force_migrate(env, p)) {
 #ifdef CONFIG_SCHEDSTATS
-		if (tsk_cache_hot) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
 			schedstat_inc(p, se.statistics.nr_forced_migrations);
-		}
 #endif
-		return 1;
-	}
-
-	if (tsk_cache_hot) {
+			return 1;
+		}
 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
 		return 0;
 	}
 	return 1;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static int preferred_node(struct task_struct *p, struct lb_env *env)
+{
+	struct mm_struct *mm = p->mm;
+
+	if (!(env->sd->flags & SD_NUMA))
+		return false;
+
+	if (mm && mm->numa_weights) {
+		struct rq *rq = env->src_rq;
+		unsigned long *weights = mm->numa_weights;
+		int target_node = cpu_to_node(env->dst_cpu);
+		int source_node = cpu_to_node(env->src_cpu);
+
+		if (weights[target_node] > weights[source_node]) {
+			if (!rq->ab_node_load || weights[target_node] < rq->ab_node_load)
+				return 1;
+		}
+	}
+	if (env->iterations)
+		return 1;
+	return 0;
+}
+#else
+static int preferred_node(struct task_struct *p, struct lb_env *env)
+{
+	return 0;
+}
+#endif
+
 /*
  * move_one_task tries to move exactly one task from busiest to this_rq, as
  * part of active balancing operations within "domain".
@@ -3947,7 +4100,11 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+again:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+		if (!preferred_node(p, env))
+			continue;
+
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -3955,6 +4112,7 @@ static int move_one_task(struct lb_env *env)
 			continue;
 
 		move_task(p, env);
+
 		/*
 		 * Right now, this is only the second place move_task()
 		 * is called, so we can safely collect move_task()
@@ -3963,11 +4121,12 @@ static int move_one_task(struct lb_env *env)
 		schedstat_inc(env->sd, lb_gained[env->idle]);
 		return 1;
 	}
+	if (!env->iterations++  && env->src_rq->active_balance != 2)
+		goto again;
+
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
@@ -4002,6 +4161,9 @@ static int move_tasks(struct lb_env *env)
 			break;
 		}
 
+		if (!preferred_node(p, env))
+			goto next;
+
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 			goto next;
 
@@ -5005,6 +5167,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.iterations	= 1,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -5047,6 +5210,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
+		if (sd->flags & SD_NUMA) {
+			if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu))
+				env.iterations = 0;
+		}
+
 		update_h_load(env.src_cpu);
 more_balance:
 		local_irq_save(flags);
@@ -5066,6 +5234,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			goto more_balance;
 		}
 
+		if (!ld_moved && !env.iterations) {
+			env.iterations++;
+			env.loop	 = 0;
+			env.loop_break	 = sched_nr_migrate_break;
+			goto more_balance;
+		}
+
 		/*
 		 * some other cpu did the load balance for us.
 		 */
@@ -5152,6 +5327,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			if (!busiest->active_balance) {
 				busiest->active_balance = 1;
 				busiest->push_cpu = this_cpu;
+#ifdef CONFIG_NUMA_BALANCING
+				busiest->ab_node_load = 0;
+#endif
 				active_balance = 1;
 			}
 			raw_spin_unlock_irqrestore(&busiest->lock, flags);
@@ -5313,8 +5491,14 @@ static int active_load_balance_cpu_stop(void *data)
 			.src_cpu	= busiest_rq->cpu,
 			.src_rq		= busiest_rq,
 			.idle		= CPU_IDLE,
+			.iterations	= 1,
 		};
 
+		if ((sd->flags & SD_NUMA)) {
+			if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu))
+				env.iterations = 0;
+		}
+
 		schedstat_inc(sd, alb_count);
 
 		if (move_one_task(&env))
@@ -5326,6 +5510,9 @@ static int active_load_balance_cpu_stop(void *data)
 	double_unlock_balance(busiest_rq, target_rq);
 out_unlock:
 	busiest_rq->active_balance = 0;
+#ifdef CONFIG_NUMA_BALANCING
+	busiest_rq->ab_node_load = 0;
+#endif
 	raw_spin_unlock_irq(&busiest_rq->lock);
 	return 0;
 }
@@ -5464,6 +5651,59 @@ void update_max_interval(void)
 	max_load_balance_interval = HZ*num_online_cpus()/10;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static int migrate_from_cpu(struct mm_struct *this_mm, int this_cpu, int nid)
+{
+	struct mm_struct *mm;
+	struct rq *rq;
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		rq = cpu_rq(cpu);
+		mm = rq->curr->mm;
+
+		if (mm == this_mm) {
+			if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(rq->curr)))
+				return cpu;
+		}
+	}
+	return -1;
+}
+
+static int migrate_from_node(unsigned long *weights, unsigned long load, int nid)
+{
+	unsigned long least_weight = weights[nid];
+	unsigned long node_load;
+	int least_node = -1;
+	int node, cpu;
+
+	for_each_online_node(node) {
+		if (node == nid)
+			continue;
+		if (weights[node] == 0)
+			continue;
+
+		node_load = 0;
+		for_each_cpu(cpu, cpumask_of_node(node)) {
+			node_load += weighted_cpuload(cpu);
+		}
+
+		if (load > node_load) {
+			if (load * nr_node_ids >= node_load * (nr_node_ids + 1))
+				continue;
+			if (weights[node] == least_weight)
+				continue;
+		}
+
+		if (weights[node] <=  least_weight) {
+			least_weight = weights[node];
+			least_node = node;
+		}
+	}
+	return least_node;
+}
+#endif
+
 /*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
@@ -5529,6 +5769,76 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 		if (!balance)
 			break;
 	}
+#ifdef CONFIG_NUMA_BALANCING
+	if (!rq->nr_running) {
+		struct mm_struct *prev_mm = NULL;
+		unsigned long load = 0, pinned_load = 0;
+		unsigned long *weights = NULL;
+		int node, nid, dcpu;
+		int this_cpu = -1;
+
+		nid = cpu_to_node(cpu);
+
+		/* Traverse only the allowed CPUs */
+		for_each_cpu(dcpu, cpumask_of_node(nid)) {
+			load += weighted_cpuload(dcpu);
+			pinned_load += cpu_rq(dcpu)->pinned_load;
+		}
+		for_each_cpu(dcpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(dcpu);
+			struct mm_struct *mm = rq->curr->mm;
+
+			if (!mm || !mm->numa_weights)
+				continue;
+
+			weights = mm->numa_weights;
+			if (!weights[nr_node_ids] || !weights[nid])
+				continue;
+
+			if (weights[nid] + pinned_load >= load)
+				break;
+			if (weights[nr_node_ids]/weights[nid] > nr_node_ids)
+				continue;
+
+			if (mm == prev_mm)
+				continue;
+
+			prev_mm = mm;
+			node = migrate_from_node(weights, load, nid);
+			if (node == -1)
+				continue;
+			this_cpu = migrate_from_cpu(mm, cpu, node);
+			if (this_cpu != -1)
+				break;
+		}
+		if (this_cpu != -1) {
+			struct rq *this_rq;
+			unsigned long flags;
+			int active_balance;
+
+			this_rq = cpu_rq(this_cpu);
+			active_balance = 0;
+
+			/*
+			 * ->active_balance synchronizes accesses to
+			 * ->active_balance_work.  Once set, it's cleared
+			 * only after active load balance is finished.
+			 */
+			raw_spin_lock_irqsave(&this_rq->lock, flags);
+			if (!this_rq->active_balance) {
+				this_rq->active_balance = 2;
+				this_rq->push_cpu = cpu;
+				this_rq->ab_node_load = load - pinned_load;
+				active_balance = 1;
+			}
+			raw_spin_unlock_irqrestore(&this_rq->lock, flags);
+
+			if (active_balance) {
+				stop_one_cpu_nowait(this_cpu, active_load_balance_cpu_stop, this_rq, &this_rq->active_balance_work);
+			}
+		}
+	}
+#endif
 	rcu_read_unlock();
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..0011bba 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -484,6 +484,10 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long pinned_load;
+	unsigned long ab_node_load;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-07-04  9:23     ` Mel Gorman
  2013-07-04 14:24       ` Rik van Riel
@ 2013-07-04 19:36       ` Johannes Weiner
  2013-07-05  9:41         ` Mel Gorman
  2013-07-05 10:48         ` Peter Zijlstra
  1 sibling, 2 replies; 43+ messages in thread
From: Johannes Weiner @ 2013-07-04 19:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Linux-MM, LKML

On Thu, Jul 04, 2013 at 10:23:56AM +0100, Mel Gorman wrote:
> On Wed, Jul 03, 2013 at 05:56:54PM -0400, Johannes Weiner wrote:
> > On Wed, Jul 03, 2013 at 03:21:34PM +0100, Mel Gorman wrote:
> > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > that are private to a task and those that are shared. This would require
> > > that the last task that accessed a page for a hinting fault would be
> > > recorded which would increase the size of struct page. Instead this patch
> > > approximates private pages by assuming that faults that pass the two-stage
> > > filter are private pages and all others are shared. The preferred NUMA
> > > node is then selected based on where the maximum number of approximately
> > > private faults were measured. Shared faults are not taken into
> > > consideration for a few reasons.
> > 
> > Ingo had a patch that would just encode a few bits of the PID along
> > with the last_nid (last_cpu in his case) member of struct page.  No
> > extra space required and should be accurate enough.
> > 
> 
> Yes, I'm aware of it. I noted in the changelog that ideally we'd record
> the task both to remind myself and so that the patch that introduces it
> could refer to this changelog so there is some sort of logical progression
> for reviewers.
> 
> I was not keen on the use of last_cpu because I felt there was an implicit
> assumption that scanning would always be fast enough to record hinting
> faults before a task got moved to another CPU for any reason. I feared this
> would be worse as memory and task sizes increased. That's why I stayed
> with tracking the nid for the two-stage filter until it could be proven
> it was insufficient for some reason.
> 
> The lack of anything resembling pid tracking now is that the series is
> already a bit of a mouthful and I thought the other parts were more
> important for now.

Fair enough.

> > Otherwise this is blind to sharedness within the node the task is
> > currently running on, right?
> > 
> 
> Yes, it is.
> 
> > > First, if there are many tasks sharing the page then they'll all move
> > > towards the same node. The node will be compute overloaded and then
> > > scheduled away later only to bounce back again. Alternatively the shared
> > > tasks would just bounce around nodes because the fault information is
> > > effectively noise. Either way accounting for shared faults the same as
> > > private faults may result in lower performance overall.
> > 
> > When the node with many shared pages is compute overloaded then there
> > is arguably not an optimal node for the tasks and moving them off is
> > inevitable. 
> 
> Yes. If such an event occurs then the ideal is that the task interleaves
> between a subset of nodes. The situation could be partially detected by
> tracking if the number of historical faults is approximately larger than
> the preferred node and then interleave between the top N nodes most faulted
> nodes until the working set fits. Starting the interleave should just be
> a matter of coding. The difficulty is correctly backing off that if there
> is a phase change.

Agreed, optimizing second-best placement can be dealt with later.  I'm
worried about optimal placement, though.

And I'm worried about skewing memory access statistics in order to
steer future situations the CPU load balancer should handle instead.

> > However, the node with the most page accesses, private or
> > shared, is still the preferred node from a memory stand point.
> > Compute load being equal, the task should go to the node with 2GB of
> > shared memory and not to the one with 2 private pages.
> > 
> 
> Agreed. The level of shared vs private needs to be detected. The problem
> here is that detecting private dominated workloads is not straight-forward,
> particularly as the scan rate slows as we've already discussed.

I was going for the opposite conclusion: that it does not matter
whether memory is accessed privately or in a shared fashion, because
there is no obvious connection to its access frequency, not to me at
least.  Short of accurate access frequency sampling and supportive
data, the node with most accesses in general should be the preferred
node, not the one with the most private accesses because it's the
smaller assumption to make.

> > > The second reason is based on a hypothetical workload that has a small
> > > number of very important, heavily accessed private pages but a large shared
> > > array. The shared array would dominate the number of faults and be selected
> > > as a preferred node even though it's the wrong decision.
> > 
> > That's a scan granularity problem and I can't see how you solve it
> > with ignoring the shared pages. 
> 
> I acknowledge it's a problem and basically I'm making a big assumption
> that private-dominated workloads are going to be the common case. Threaded
> application on UMA with heavy amounts of shared data (within cache lines)
> already suck in terms of performance so I'm expecting programmers already
> try and avoid this sort of sharing. Obviously we are at a page granularity
> here so the assumption will depend entirely on alignments and buffer sizes
> so it might still fall apart.

Don't basically all VM-based mulithreaded programs have this usage
pattern?  The whole runtime (text, heap) is shared between threads.
If some thread-local memory spills over to another node, should the
scheduler move this thread off node from a memory standpoint?  I don't
think so at all.  I would expect it to always gravitate back towards
this node with the VM on it, only get moved off for CPU load reasons,
and get moved back as soon as the load situation permits.

Meanwhile, if there is little shared memory and private memory spills
over to other nodes, you still don't know which node's memory is more
frequently used beyond scan period granularity.  Ignoring shared
memory accesses would not actually help finding the right node in this
case.  You might even make it worse: since private accesses are all
treated equally, you assume equal access frequency among them.  Shared
accesses could now be the distinguishing data point to tell which node
sees more memory accesses.

> I think that dealing with this specific problem is a series all on its
> own and treating it on its own in isolation would be best.

The scan granularity issue is indeed a separate issue, I'm just really
suspicious of the assumption you make to attempt working around it in
this patch, because of the risk of moving away from optimal placement
prematurely.

The null hypothesis is that there is no connection between accesses
being shared or private and their actual frequency.  I would be
interested if this assumption has had a positive effect in your
testing or if this is based on the theoretical cases you mentioned in
the changelog, i.e. why you chose to make the bigger assumption.
-ENODATA ;-)

I think answering this question is precisely the scope of this patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-07-04 19:36       ` Johannes Weiner
@ 2013-07-05  9:41         ` Mel Gorman
  2013-07-05 10:48         ` Peter Zijlstra
  1 sibling, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-05  9:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Linux-MM, LKML

On Thu, Jul 04, 2013 at 03:36:38PM -0400, Johannes Weiner wrote:
> On Thu, Jul 04, 2013 at 10:23:56AM +0100, Mel Gorman wrote:
> > On Wed, Jul 03, 2013 at 05:56:54PM -0400, Johannes Weiner wrote:
> > > On Wed, Jul 03, 2013 at 03:21:34PM +0100, Mel Gorman wrote:
> > > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > > that are private to a task and those that are shared. This would require
> > > > that the last task that accessed a page for a hinting fault would be
> > > > recorded which would increase the size of struct page. Instead this patch
> > > > approximates private pages by assuming that faults that pass the two-stage
> > > > filter are private pages and all others are shared. The preferred NUMA
> > > > node is then selected based on where the maximum number of approximately
> > > > private faults were measured. Shared faults are not taken into
> > > > consideration for a few reasons.
> > > 
> > > Ingo had a patch that would just encode a few bits of the PID along
> > > with the last_nid (last_cpu in his case) member of struct page.  No
> > > extra space required and should be accurate enough.
> > > 
> > 
> > Yes, I'm aware of it. I noted in the changelog that ideally we'd record
> > the task both to remind myself and so that the patch that introduces it
> > could refer to this changelog so there is some sort of logical progression
> > for reviewers.
> > 
> > I was not keen on the use of last_cpu because I felt there was an implicit
> > assumption that scanning would always be fast enough to record hinting
> > faults before a task got moved to another CPU for any reason. I feared this
> > would be worse as memory and task sizes increased. That's why I stayed
> > with tracking the nid for the two-stage filter until it could be proven
> > it was insufficient for some reason.
> > 
> > The lack of anything resembling pid tracking now is that the series is
> > already a bit of a mouthful and I thought the other parts were more
> > important for now.
> 
> Fair enough.
> 

I prototyped a node/pid tracker where node is used for misplaced
detection and pid is used for private/shared detection. Tests running.
I'll include it in the next release if it works out.

> > > Otherwise this is blind to sharedness within the node the task is
> > > currently running on, right?
> > > 
> > 
> > Yes, it is.
> > 
> > > > First, if there are many tasks sharing the page then they'll all move
> > > > towards the same node. The node will be compute overloaded and then
> > > > scheduled away later only to bounce back again. Alternatively the shared
> > > > tasks would just bounce around nodes because the fault information is
> > > > effectively noise. Either way accounting for shared faults the same as
> > > > private faults may result in lower performance overall.
> > > 
> > > When the node with many shared pages is compute overloaded then there
> > > is arguably not an optimal node for the tasks and moving them off is
> > > inevitable. 
> > 
> > Yes. If such an event occurs then the ideal is that the task interleaves
> > between a subset of nodes. The situation could be partially detected by
> > tracking if the number of historical faults is approximately larger than
> > the preferred node and then interleave between the top N nodes most faulted
> > nodes until the working set fits. Starting the interleave should just be
> > a matter of coding. The difficulty is correctly backing off that if there
> > is a phase change.
> 
> Agreed, optimizing second-best placement can be dealt with later.  I'm
> worried about optimal placement, though.
> 
> And I'm worried about skewing memory access statistics in order to
> steer future situations the CPU load balancer should handle instead.
> 

And I'm worried about shared accesses drowning out any useful data. Ok,
at the very least I can test this

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1647a87..26b57b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4092,7 +4092,8 @@ static bool migrate_locality_prepare(struct task_struct *p, struct lb_env *env,
 /* Returns true if the destination node has incurred more faults */
 static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 {
-	int src_nid, dst_nid;
+	int src_nid, dst_nid, priv;
+	unsigned long src_faults = 0, dst_faults = 0;
 
 	if (!migrate_locality_prepare(p, env, &src_nid, &dst_nid))
 		return false;
@@ -4101,12 +4102,13 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (p->numa_preferred_nid == dst_nid)
 		return true;
 
-	/*
-	 * Move towards node if there were a higher number of private
-	 * NUMA hinting faults recorded on it
-	 */
-	if (p->numa_faults[task_faults_idx(dst_nid, 1)] >
-	    p->numa_faults[task_faults_idx(src_nid, 1)])
+	/* Move towards node if there were more NUMA hinting faults recorded */
+	for (priv = 0; priv < 2; priv++) {
+		src_faults += p->numa_faults[task_faults_idx(src_nid, priv)];
+		dst_faults += p->numa_faults[task_faults_idx(dst_nid, priv)];
+	}
+
+	if (dst_faults > src_faults)
 		return true;
 
 	return false;

It selects preferred node based on private accesses but if migrating to
any other node it will take the total number of faults into account.

> > > However, the node with the most page accesses, private or
> > > shared, is still the preferred node from a memory stand point.
> > > Compute load being equal, the task should go to the node with 2GB of
> > > shared memory and not to the one with 2 private pages.
> > > 
> > 
> > Agreed. The level of shared vs private needs to be detected. The problem
> > here is that detecting private dominated workloads is not straight-forward,
> > particularly as the scan rate slows as we've already discussed.
> 
> I was going for the opposite conclusion: that it does not matter
> whether memory is accessed privately or in a shared fashion, because
> there is no obvious connection to its access frequency, not to me at
> least. 

We cannot accurately detect access frequency because we are limited to
receiving the hinting faults.

> Short of accurate access frequency sampling and supportive
> data, the node with most accesses in general should be the preferred
> node, not the one with the most private accesses because it's the
> smaller assumption to make.
> 

Workloads with many shared accesses (numa01 on > 2 node machines) will
bounce all over the place.

> > > > The second reason is based on a hypothetical workload that has a small
> > > > number of very important, heavily accessed private pages but a large shared
> > > > array. The shared array would dominate the number of faults and be selected
> > > > as a preferred node even though it's the wrong decision.
> > > 
> > > That's a scan granularity problem and I can't see how you solve it
> > > with ignoring the shared pages. 
> > 
> > I acknowledge it's a problem and basically I'm making a big assumption
> > that private-dominated workloads are going to be the common case. Threaded
> > application on UMA with heavy amounts of shared data (within cache lines)
> > already suck in terms of performance so I'm expecting programmers already
> > try and avoid this sort of sharing. Obviously we are at a page granularity
> > here so the assumption will depend entirely on alignments and buffer sizes
> > so it might still fall apart.
> 
> Don't basically all VM-based mulithreaded programs have this usage
> pattern?  The whole runtime (text, heap) is shared between threads.

There are allocators with thread-local storage and large allocations in
glibc are satisified using mmap reducing the opportunity for false sharing
due to unaligned buffers.

> If some thread-local memory spills over to another node, should the
> scheduler move this thread off node from a memory standpoint?

If thread local memory spills over to another node, there will be attempts
to migrate it towards where the task is running. If those migrations fail,
then what you'd expect is that faults would accumulate on the remote node
and eventually it becomes the preferred node but that is not what currently
happens. The tracking in the failed migration case needs revisiting.

> I don't
> think so at all.  I would expect it to always gravitate back towards
> this node with the VM on it, only get moved off for CPU load reasons,
> and get moved back as soon as the load situation permits.
> 
> Meanwhile, if there is little shared memory and private memory spills
> over to other nodes, you still don't know which node's memory is more
> frequently used beyond scan period granularity. 

That's true whether we account private/shared separately or not.

> Ignoring shared
> memory accesses would not actually help finding the right node in this
> case.  You might even make it worse: since private accesses are all
> treated equally, you assume equal access frequency among them.  Shared
> accesses could now be the distinguishing data point to tell which node
> sees more memory accesses.
> 

Similar points can be made the other way as well. If shared/private is
ignored then infrequent accesses can dominate continual access due to
limitations of the scan-based sampling.

> > I think that dealing with this specific problem is a series all on its
> > own and treating it on its own in isolation would be best.
> 
> The scan granularity issue is indeed a separate issue, I'm just really
> suspicious of the assumption you make to attempt working around it in
> this patch, because of the risk of moving away from optimal placement
> prematurely.
> 

And the crux of the matter is whether the optimal placement is hurt by
prioritising private accesses or not. Initially I had been testing for
this but the last data I have on this is now relatively old. It's simple
enough to test by patching to make priv = 0 in task_numa_fault.

> The null hypothesis is that there is no connection between accesses
> being shared or private and their actual frequency.  I would be
> interested if this assumption has had a positive effect in your
> testing or if this is based on the theoretical cases you mentioned in
> the changelog, i.e. why you chose to make the bigger assumption.
> -ENODATA ;-)
> 

It was tested for in v1 but I did not publish the relevant data or every
patch series would be a paper. When I was testing for it, there was a
benefit to the patch but the series has changed too much since to state
with certainty it's still correct.

I'll see can I rework the series so that the split of private/shared
happens later so it's easier to measure whether it benefits in isolation.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/13] sched: Check current->mm before allocating NUMA faults
  2013-07-04 12:48   ` Srikar Dronamraju
@ 2013-07-05 10:07     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2013-07-05 10:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Thu, Jul 04, 2013 at 06:18:23PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@suse.de> [2013-07-03 15:21:38]:
> 
> > task_numa_placement checks current->mm but after buffers for faults
> > have already been uselessly allocated. Move the check earlier.
> > 
> > [peterz@infradead.org: Identified the problem]
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  kernel/sched/fair.c | 22 ++++++++++++++--------
> >  1 file changed, 14 insertions(+), 8 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 336074f..3c796b0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -870,8 +870,6 @@ static void task_numa_placement(struct task_struct *p)
> >  	int seq, nid, max_nid = 0;
> >  	unsigned long max_faults = 0;
> > 
> > -	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
> > -		return;
> >  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> >  	if (p->numa_scan_seq == seq)
> >  		return;
> > @@ -945,6 +943,12 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
> >  	if (!sched_feat_numa(NUMA))
> >  		return;
> > 
> > +	/* for example, ksmd faulting in a user's mm */
> > +	if (!p->mm) {
> > +		p->numa_scan_period = sysctl_numa_balancing_scan_period_max;
> 
> Naive question:
> Why are we resetting the scan_period?
> 

At the time I wrote it I was thinking of tick times and meant to recheck
if it's necessary but then it slipped my mind. The reset is unnecessary
as curr->mm is already checked.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC WIP] Process weights based scheduling for better consolidation
  2013-07-04 18:02 ` [PATCH RFC WIP] Process weights based scheduling for better consolidation Srikar Dronamraju
@ 2013-07-05 10:16   ` Peter Zijlstra
  2013-07-05 12:49     ` Srikar Dronamraju
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-05 10:16 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

On Thu, Jul 04, 2013 at 11:32:27PM +0530, Srikar Dronamraju wrote:
> Here is an approach to look at numa balanced scheduling from a non numa fault
> angle. This approach uses process weights instead of faults as a basis to
> move or bring tasks together.

That doesn't make any sense..... how would weight be related to numa
placement?

What it appears to do it simply group tasks based on ->mm. And by
keeping them somewhat sticky to the same node it gets locality.

What about multi-process shared memory workloads? Its one of the things
I disliked about autonuma. It completely disregards the multi-process
scenario.

If you want to go without faults; you also won't migrate memory along
and if you just happen to place your workload elsewhere you've no idea
where your memory is. If you have the faults, you might as well account
them to get a notion of where the memory is at; its nearly free at that
point anyway.

Load spikes/fluctuations can easily lead to transient task movement to
keep balance. If these movements are indeed transient you want to return
to where you came from; however if they are not.. you want the memory to
come to you.

> +static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
> +{
> +	struct rq *rq = rq_of(cfs_rq);
> +	unsigned long task_load = 0;
> +	int curnode = cpu_to_node(cpu_of(rq));
> +#ifdef CONFIG_SCHED_AUTOGROUP
> +	struct sched_entity *se;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq)];
> +	if (!se)
> +		return;
> +
> +	if (cfs_rq->load.weight) {
> +		task_load =  p->se.load.weight * se->load.weight;
> +		task_load /= cfs_rq->load.weight;
> +	} else {
> +		task_load = 0;
> +	}
> +#else
> +	task_load = p->se.load.weight;
> +#endif

This looks broken; didn't you want to use task_h_load() here? There's
nothing autogroup specific about task_load. If anything you want to do
full cgroup which I think reduces to task_h_load() here.

> +	p->task_load = 0;
> +	if (!task_load)
> +		return;
> +
> +	if (p->mm && p->mm->numa_weights) {
> +		p->mm->numa_weights[curnode] += task_load;
> +		p->mm->numa_weights[nr_node_ids] += task_load;
> +	}
> +
> +	if (p->nr_cpus_allowed != num_online_cpus())
> +		rq->pinned_load += task_load;
> +	p->task_load = task_load;
> +}
> +

> @@ -5529,6 +5769,76 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
>  		if (!balance)
>  			break;
>  	}
> +#ifdef CONFIG_NUMA_BALANCING
> +	if (!rq->nr_running) {

This would only work for under utilized systems...

> +	}
> +#endif

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
  2013-07-04 19:36       ` Johannes Weiner
  2013-07-05  9:41         ` Mel Gorman
@ 2013-07-05 10:48         ` Peter Zijlstra
  1 sibling, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2013-07-05 10:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Linux-MM, LKML

On Thu, Jul 04, 2013 at 03:36:38PM -0400, Johannes Weiner wrote:

> I was going for the opposite conclusion: that it does not matter
> whether memory is accessed privately or in a shared fashion, because
> there is no obvious connection to its access frequency, not to me at
> least.  

There is a relation to access freq; however due to the low sample rate
(once every 100ms or so) we obviously miss all high freq data there.

> > I acknowledge it's a problem and basically I'm making a big assumption
> > that private-dominated workloads are going to be the common case. Threaded
> > application on UMA with heavy amounts of shared data (within cache lines)
> > already suck in terms of performance so I'm expecting programmers already
> > try and avoid this sort of sharing. Obviously we are at a page granularity
> > here so the assumption will depend entirely on alignments and buffer sizes
> > so it might still fall apart.
> 
> Don't basically all VM-based mulithreaded programs have this usage
> pattern?  The whole runtime (text, heap) is shared between threads.
> If some thread-local memory spills over to another node, should the
> scheduler move this thread off node from a memory standpoint?  I don't
> think so at all.  I would expect it to always gravitate back towards
> this node with the VM on it, only get moved off for CPU load reasons,
> and get moved back as soon as the load situation permits.

All data being allocated on the same heap and being shared in the access
sense doesn't imply all threads will indeed use all data; even if TLS is
not used.

For a concurrent program to reach any useful level of concurrency gain
you need data partitioning. Threads must work on different data sets
otherwise they'd constantly be waiting on serialization -- which makes
your concurrency gain tank.

There's two main issues here:

Firstly; the question is if there's much false sharing on page
granularity. Typically you want the compute time per data fragment to be
significantly higher than the demux + mux overhead which favours larger
data units.

Secondly; you want your scan freq to be at least half the compute time
per data fragment. Otherwise you'll run the risk of not seeing the data
being local to that thread.

So for optimal benefit you want to minimize sharing pages between data
fragments and have your data fragment compute time as long as possible.
Luckily both are also goals for maximizing concurrency gain so we should
be good there.

This should cover all 'traditional' concurrent stuff; most of the 'new'
concurrency stuff can be different though -- some of it simply never
thought/designed for concurrency and just hopes it works. Others most
notably the multi-core concurrency stuff assumes the demux+mux cost are
_very_ low and therefore the data fragment and associated compute time
shrink to useless levels :/




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC WIP] Process weights based scheduling for better consolidation
  2013-07-05 10:16   ` Peter Zijlstra
@ 2013-07-05 12:49     ` Srikar Dronamraju
  0 siblings, 0 replies; 43+ messages in thread
From: Srikar Dronamraju @ 2013-07-05 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML

* Peter Zijlstra <peterz@infradead.org> [2013-07-05 12:16:54]:

> On Thu, Jul 04, 2013 at 11:32:27PM +0530, Srikar Dronamraju wrote:
> > Here is an approach to look at numa balanced scheduling from a non numa fault
> > angle. This approach uses process weights instead of faults as a basis to
> > move or bring tasks together.
> 
> That doesn't make any sense..... how would weight be related to numa
> placement?
> 

Since groups tasks to a node, makes sure that all the memory moves to
that node (courtesy the existing numa balancing in the kernel). So we
have both the tasks and memory in the same node. 

> What it appears to do it simply group tasks based on ->mm. And by
> keeping them somewhat sticky to the same node it gets locality.
> 

Yes, thats the key thing it tries to achieve.

> What about multi-process shared memory workloads? Its one of the things
> I disliked about autonuma. It completely disregards the multi-process
> scenario.
> 

Yes, This approach doesnt work that well with multi-process shared
memory workloads.  However the current Mel's proposal also disregards
shared pages for preferred_node logic.  Further if we consider multiple
processes sharing memory, then they would probably be sharing more memory
within themselves. And that one of the observations that Mel made in
defense of accounting private faults.

Also the processes that share data within themselves are probably very
very high compared to processes that share data with other processes. 
Shouldnt we be optimizing for the majority case first.

With my suggested approach, it would be a problem if two process share
data and are of so big size that they cannot be part of the same node.

I think numa faults should be part of scheduling and should solve these
cases but it might/should kick in later. Do you agree that solving the
case where tasks share data within themselves is more important problem
to solve now. (I too had the code for numa faults, but I thought we need
to get this in first, so moved it out. And I am happy that Mel is taking
care of that approach.)

> If you want to go without faults; you also won't migrate memory along
> and if you just happen to place your workload elsewhere you've no idea

Why, the memory moves to the workload because of numa faults, I am not
disabling numa faults. So if all or majority of the task move to that
node, the memory obviously should follow to that node and I seeing that
happen. Do you see a reason why it wouldnt move?

> where your memory is. If you have the faults, you might as well account
> them to get a notion of where the memory is at; its nearly free at that
> point anyway.
> 

And I am not against numa fault based scheduling, I, for now think the
primary step should be based on grouping task based on mm and then on
fault stats.

> Load spikes/fluctuations can easily lead to transient task movement to
> keep balance. If these movements are indeed transient you want to return
> to where you came from; however if they are not.. you want the memory to
> come to you.
> 

Yes, this should be achieved because in the load spike not all load runs
on that node, not all tasks from this mm gets move out of the node. And
hence the node weights should still be in similar proportions. Infact
we have checks and iterations in can_migrate_task(), its most
likely that these tasks that have a numa weightage get a preference to
stay in their node. 

> > +static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
> > +{
> > +	struct rq *rq = rq_of(cfs_rq);
> > +	unsigned long task_load = 0;
> > +	int curnode = cpu_to_node(cpu_of(rq));
> > +#ifdef CONFIG_SCHED_AUTOGROUP
> > +	struct sched_entity *se;
> > +
> > +	se = cfs_rq->tg->se[cpu_of(rq)];
> > +	if (!se)
> > +		return;
> > +
> > +	if (cfs_rq->load.weight) {
> > +		task_load =  p->se.load.weight * se->load.weight;
> > +		task_load /= cfs_rq->load.weight;
> > +	} else {
> > +		task_load = 0;
> > +	}
> > +#else
> > +	task_load = p->se.load.weight;
> > +#endif
> 
> This looks broken; didn't you want to use task_h_load() here? There's
> nothing autogroup specific about task_load. If anything you want to do
> full cgroup which I think reduces to task_h_load() here.
> 

Yes, I realize, 
I actually tried task_h_load, In the autogroup case the load on the cpu
was showing 83, while task_h_load returned 1024. the cgroup load was
2048 and the cgroups se load was 12. 

So I concluded that the cgroups load contributing to the total load is 
12 out of 83 and the proportion of this se was 6. Hence the equation. 
I will retry.

There are probably half-a-dozen such crap in my code which I
still need to fix. Thanks for pointing this one.

One other easy to locate issue is some sort of missing synchronization
in migrate_from_cpu/migrate_from_node.

> > +	p->task_load = 0;
> > +	if (!task_load)
> > +		return;
> > +
> > +	if (p->mm && p->mm->numa_weights) {
> > +		p->mm->numa_weights[curnode] += task_load;
> > +		p->mm->numa_weights[nr_node_ids] += task_load;
> > +	}
> > +
> > +	if (p->nr_cpus_allowed != num_online_cpus())
> > +		rq->pinned_load += task_load;
> > +	p->task_load = task_load;
> > +}
> > +
> 
> > @@ -5529,6 +5769,76 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
> >  		if (!balance)
> >  			break;
> >  	}
> > +#ifdef CONFIG_NUMA_BALANCING
> > +	if (!rq->nr_running) {
> 
> This would only work for under utilized systems...
> 

Why? Even on 2x  or 4x load machines, I see rebalance_domain calling
with NEWLY_IDLE and failing to do any balance. I made this observation
based on schedstats. So unless we see 0% idle times, this code should
kick in. Right?

Further if the machine is loaded, our checks introduced by
preferred_node, force_migrate will be more than useful to move tasks, 
We would ideally need active balance on lightly loaded machines because
then the tasks that we want to move are more likely to be active on the
cpus and hence the regular scheduler cannot do the right thing.

> > +	}
> > +#endif
> 

And finally, thanks for taking a look.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2013-07-05 12:49 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
2013-07-03 14:21 ` [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-03 14:21 ` [PATCH 02/13] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-03 14:21 ` [PATCH 03/13] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-03 14:21 ` [PATCH 04/13] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-03 14:21 ` [PATCH 05/13] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-03 14:21 ` [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-04 12:26   ` Srikar Dronamraju
2013-07-04 13:29     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter Mel Gorman
2013-07-03 21:56   ` Johannes Weiner
2013-07-04  9:23     ` Mel Gorman
2013-07-04 14:24       ` Rik van Riel
2013-07-04 19:36       ` Johannes Weiner
2013-07-05  9:41         ` Mel Gorman
2013-07-05 10:48         ` Peter Zijlstra
2013-07-03 14:21 ` [PATCH 08/13] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-03 14:21 ` [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
2013-07-03 18:27   ` Peter Zijlstra
2013-07-04  9:25     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 10/13] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
2013-07-03 14:21 ` [PATCH 11/13] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-03 15:33   ` Mel Gorman
2013-07-04 12:48   ` Srikar Dronamraju
2013-07-05 10:07     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-03 18:35   ` Peter Zijlstra
2013-07-04  9:27     ` Mel Gorman
2013-07-03 18:41   ` Peter Zijlstra
2013-07-04  9:32     ` Mel Gorman
2013-07-03 18:42   ` Peter Zijlstra
2013-07-03 14:21 ` [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
2013-07-03 18:32   ` Peter Zijlstra
2013-07-04  9:37     ` Mel Gorman
2013-07-04 13:07       ` Srikar Dronamraju
2013-07-04 13:54         ` Mel Gorman
2013-07-04 14:06           ` Peter Zijlstra
2013-07-04 14:40             ` Mel Gorman
2013-07-03 16:19 ` [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
2013-07-03 16:26   ` Mel Gorman
2013-07-04 18:02 ` [PATCH RFC WIP] Process weights based scheduling for better consolidation Srikar Dronamraju
2013-07-05 10:16   ` Peter Zijlstra
2013-07-05 12:49     ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).