All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/16] Basic scheduler support for automatic NUMA balancing V4
@ 2013-07-11  9:46 ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This continues to build on the previous feedback and further testing. Peter
posted a patch that avoids overloading a destination node relative to a
source node by postponing the reschedule of tasks on a preferred node. I
took the load calculations but dropped the balancing part as it performed
badly on local tests. It was evident that false sharing within THP pages
is a problem and I think it would alleviate the overloading problem if it
was solved first. Shared accesses are still not properly used for selecting
preferred nodes due to the impact of false sharing within THP pages.

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patch 3 corrects a THP NUMA hint fault accounting bug

Patch 4 avoids trying to migrate the THP zero page

Patches 5-7 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 8 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 9 adds infrastructure to allow separate tracking of shared/private
	pages but treats all faults as if they are private accesses. Laying
	it out this way reduces churn later in the series when private
	fault detection is introduced

Patch 10 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

Patch 11 avoids some unnecessary allocation

Patch 12 sets the scan rate proportional to the size of the task being scanned.

Patch 13-14 kicks away some training wheels and scans shared pages and small VMAs.

Patch 15 introduces private fault detection based on the PID of the faulting
	process and accounts for shared/private accesses differently

Patch 16 pick the least loaded CPU based on a preferred node based on a scheduling
	domain common to both the source and destination NUMA node.

Testing on this is only partial as full tests take a long time to run. A
full specjbb for both single and multi takes over 4 hours. NPB D class
also takes a few hours. With all the kernels in question, it still takes
a weekend to churn through them all.

Kernel 3.9 is still the testing baseline. The following kernels were tested

o vanilla		vanilla kernel with automatic numa balancing enabled
o favorpref-v4   	Patches 1-11
o scanshared-v4   	Patches 1-14
o splitprivate-v4   	Patches 1-15
o accountload-v4   	Patches 1-16

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time.

specjbb
                        3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                      vanilla       favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
TPut 1      26099.00 (  0.00%)     24726.00 ( -5.26%)     23924.00 ( -8.33%)     24788.00 ( -5.02%)     23692.00 ( -9.22%)
TPut 7     187276.00 (  0.00%)    190315.00 (  1.62%)    189450.00 (  1.16%)    185294.00 ( -1.06%)    183639.00 ( -1.94%)
TPut 13    318028.00 (  0.00%)    340088.00 (  6.94%)    330785.00 (  4.01%)    334663.00 (  5.23%)    333818.00 (  4.96%)
TPut 19    368547.00 (  0.00%)    422009.00 ( 14.51%)    401622.00 (  8.97%)    448669.00 ( 21.74%)    447950.00 ( 21.54%)
TPut 25    377522.00 (  0.00%)    442038.00 ( 17.09%)    413670.00 (  9.58%)    499595.00 ( 32.34%)    506872.00 ( 34.26%)
TPut 31    347642.00 (  0.00%)    425809.00 ( 22.48%)    382499.00 ( 10.03%)    487862.00 ( 40.33%)    468347.00 ( 34.72%)
TPut 37    313439.00 (  0.00%)    402418.00 ( 28.39%)    350941.00 ( 11.96%)    467847.00 ( 49.26%)    437945.00 ( 39.72%)
TPut 43    291958.00 (  0.00%)    363120.00 ( 24.37%)    313203.00 (  7.28%)    422984.00 ( 44.88%)    384563.00 ( 31.72%)

First off, note what the shared/private split patch does. Once we start
scanning all pages there is a degradation in performance as the shared page
faults introduce noise to the statistics. All indications are because there
is false sharing within THP pages that needs to be addressed. Splitting
the shared/private faults restores the performance and the key task in
the future is to use this shared/private information for maximum benefit.


specjbb Peaks
                                  3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                                vanilla          favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
 Actual Warehouse       26.00 (  0.00%)       26.00 (  0.00%)       26.00 (  0.00%)       26.00 (  0.00%)       26.00 (  0.00%)
 Actual Peak Bops   377522.00 (  0.00%)   442038.00 ( 17.09%)   413670.00 (  9.58%)   499595.00 ( 32.34%)   506872.00 ( 34.26%)

Peak performance is improved overall.


               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
User         5184.53     5177.92     5178.37     5177.24     5181.78
System         59.61       65.77       60.97       67.21       67.43
Elapsed       254.52      254.14      254.06      254.24      254.33

This is an increase in system CPU overhead that needs to be watched.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
THP fault alloc                  33297       34710       35229       34480       33510
THP collapse alloc                   9           6          14          11          12
THP splits                           3           3           3           4           1
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           1773768     1949772     1407218     4253043     4218882
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   1841        2023        1460        4414        4379
NUMA PTE updates              17461135    18458997    14255329    15856615    16071944
NUMA hint faults                 85873      172654       80923       91043       90465
NUMA hint local faults           27145      119972       32219       36020       34847
NUMA hint local percent             31          69          39          39          38
NUMA pages migrated            1773768     1949772     1407218     4253043     4218882
AutoNUMA cost                      585        1029         531         647         644

It's interesting to note how much scanning shared pages affects the
percentage of local NUMA hinting faults. There is a lot more work to do
there. There are fewer PTE scan updates but there are a much larger number
of pages being migrated that will need examination. Due to the overall
performance the focus will still be on false THP sharing.

Next is the autonuma benchmark results. These were only run once so I have no
idea what the variance is. Obviously they could be run multiple times but with
this number of kernels we would die of old age waiting on the results.

autonumabench
                                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                                        vanilla       favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
User    NUMA01               52623.86 (  0.00%)    49514.41 (  5.91%)    53783.60 ( -2.20%)    51205.78 (  2.69%)    53501.03 ( -1.67%)
User    NUMA01_THEADLOCAL    17595.48 (  0.00%)    17620.51 ( -0.14%)    19734.74 (-12.16%)    16966.63 (  3.57%)    17113.31 (  2.74%)
User    NUMA02                2043.84 (  0.00%)     1993.04 (  2.49%)     2051.29 ( -0.36%)     1901.96 (  6.94%)     2035.80 (  0.39%)
User    NUMA02_SMT            1057.11 (  0.00%)     1005.61 (  4.87%)      980.19 (  7.28%)      977.65 (  7.52%)      972.60 (  7.99%)
System  NUMA01                 414.17 (  0.00%)      222.86 ( 46.19%)      145.79 ( 64.80%)      321.93 ( 22.27%)      344.93 ( 16.72%)
System  NUMA01_THEADLOCAL      105.17 (  0.00%)      102.35 (  2.68%)      117.22 (-11.46%)      105.35 ( -0.17%)      102.54 (  2.50%)
System  NUMA02                   9.36 (  0.00%)        9.96 ( -6.41%)       13.02 (-39.10%)        9.53 ( -1.82%)        6.73 ( 28.10%)
System  NUMA02_SMT               3.54 (  0.00%)        3.53 (  0.28%)        3.46 (  2.26%)        5.85 (-65.25%)        4.49 (-26.84%)
Elapsed NUMA01                1201.52 (  0.00%)     1143.59 (  4.82%)     1244.61 ( -3.59%)     1182.92 (  1.55%)     1208.74 ( -0.60%)
Elapsed NUMA01_THEADLOCAL      393.91 (  0.00%)      392.49 (  0.36%)      442.04 (-12.22%)      385.61 (  2.11%)      386.43 (  1.90%)
Elapsed NUMA02                  50.30 (  0.00%)       50.36 ( -0.12%)       49.53 (  1.53%)       48.91 (  2.76%)       49.23 (  2.13%)
Elapsed NUMA02_SMT              58.48 (  0.00%)       47.79 ( 18.28%)       51.56 ( 11.83%)       55.98 (  4.27%)       56.34 (  3.66%)
CPU     NUMA01                4414.00 (  0.00%)     4349.00 (  1.47%)     4333.00 (  1.84%)     4355.00 (  1.34%)     4454.00 ( -0.91%)
CPU     NUMA01_THEADLOCAL     4493.00 (  0.00%)     4515.00 ( -0.49%)     4490.00 (  0.07%)     4427.00 (  1.47%)     4455.00 (  0.85%)
CPU     NUMA02                4081.00 (  0.00%)     3977.00 (  2.55%)     4167.00 ( -2.11%)     3908.00 (  4.24%)     4148.00 ( -1.64%)
CPU     NUMA02_SMT            1813.00 (  0.00%)     2111.00 (-16.44%)     1907.00 ( -5.18%)     1756.00 (  3.14%)     1734.00 (  4.36%)

numa01 saw no major performance benefit with a mix of gains and losses
throughout the series for its system CPU usage. It is an adverse workload
for this machine so right now I'm not overly concerned with improving its
performance.

numa01_threadlocal saw a very small performance gain overall although
it is interesting to note that scanning shared pages hurt it badly. Again
I predict that better shared page detection will help here.

numa02 showed a small improvement but it should also be already running
close to as quickly as possible.

numa02_smt also shows a small improvement although again scanning shared
pages hurt and would benefit from improved handling there.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
THP fault alloc                  14325       11724       14906       13553       14403
THP collapse alloc                   6           3           7          13          10
THP splits                           4           1           4           2           2
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           9020528     9708110     6677767     6773951     6170746
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   9363       10077        6931        7031        6405
NUMA PTE updates             119292401   114641446    85954812    74337906    75911999
NUMA hint faults                755901      499186      287825      237095      232126
NUMA hint local faults          595478      333483      152899      122210      128762
NUMA hint local percent             78          66          53          51          55
NUMA pages migrated            9020528     9708110     6677767     6773951     6170746
AutoNUMA cost                     4785        3482        2167        1834        1809

As all the tests are mashed together it is possible to make specific
conclusions on each testcase.  However, in general the series is doing a lot
less work with PTE updates, faults and so on. THe percentage of local faults
suffers but a large part of this seems to be around where shared pages are
getting scanned.

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system.

specjbb
                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                        vanilla       favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
Mean   1      30640.75 (  0.00%)     31222.25 (  1.90%)     31275.50 (  2.07%)     30554.00 ( -0.28%)     30348.75 ( -0.95%)
Mean   10    136983.25 (  0.00%)    133072.00 ( -2.86%)    140022.00 (  2.22%)    119168.25 (-13.01%)    140998.00 (  2.93%)
Mean   19    124005.25 (  0.00%)    121016.25 ( -2.41%)    122189.00 ( -1.46%)    111813.75 ( -9.83%)    129100.75 (  4.11%)
Mean   28    114672.00 (  0.00%)    111643.00 ( -2.64%)    109175.75 ( -4.79%)    101199.50 (-11.75%)    116026.50 (  1.18%)
Mean   37    110916.50 (  0.00%)    105791.75 ( -4.62%)    103103.75 ( -7.04%)    100187.00 ( -9.67%)    108801.00 ( -1.91%)
Mean   46    110139.25 (  0.00%)    105383.25 ( -4.32%)     99454.75 ( -9.70%)     99762.00 ( -9.42%)    104239.25 ( -5.36%)
Stddev 1       1002.06 (  0.00%)      1125.30 (-12.30%)       959.60 (  4.24%)       960.28 (  4.17%)      1014.89 ( -1.28%)
Stddev 10      4656.47 (  0.00%)      6679.25 (-43.44%)      5946.78 (-27.71%)     10427.37(-123.93%)      4039.93 ( 13.24%)
Stddev 19      2578.12 (  0.00%)      5261.94 (-104.10%)     3414.66 (-32.45%)      5070.00 (-96.65%)      1849.10 ( 28.28%)
Stddev 28      4123.69 (  0.00%)      4156.17 ( -0.79%)      6666.32 (-61.66%)      3899.89 (  5.43%)      3081.40 ( 25.28%)
Stddev 37      2301.94 (  0.00%)      5225.48 (-127.00%)     5444.18(-136.50%)      3490.87 (-51.65%)      1795.72 ( 21.99%)
Stddev 46      8317.91 (  0.00%)      6759.04 ( 18.74%)      6587.32 ( 20.81%)      4458.49 ( 46.40%)      7387.32 ( 11.19%)
TPut   1     122563.00 (  0.00%)    124889.00 (  1.90%)    125102.00 (  2.07%)    122216.00 ( -0.28%)    121395.00 ( -0.95%)
TPut   10    547933.00 (  0.00%)    532288.00 ( -2.86%)    560088.00 (  2.22%)    476673.00 (-13.01%)    563992.00 (  2.93%)
TPut   19    496021.00 (  0.00%)    484065.00 ( -2.41%)    488756.00 ( -1.46%)    447255.00 ( -9.83%)    516403.00 (  4.11%)
TPut   28    458688.00 (  0.00%)    446572.00 ( -2.64%)    436703.00 ( -4.79%)    404798.00 (-11.75%)    464106.00 (  1.18%)
TPut   37    443666.00 (  0.00%)    423167.00 ( -4.62%)    412415.00 ( -7.04%)    400748.00 ( -9.67%)    435204.00 ( -1.91%)
TPut   46    440557.00 (  0.00%)    421533.00 ( -4.32%)    397819.00 ( -9.70%)    399048.00 ( -9.42%)    416957.00 ( -5.36%)

Performance here is more or less flat although it's interesting to
note how much scanning share pages affects the differences between JVM
performance. Overall the series performance is more or less unchanged with
some improvements in varaiability. This should also benefit from false
sharing detection but it would also benefit if there was proper detection
of related tasks that share pages.

specjbb Peaks
                                3.9.0               3.9.0              3.9.0               3.9.0               3.9.0
                              vanilla        favorpref-v4      scanshared-v4     splitprivate-v4      accountload-v4   
 Actual Warehouse     11.00 (  0.00%)     11.00 (  0.00%)    11.00 (  0.00%)     11.00 (  0.00%)     11.00 (  0.00%)
 Actual Peak Bops 547933.00 (  0.00%) 532288.00 ( -2.86%)560088.00 (  2.22%) 476673.00 (-13.01%) 563992.00 (  2.93%)

Accounting for load recovers the loss from splitting private/shared. Again,
proper false shared detection is required.

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
User        52899.04    53106.74    53245.67    52828.25    53162.02
System        250.42      254.20      203.97      222.28      230.85
Elapsed      1199.72     1208.35     1206.14     1197.28     1207.10

Small reduction in system CPU overhead.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
THP fault alloc                  65188       66217       68158       63283       65531
THP collapse alloc                  97         172          91         108         135
THP splits                          38          37          36          34          41
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success          14583860    14559261     7770770    10131560    10932731
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                  15138       15112        8066       10516       11348
NUMA PTE updates             128327468   129131539    74033679    72954561    72832728
NUMA hint faults               2103190     1712971     1488709     1362365     1292772
NUMA hint local faults          734136      640363      405816      471928      403028
NUMA hint local percent             34          37          27          34          31
NUMA pages migrated           14583860    14559261     7770770    10131560    10932731
AutoNUMA cost                    11691        9745        8109        7515        7181

Fewer PTE updates but the percentage of local hinting faults clearly
needs improvement.

Overall the series performs well even though the gaps are still evident.
This is likely to be my last update to this series for a while but I'd
like to see this treated as a standalone with a separate series focusing on
false sharing detection and reduction, shared accesses used for selecting
preferred nodes, shared accesses used for load balancing and reintroducing
Peter's patch that balances compute nodes relative to each other. This is
to keep each series a manageable size for review even if it's obvious that
more work is required.

 Documentation/sysctl/kernel.txt   |  68 ++++++++
 include/linux/migrate.h           |   7 +-
 include/linux/mm.h                |  69 +++++---
 include/linux/mm_types.h          |   7 +-
 include/linux/page-flags-layout.h |  28 ++--
 include/linux/sched.h             |  23 ++-
 include/linux/sched/sysctl.h      |   1 -
 kernel/sched/core.c               |  26 ++-
 kernel/sched/fair.c               | 321 +++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h              |  12 ++
 kernel/sysctl.c                   |  14 +-
 mm/huge_memory.c                  |  26 ++-
 mm/memory.c                       |  27 ++--
 mm/mempolicy.c                    |   8 +-
 mm/migrate.c                      |  21 +--
 mm/mm_init.c                      |  18 +--
 mm/mmzone.c                       |  12 +-
 mm/mprotect.c                     |  28 ++--
 mm/page_alloc.c                   |   4 +-
 19 files changed, 568 insertions(+), 152 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 0/16] Basic scheduler support for automatic NUMA balancing V4
@ 2013-07-11  9:46 ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This continues to build on the previous feedback and further testing. Peter
posted a patch that avoids overloading a destination node relative to a
source node by postponing the reschedule of tasks on a preferred node. I
took the load calculations but dropped the balancing part as it performed
badly on local tests. It was evident that false sharing within THP pages
is a problem and I think it would alleviate the overloading problem if it
was solved first. Shared accesses are still not properly used for selecting
preferred nodes due to the impact of false sharing within THP pages.

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.

Patch 1 adds sysctl documentation

Patch 2 tracks NUMA hinting faults per-task and per-node

Patch 3 corrects a THP NUMA hint fault accounting bug

Patch 4 avoids trying to migrate the THP zero page

Patches 5-7 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node.

Patch 8 reschedules a task when a preferred node is selected if it is not
	running on that node already. This avoids waiting for the scheduler
	to move the task slowly.

Patch 9 adds infrastructure to allow separate tracking of shared/private
	pages but treats all faults as if they are private accesses. Laying
	it out this way reduces churn later in the series when private
	fault detection is introduced

Patch 10 replaces PTE scanning reset hammer and instread increases the
	scanning rate when an otherwise settled task changes its
	preferred node.

Patch 11 avoids some unnecessary allocation

Patch 12 sets the scan rate proportional to the size of the task being scanned.

Patch 13-14 kicks away some training wheels and scans shared pages and small VMAs.

Patch 15 introduces private fault detection based on the PID of the faulting
	process and accounts for shared/private accesses differently

Patch 16 pick the least loaded CPU based on a preferred node based on a scheduling
	domain common to both the source and destination NUMA node.

Testing on this is only partial as full tests take a long time to run. A
full specjbb for both single and multi takes over 4 hours. NPB D class
also takes a few hours. With all the kernels in question, it still takes
a weekend to churn through them all.

Kernel 3.9 is still the testing baseline. The following kernels were tested

o vanilla		vanilla kernel with automatic numa balancing enabled
o favorpref-v4   	Patches 1-11
o scanshared-v4   	Patches 1-14
o splitprivate-v4   	Patches 1-15
o accountload-v4   	Patches 1-16

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time.

specjbb
                        3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                      vanilla       favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
TPut 1      26099.00 (  0.00%)     24726.00 ( -5.26%)     23924.00 ( -8.33%)     24788.00 ( -5.02%)     23692.00 ( -9.22%)
TPut 7     187276.00 (  0.00%)    190315.00 (  1.62%)    189450.00 (  1.16%)    185294.00 ( -1.06%)    183639.00 ( -1.94%)
TPut 13    318028.00 (  0.00%)    340088.00 (  6.94%)    330785.00 (  4.01%)    334663.00 (  5.23%)    333818.00 (  4.96%)
TPut 19    368547.00 (  0.00%)    422009.00 ( 14.51%)    401622.00 (  8.97%)    448669.00 ( 21.74%)    447950.00 ( 21.54%)
TPut 25    377522.00 (  0.00%)    442038.00 ( 17.09%)    413670.00 (  9.58%)    499595.00 ( 32.34%)    506872.00 ( 34.26%)
TPut 31    347642.00 (  0.00%)    425809.00 ( 22.48%)    382499.00 ( 10.03%)    487862.00 ( 40.33%)    468347.00 ( 34.72%)
TPut 37    313439.00 (  0.00%)    402418.00 ( 28.39%)    350941.00 ( 11.96%)    467847.00 ( 49.26%)    437945.00 ( 39.72%)
TPut 43    291958.00 (  0.00%)    363120.00 ( 24.37%)    313203.00 (  7.28%)    422984.00 ( 44.88%)    384563.00 ( 31.72%)

First off, note what the shared/private split patch does. Once we start
scanning all pages there is a degradation in performance as the shared page
faults introduce noise to the statistics. All indications are because there
is false sharing within THP pages that needs to be addressed. Splitting
the shared/private faults restores the performance and the key task in
the future is to use this shared/private information for maximum benefit.


specjbb Peaks
                                  3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                                vanilla          favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
 Actual Warehouse       26.00 (  0.00%)       26.00 (  0.00%)       26.00 (  0.00%)       26.00 (  0.00%)       26.00 (  0.00%)
 Actual Peak Bops   377522.00 (  0.00%)   442038.00 ( 17.09%)   413670.00 (  9.58%)   499595.00 ( 32.34%)   506872.00 ( 34.26%)

Peak performance is improved overall.


               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
User         5184.53     5177.92     5178.37     5177.24     5181.78
System         59.61       65.77       60.97       67.21       67.43
Elapsed       254.52      254.14      254.06      254.24      254.33

This is an increase in system CPU overhead that needs to be watched.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
THP fault alloc                  33297       34710       35229       34480       33510
THP collapse alloc                   9           6          14          11          12
THP splits                           3           3           3           4           1
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           1773768     1949772     1407218     4253043     4218882
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   1841        2023        1460        4414        4379
NUMA PTE updates              17461135    18458997    14255329    15856615    16071944
NUMA hint faults                 85873      172654       80923       91043       90465
NUMA hint local faults           27145      119972       32219       36020       34847
NUMA hint local percent             31          69          39          39          38
NUMA pages migrated            1773768     1949772     1407218     4253043     4218882
AutoNUMA cost                      585        1029         531         647         644

It's interesting to note how much scanning shared pages affects the
percentage of local NUMA hinting faults. There is a lot more work to do
there. There are fewer PTE scan updates but there are a much larger number
of pages being migrated that will need examination. Due to the overall
performance the focus will still be on false THP sharing.

Next is the autonuma benchmark results. These were only run once so I have no
idea what the variance is. Obviously they could be run multiple times but with
this number of kernels we would die of old age waiting on the results.

autonumabench
                                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                                        vanilla       favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
User    NUMA01               52623.86 (  0.00%)    49514.41 (  5.91%)    53783.60 ( -2.20%)    51205.78 (  2.69%)    53501.03 ( -1.67%)
User    NUMA01_THEADLOCAL    17595.48 (  0.00%)    17620.51 ( -0.14%)    19734.74 (-12.16%)    16966.63 (  3.57%)    17113.31 (  2.74%)
User    NUMA02                2043.84 (  0.00%)     1993.04 (  2.49%)     2051.29 ( -0.36%)     1901.96 (  6.94%)     2035.80 (  0.39%)
User    NUMA02_SMT            1057.11 (  0.00%)     1005.61 (  4.87%)      980.19 (  7.28%)      977.65 (  7.52%)      972.60 (  7.99%)
System  NUMA01                 414.17 (  0.00%)      222.86 ( 46.19%)      145.79 ( 64.80%)      321.93 ( 22.27%)      344.93 ( 16.72%)
System  NUMA01_THEADLOCAL      105.17 (  0.00%)      102.35 (  2.68%)      117.22 (-11.46%)      105.35 ( -0.17%)      102.54 (  2.50%)
System  NUMA02                   9.36 (  0.00%)        9.96 ( -6.41%)       13.02 (-39.10%)        9.53 ( -1.82%)        6.73 ( 28.10%)
System  NUMA02_SMT               3.54 (  0.00%)        3.53 (  0.28%)        3.46 (  2.26%)        5.85 (-65.25%)        4.49 (-26.84%)
Elapsed NUMA01                1201.52 (  0.00%)     1143.59 (  4.82%)     1244.61 ( -3.59%)     1182.92 (  1.55%)     1208.74 ( -0.60%)
Elapsed NUMA01_THEADLOCAL      393.91 (  0.00%)      392.49 (  0.36%)      442.04 (-12.22%)      385.61 (  2.11%)      386.43 (  1.90%)
Elapsed NUMA02                  50.30 (  0.00%)       50.36 ( -0.12%)       49.53 (  1.53%)       48.91 (  2.76%)       49.23 (  2.13%)
Elapsed NUMA02_SMT              58.48 (  0.00%)       47.79 ( 18.28%)       51.56 ( 11.83%)       55.98 (  4.27%)       56.34 (  3.66%)
CPU     NUMA01                4414.00 (  0.00%)     4349.00 (  1.47%)     4333.00 (  1.84%)     4355.00 (  1.34%)     4454.00 ( -0.91%)
CPU     NUMA01_THEADLOCAL     4493.00 (  0.00%)     4515.00 ( -0.49%)     4490.00 (  0.07%)     4427.00 (  1.47%)     4455.00 (  0.85%)
CPU     NUMA02                4081.00 (  0.00%)     3977.00 (  2.55%)     4167.00 ( -2.11%)     3908.00 (  4.24%)     4148.00 ( -1.64%)
CPU     NUMA02_SMT            1813.00 (  0.00%)     2111.00 (-16.44%)     1907.00 ( -5.18%)     1756.00 (  3.14%)     1734.00 (  4.36%)

numa01 saw no major performance benefit with a mix of gains and losses
throughout the series for its system CPU usage. It is an adverse workload
for this machine so right now I'm not overly concerned with improving its
performance.

numa01_threadlocal saw a very small performance gain overall although
it is interesting to note that scanning shared pages hurt it badly. Again
I predict that better shared page detection will help here.

numa02 showed a small improvement but it should also be already running
close to as quickly as possible.

numa02_smt also shows a small improvement although again scanning shared
pages hurt and would benefit from improved handling there.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
THP fault alloc                  14325       11724       14906       13553       14403
THP collapse alloc                   6           3           7          13          10
THP splits                           4           1           4           2           2
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success           9020528     9708110     6677767     6773951     6170746
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                   9363       10077        6931        7031        6405
NUMA PTE updates             119292401   114641446    85954812    74337906    75911999
NUMA hint faults                755901      499186      287825      237095      232126
NUMA hint local faults          595478      333483      152899      122210      128762
NUMA hint local percent             78          66          53          51          55
NUMA pages migrated            9020528     9708110     6677767     6773951     6170746
AutoNUMA cost                     4785        3482        2167        1834        1809

As all the tests are mashed together it is possible to make specific
conclusions on each testcase.  However, in general the series is doing a lot
less work with PTE updates, faults and so on. THe percentage of local faults
suffers but a large part of this seems to be around where shared pages are
getting scanned.

I also ran SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system.

specjbb
                          3.9.0                 3.9.0                 3.9.0                 3.9.0                 3.9.0
                        vanilla       favorpref-v4         scanshared-v4       splitprivate-v4        accountload-v4   
Mean   1      30640.75 (  0.00%)     31222.25 (  1.90%)     31275.50 (  2.07%)     30554.00 ( -0.28%)     30348.75 ( -0.95%)
Mean   10    136983.25 (  0.00%)    133072.00 ( -2.86%)    140022.00 (  2.22%)    119168.25 (-13.01%)    140998.00 (  2.93%)
Mean   19    124005.25 (  0.00%)    121016.25 ( -2.41%)    122189.00 ( -1.46%)    111813.75 ( -9.83%)    129100.75 (  4.11%)
Mean   28    114672.00 (  0.00%)    111643.00 ( -2.64%)    109175.75 ( -4.79%)    101199.50 (-11.75%)    116026.50 (  1.18%)
Mean   37    110916.50 (  0.00%)    105791.75 ( -4.62%)    103103.75 ( -7.04%)    100187.00 ( -9.67%)    108801.00 ( -1.91%)
Mean   46    110139.25 (  0.00%)    105383.25 ( -4.32%)     99454.75 ( -9.70%)     99762.00 ( -9.42%)    104239.25 ( -5.36%)
Stddev 1       1002.06 (  0.00%)      1125.30 (-12.30%)       959.60 (  4.24%)       960.28 (  4.17%)      1014.89 ( -1.28%)
Stddev 10      4656.47 (  0.00%)      6679.25 (-43.44%)      5946.78 (-27.71%)     10427.37(-123.93%)      4039.93 ( 13.24%)
Stddev 19      2578.12 (  0.00%)      5261.94 (-104.10%)     3414.66 (-32.45%)      5070.00 (-96.65%)      1849.10 ( 28.28%)
Stddev 28      4123.69 (  0.00%)      4156.17 ( -0.79%)      6666.32 (-61.66%)      3899.89 (  5.43%)      3081.40 ( 25.28%)
Stddev 37      2301.94 (  0.00%)      5225.48 (-127.00%)     5444.18(-136.50%)      3490.87 (-51.65%)      1795.72 ( 21.99%)
Stddev 46      8317.91 (  0.00%)      6759.04 ( 18.74%)      6587.32 ( 20.81%)      4458.49 ( 46.40%)      7387.32 ( 11.19%)
TPut   1     122563.00 (  0.00%)    124889.00 (  1.90%)    125102.00 (  2.07%)    122216.00 ( -0.28%)    121395.00 ( -0.95%)
TPut   10    547933.00 (  0.00%)    532288.00 ( -2.86%)    560088.00 (  2.22%)    476673.00 (-13.01%)    563992.00 (  2.93%)
TPut   19    496021.00 (  0.00%)    484065.00 ( -2.41%)    488756.00 ( -1.46%)    447255.00 ( -9.83%)    516403.00 (  4.11%)
TPut   28    458688.00 (  0.00%)    446572.00 ( -2.64%)    436703.00 ( -4.79%)    404798.00 (-11.75%)    464106.00 (  1.18%)
TPut   37    443666.00 (  0.00%)    423167.00 ( -4.62%)    412415.00 ( -7.04%)    400748.00 ( -9.67%)    435204.00 ( -1.91%)
TPut   46    440557.00 (  0.00%)    421533.00 ( -4.32%)    397819.00 ( -9.70%)    399048.00 ( -9.42%)    416957.00 ( -5.36%)

Performance here is more or less flat although it's interesting to
note how much scanning share pages affects the differences between JVM
performance. Overall the series performance is more or less unchanged with
some improvements in varaiability. This should also benefit from false
sharing detection but it would also benefit if there was proper detection
of related tasks that share pages.

specjbb Peaks
                                3.9.0               3.9.0              3.9.0               3.9.0               3.9.0
                              vanilla        favorpref-v4      scanshared-v4     splitprivate-v4      accountload-v4   
 Actual Warehouse     11.00 (  0.00%)     11.00 (  0.00%)    11.00 (  0.00%)     11.00 (  0.00%)     11.00 (  0.00%)
 Actual Peak Bops 547933.00 (  0.00%) 532288.00 ( -2.86%)560088.00 (  2.22%) 476673.00 (-13.01%) 563992.00 (  2.93%)

Accounting for load recovers the loss from splitting private/shared. Again,
proper false shared detection is required.

               3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
             vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
User        52899.04    53106.74    53245.67    52828.25    53162.02
System        250.42      254.20      203.97      222.28      230.85
Elapsed      1199.72     1208.35     1206.14     1197.28     1207.10

Small reduction in system CPU overhead.

                                 3.9.0       3.9.0       3.9.0       3.9.0       3.9.0
                               vanillafavorpref-v4   scanshared-v4   splitprivate-v4   accountload-v4   
THP fault alloc                  65188       66217       68158       63283       65531
THP collapse alloc                  97         172          91         108         135
THP splits                          38          37          36          34          41
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success          14583860    14559261     7770770    10131560    10932731
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                  15138       15112        8066       10516       11348
NUMA PTE updates             128327468   129131539    74033679    72954561    72832728
NUMA hint faults               2103190     1712971     1488709     1362365     1292772
NUMA hint local faults          734136      640363      405816      471928      403028
NUMA hint local percent             34          37          27          34          31
NUMA pages migrated           14583860    14559261     7770770    10131560    10932731
AutoNUMA cost                    11691        9745        8109        7515        7181

Fewer PTE updates but the percentage of local hinting faults clearly
needs improvement.

Overall the series performs well even though the gaps are still evident.
This is likely to be my last update to this series for a while but I'd
like to see this treated as a standalone with a separate series focusing on
false sharing detection and reduction, shared accesses used for selecting
preferred nodes, shared accesses used for load balancing and reintroducing
Peter's patch that balances compute nodes relative to each other. This is
to keep each series a manageable size for review even if it's obvious that
more work is required.

 Documentation/sysctl/kernel.txt   |  68 ++++++++
 include/linux/migrate.h           |   7 +-
 include/linux/mm.h                |  69 +++++---
 include/linux/mm_types.h          |   7 +-
 include/linux/page-flags-layout.h |  28 ++--
 include/linux/sched.h             |  23 ++-
 include/linux/sched/sysctl.h      |   1 -
 kernel/sched/core.c               |  26 ++-
 kernel/sched/fair.c               | 321 +++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h              |  12 ++
 kernel/sysctl.c                   |  14 +-
 mm/huge_memory.c                  |  26 ++-
 mm/memory.c                       |  27 ++--
 mm/mempolicy.c                    |   8 +-
 mm/migrate.c                      |  21 +--
 mm/mm_init.c                      |  18 +--
 mm/mmzone.c                       |  12 +-
 mm/mprotect.c                     |  28 ++--
 mm/page_alloc.c                   |   4 +-
 19 files changed, 568 insertions(+), 152 deletions(-)

-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/16] mm: numa: Document automatic NUMA balancing sysctls
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 01/16] mm: numa: Document automatic NUMA balancing sysctls
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/16] sched: Track NUMA hinting faults on per-node basis
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.  Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 +++++++++++-
 kernel/sched/sched.h  | 11 +++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
+
+	/* Record the fault, double the weight if pages were migrated */
+	p->numa_faults[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..c5f773d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/16] sched: Track NUMA hinting faults on per-node basis
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.  Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 +++++++++++-
 kernel/sched/sched.h  | 11 +++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
+
+	/* Record the fault, double the weight if pages were migrated */
+	p->numa_faults[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..c5f773d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/16] mm: numa: Account for THP numa hinting faults on the correct node
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

THP NUMA hinting fault on pages that are not migrated are being
accounted for incorrectly. Currently the fault will be counted as if the
task was running on a node local to the page which is not necessarily
true.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..e4a79fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
-	int current_nid = -1;
+	int src_nid = -1;
 	bool migrated;
 
 	spin_lock(&mm->page_table_lock);
@@ -1302,9 +1302,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(pmd);
 	get_page(page);
-	current_nid = page_to_nid(page);
+	src_nid = numa_node_id();
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (src_nid == page_to_nid(page))
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
@@ -1346,8 +1346,8 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (src_nid != -1)
+		task_numa_fault(src_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/16] mm: numa: Account for THP numa hinting faults on the correct node
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

THP NUMA hinting fault on pages that are not migrated are being
accounted for incorrectly. Currently the fault will be counted as if the
task was running on a node local to the page which is not necessarily
true.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..e4a79fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
-	int current_nid = -1;
+	int src_nid = -1;
 	bool migrated;
 
 	spin_lock(&mm->page_table_lock);
@@ -1302,9 +1302,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(pmd);
 	get_page(page);
-	current_nid = page_to_nid(page);
+	src_nid = numa_node_id();
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (src_nid == page_to_nid(page))
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
@@ -1346,8 +1346,8 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (src_nid != -1)
+		task_numa_fault(src_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/16] mm: numa: Do not migrate or account for hinting faults on the zero page
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The zero page is not replicated between nodes and is often shared
between processes. The data is read-only and likely to be cached in
local CPUs if heavily accessed meaning that the remote memory access
cost is less of a concern. This patch stops accounting for numa hinting
faults on the zero page in both terms of counting faults and scheduling
tasks on nodes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 9 +++++++++
 mm/memory.c      | 7 ++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e4a79fa..ec938ed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1302,6 +1302,15 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(pmd);
 	get_page(page);
+
+	/*
+	 * Do not account for faults against the huge zero page. The read-only
+	 * data is likely to be read-cached on the local CPUs and it is less
+	 * useful to know about local versus remote hits on the zero page.
+	 */
+	if (is_huge_zero_pfn(page_to_pfn(page)))
+		goto clear_pmdnuma;
+
 	src_nid = numa_node_id();
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (src_nid == page_to_nid(page))
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..422351c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3560,8 +3560,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
+	/*
+	 * Do not account for faults against the huge zero page. The read-only
+	 * data is likely to be read-cached on the local CPUs and it is less
+	 * useful to know about local versus remote hits on the zero page.
+	 */
 	page = vm_normal_page(vma, addr, pte);
-	if (!page) {
+	if (!page || is_zero_pfn(page_to_pfn(page))) {
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/16] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The zero page is not replicated between nodes and is often shared
between processes. The data is read-only and likely to be cached in
local CPUs if heavily accessed meaning that the remote memory access
cost is less of a concern. This patch stops accounting for numa hinting
faults on the zero page in both terms of counting faults and scheduling
tasks on nodes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 9 +++++++++
 mm/memory.c      | 7 ++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e4a79fa..ec938ed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1302,6 +1302,15 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(pmd);
 	get_page(page);
+
+	/*
+	 * Do not account for faults against the huge zero page. The read-only
+	 * data is likely to be read-cached on the local CPUs and it is less
+	 * useful to know about local versus remote hits on the zero page.
+	 */
+	if (is_huge_zero_pfn(page_to_pfn(page)))
+		goto clear_pmdnuma;
+
 	src_nid = numa_node_id();
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (src_nid == page_to_nid(page))
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..422351c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3560,8 +3560,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
+	/*
+	 * Do not account for faults against the huge zero page. The read-only
+	 * data is likely to be read-cached on the local CPUs and it is less
+	 * useful to know about local versus remote hits on the zero page.
+	 */
 	page = vm_normal_page(vma, addr, pte);
-	if (!page) {
+	if (!page || is_zero_pfn(page_to_pfn(page))) {
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/16] sched: Select a preferred node with the most numa hinting faults
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..ed4e785 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..c0bee41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = 0;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/16] sched: Select a preferred node with the most numa hinting faults
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..ed4e785 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..c0bee41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = 0;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for (nid = 0; nid < nr_node_ids; nid++) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/16] sched: Update NUMA hinting faults once per scan
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed4e785..0bd541c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0bee41..8dc9ff9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -832,9 +838,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -848,7 +858,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults[node] += pages << migrated;
+	p->numa_faults_buffer[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/16] sched: Update NUMA hinting faults once per scan
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed4e785..0bd541c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0bee41..8dc9ff9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -832,9 +838,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -848,7 +858,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults[node] += pages << migrated;
+	p->numa_faults_buffer[node] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/16] sched: Favour moving tasks towards the preferred node
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality a
big weakness is that the nodes CPUs can be overloaded and it would be more
efficient to queue tasks on an idle node and migrate to the new node. This
would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 ++-
 kernel/sched/fair.c             | 60 ++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c                 |  7 +++++
 5 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -418,6 +419,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bd541c..5e02507 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -6141,6 +6141,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8dc9ff9..5055bf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -820,8 +830,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -3898,6 +3910,35 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_preferred_nid == dst_nid)
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3946,11 +3987,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..263486f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/16] sched: Favour moving tasks towards the preferred node
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality a
big weakness is that the nodes CPUs can be overloaded and it would be more
efficient to queue tasks on an idle node and migrate to the new node. This
would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 ++-
 kernel/sched/fair.c             | 60 ++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c                 |  7 +++++
 5 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -418,6 +419,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bd541c..5e02507 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -6141,6 +6141,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8dc9ff9..5055bf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -820,8 +830,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -3898,6 +3910,35 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_preferred_nid == dst_nid)
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3946,11 +3987,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..263486f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 17 +++++++++++++++++
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e02507..e4c1832 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -992,6 +992,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5055bf9..c9ce879 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,31 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			min_load = load;
+			idlest_cpu = i;
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5f773d..795346d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,6 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 17 +++++++++++++++++
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e02507..e4c1832 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -992,6 +992,23 @@ struct migration_arg {
 
 static int migration_cpu_stop(void *data);
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5055bf9..c9ce879 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,31 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			min_load = load;
+			idlest_cpu = i;
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5f773d..795346d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,6 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/16] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 33 ++++++++++++++++++++++++---------
 mm/huge_memory.c      |  7 ++++---
 mm/memory.c           |  9 ++++++---
 4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..b81195e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9ce879..eb6be97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -825,6 +825,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
 	return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -841,13 +846,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -883,16 +894,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!sched_feat_numa(NUMA))
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
@@ -900,7 +915,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -914,7 +929,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults_buffer[node] += pages << migrated;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec938ed..9462591 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid;
+	int target_nid, last_nid;
 	int src_nid = -1;
 	bool migrated;
 
@@ -1316,6 +1316,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (src_nid == page_to_nid(page))
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	last_nid = page_nid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1341,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1356,7 +1357,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (src_nid != -1)
-		task_numa_fault(src_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nid, src_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 422351c..189da75 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int current_nid = -1, last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3571,6 +3571,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	last_nid = page_nid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3591,7 +3592,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+		task_numa_fault(last_nid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3607,6 +3608,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3659,6 +3661,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
+		last_nid = page_nid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3671,7 +3674,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		task_numa_fault(last_nid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/16] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 33 ++++++++++++++++++++++++---------
 mm/huge_memory.c      |  7 ++++---
 mm/memory.c           |  9 ++++++---
 4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..b81195e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9ce879..eb6be97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -825,6 +825,11 @@ find_idlest_cpu_node(int this_cpu, int nid)
 	return idlest_cpu;
 }
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = 0;
@@ -841,13 +846,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -883,16 +894,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!sched_feat_numa(NUMA))
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
@@ -900,7 +915,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -914,7 +929,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 	task_numa_placement(p);
 
 	/* Record the fault, double the weight if pages were migrated */
-	p->numa_faults_buffer[node] += pages << migrated;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec938ed..9462591 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid;
+	int target_nid, last_nid;
 	int src_nid = -1;
 	bool migrated;
 
@@ -1316,6 +1316,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (src_nid == page_to_nid(page))
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	last_nid = page_nid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1341,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1356,7 +1357,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (src_nid != -1)
-		task_numa_fault(src_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nid, src_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 422351c..189da75 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int current_nid = -1, last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3571,6 +3571,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	last_nid = page_nid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3591,7 +3592,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+		task_numa_fault(last_nid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3607,6 +3608,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3659,6 +3661,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
+		last_nid = page_nid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3671,7 +3674,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		task_numa_fault(last_nid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/16] sched: Increase NUMA PTE scanning when a new preferred node is selected
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 27 ++++++++++++---------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 15 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4c1832..02db92a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1602,7 +1602,6 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb6be97..0d4e516 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -873,6 +872,7 @@ static void task_numa_placement(struct task_struct *p)
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		int preferred_cpu;
+		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
 		 * If the task is not on the preferred node then find the most
@@ -888,6 +888,16 @@ static void task_numa_placement(struct task_struct *p)
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
 		migrate_task_to(p, preferred_cpu);
+
+		/*
+		 * If preferred nodes changes frequently then the scan rate
+		 * will be continually high. Mitigate this by increasing the
+		 * scan rate only if the task was settled.
+		 */
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period = max(p->numa_scan_period >> 1,
+					sysctl_numa_balancing_scan_period_min);
+		}
 	}
 }
 
@@ -984,19 +994,6 @@ void task_numa_work(struct callback_head *work)
 	}
 
 	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
-	}
-
-	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 263486f..1fcbc68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,13 +373,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/16] sched: Increase NUMA PTE scanning when a new preferred node is selected
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 27 ++++++++++++---------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 15 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4c1832..02db92a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1602,7 +1602,6 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb6be97..0d4e516 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -873,6 +872,7 @@ static void task_numa_placement(struct task_struct *p)
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		int preferred_cpu;
+		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
 		 * If the task is not on the preferred node then find the most
@@ -888,6 +888,16 @@ static void task_numa_placement(struct task_struct *p)
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
 		migrate_task_to(p, preferred_cpu);
+
+		/*
+		 * If preferred nodes changes frequently then the scan rate
+		 * will be continually high. Mitigate this by increasing the
+		 * scan rate only if the task was settled.
+		 */
+		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+			p->numa_scan_period = max(p->numa_scan_period >> 1,
+					sysctl_numa_balancing_scan_period_min);
+		}
 	}
 }
 
@@ -984,19 +994,6 @@ void task_numa_work(struct callback_head *work)
 	}
 
 	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
-	}
-
-	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 263486f..1fcbc68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,13 +373,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/16] sched: Check current->mm before allocating NUMA faults
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d4e516..54702c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -834,8 +834,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = 0;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -912,6 +910,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/16] sched: Check current->mm before allocating NUMA faults
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d4e516..54702c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -834,8 +834,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = 0;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -912,6 +910,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!sched_feat_numa(NUMA))
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/16] sched: Set the scan rate proportional to the size of the task being scanned
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks virtual
address space. Conceptually this is a lot easier to understand. There is a
"sanity" check to ensure the scan rate is never extremely fast based on the
amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 ++++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 72 +++++++++++++++++++++++++++++++++++------
 3 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a275042..f38d4f4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,15 +401,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b81195e..d44fbc6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54702c9..3ffe097 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,10 +779,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -790,6 +792,46 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long nr_vm_pages = 0;
+	unsigned long nr_scan_pages;
+
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	nr_vm_pages = p->mm->total_vm;
+	if (!nr_vm_pages)
+		nr_vm_pages = nr_scan_pages;
+
+	nr_vm_pages = round_up(nr_vm_pages, nr_scan_pages);
+	return nr_vm_pages / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 /*
  * Once a preferred node is selected the scheduler balancer will prefer moving
  * a task to that node for sysctl_numa_balancing_settle_count number of PTE
@@ -839,6 +881,7 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 	p->numa_migrate_seq++;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -894,7 +937,7 @@ static void task_numa_placement(struct task_struct *p)
 		 */
 		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
 			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+					task_scan_min(p));
 		}
 	}
 }
@@ -935,7 +978,7 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	 * This is reset periodically in case of phase changes
 	 */
         if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
@@ -961,6 +1004,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -1002,8 +1046,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -1042,7 +1088,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1089,7 +1143,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/16] sched: Set the scan rate proportional to the size of the task being scanned
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks virtual
address space. Conceptually this is a lot easier to understand. There is a
"sanity" check to ensure the scan rate is never extremely fast based on the
amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 ++++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 72 +++++++++++++++++++++++++++++++++++------
 3 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a275042..f38d4f4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,15 +401,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b81195e..d44fbc6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54702c9..3ffe097 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,10 +779,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -790,6 +792,46 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long nr_vm_pages = 0;
+	unsigned long nr_scan_pages;
+
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	nr_vm_pages = p->mm->total_vm;
+	if (!nr_vm_pages)
+		nr_vm_pages = nr_scan_pages;
+
+	nr_vm_pages = round_up(nr_vm_pages, nr_scan_pages);
+	return nr_vm_pages / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 /*
  * Once a preferred node is selected the scheduler balancer will prefer moving
  * a task to that node for sysctl_numa_balancing_settle_count number of PTE
@@ -839,6 +881,7 @@ static void task_numa_placement(struct task_struct *p)
 		return;
 	p->numa_scan_seq = seq;
 	p->numa_migrate_seq++;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
 	for (nid = 0; nid < nr_node_ids; nid++) {
@@ -894,7 +937,7 @@ static void task_numa_placement(struct task_struct *p)
 		 */
 		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
 			p->numa_scan_period = max(p->numa_scan_period >> 1,
-					sysctl_numa_balancing_scan_period_min);
+					task_scan_min(p));
 		}
 	}
 }
@@ -935,7 +978,7 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	 * This is reset periodically in case of phase changes
 	 */
         if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
@@ -961,6 +1004,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -1002,8 +1046,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -1042,7 +1088,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1089,7 +1143,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/16] mm: numa: Scan pages with elevated page_mapcount
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/memory.c             |  7 ++-----
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 4 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 189da75..f4e3ad5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3586,7 +3586,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		current_nid = target_nid;
 
@@ -3651,9 +3651,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
 
 		/*
 		 * Note that the NUMA fault is later accounted to either
@@ -3671,7 +3668,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Migrate to the requested node */
 		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
+		migrated = migrate_misplaced_page(page, vma, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
 		task_numa_fault(last_nid, curr_nid, 1, migrated);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3bbaf5d..23f8122 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1579,7 +1579,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1641,13 +1643,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..cacc64a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/16] mm: numa: Scan pages with elevated page_mapcount
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/memory.c             |  7 ++-----
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 4 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 189da75..f4e3ad5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3586,7 +3586,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		current_nid = target_nid;
 
@@ -3651,9 +3651,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
 
 		/*
 		 * Note that the NUMA fault is later accounted to either
@@ -3671,7 +3668,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Migrate to the requested node */
 		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
+		migrated = migrate_misplaced_page(page, vma, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
 		task_numa_fault(last_nid, curr_nid, 1, migrated);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3bbaf5d..23f8122 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1579,7 +1579,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1641,13 +1643,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..cacc64a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/16] sched: Remove check that skips small VMAs
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ffe097..45dcf51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,10 +1080,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/16] sched: Remove check that skips small VMAs
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ffe097..45dcf51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,10 +1080,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 15/16] sched: Set preferred NUMA node based on number of private faults
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:46   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 69 ++++++++++++++++++++++++++-------------
 include/linux/mm_types.h          |  4 +--
 include/linux/page-flags-layout.h | 28 +++++++++-------
 kernel/sched/fair.c               | 12 +++++--
 mm/huge_memory.c                  | 10 +++---
 mm/memory.c                       | 16 ++++-----
 mm/mempolicy.c                    |  8 +++--
 mm/migrate.c                      |  4 +--
 mm/mm_init.c                      | 18 +++++-----
 mm/mmzone.c                       | 12 +++----
 mm/mprotect.c                     | 24 +++++++++-----
 mm/page_alloc.c                   |  4 +--
 12 files changed, 128 insertions(+), 81 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..93f9feb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -582,11 +582,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -596,7 +596,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -618,7 +618,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -662,48 +662,73 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
+{
+	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	page->_last_nid = -1;
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de70964..4137f67 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45dcf51..2ab8fa0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -945,7 +945,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -957,8 +957,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (!nidpid_pid_unset(last_nidpid))
+		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	else
+		priv = 1;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9462591..c7f79dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid, last_nid;
+	int target_nid, last_nidpid;
 	int src_nid = -1;
 	bool migrated;
 
@@ -1316,7 +1316,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (src_nid == page_to_nid(page))
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1342,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nidpid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1357,7 +1357,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (src_nid != -1)
-		task_numa_fault(last_nid, src_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nidpid, src_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
@@ -1649,7 +1649,7 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index f4e3ad5..015574f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1, last_nid;
+	int current_nid = -1, last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3571,7 +3571,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3592,7 +3592,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(last_nid, current_nid, 1, migrated);
+		task_numa_fault(last_nidpid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3608,7 +3608,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3658,7 +3658,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3671,7 +3671,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, vma, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(last_nid, curr_nid, 1, migrated);
+		task_numa_fault(last_nidpid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..4669000 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2288,9 +2288,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);;
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2313,8 +2315,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 23f8122..01d653d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1478,7 +1478,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1655,7 +1655,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c280a02..eecdc64 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -69,26 +69,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -100,9 +100,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..89b3b7e 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -98,19 +98,19 @@ void lruvec_init(struct lruvec *lruvec)
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cacc64a..726e615 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
+	bool all_same_nidpid = true;
 	int last_nid = -1;
+	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -64,10 +65,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
 					int this_nid = page_to_nid(page);
+					int nidpid = page_nidpid_last(page);
+					int this_pid = nidpid_to_pid(nidpid);
+					
 					if (last_nid == -1)
 						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
+					if (last_pid == -1)
+						last_pid = this_pid;
+					if (last_nid != this_nid ||
+					    last_pid != this_pid) {
+						all_same_nidpid = false;
+					}
 
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
@@ -106,7 +114,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
+	*ret_all_same_nidpid = all_same_nidpid;
 	return pages;
 }
 
@@ -133,7 +141,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
+	bool all_same_nidpid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -151,7 +159,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
+				 dirty_accountable, prot_numa, &all_same_nidpid);
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -159,7 +167,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node)
+		if (prot_numa && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..f7c9c0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -613,7 +613,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3910,7 +3910,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 15/16] sched: Set preferred NUMA node based on number of private faults
@ 2013-07-11  9:46   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 69 ++++++++++++++++++++++++++-------------
 include/linux/mm_types.h          |  4 +--
 include/linux/page-flags-layout.h | 28 +++++++++-------
 kernel/sched/fair.c               | 12 +++++--
 mm/huge_memory.c                  | 10 +++---
 mm/memory.c                       | 16 ++++-----
 mm/mempolicy.c                    |  8 +++--
 mm/migrate.c                      |  4 +--
 mm/mm_init.c                      | 18 +++++-----
 mm/mmzone.c                       | 12 +++----
 mm/mprotect.c                     | 24 +++++++++-----
 mm/page_alloc.c                   |  4 +--
 12 files changed, 128 insertions(+), 81 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..93f9feb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -582,11 +582,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -596,7 +596,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -618,7 +618,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -662,48 +662,73 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
+{
+	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	page->_last_nid = -1;
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de70964..4137f67 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45dcf51..2ab8fa0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -945,7 +945,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -957,8 +957,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (!nidpid_pid_unset(last_nidpid))
+		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	else
+		priv = 1;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9462591..c7f79dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int target_nid, last_nid;
+	int target_nid, last_nidpid;
 	int src_nid = -1;
 	bool migrated;
 
@@ -1316,7 +1316,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (src_nid == page_to_nid(page))
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		put_page(page);
@@ -1342,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!migrated)
 		goto check_same;
 
-	task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
+	task_numa_fault(last_nidpid, target_nid, HPAGE_PMD_NR, true);
 	return 0;
 
 check_same:
@@ -1357,7 +1357,7 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (src_nid != -1)
-		task_numa_fault(last_nid, src_nid, HPAGE_PMD_NR, false);
+		task_numa_fault(last_nidpid, src_nid, HPAGE_PMD_NR, false);
 	return 0;
 }
 
@@ -1649,7 +1649,7 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index f4e3ad5..015574f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1, last_nid;
+	int current_nid = -1, last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3571,7 +3571,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	current_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3592,7 +3592,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(last_nid, current_nid, 1, migrated);
+		task_numa_fault(last_nidpid, current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3608,7 +3608,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3658,7 +3658,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * migrated to.
 		 */
 		curr_nid = local_nid;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		target_nid = numa_migrate_prep(page, vma, addr,
 					       page_to_nid(page));
 		if (target_nid == -1) {
@@ -3671,7 +3671,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		migrated = migrate_misplaced_page(page, vma, target_nid);
 		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(last_nid, curr_nid, 1, migrated);
+		task_numa_fault(last_nidpid, curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..4669000 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2288,9 +2288,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);;
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2313,8 +2315,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 23f8122..01d653d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1478,7 +1478,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1655,7 +1655,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c280a02..eecdc64 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -69,26 +69,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -100,9 +100,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..89b3b7e 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -98,19 +98,19 @@ void lruvec_init(struct lruvec *lruvec)
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cacc64a..726e615 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
+	bool all_same_nidpid = true;
 	int last_nid = -1;
+	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -64,10 +65,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
 					int this_nid = page_to_nid(page);
+					int nidpid = page_nidpid_last(page);
+					int this_pid = nidpid_to_pid(nidpid);
+					
 					if (last_nid == -1)
 						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
+					if (last_pid == -1)
+						last_pid = this_pid;
+					if (last_nid != this_nid ||
+					    last_pid != this_pid) {
+						all_same_nidpid = false;
+					}
 
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
@@ -106,7 +114,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
+	*ret_all_same_nidpid = all_same_nidpid;
 	return pages;
 }
 
@@ -133,7 +141,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
+	bool all_same_nidpid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -151,7 +159,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
+				 dirty_accountable, prot_numa, &all_same_nidpid);
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -159,7 +167,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node)
+		if (prot_numa && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..f7c9c0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -613,7 +613,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3910,7 +3910,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 16/16] sched: Select least loaded CPU on preferred NUMA node
  2013-07-11  9:46 ` Mel Gorman
@ 2013-07-11  9:47   ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:47 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It is not implemented in this patch but
potentially the information can be used to avoid overloading the destination
node relative to the source node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 96 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ab8fa0..aadff22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -841,29 +841,81 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
-static unsigned long weighted_cpuload(const int cpu);
-
-
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+
+static int task_numa_find_cpu(struct task_struct *p, int nid)
+{
+	int node_cpu = cpumask_first(cpumask_of_node(nid));
+	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+	unsigned long src_load, dst_load;
+	unsigned long min_load = ULONG_MAX;
+	struct task_group *tg = task_group(p);
+	s64 src_eff_load, dst_eff_load;
+	struct sched_domain *sd;
+	unsigned long weight;
+	int imbalance_pct, idx = -1;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	/* No harm being optimistic */
+	if (idle_cpu(node_cpu))
+		return node_cpu;
 
+	/*
+	 * Find the lowest common scheduling domain covering the nodes of both
+	 * the CPU the task is currently running on and the target NUMA node.
+	 */
 	rcu_read_lock();
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
-
-		if (load < min_load) {
-			min_load = load;
-			idlest_cpu = i;
+	for_each_domain(src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			/*
+			 * busy_idx is used for the load decision as it is the
+			 * same index used by the regular load balancer for an
+			 * active cpu.
+			 */
+			idx = sd->busy_idx;
+			imbalance_pct = sd->imbalance_pct;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return idlest_cpu;
+	if (WARN_ON_ONCE(idx == -1))
+		return src_cpu;
+
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
+	weight = p->se.load.weight;
+ 
+	src_load = source_load(src_cpu, idx);
+
+	src_eff_load = 100 + (imbalance_pct - 100) / 2;
+	src_eff_load *= power_of(src_cpu);
+	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		dst_load = target_load(cpu, idx);
+
+		/* If the CPU is idle, use it */
+		if (!dst_load)
+			return dst_cpu;
+
+		/* Otherwise check the target CPU load */
+		dst_eff_load = 100;
+		dst_eff_load *= power_of(cpu);
+		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+
+		if (dst_load < min_load) {
+			min_load = dst_load;
+			dst_cpu = cpu;
+		}
+ 	}
+
+	return dst_cpu;
 }
 
 static inline int task_faults_idx(int nid, int priv)
@@ -916,14 +968,12 @@ static void task_numa_placement(struct task_struct *p)
 		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
+		 * If the task is not on the preferred node then find 
+		 * a suitable CPU to migrate to.
 		 */
 		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid) {
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-		}
+		if (cpu_to_node(preferred_cpu) != max_nid)
+			preferred_cpu = task_numa_find_cpu(p, max_nid);
 
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
@@ -3238,7 +3288,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
+static unsigned long effective_load(struct task_group *tg, int cpu,
 		unsigned long wl, unsigned long wg)
 {
 	return wl;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 16/16] sched: Select least loaded CPU on preferred NUMA node
@ 2013-07-11  9:47   ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11  9:47 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Mel Gorman

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It is not implemented in this patch but
potentially the information can be used to avoid overloading the destination
node relative to the source node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 96 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ab8fa0..aadff22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -841,29 +841,81 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
-static unsigned long weighted_cpuload(const int cpu);
-
-
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+
+static int task_numa_find_cpu(struct task_struct *p, int nid)
+{
+	int node_cpu = cpumask_first(cpumask_of_node(nid));
+	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+	unsigned long src_load, dst_load;
+	unsigned long min_load = ULONG_MAX;
+	struct task_group *tg = task_group(p);
+	s64 src_eff_load, dst_eff_load;
+	struct sched_domain *sd;
+	unsigned long weight;
+	int imbalance_pct, idx = -1;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	/* No harm being optimistic */
+	if (idle_cpu(node_cpu))
+		return node_cpu;
 
+	/*
+	 * Find the lowest common scheduling domain covering the nodes of both
+	 * the CPU the task is currently running on and the target NUMA node.
+	 */
 	rcu_read_lock();
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
-
-		if (load < min_load) {
-			min_load = load;
-			idlest_cpu = i;
+	for_each_domain(src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			/*
+			 * busy_idx is used for the load decision as it is the
+			 * same index used by the regular load balancer for an
+			 * active cpu.
+			 */
+			idx = sd->busy_idx;
+			imbalance_pct = sd->imbalance_pct;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return idlest_cpu;
+	if (WARN_ON_ONCE(idx == -1))
+		return src_cpu;
+
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
+	weight = p->se.load.weight;
+ 
+	src_load = source_load(src_cpu, idx);
+
+	src_eff_load = 100 + (imbalance_pct - 100) / 2;
+	src_eff_load *= power_of(src_cpu);
+	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		dst_load = target_load(cpu, idx);
+
+		/* If the CPU is idle, use it */
+		if (!dst_load)
+			return dst_cpu;
+
+		/* Otherwise check the target CPU load */
+		dst_eff_load = 100;
+		dst_eff_load *= power_of(cpu);
+		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+
+		if (dst_load < min_load) {
+			min_load = dst_load;
+			dst_cpu = cpu;
+		}
+ 	}
+
+	return dst_cpu;
 }
 
 static inline int task_faults_idx(int nid, int priv)
@@ -916,14 +968,12 @@ static void task_numa_placement(struct task_struct *p)
 		int old_migrate_seq = p->numa_migrate_seq;
 
 		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
+		 * If the task is not on the preferred node then find 
+		 * a suitable CPU to migrate to.
 		 */
 		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid) {
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-		}
+		if (cpu_to_node(preferred_cpu) != max_nid)
+			preferred_cpu = task_numa_find_cpu(p, max_nid);
 
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
@@ -3238,7 +3288,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
+static unsigned long effective_load(struct task_group *tg, int cpu,
 		unsigned long wl, unsigned long wg)
 {
 	return wl;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/16] mm: numa: Do not migrate or account for hinting faults on the zero page
  2013-07-11  9:46   ` Mel Gorman
@ 2013-07-11 11:21     ` Peter Zijlstra
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 11:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:46:48AM +0100, Mel Gorman wrote:
> +++ b/mm/memory.c
> @@ -3560,8 +3560,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	set_pte_at(mm, addr, ptep, pte);
>  	update_mmu_cache(vma, addr, ptep);
>  
> +	/*
> +	 * Do not account for faults against the huge zero page. The read-only

s/huge //

> +	 * data is likely to be read-cached on the local CPUs and it is less
> +	 * useful to know about local versus remote hits on the zero page.
> +	 */
>  	page = vm_normal_page(vma, addr, pte);
> -	if (!page) {
> +	if (!page || is_zero_pfn(page_to_pfn(page))) {
>  		pte_unmap_unlock(ptep, ptl);
>  		return 0;
>  	}
> -- 
> 1.8.1.4
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/16] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-07-11 11:21     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 11:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:46:48AM +0100, Mel Gorman wrote:
> +++ b/mm/memory.c
> @@ -3560,8 +3560,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	set_pte_at(mm, addr, ptep, pte);
>  	update_mmu_cache(vma, addr, ptep);
>  
> +	/*
> +	 * Do not account for faults against the huge zero page. The read-only

s/huge //

> +	 * data is likely to be read-cached on the local CPUs and it is less
> +	 * useful to know about local versus remote hits on the zero page.
> +	 */
>  	page = vm_normal_page(vma, addr, pte);
> -	if (!page) {
> +	if (!page || is_zero_pfn(page_to_pfn(page))) {
>  		pte_unmap_unlock(ptep, ptl);
>  		return 0;
>  	}
> -- 
> 1.8.1.4
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/16] sched: Select a preferred node with the most numa hinting faults
  2013-07-11  9:46   ` Mel Gorman
@ 2013-07-11 11:23     ` Peter Zijlstra
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 11:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:46:49AM +0100, Mel Gorman wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
>  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> +	p->numa_preferred_nid = -1;

(1)

>  	p->numa_work.next = &p->numa_work;
>  	p->numa_faults = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 904fd6f..c0bee41 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>  
>  static void task_numa_placement(struct task_struct *p)
>  {
> -	int seq;
> +	int seq, nid, max_nid = 0;

Should you not start with max_nid = -1?

> +	unsigned long max_faults = 0;
>  
>  	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
>  		return;
> @@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
>  		return;
>  	p->numa_scan_seq = seq;
>  
> -	/* FIXME: Scheduling placement policy hints go here */
> +	/* Find the node with the highest number of faults */
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		unsigned long faults = p->numa_faults[nid];
> +		p->numa_faults[nid] >>= 1;
> +		if (faults > max_faults) {
> +			max_faults = faults;
> +			max_nid = nid;
> +		}
> +	}
> +

It is rather unlikely; but suppose the entire ->numa_faults[] array is 0, you'd
somehow end up selecting max_nid := 0. Which seems inconsistent with \1.

> +	/* Update the tasks preferred node if necessary */
> +	if (max_faults && max_nid != p->numa_preferred_nid)
> +		p->numa_preferred_nid = max_nid;
>  }
>  
>  /*
> -- 
> 1.8.1.4
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/16] sched: Select a preferred node with the most numa hinting faults
@ 2013-07-11 11:23     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 11:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:46:49AM +0100, Mel Gorman wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
>  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
>  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> +	p->numa_preferred_nid = -1;

(1)

>  	p->numa_work.next = &p->numa_work;
>  	p->numa_faults = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 904fd6f..c0bee41 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
>  
>  static void task_numa_placement(struct task_struct *p)
>  {
> -	int seq;
> +	int seq, nid, max_nid = 0;

Should you not start with max_nid = -1?

> +	unsigned long max_faults = 0;
>  
>  	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
>  		return;
> @@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
>  		return;
>  	p->numa_scan_seq = seq;
>  
> -	/* FIXME: Scheduling placement policy hints go here */
> +	/* Find the node with the highest number of faults */
> +	for (nid = 0; nid < nr_node_ids; nid++) {
> +		unsigned long faults = p->numa_faults[nid];
> +		p->numa_faults[nid] >>= 1;
> +		if (faults > max_faults) {
> +			max_faults = faults;
> +			max_nid = nid;
> +		}
> +	}
> +

It is rather unlikely; but suppose the entire ->numa_faults[] array is 0, you'd
somehow end up selecting max_nid := 0. Which seems inconsistent with \1.

> +	/* Update the tasks preferred node if necessary */
> +	if (max_faults && max_nid != p->numa_preferred_nid)
> +		p->numa_preferred_nid = max_nid;
>  }
>  
>  /*
> -- 
> 1.8.1.4
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
  2013-07-11  9:46   ` Mel Gorman
@ 2013-07-11 12:30     ` Peter Zijlstra
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 12:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> -	/* Update the tasks preferred node if necessary */
> +	/*
> +	 * Record the preferred node as the node with the most faults,
> +	 * requeue the task to be running on the idlest CPU on the
> +	 * preferred node and reset the scanning rate to recheck
> +	 * the working set placement.
> +	 */
>  	if (max_faults && max_nid != p->numa_preferred_nid) {
> +		int preferred_cpu;
> +
> +		/*
> +		 * If the task is not on the preferred node then find the most
> +		 * idle CPU to migrate to.
> +		 */
> +		preferred_cpu = task_cpu(p);
> +		if (cpu_to_node(preferred_cpu) != max_nid) {
> +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> +							     max_nid);
> +		}
> +
> +		/* Update the preferred nid and migrate task if possible */
>  		p->numa_preferred_nid = max_nid;
>  		p->numa_migrate_seq = 0;
> +		migrate_task_to(p, preferred_cpu);
>  	}
>  }

Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
but then never re-try the migration. Should we not re-try the migration every
so often, regardless of whether max_nid changed?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-11 12:30     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 12:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> -	/* Update the tasks preferred node if necessary */
> +	/*
> +	 * Record the preferred node as the node with the most faults,
> +	 * requeue the task to be running on the idlest CPU on the
> +	 * preferred node and reset the scanning rate to recheck
> +	 * the working set placement.
> +	 */
>  	if (max_faults && max_nid != p->numa_preferred_nid) {
> +		int preferred_cpu;
> +
> +		/*
> +		 * If the task is not on the preferred node then find the most
> +		 * idle CPU to migrate to.
> +		 */
> +		preferred_cpu = task_cpu(p);
> +		if (cpu_to_node(preferred_cpu) != max_nid) {
> +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> +							     max_nid);
> +		}
> +
> +		/* Update the preferred nid and migrate task if possible */
>  		p->numa_preferred_nid = max_nid;
>  		p->numa_migrate_seq = 0;
> +		migrate_task_to(p, preferred_cpu);
>  	}
>  }

Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
but then never re-try the migration. Should we not re-try the migration every
so often, regardless of whether max_nid changed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/16] sched: Select least loaded CPU on preferred NUMA node
  2013-07-11  9:47   ` Mel Gorman
@ 2013-07-11 12:39     ` Peter Zijlstra
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 12:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:47:00AM +0100, Mel Gorman wrote:
> +++ b/kernel/sched/fair.c
> @@ -841,29 +841,81 @@ static unsigned int task_scan_max(struct task_struct *p)
>   */
>  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
>  
> +static unsigned long source_load(int cpu, int type);
> +static unsigned long target_load(int cpu, int type);
> +static unsigned long power_of(int cpu);
> +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> +
> +static int task_numa_find_cpu(struct task_struct *p, int nid)
> +{
> +	int node_cpu = cpumask_first(cpumask_of_node(nid));
> +	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> +	unsigned long src_load, dst_load;
> +	unsigned long min_load = ULONG_MAX;
> +	struct task_group *tg = task_group(p);
> +	s64 src_eff_load, dst_eff_load;
> +	struct sched_domain *sd;
> +	unsigned long weight;
> +	int imbalance_pct, idx = -1;
>  
> +	/* No harm being optimistic */
> +	if (idle_cpu(node_cpu))
> +		return node_cpu;
>  
> +	/*
> +	 * Find the lowest common scheduling domain covering the nodes of both
> +	 * the CPU the task is currently running on and the target NUMA node.
> +	 */
>  	rcu_read_lock();
> +	for_each_domain(src_cpu, sd) {
> +		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
> +			/*
> +			 * busy_idx is used for the load decision as it is the
> +			 * same index used by the regular load balancer for an
> +			 * active cpu.
> +			 */
> +			idx = sd->busy_idx;
> +			imbalance_pct = sd->imbalance_pct;
> +			break;
>  		}
>  	}
>  	rcu_read_unlock();
>  
> +	if (WARN_ON_ONCE(idx == -1))
> +		return src_cpu;
> +
> +	/*
> +	 * XXX the below is mostly nicked from wake_affine(); we should
> +	 * see about sharing a bit if at all possible; also it might want
> +	 * some per entity weight love.
> +	 */
> +	weight = p->se.load.weight;
> + 
> +	src_load = source_load(src_cpu, idx);
> +
> +	src_eff_load = 100 + (imbalance_pct - 100) / 2;
> +	src_eff_load *= power_of(src_cpu);
> +	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
> +
> +	for_each_cpu(cpu, cpumask_of_node(nid)) {
> +		dst_load = target_load(cpu, idx);
> +
> +		/* If the CPU is idle, use it */
> +		if (!dst_load)
> +			return dst_cpu;
> +
> +		/* Otherwise check the target CPU load */
> +		dst_eff_load = 100;
> +		dst_eff_load *= power_of(cpu);
> +		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);

So the missing part is:

		/*
		 * Do not allow the destination CPU to be loaded significantly
		 * more than the CPU we came from.
		 */
		if (dst_eff_load <= src_eff_load)
			continue;

> +
> +		if (dst_load < min_load) {
> +			min_load = dst_load;
> +			dst_cpu = cpu;
> +		}
> + 	}
> +
> +	return dst_cpu;
>  }

This is almost a big fat NOP. It did a scan for the least loaded cpu and now it
still does. It also doesn't cure the problem Srikar saw where we kept migrating
all tasks back to the one node with all the memory.

Task migration must be subject to fairness limits; otherwise there's nothing
avoiding heaping all tasks on a single pile.

One thing we could do to maybe relax things a little bit is take away the
effective_load() term in the src_eff_load() computation. That way we compare
the current src load to the future dst load, instead of using the future load
for both.

But we must put a limit on the migration.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/16] sched: Select least loaded CPU on preferred NUMA node
@ 2013-07-11 12:39     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 12:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 10:47:00AM +0100, Mel Gorman wrote:
> +++ b/kernel/sched/fair.c
> @@ -841,29 +841,81 @@ static unsigned int task_scan_max(struct task_struct *p)
>   */
>  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
>  
> +static unsigned long source_load(int cpu, int type);
> +static unsigned long target_load(int cpu, int type);
> +static unsigned long power_of(int cpu);
> +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> +
> +static int task_numa_find_cpu(struct task_struct *p, int nid)
> +{
> +	int node_cpu = cpumask_first(cpumask_of_node(nid));
> +	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> +	unsigned long src_load, dst_load;
> +	unsigned long min_load = ULONG_MAX;
> +	struct task_group *tg = task_group(p);
> +	s64 src_eff_load, dst_eff_load;
> +	struct sched_domain *sd;
> +	unsigned long weight;
> +	int imbalance_pct, idx = -1;
>  
> +	/* No harm being optimistic */
> +	if (idle_cpu(node_cpu))
> +		return node_cpu;
>  
> +	/*
> +	 * Find the lowest common scheduling domain covering the nodes of both
> +	 * the CPU the task is currently running on and the target NUMA node.
> +	 */
>  	rcu_read_lock();
> +	for_each_domain(src_cpu, sd) {
> +		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
> +			/*
> +			 * busy_idx is used for the load decision as it is the
> +			 * same index used by the regular load balancer for an
> +			 * active cpu.
> +			 */
> +			idx = sd->busy_idx;
> +			imbalance_pct = sd->imbalance_pct;
> +			break;
>  		}
>  	}
>  	rcu_read_unlock();
>  
> +	if (WARN_ON_ONCE(idx == -1))
> +		return src_cpu;
> +
> +	/*
> +	 * XXX the below is mostly nicked from wake_affine(); we should
> +	 * see about sharing a bit if at all possible; also it might want
> +	 * some per entity weight love.
> +	 */
> +	weight = p->se.load.weight;
> + 
> +	src_load = source_load(src_cpu, idx);
> +
> +	src_eff_load = 100 + (imbalance_pct - 100) / 2;
> +	src_eff_load *= power_of(src_cpu);
> +	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
> +
> +	for_each_cpu(cpu, cpumask_of_node(nid)) {
> +		dst_load = target_load(cpu, idx);
> +
> +		/* If the CPU is idle, use it */
> +		if (!dst_load)
> +			return dst_cpu;
> +
> +		/* Otherwise check the target CPU load */
> +		dst_eff_load = 100;
> +		dst_eff_load *= power_of(cpu);
> +		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);

So the missing part is:

		/*
		 * Do not allow the destination CPU to be loaded significantly
		 * more than the CPU we came from.
		 */
		if (dst_eff_load <= src_eff_load)
			continue;

> +
> +		if (dst_load < min_load) {
> +			min_load = dst_load;
> +			dst_cpu = cpu;
> +		}
> + 	}
> +
> +	return dst_cpu;
>  }

This is almost a big fat NOP. It did a scan for the least loaded cpu and now it
still does. It also doesn't cure the problem Srikar saw where we kept migrating
all tasks back to the one node with all the memory.

Task migration must be subject to fairness limits; otherwise there's nothing
avoiding heaping all tasks on a single pile.

One thing we could do to maybe relax things a little bit is take away the
effective_load() term in the src_eff_load() computation. That way we compare
the current src load to the future dst load, instead of using the future load
for both.

But we must put a limit on the migration.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/16] mm: numa: Do not migrate or account for hinting faults on the zero page
  2013-07-11 11:21     ` Peter Zijlstra
@ 2013-07-11 12:42       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 01:21:02PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:46:48AM +0100, Mel Gorman wrote:
> > +++ b/mm/memory.c
> > @@ -3560,8 +3560,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	set_pte_at(mm, addr, ptep, pte);
> >  	update_mmu_cache(vma, addr, ptep);
> >  
> > +	/*
> > +	 * Do not account for faults against the huge zero page. The read-only
> 
> s/huge //
> 

Whoops, thanks. Guess which comment I wrote first?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/16] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-07-11 12:42       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 01:21:02PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:46:48AM +0100, Mel Gorman wrote:
> > +++ b/mm/memory.c
> > @@ -3560,8 +3560,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	set_pte_at(mm, addr, ptep, pte);
> >  	update_mmu_cache(vma, addr, ptep);
> >  
> > +	/*
> > +	 * Do not account for faults against the huge zero page. The read-only
> 
> s/huge //
> 

Whoops, thanks. Guess which comment I wrote first?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/16] sched: Select a preferred node with the most numa hinting faults
  2013-07-11 11:23     ` Peter Zijlstra
@ 2013-07-11 12:53       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 01:23:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:46:49AM +0100, Mel Gorman wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_preferred_nid = -1;
> 
> (1)
> 
> >  	p->numa_work.next = &p->numa_work;
> >  	p->numa_faults = NULL;
> >  #endif /* CONFIG_NUMA_BALANCING */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 904fd6f..c0bee41 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
> >  
> >  static void task_numa_placement(struct task_struct *p)
> >  {
> > -	int seq;
> > +	int seq, nid, max_nid = 0;
> 
> Should you not start with max_nid = -1?
> 

For consistency, yes.

> > +	unsigned long max_faults = 0;
> >  
> >  	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
> >  		return;
> > @@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
> >  		return;
> >  	p->numa_scan_seq = seq;
> >  
> > -	/* FIXME: Scheduling placement policy hints go here */
> > +	/* Find the node with the highest number of faults */
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		unsigned long faults = p->numa_faults[nid];
> > +		p->numa_faults[nid] >>= 1;
> > +		if (faults > max_faults) {
> > +			max_faults = faults;
> > +			max_nid = nid;
> > +		}
> > +	}
> > +
> 
> It is rather unlikely; but suppose the entire ->numa_faults[] array is 0, you'd
> somehow end up selecting max_nid := 0. Which seems inconsistent with \1.
> 

This happens more often than you might imagine due to THP false sharing
getting count as shared. If the entire numa_faults[] array is 0 then
max_faults == 0 and it will not update preferred_nid due to the
"max_faults" check below.

> > +	/* Update the tasks preferred node if necessary */
> > +	if (max_faults && max_nid != p->numa_preferred_nid)
> > +		p->numa_preferred_nid = max_nid;
> >  }
> >  

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/16] sched: Select a preferred node with the most numa hinting faults
@ 2013-07-11 12:53       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 01:23:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:46:49AM +0100, Mel Gorman wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
> >  	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> >  	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> >  	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > +	p->numa_preferred_nid = -1;
> 
> (1)
> 
> >  	p->numa_work.next = &p->numa_work;
> >  	p->numa_faults = NULL;
> >  #endif /* CONFIG_NUMA_BALANCING */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 904fd6f..c0bee41 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
> >  
> >  static void task_numa_placement(struct task_struct *p)
> >  {
> > -	int seq;
> > +	int seq, nid, max_nid = 0;
> 
> Should you not start with max_nid = -1?
> 

For consistency, yes.

> > +	unsigned long max_faults = 0;
> >  
> >  	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
> >  		return;
> > @@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
> >  		return;
> >  	p->numa_scan_seq = seq;
> >  
> > -	/* FIXME: Scheduling placement policy hints go here */
> > +	/* Find the node with the highest number of faults */
> > +	for (nid = 0; nid < nr_node_ids; nid++) {
> > +		unsigned long faults = p->numa_faults[nid];
> > +		p->numa_faults[nid] >>= 1;
> > +		if (faults > max_faults) {
> > +			max_faults = faults;
> > +			max_nid = nid;
> > +		}
> > +	}
> > +
> 
> It is rather unlikely; but suppose the entire ->numa_faults[] array is 0, you'd
> somehow end up selecting max_nid := 0. Which seems inconsistent with \1.
> 

This happens more often than you might imagine due to THP false sharing
getting count as shared. If the entire numa_faults[] array is 0 then
max_faults == 0 and it will not update preferred_nid due to the
"max_faults" check below.

> > +	/* Update the tasks preferred node if necessary */
> > +	if (max_faults && max_nid != p->numa_preferred_nid)
> > +		p->numa_preferred_nid = max_nid;
> >  }
> >  

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
  2013-07-11 12:30     ` Peter Zijlstra
@ 2013-07-11 13:03       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 02:30:38PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> > @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
> >  		}
> >  	}
> >  
> > -	/* Update the tasks preferred node if necessary */
> > +	/*
> > +	 * Record the preferred node as the node with the most faults,
> > +	 * requeue the task to be running on the idlest CPU on the
> > +	 * preferred node and reset the scanning rate to recheck
> > +	 * the working set placement.
> > +	 */
> >  	if (max_faults && max_nid != p->numa_preferred_nid) {
> > +		int preferred_cpu;
> > +
> > +		/*
> > +		 * If the task is not on the preferred node then find the most
> > +		 * idle CPU to migrate to.
> > +		 */
> > +		preferred_cpu = task_cpu(p);
> > +		if (cpu_to_node(preferred_cpu) != max_nid) {
> > +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> > +							     max_nid);
> > +		}
> > +
> > +		/* Update the preferred nid and migrate task if possible */
> >  		p->numa_preferred_nid = max_nid;
> >  		p->numa_migrate_seq = 0;
> > +		migrate_task_to(p, preferred_cpu);
> >  	}
> >  }
> 
> Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
> but then never re-try the migration. Should we not re-try the migration every
> so often, regardless of whether max_nid changed?

We do this

load_balance
-> active_load_balance_cpu_stop
  -> move_one_task
    -> can_migrate_task
      -> migrate_improves_locality

If the conditions are right then it'll move the task to the preferred node
for a number of PTE scans. Of course there is no guarantee that the necessary
conditions will occur but I was wary of taking more drastic steps in the
scheduler such as retrying on every fault until the migration succeeds.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-11 13:03       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 02:30:38PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> > @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
> >  		}
> >  	}
> >  
> > -	/* Update the tasks preferred node if necessary */
> > +	/*
> > +	 * Record the preferred node as the node with the most faults,
> > +	 * requeue the task to be running on the idlest CPU on the
> > +	 * preferred node and reset the scanning rate to recheck
> > +	 * the working set placement.
> > +	 */
> >  	if (max_faults && max_nid != p->numa_preferred_nid) {
> > +		int preferred_cpu;
> > +
> > +		/*
> > +		 * If the task is not on the preferred node then find the most
> > +		 * idle CPU to migrate to.
> > +		 */
> > +		preferred_cpu = task_cpu(p);
> > +		if (cpu_to_node(preferred_cpu) != max_nid) {
> > +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> > +							     max_nid);
> > +		}
> > +
> > +		/* Update the preferred nid and migrate task if possible */
> >  		p->numa_preferred_nid = max_nid;
> >  		p->numa_migrate_seq = 0;
> > +		migrate_task_to(p, preferred_cpu);
> >  	}
> >  }
> 
> Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
> but then never re-try the migration. Should we not re-try the migration every
> so often, regardless of whether max_nid changed?

We do this

load_balance
-> active_load_balance_cpu_stop
  -> move_one_task
    -> can_migrate_task
      -> migrate_improves_locality

If the conditions are right then it'll move the task to the preferred node
for a number of PTE scans. Of course there is no guarantee that the necessary
conditions will occur but I was wary of taking more drastic steps in the
scheduler such as retrying on every fault until the migration succeeds.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
  2013-07-11 13:03       ` Mel Gorman
@ 2013-07-11 13:11         ` Peter Zijlstra
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 13:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 02:03:22PM +0100, Mel Gorman wrote:
> On Thu, Jul 11, 2013 at 02:30:38PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> > > @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
> > >  		}
> > >  	}
> > >  
> > > -	/* Update the tasks preferred node if necessary */
> > > +	/*
> > > +	 * Record the preferred node as the node with the most faults,
> > > +	 * requeue the task to be running on the idlest CPU on the
> > > +	 * preferred node and reset the scanning rate to recheck
> > > +	 * the working set placement.
> > > +	 */
> > >  	if (max_faults && max_nid != p->numa_preferred_nid) {
> > > +		int preferred_cpu;
> > > +
> > > +		/*
> > > +		 * If the task is not on the preferred node then find the most
> > > +		 * idle CPU to migrate to.
> > > +		 */
> > > +		preferred_cpu = task_cpu(p);
> > > +		if (cpu_to_node(preferred_cpu) != max_nid) {
> > > +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> > > +							     max_nid);
> > > +		}
> > > +
> > > +		/* Update the preferred nid and migrate task if possible */
> > >  		p->numa_preferred_nid = max_nid;
> > >  		p->numa_migrate_seq = 0;
> > > +		migrate_task_to(p, preferred_cpu);
> > >  	}
> > >  }
> > 
> > Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
> > but then never re-try the migration. Should we not re-try the migration every
> > so often, regardless of whether max_nid changed?
> 
> We do this
> 
> load_balance
> -> active_load_balance_cpu_stop

Note that active balance is rare to begin with.

>   -> move_one_task
>     -> can_migrate_task
>       -> migrate_improves_locality
> 
> If the conditions are right then it'll move the task to the preferred node
> for a number of PTE scans. Of course there is no guarantee that the necessary
> conditions will occur but I was wary of taking more drastic steps in the
> scheduler such as retrying on every fault until the migration succeeds.
> 

Ah, so task_numa_placement() is only called every full scan, not every fault.
Also one could throttle it.

So initially I did all the movement through the regular balancer, but Ingo
found that when the machine grows it quickly becomes unlikely we hit the right
conditions. Hence he also went to direct migrations in his series.

Another thing we might consider is counting the number of migration attempts
and settling for the n-th best node for the n'th attempt and giving up when n
surpasses the quality of the node we're currently on.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-11 13:11         ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-11 13:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 02:03:22PM +0100, Mel Gorman wrote:
> On Thu, Jul 11, 2013 at 02:30:38PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> > > @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
> > >  		}
> > >  	}
> > >  
> > > -	/* Update the tasks preferred node if necessary */
> > > +	/*
> > > +	 * Record the preferred node as the node with the most faults,
> > > +	 * requeue the task to be running on the idlest CPU on the
> > > +	 * preferred node and reset the scanning rate to recheck
> > > +	 * the working set placement.
> > > +	 */
> > >  	if (max_faults && max_nid != p->numa_preferred_nid) {
> > > +		int preferred_cpu;
> > > +
> > > +		/*
> > > +		 * If the task is not on the preferred node then find the most
> > > +		 * idle CPU to migrate to.
> > > +		 */
> > > +		preferred_cpu = task_cpu(p);
> > > +		if (cpu_to_node(preferred_cpu) != max_nid) {
> > > +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> > > +							     max_nid);
> > > +		}
> > > +
> > > +		/* Update the preferred nid and migrate task if possible */
> > >  		p->numa_preferred_nid = max_nid;
> > >  		p->numa_migrate_seq = 0;
> > > +		migrate_task_to(p, preferred_cpu);
> > >  	}
> > >  }
> > 
> > Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
> > but then never re-try the migration. Should we not re-try the migration every
> > so often, regardless of whether max_nid changed?
> 
> We do this
> 
> load_balance
> -> active_load_balance_cpu_stop

Note that active balance is rare to begin with.

>   -> move_one_task
>     -> can_migrate_task
>       -> migrate_improves_locality
> 
> If the conditions are right then it'll move the task to the preferred node
> for a number of PTE scans. Of course there is no guarantee that the necessary
> conditions will occur but I was wary of taking more drastic steps in the
> scheduler such as retrying on every fault until the migration succeeds.
> 

Ah, so task_numa_placement() is only called every full scan, not every fault.
Also one could throttle it.

So initially I did all the movement through the regular balancer, but Ingo
found that when the machine grows it quickly becomes unlikely we hit the right
conditions. Hence he also went to direct migrations in his series.

Another thing we might consider is counting the number of migration attempts
and settling for the n-th best node for the n'th attempt and giving up when n
surpasses the quality of the node we're currently on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/16] sched: Select least loaded CPU on preferred NUMA node
  2013-07-11 12:39     ` Peter Zijlstra
@ 2013-07-11 13:24       ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 02:39:02PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:47:00AM +0100, Mel Gorman wrote:
> > +++ b/kernel/sched/fair.c
> > @@ -841,29 +841,81 @@ static unsigned int task_scan_max(struct task_struct *p)
> >   */
> >  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> >  
> > +static unsigned long source_load(int cpu, int type);
> > +static unsigned long target_load(int cpu, int type);
> > +static unsigned long power_of(int cpu);
> > +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> > +
> > +static int task_numa_find_cpu(struct task_struct *p, int nid)
> > +{
> > +	int node_cpu = cpumask_first(cpumask_of_node(nid));
> > +	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> > +	unsigned long src_load, dst_load;
> > +	unsigned long min_load = ULONG_MAX;
> > +	struct task_group *tg = task_group(p);
> > +	s64 src_eff_load, dst_eff_load;
> > +	struct sched_domain *sd;
> > +	unsigned long weight;
> > +	int imbalance_pct, idx = -1;
> >  
> > +	/* No harm being optimistic */
> > +	if (idle_cpu(node_cpu))
> > +		return node_cpu;
> >  
> > +	/*
> > +	 * Find the lowest common scheduling domain covering the nodes of both
> > +	 * the CPU the task is currently running on and the target NUMA node.
> > +	 */
> >  	rcu_read_lock();
> > +	for_each_domain(src_cpu, sd) {
> > +		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
> > +			/*
> > +			 * busy_idx is used for the load decision as it is the
> > +			 * same index used by the regular load balancer for an
> > +			 * active cpu.
> > +			 */
> > +			idx = sd->busy_idx;
> > +			imbalance_pct = sd->imbalance_pct;
> > +			break;
> >  		}
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	if (WARN_ON_ONCE(idx == -1))
> > +		return src_cpu;
> > +
> > +	/*
> > +	 * XXX the below is mostly nicked from wake_affine(); we should
> > +	 * see about sharing a bit if at all possible; also it might want
> > +	 * some per entity weight love.
> > +	 */
> > +	weight = p->se.load.weight;
> > + 
> > +	src_load = source_load(src_cpu, idx);
> > +
> > +	src_eff_load = 100 + (imbalance_pct - 100) / 2;
> > +	src_eff_load *= power_of(src_cpu);
> > +	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
> > +
> > +	for_each_cpu(cpu, cpumask_of_node(nid)) {
> > +		dst_load = target_load(cpu, idx);
> > +
> > +		/* If the CPU is idle, use it */
> > +		if (!dst_load)
> > +			return dst_cpu;
> > +
> > +		/* Otherwise check the target CPU load */
> > +		dst_eff_load = 100;
> > +		dst_eff_load *= power_of(cpu);
> > +		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
> 
> So the missing part is:
> 
> 		/*
> 		 * Do not allow the destination CPU to be loaded significantly
> 		 * more than the CPU we came from.
> 		 */
> 		if (dst_eff_load <= src_eff_load)
> 			continue;
> 

Yes, the results with the patch included. I decided to punt it for now as I
expected that fixing false shared detection would mitigate the problem and
the requirement of the patch would be reduced. That said, the comparison
also had another patch in the middle that was dropped before release so
I'll retest in isolation.

> > +
> > +		if (dst_load < min_load) {
> > +			min_load = dst_load;
> > +			dst_cpu = cpu;
> > +		}
> > + 	}
> > +
> > +	return dst_cpu;
> >  }
> 
> This is almost a big fat NOP. It did a scan for the least loaded cpu and now it
> still does.

This version makes more sense and does not fall apart just because the
number of NUMA tasks running happened to be more than the available CPUs.

> It also doesn't cure the problem Srikar saw where we kept migrating
> all tasks back to the one node with all the memory.
> 

No, it doesn't. That problem is still there.

> Task migration must be subject to fairness limits; otherwise there's nothing
> avoiding heaping all tasks on a single pile.
> 
> One thing we could do to maybe relax things a little bit is take away the
> effective_load() term in the src_eff_load() computation. That way we compare
> the current src load to the future dst load, instead of using the future load
> for both.
> 

I'll try that and get back to you.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/16] sched: Select least loaded CPU on preferred NUMA node
@ 2013-07-11 13:24       ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 02:39:02PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 10:47:00AM +0100, Mel Gorman wrote:
> > +++ b/kernel/sched/fair.c
> > @@ -841,29 +841,81 @@ static unsigned int task_scan_max(struct task_struct *p)
> >   */
> >  unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
> >  
> > +static unsigned long source_load(int cpu, int type);
> > +static unsigned long target_load(int cpu, int type);
> > +static unsigned long power_of(int cpu);
> > +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> > +
> > +static int task_numa_find_cpu(struct task_struct *p, int nid)
> > +{
> > +	int node_cpu = cpumask_first(cpumask_of_node(nid));
> > +	int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> > +	unsigned long src_load, dst_load;
> > +	unsigned long min_load = ULONG_MAX;
> > +	struct task_group *tg = task_group(p);
> > +	s64 src_eff_load, dst_eff_load;
> > +	struct sched_domain *sd;
> > +	unsigned long weight;
> > +	int imbalance_pct, idx = -1;
> >  
> > +	/* No harm being optimistic */
> > +	if (idle_cpu(node_cpu))
> > +		return node_cpu;
> >  
> > +	/*
> > +	 * Find the lowest common scheduling domain covering the nodes of both
> > +	 * the CPU the task is currently running on and the target NUMA node.
> > +	 */
> >  	rcu_read_lock();
> > +	for_each_domain(src_cpu, sd) {
> > +		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
> > +			/*
> > +			 * busy_idx is used for the load decision as it is the
> > +			 * same index used by the regular load balancer for an
> > +			 * active cpu.
> > +			 */
> > +			idx = sd->busy_idx;
> > +			imbalance_pct = sd->imbalance_pct;
> > +			break;
> >  		}
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	if (WARN_ON_ONCE(idx == -1))
> > +		return src_cpu;
> > +
> > +	/*
> > +	 * XXX the below is mostly nicked from wake_affine(); we should
> > +	 * see about sharing a bit if at all possible; also it might want
> > +	 * some per entity weight love.
> > +	 */
> > +	weight = p->se.load.weight;
> > + 
> > +	src_load = source_load(src_cpu, idx);
> > +
> > +	src_eff_load = 100 + (imbalance_pct - 100) / 2;
> > +	src_eff_load *= power_of(src_cpu);
> > +	src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
> > +
> > +	for_each_cpu(cpu, cpumask_of_node(nid)) {
> > +		dst_load = target_load(cpu, idx);
> > +
> > +		/* If the CPU is idle, use it */
> > +		if (!dst_load)
> > +			return dst_cpu;
> > +
> > +		/* Otherwise check the target CPU load */
> > +		dst_eff_load = 100;
> > +		dst_eff_load *= power_of(cpu);
> > +		dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
> 
> So the missing part is:
> 
> 		/*
> 		 * Do not allow the destination CPU to be loaded significantly
> 		 * more than the CPU we came from.
> 		 */
> 		if (dst_eff_load <= src_eff_load)
> 			continue;
> 

Yes, the results with the patch included. I decided to punt it for now as I
expected that fixing false shared detection would mitigate the problem and
the requirement of the patch would be reduced. That said, the comparison
also had another patch in the middle that was dropped before release so
I'll retest in isolation.

> > +
> > +		if (dst_load < min_load) {
> > +			min_load = dst_load;
> > +			dst_cpu = cpu;
> > +		}
> > + 	}
> > +
> > +	return dst_cpu;
> >  }
> 
> This is almost a big fat NOP. It did a scan for the least loaded cpu and now it
> still does.

This version makes more sense and does not fall apart just because the
number of NUMA tasks running happened to be more than the available CPUs.

> It also doesn't cure the problem Srikar saw where we kept migrating
> all tasks back to the one node with all the memory.
> 

No, it doesn't. That problem is still there.

> Task migration must be subject to fairness limits; otherwise there's nothing
> avoiding heaping all tasks on a single pile.
> 
> One thing we could do to maybe relax things a little bit is take away the
> effective_load() term in the src_eff_load() computation. That way we compare
> the current src load to the future dst load, instead of using the future load
> for both.
> 

I'll try that and get back to you.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
  2013-07-11 13:11         ` Peter Zijlstra
@ 2013-07-11 14:09           ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 03:11:58PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 02:03:22PM +0100, Mel Gorman wrote:
> > On Thu, Jul 11, 2013 at 02:30:38PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> > > > @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
> > > >  		}
> > > >  	}
> > > >  
> > > > -	/* Update the tasks preferred node if necessary */
> > > > +	/*
> > > > +	 * Record the preferred node as the node with the most faults,
> > > > +	 * requeue the task to be running on the idlest CPU on the
> > > > +	 * preferred node and reset the scanning rate to recheck
> > > > +	 * the working set placement.
> > > > +	 */
> > > >  	if (max_faults && max_nid != p->numa_preferred_nid) {
> > > > +		int preferred_cpu;
> > > > +
> > > > +		/*
> > > > +		 * If the task is not on the preferred node then find the most
> > > > +		 * idle CPU to migrate to.
> > > > +		 */
> > > > +		preferred_cpu = task_cpu(p);
> > > > +		if (cpu_to_node(preferred_cpu) != max_nid) {
> > > > +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> > > > +							     max_nid);
> > > > +		}
> > > > +
> > > > +		/* Update the preferred nid and migrate task if possible */
> > > >  		p->numa_preferred_nid = max_nid;
> > > >  		p->numa_migrate_seq = 0;
> > > > +		migrate_task_to(p, preferred_cpu);
> > > >  	}
> > > >  }
> > > 
> > > Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
> > > but then never re-try the migration. Should we not re-try the migration every
> > > so often, regardless of whether max_nid changed?
> > 
> > We do this
> > 
> > load_balance
> > -> active_load_balance_cpu_stop
> 
> Note that active balance is rare to begin with.
> 

Yeah. I was not sure how rare exactly but it is what motivated the
introduction of migrate_task_to in the first place. I actually have no
idea how often the migration fails. I did not check for it.

> >   -> move_one_task
> >     -> can_migrate_task
> >       -> migrate_improves_locality
> > 
> > If the conditions are right then it'll move the task to the preferred node
> > for a number of PTE scans. Of course there is no guarantee that the necessary
> > conditions will occur but I was wary of taking more drastic steps in the
> > scheduler such as retrying on every fault until the migration succeeds.
> > 
> 
> Ah, so task_numa_placement() is only called every full scan, not every fault.
> Also one could throttle it.
> 
> So initially I did all the movement through the regular balancer, but Ingo
> found that when the machine grows it quickly becomes unlikely we hit the right
> conditions. Hence he also went to direct migrations in his series.
> 

I wanted to avoid aggressive scheduling decisions until after the false
share detection stuff was solid.

> Another thing we might consider is counting the number of migration attempts
> and settling for the n-th best node for the n'th attempt and giving up when n
> surpasses the quality of the node we're currently on.

That might be necessary when the machine is overloaded. As a
starting point the following should retry the migrate a number of times
until success. The retry is checked on every fault but should not fire
more than once every 100ms.

Compile tested only

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d44fbc6..454ad2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,7 @@ struct task_struct {
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f2b37e01..a5b6b01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -934,6 +934,20 @@ static inline int task_faults_idx(int nid, int priv)
 	return 2 * nid + priv;
 }
 
+/* Attempt to migrate a task to a CPU on the preferred node */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+	int preferred_cpu = task_cpu(p);
+
+	p->numa_migrate_retry = 0;
+	if (cpu_to_node(preferred_cpu) != p->numa_preferred_nid) {
+		preferred_cpu = task_numa_find_cpu(p, p->numa_preferred_nid);
+
+		if (!migrate_task_to(p, preferred_cpu))
+			p->numa_migrate_retry = jiffies + HZ/10;
+	}
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -975,21 +989,12 @@ static void task_numa_placement(struct task_struct *p)
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
 		int old_migrate_seq = p->numa_migrate_seq;
 
-		/*
-		 * If the task is not on the preferred node then find 
-		 * a suitable CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid)
-			preferred_cpu = task_numa_find_cpu(p, max_nid);
-
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
-		migrate_task_to(p, preferred_cpu);
+		numa_migrate_preferred(p);
 
 		/*
 		 * If preferred nodes changes frequently then the scan rate
@@ -1050,6 +1055,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
+	/* Retry task to preferred node migration if it previously failed */
+	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+		numa_migrate_preferred(p);
+
 	/* Record the fault, double the weight if pages were migrated */
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-11 14:09           ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-11 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 03:11:58PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 02:03:22PM +0100, Mel Gorman wrote:
> > On Thu, Jul 11, 2013 at 02:30:38PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 11, 2013 at 10:46:52AM +0100, Mel Gorman wrote:
> > > > @@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
> > > >  		}
> > > >  	}
> > > >  
> > > > -	/* Update the tasks preferred node if necessary */
> > > > +	/*
> > > > +	 * Record the preferred node as the node with the most faults,
> > > > +	 * requeue the task to be running on the idlest CPU on the
> > > > +	 * preferred node and reset the scanning rate to recheck
> > > > +	 * the working set placement.
> > > > +	 */
> > > >  	if (max_faults && max_nid != p->numa_preferred_nid) {
> > > > +		int preferred_cpu;
> > > > +
> > > > +		/*
> > > > +		 * If the task is not on the preferred node then find the most
> > > > +		 * idle CPU to migrate to.
> > > > +		 */
> > > > +		preferred_cpu = task_cpu(p);
> > > > +		if (cpu_to_node(preferred_cpu) != max_nid) {
> > > > +			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> > > > +							     max_nid);
> > > > +		}
> > > > +
> > > > +		/* Update the preferred nid and migrate task if possible */
> > > >  		p->numa_preferred_nid = max_nid;
> > > >  		p->numa_migrate_seq = 0;
> > > > +		migrate_task_to(p, preferred_cpu);
> > > >  	}
> > > >  }
> > > 
> > > Now what happens if the migrations fails? We set numa_preferred_nid to max_nid
> > > but then never re-try the migration. Should we not re-try the migration every
> > > so often, regardless of whether max_nid changed?
> > 
> > We do this
> > 
> > load_balance
> > -> active_load_balance_cpu_stop
> 
> Note that active balance is rare to begin with.
> 

Yeah. I was not sure how rare exactly but it is what motivated the
introduction of migrate_task_to in the first place. I actually have no
idea how often the migration fails. I did not check for it.

> >   -> move_one_task
> >     -> can_migrate_task
> >       -> migrate_improves_locality
> > 
> > If the conditions are right then it'll move the task to the preferred node
> > for a number of PTE scans. Of course there is no guarantee that the necessary
> > conditions will occur but I was wary of taking more drastic steps in the
> > scheduler such as retrying on every fault until the migration succeeds.
> > 
> 
> Ah, so task_numa_placement() is only called every full scan, not every fault.
> Also one could throttle it.
> 
> So initially I did all the movement through the regular balancer, but Ingo
> found that when the machine grows it quickly becomes unlikely we hit the right
> conditions. Hence he also went to direct migrations in his series.
> 

I wanted to avoid aggressive scheduling decisions until after the false
share detection stuff was solid.

> Another thing we might consider is counting the number of migration attempts
> and settling for the n-th best node for the n'th attempt and giving up when n
> surpasses the quality of the node we're currently on.

That might be necessary when the machine is overloaded. As a
starting point the following should retry the migrate a number of times
until success. The retry is checked on every fault but should not fire
more than once every 100ms.

Compile tested only

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d44fbc6..454ad2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,7 @@ struct task_struct {
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f2b37e01..a5b6b01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -934,6 +934,20 @@ static inline int task_faults_idx(int nid, int priv)
 	return 2 * nid + priv;
 }
 
+/* Attempt to migrate a task to a CPU on the preferred node */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+	int preferred_cpu = task_cpu(p);
+
+	p->numa_migrate_retry = 0;
+	if (cpu_to_node(preferred_cpu) != p->numa_preferred_nid) {
+		preferred_cpu = task_numa_find_cpu(p, p->numa_preferred_nid);
+
+		if (!migrate_task_to(p, preferred_cpu))
+			p->numa_migrate_retry = jiffies + HZ/10;
+	}
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -975,21 +989,12 @@ static void task_numa_placement(struct task_struct *p)
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
 		int old_migrate_seq = p->numa_migrate_seq;
 
-		/*
-		 * If the task is not on the preferred node then find 
-		 * a suitable CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid)
-			preferred_cpu = task_numa_find_cpu(p, max_nid);
-
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
-		migrate_task_to(p, preferred_cpu);
+		numa_migrate_preferred(p);
 
 		/*
 		 * If preferred nodes changes frequently then the scan rate
@@ -1050,6 +1055,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
+	/* Retry task to preferred node migration if it previously failed */
+	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+		numa_migrate_preferred(p);
+
 	/* Record the fault, double the weight if pages were migrated */
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
 }

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
  2013-07-11 14:09           ` Mel Gorman
@ 2013-07-12 10:14             ` Peter Zijlstra
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-12 10:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 03:09:14PM +0100, Mel Gorman wrote:
> That might be necessary when the machine is overloaded. As a
> starting point the following should retry the migrate a number of times
> until success. The retry is checked on every fault but should not fire
> more than once every 100ms.
 
Yeah, something like that might work. But getting a working imbalance bound is
important. The current very weak direct migration is the only thing that keeps
the system from massively skewing load.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-12 10:14             ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2013-07-12 10:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Thu, Jul 11, 2013 at 03:09:14PM +0100, Mel Gorman wrote:
> That might be necessary when the machine is overloaded. As a
> starting point the following should retry the migrate a number of times
> until success. The retry is checked on every fault but should not fire
> more than once every 100ms.
 
Yeah, something like that might work. But getting a working imbalance bound is
important. The current very weak direct migration is the only thing that keeps
the system from massively skewing load.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
  2013-07-12 10:14             ` Peter Zijlstra
@ 2013-07-12 10:28               ` Mel Gorman
  -1 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-12 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Jul 12, 2013 at 12:14:08PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 03:09:14PM +0100, Mel Gorman wrote:
> > That might be necessary when the machine is overloaded. As a
> > starting point the following should retry the migrate a number of times
> > until success. The retry is checked on every fault but should not fire
> > more than once every 100ms.
>  
> Yeah, something like that might work. But getting a working imbalance bound is
> important. The current very weak direct migration is the only thing that keeps
> the system from massively skewing load.

I've corrected a flaw in that proposed patch, pulled in the check for
dst load vs src load, and am adding another patch that will fallback to
less preferred nodes if task_numa_find_cpu fails to find a suitable CPU.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected
@ 2013-07-12 10:28               ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2013-07-12 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Jul 12, 2013 at 12:14:08PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 03:09:14PM +0100, Mel Gorman wrote:
> > That might be necessary when the machine is overloaded. As a
> > starting point the following should retry the migrate a number of times
> > until success. The retry is checked on every fault but should not fire
> > more than once every 100ms.
>  
> Yeah, something like that might work. But getting a working imbalance bound is
> important. The current very weak direct migration is the only thing that keeps
> the system from massively skewing load.

I've corrected a flaw in that proposed patch, pulled in the check for
dst load vs src load, and am adding another patch that will fallback to
less preferred nodes if task_numa_find_cpu fails to find a suitable CPU.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2013-07-12 10:28 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-11  9:46 [PATCH 0/16] Basic scheduler support for automatic NUMA balancing V4 Mel Gorman
2013-07-11  9:46 ` Mel Gorman
2013-07-11  9:46 ` [PATCH 01/16] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 02/16] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 03/16] mm: numa: Account for THP numa hinting faults on the correct node Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 04/16] mm: numa: Do not migrate or account for hinting faults on the zero page Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11 11:21   ` Peter Zijlstra
2013-07-11 11:21     ` Peter Zijlstra
2013-07-11 12:42     ` Mel Gorman
2013-07-11 12:42       ` Mel Gorman
2013-07-11  9:46 ` [PATCH 05/16] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11 11:23   ` Peter Zijlstra
2013-07-11 11:23     ` Peter Zijlstra
2013-07-11 12:53     ` Mel Gorman
2013-07-11 12:53       ` Mel Gorman
2013-07-11  9:46 ` [PATCH 06/16] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 07/16] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 08/16] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11 12:30   ` Peter Zijlstra
2013-07-11 12:30     ` Peter Zijlstra
2013-07-11 13:03     ` Mel Gorman
2013-07-11 13:03       ` Mel Gorman
2013-07-11 13:11       ` Peter Zijlstra
2013-07-11 13:11         ` Peter Zijlstra
2013-07-11 14:09         ` Mel Gorman
2013-07-11 14:09           ` Mel Gorman
2013-07-12 10:14           ` Peter Zijlstra
2013-07-12 10:14             ` Peter Zijlstra
2013-07-12 10:28             ` Mel Gorman
2013-07-12 10:28               ` Mel Gorman
2013-07-11  9:46 ` [PATCH 09/16] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 10/16] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 11/16] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 12/16] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 13/16] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 14/16] sched: Remove check that skips small VMAs Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:46 ` [PATCH 15/16] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-07-11  9:46   ` Mel Gorman
2013-07-11  9:47 ` [PATCH 16/16] sched: Select least loaded CPU on preferred NUMA node Mel Gorman
2013-07-11  9:47   ` Mel Gorman
2013-07-11 12:39   ` Peter Zijlstra
2013-07-11 12:39     ` Peter Zijlstra
2013-07-11 13:24     ` Mel Gorman
2013-07-11 13:24       ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.