linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/43] Automatic NUMA Balancing V3
@ 2012-11-16 11:22 Mel Gorman
  2012-11-16 11:22 ` [PATCH 01/43] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
                   ` (43 more replies)
  0 siblings, 44 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

tldr: Benchmarkers, only test patches 1-35.

git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v3r27

This is a large drop and is a bit more rushed than I'd like but delaying
it was not an option. This can be roughly considered to be in four major stages

1. Basic foundation, very similar to what was in V1
2. Full PMD fault handling, rate limiting of migration, two-stage migration filter.
   This will migrate pages on a PTE or PMD level using just the current referencing
   CPU as a placement hint
3. TLB flush optimisations
4. CPU follows memory algorithm. Very broadly speaking the intention is that based
   on fault statistics a home node is identified and the process tries to remain
   on the home node. It's crude and a much more complete implementation is needed.

Very broadly speaking the most urgent TODOs that spring to mind are

1. Move change_prot_numa to be based on change_protection
2. Native THP migration
3. Mitigate TLB flush operations in try_to_unmap_one called from migration path
4. Tunable to enable/disable from command-line and at runtime. It should be completely
   disabled if the machine does not support NUMA.
5. Better load balancer integration (current is based on an old version of schednuma)
6. Fix/replace CPU follows algorithm. Current one is a broken port from autonuma, it's
   very expensive and migrations are excessive. Either autonuma, schednuma or something
   else needs to be rebased on top of this properly. The broken implementation gives
   an indication where all the different parts should be plumbed in.
7. Depending on what happens with 6, fold page struct additions into page->flags
8. Revisit MPOL_NOOP and MPOL_MF_LAZY
9. Other architecture support or at least validation that it could be made work. I'm
   half-hoping that the PPC64 people are watching because they tend to be interested
   in this type of thing.
10. A review of all the conditionally compiled stuff. More of it could be compiled
   out if !CONFIG_NUMA or !CONFIG_BALANCE_NUMA.

In terms of benchmarking only patches 1-35 should be considered. Patches
36-43 implement a placement policy that I know is not working as planned at
the moment. Note that all my own benchmarking did *not* include patch 16
"mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"
but that should not make a difference.

I'm leaving the RFC in place because patches 36-43 are incomplete.

Changelog since V2
  o Do not allocate from home node
  o Mostly remove pmd_numa handling for regular pmds
  o HOME policy will allocate from and migrate towards local node
  o Load balancer is more aggressive about moving tasks towards home node
  o Renames to sync up more with -tip version
  o Move pte handlers to generic code
  o Scanning rate starts at 100ms, system CPU usage expected to increase
  o Handle migration of PMD hinting faults
  o Rate limit migration on a per-node basis
  o Alter how the rate of PTE scanning is adapted
  o Rate limit setting of pte_numa if node is congested
  o Only flush local TLB is unmapping a pte_numa page
  o Only consider one CPU in cpu follow algorithm

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two competing approaches to implement support for
automatically migrating pages to optimise NUMA locality. Performance results
are available for both but review highlighted different problems in both.
They are not compatible with each other even though some fundamental
mechanics should have been the same.  This series addresses part of the
integration and sharing problem by implementing a foundation that either
the policy for schednuma or autonuma can be rebased on.

The initial policy it implements is a very basic greedy policy called
"Migrate On Reference Of pte_numa Node (MORON)" and is later replaced by
a variation of the home-node policy and renamed.  I expect to build upon
this revised policy and rename it to something more sensible that reflects
what it means.

In terms of building on top of the foundation the ideal would be that
patches affect one of the following areas although obviously that will
not always be possible

1. The PTE update helper functions
2. The PTE scanning machinary driven from task_numa_tick
3. Task and process fault accounting and how that information is used
   to determine if a page is misplaced
4. Fault handling, migrating the page if misplaced, what information is
   provided to the placement policy
5. Scheduler and load balancing


Patches 1-3 move some vmstat counters so that migrated pages get accounted
	for. In the past the primary user of migration was compaction but
	if pages are to migrate for NUMA optimisation then the counters
	need to be generally useful.

Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used
	to trigger faults later in the series. A placement policy is expected
	to use these faults to determine if a page should migrate.  On x86,
	the bit is the same as _PAGE_PROTNONE but other architectures
	may differ.

Patch 5-8 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
	friends. It implements them for x86, handles GUP and preserves
	the _PAGE_NUMA bit across THP splits.

Patch 9 creates the fault handler for p[te|md]_numa PTEs and just clears
	them again.

Patch 10 adds a MPOL_LOCAL policy so applications can explicitly request the
	historical behaviour.

Patch 11 is premature but adds a MPOL_NOOP policy that can be used in
	conjunction with the LAZY flags introduced later in the series.

Patch 12 adds migrate_misplaced_page which is responsible for migrating
	a page to a new location.

Patch 13 migrates the page on fault if mpol_misplaced() says to do so.

Patch 14 updates the page fault handlers. Transparent huge pages are split.
	Pages pointed to by PTEs are migrated. Pages pointed to by PMDs
	are not properly handed until later in the series.

Patch 15 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
	On the next reference the memory should be migrated to the node that
	references the memory.

Patch 16 notes that the MPOL_MF_LAZY and MPOL_NOOP flags have not been properly
	reviewed and there are no manual pages. They are removed for now and
	need to be revisited.

Patch 17 adds an arch flag for supporting balance numa

Patch 18 sets pte_numa within the context of the scheduler.

Patch 19 tries to avoid double faulting after migrating a page

Patches 20-22 note that the marking of pte_numa has a number of disadvantages and
	instead incrementally updates a limited range of the address space
	each tick.

Patch 23 adds some vmstats that can be used to approximate the cost of the
	scheduling policy in a more fine-grained fashion than looking at
	the system CPU usage.

Patch 24 implements the MORON policy. This is roughly where V1 of the series was.

Patch 25 properly handles the migration of pages faulted when handling a pmd
	numa hinting fault. This could be improved as it's a bit tangled
	to follow.


Patch 26 will only mark a PMD pmd_numa if many of the pages underneath are on
	the same node.

Patches 27-29 rate-limit the number of pages being migrated and marked as pte_numa

Patch 30 slowly decreases the pte_numa update scanning rate

Patch 31-32 introduces last_nid and uses it to build a two-stage filter
	that delays when a page gets migrated to avoid a situation where
	a task running temporarily off its home node forces a migration.

Patches 33-35 brings in some TLB flush reduction patches. It was pointed
	out that try_to_unmap_one still incurs a TLB flush and this is true.
	An initial patch to cover this looked promising but was suspected
	of a stability issue. It was likely triggered by another corruption
	bug that has since been fixed and needs to be revisited.

Patches 36-39 introduces the concept of a home-node that the scheduler tries
	to keep processes on. It's advisory only and not particularly strict.
	There may be a problem with this whereby the load balancer is not
	pushing processes back to their home node because there are no
	idle CPUs available. It might need to be more aggressive about
	swapping two tasks that are both running off their home node.

Patch 40 implements a CPU follow memory policy that is roughly based on what
	was in autonuma. It builds statistics on faults on a per-task and
	per-mm basis and decides if a tasks home node should be updated
	on that basis. It is basically broken at the moment, is far too
	heavy and results in bouncing but it serves as an illustration.
	It needs to be reworked significantly or reimplemented.

Patch 41 makes patch 40 slightly less expensive but still way too heavy

Patch 42 adapts the pte_numa scanning rates based on the placement policy.
	This also needs to be redone as it was found while writing this
	changelog that the system CPU cost of reducing the scanning rate
	is SEVERE. I kept the patch because it serves as a reminder that
	we should do something like this.

Some notes.

This still is missing a mechanism for disabling from the command-line.

Documentation is sorely missing at this point.

In the past I noticed from profiles that mutex_spin_on_owner()
is very high in the last. I do not have recent profiles but will run
something over the weekend.  The old observation was that on autonumabench
NUMA01_THREADLOCAL, the patches spend more time spinning in there and more
time in intel_idle implying that other users are waiting for the pte_numa
updates to complete. In the autonumabenchmark cases, the other contender
could be khugepaged. In the specjbb case there is also a lot of spinning
and it could be due to the JVM calling mprotect(). One way or the other,
it needs to be pinned down if the pte_numa updates are the problem and
if so how we might work around the requirement to hold mmap_sem while the
pte_numa update takes place.

Now the usual round of benchmarking! 7 kernels were considered, all based
on 3.7-rc4.

schednuma-v2r3		tip/sched/core + latest patches from Peter and Ingo
autonuma-v28fast	rebased autonuma-v28fast branch from Andrea
stats-v2r34		Patches 1-3 of this series
moron-v3r27		Patches 1-24. MORON policy (similar to v1 of series) 
twostage-v3r27		Patches 1-32. PMD handling, rate limiting, two-stage filter
lessflush-v3r27		Patches 1-35. TLB flush fixes on top
cpuone-v3r27		Patches 1-42. CPU follows algorithm
adaptscan-v3r27		Patches 1-43. Adaptive scanning

AUTONUMA BENCH
                                          3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                                rc4-stats-v2r34    rc4-schednuma-v2r3  rc4-autonuma-v28fast       rc4-moron-v3r27    rc4-twostage-v3r27   rc4-lessflush-v3r27      rc4-cpuone-v3r27   rc4-adaptscan-v3r27
User    NUMA01               67351.66 (  0.00%)    47146.57 ( 30.00%)    30273.64 ( 55.05%)    23514.02 ( 65.09%)    62299.74 (  7.50%)    66947.87 (  0.60%)    55683.74 ( 17.32%)    40591.96 ( 39.73%)
User    NUMA01_THEADLOCAL    54788.28 (  0.00%)    17198.99 ( 68.61%)    17039.73 ( 68.90%)    20074.86 ( 63.36%)    22192.46 ( 59.49%)    21008.74 ( 61.65%)    18174.40 ( 66.83%)    17027.78 ( 68.92%)
User    NUMA02                7179.87 (  0.00%)     2096.07 ( 70.81%)     2099.85 ( 70.75%)     2902.95 ( 59.57%)     2140.49 ( 70.19%)     2208.52 ( 69.24%)     1125.91 ( 84.32%)     1329.20 ( 81.49%)
User    NUMA02_SMT            3028.11 (  0.00%)      998.22 ( 67.03%)     1052.97 ( 65.23%)     1051.16 ( 65.29%)     1053.06 ( 65.22%)      969.17 ( 67.99%)      778.44 ( 74.29%)      936.55 ( 69.07%)
System  NUMA01                  45.68 (  0.00%)     3531.04 (-7629.95%)      423.91 (-828.00%)      723.05 (-1482.86%)     1548.99 (-3290.96%)     1903.18 (-4066.33%)     3762.31 (-8136.23%)     9143.26 (-19915.89%)
System  NUMA01_THEADLOCAL       40.92 (  0.00%)      926.72 (-2164.71%)      188.15 (-359.80%)      460.77 (-1026.03%)      685.06 (-1574.14%)      586.56 (-1333.43%)     1317.25 (-3119.09%)     4091.30 (-9898.29%)
System  NUMA02                   1.72 (  0.00%)       23.64 (-1274.42%)       27.37 (-1491.28%)       33.15 (-1827.33%)       70.41 (-3993.60%)       72.02 (-4087.21%)      156.47 (-8997.09%)      158.89 (-9137.79%)
System  NUMA02_SMT               0.92 (  0.00%)        8.18 (-789.13%)       18.43 (-1903.26%)       22.31 (-2325.00%)       41.63 (-4425.00%)       38.06 (-4036.96%)      101.56 (-10939.13%)       65.32 (-7000.00%)
Elapsed NUMA01                1514.61 (  0.00%)     1122.78 ( 25.87%)      722.66 ( 52.29%)      534.56 ( 64.71%)     1419.97 (  6.25%)     1532.43 ( -1.18%)     1339.58 ( 11.56%)     1242.21 ( 17.98%)
Elapsed NUMA01_THEADLOCAL     1264.08 (  0.00%)      393.79 ( 68.85%)      391.48 ( 69.03%)      471.07 ( 62.73%)      508.68 ( 59.76%)      487.97 ( 61.40%)      460.43 ( 63.58%)      531.53 ( 57.95%)
Elapsed NUMA02                 181.88 (  0.00%)       49.44 ( 72.82%)       61.55 ( 66.16%)       77.55 ( 57.36%)       60.96 ( 66.48%)       60.10 ( 66.96%)       56.96 ( 68.68%)       57.11 ( 68.60%)
Elapsed NUMA02_SMT             168.41 (  0.00%)       47.49 ( 71.80%)       54.72 ( 67.51%)       66.98 ( 60.23%)       57.56 ( 65.82%)       54.06 ( 67.90%)       58.04 ( 65.54%)       53.99 ( 67.94%)
CPU     NUMA01                4449.00 (  0.00%)     4513.00 ( -1.44%)     4247.00 (  4.54%)     4534.00 ( -1.91%)     4496.00 ( -1.06%)     4492.00 ( -0.97%)     4437.00 (  0.27%)     4003.00 ( 10.02%)
CPU     NUMA01_THEADLOCAL     4337.00 (  0.00%)     4602.00 ( -6.11%)     4400.00 ( -1.45%)     4359.00 ( -0.51%)     4497.00 ( -3.69%)     4425.00 ( -2.03%)     4233.00 (  2.40%)     3973.00 (  8.39%)
CPU     NUMA02                3948.00 (  0.00%)     4287.00 ( -8.59%)     3455.00 ( 12.49%)     3785.00 (  4.13%)     3626.00 (  8.16%)     3794.00 (  3.90%)     2251.00 ( 42.98%)     2605.00 ( 34.02%)
CPU     NUMA02_SMT            1798.00 (  0.00%)     2118.00 (-17.80%)     1957.00 ( -8.84%)     1602.00 ( 10.90%)     1901.00 ( -5.73%)     1862.00 ( -3.56%)     1516.00 ( 15.68%)     1855.00 ( -3.17%)


For NUMA01 moron-v3r27 does well but largely because it places things well
initially and then gets out of the way. The later patches in the series
do not cope as well. NUMA01 is an adverse workload and needs to be handled
better. The System CPU usage is high reflecting the migration it is doing
and while it's lower than schednuma's, it's still far too high.

In general, the System CPU usage is too high for everyone. Note that the
cpu follows algorithm puts it sky sky and the adaptive scanning makes it
worse. This needs addressing. A very large portion of this sytem CPU cost
is to due to TLB flushes during migration when handling pte_numa faults.

In terms of Elapsed time things are not too bad. For NUMA01_THEADLOCAL,
NUMA02 and NUMA02_SMT, lessflush-v3r27 (the main series that I think should
be benchmarked) shows reasonable improvements. It's not as good as schednuma
and autonuma in general but it is still respectable and there is no proper
placement policy after all.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
         stats-v2r34 schednuma-v2r3 autonuma moron-v3r27 twostage-v3r27 lessflush-v3 cpuone-v3 adaptscan-v3r27
User       132355.28    67445.10    50473.41    47550.20    87692.19    91141.69    75769.47    59892.77
System         89.90     4490.17      658.51     1239.92     2346.70     2600.48     5338.27    13459.43
Elapsed      3138.98     1621.73     1240.09     1159.42     2055.92     2144.42     1924.10     1893.29

Bit mushed up but the main take-away here is the System CPU
cost. wwostage-v3r27 and lessflush-v3r27 is very high and the placement
policy and adaptive scan make it a lot worse. autonumas looks really low
but this could be due to the fact it does a lot of work in kernel threads
where the cost is not as obvious.


MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc4-stats-v2r34rc4-schednuma-v2r3rc4-autonuma-v28fastrc4-moron-v3r27rc4-twostage-v3r27rc4-lessflush-v3r27rc4-cpuone-v3r27rc4-adaptscan-v3r27
Page Ins                         40180       36944       41824       43420       43432       43168       43424       43148
Page Outs                        29548       16996       13352       12864       20684       21628       17964       18984
Swap Ins                             0           0           0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0           0           0
THP fault alloc                  16688       12225       19232       17117       17828       17273       18272       18695
THP collapse alloc                   8           1        9743         484         918        1034        1097        1095
THP splits                           3           0       10654        7568        7453        7679        8051        8134
THP fault fallback                   0           0           0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0           0           0
Page migrate success                 0           0           0     3372219     9296248     9122453    19833353    42345720
Page migrate failure                 0           0           0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0           0           0
Compaction cost                      0           0           0        3500        9649        9469       20587       43954
NUMA PTE updates                     0           0           0   571770975   101066122   104353617   411858897  1471434472
NUMA hint faults                     0           0           0   573525212   103538510   106870459   415594465  1481522643
NUMA hint local faults               0           0           0   149965397    49345272    51268932   202046567   555527366
NUMA pages migrated                  0           0           0     3372219     9296248     9122453    19833353    42345720
AutoNUMA cost                        0           0           0     2871692      518576      535256     2081232     7418717

schednuma and autonuma do not have the stats so we cannot compare the
notional costs except to note that schednuma has no THP splits as it
supports native THP migration.

For balancenuma, the main thing to spot is that there are a LOT of pte
updates and migrations. Superficially this indicates that the workload is
not converging properly and reducing the scanning rate when it does. This
is where a proper placement policy, scheduling decisions and scan rate
adaption should come in to play.


SPECJBB BOPS

Cutting this one a bit short again to save pace

                          3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                rc4-stats-v2r34    rc4-schednuma-v2r3  rc4-autonuma-v28fast       rc4-moron-v3r27    rc4-twostage-v3r27   rc4-lessflush-v3r27      rc4-cpuone-v3r27   rc4-adaptscan-v3r27
Mean   1      25034.25 (  0.00%)     20598.50 (-17.72%)     25192.25 (  0.63%)     25168.75 (  0.54%)     25525.75 (  1.96%)     25146.25 (  0.45%)     24270.25 ( -3.05%)     24703.75 ( -1.32%)
Mean   2      53176.00 (  0.00%)     43906.50 (-17.43%)     55508.25 (  4.39%)     52909.50 ( -0.50%)     49688.25 ( -6.56%)     50920.75 ( -4.24%)     51530.25 ( -3.09%)     47131.25 (-11.37%)
Mean   3      77350.50 (  0.00%)     60342.75 (-21.99%)     82122.50 (  6.17%)     76557.00 ( -1.03%)     75532.25 ( -2.35%)     73726.75 ( -4.68%)     74520.00 ( -3.66%)     63232.25 (-18.25%)
Mean   4      99919.50 (  0.00%)     80781.75 (-19.15%)    107233.25 (  7.32%)     98943.50 ( -0.98%)     97165.00 ( -2.76%)     96004.75 ( -3.92%)     95784.00 ( -4.14%)     67122.00 (-32.82%)
Mean   5     119797.00 (  0.00%)     97870.00 (-18.30%)    131016.00 (  9.37%)    118532.25 ( -1.06%)    117484.50 ( -1.93%)    116902.50 ( -2.42%)    116510.25 ( -2.74%)     69520.00 (-41.97%)
Mean   6     135858.00 (  0.00%)    123912.50 ( -8.79%)    152444.75 ( 12.21%)    133761.75 ( -1.54%)    133725.25 ( -1.57%)    134317.50 ( -1.13%)    132445.75 ( -2.51%)     42056.75 (-69.04%)
Mean   7     136074.00 (  0.00%)    126574.25 ( -6.98%)    157372.75 ( 15.65%)    133870.75 ( -1.62%)    135822.50 ( -0.18%)    137850.25 (  1.31%)    135727.75 ( -0.25%)     19630.75 (-85.57%)
Mean   8     132426.25 (  0.00%)    121766.00 ( -8.05%)    161655.25 ( 22.07%)    131605.50 ( -0.62%)    136697.25 (  3.23%)    135818.00 (  2.56%)    135559.00 (  2.37%)     27267.75 (-79.41%)
Mean   9     129432.75 (  0.00%)    114224.25 (-11.75%)    160530.50 ( 24.03%)    130498.50 (  0.82%)    134121.00 (  3.62%)    133703.25 (  3.30%)    134048.75 (  3.57%)     18777.00 (-85.49%)
Mean   10    118399.75 (  0.00%)    109040.50 ( -7.90%)    158692.00 ( 34.03%)    125355.50 (  5.87%)    131581.75 ( 11.13%)    129295.50 (  9.20%)    130685.25 ( 10.38%)      4565.00 (-96.14%)
Mean   11    119604.00 (  0.00%)    105566.50 (-11.74%)    154462.00 ( 29.14%)    126155.75 (  5.48%)    127086.00 (  6.26%)    125634.25 (  5.04%)    125426.00 (  4.87%)      4811.25 (-95.98%)
Mean   12    112742.25 (  0.00%)    101728.75 ( -9.77%)    149546.00 ( 32.64%)    111419.00 ( -1.17%)    118136.75 (  4.78%)    118625.75 (  5.22%)    120271.25 (  6.68%)      5029.25 (-95.54%)
Mean   13    109480.75 (  0.00%)    103737.50 ( -5.25%)    144929.25 ( 32.38%)    109388.25 ( -0.08%)    114351.25 (  4.45%)    115157.00 (  5.18%)    117752.00 (  7.55%)      3314.25 (-96.97%)
Mean   14    109724.00 (  0.00%)    103516.00 ( -5.66%)    143804.50 ( 31.06%)    108902.25 ( -0.75%)    115422.75 (  5.19%)    114151.75 (  4.04%)    115994.00 (  5.71%)      5312.50 (-95.16%)
Mean   15    109111.75 (  0.00%)    100817.00 ( -7.60%)    141878.00 ( 30.03%)    108213.25 ( -0.82%)    115640.00 (  5.98%)    112870.00 (  3.44%)    117017.00 (  7.25%)      5298.75 (-95.14%)
Mean   16    105385.75 (  0.00%)     99327.25 ( -5.75%)    140156.75 ( 32.99%)    105159.75 ( -0.21%)    113128.25 (  7.35%)    111836.50 (  6.12%)    115170.00 (  9.28%)      4091.75 (-96.12%)
Mean   17    101903.50 (  0.00%)     96464.50 ( -5.34%)    138402.00 ( 35.82%)    104582.75 (  2.63%)    112576.00 ( 10.47%)    112967.50 ( 10.86%)    113390.75 ( 11.27%)      5601.25 (-94.50%)
Mean   18    103632.50 (  0.00%)     95632.50 ( -7.72%)    137781.50 ( 32.95%)    103168.00 ( -0.45%)    110462.00 (  6.59%)    113622.75 (  9.64%)    113209.00 (  9.24%)      6216.75 (-94.00%)
Stddev 1       1195.76 (  0.00%)       358.07 ( 70.06%)       861.97 ( 27.91%)      1108.27 (  7.32%)       704.35 ( 41.10%)       738.31 ( 38.26%)       370.96 ( 68.98%)       858.14 ( 28.23%)
Stddev 2        883.39 (  0.00%)      1203.29 (-36.21%)       855.08 (  3.20%)       320.44 ( 63.73%)      1190.25 (-34.74%)       918.86 ( -4.02%)       720.67 ( 18.42%)      1831.94 (-107.38%)
Stddev 3        997.25 (  0.00%)      3755.67 (-276.60%)       545.50 ( 45.30%)       971.40 (  2.59%)      1444.69 (-44.87%)      1507.91 (-51.21%)      1227.37 (-23.08%)      4043.37 (-305.45%)
Stddev 4       1115.16 (  0.00%)      6390.65 (-473.07%)      1183.49 ( -6.13%)       679.74 ( 39.05%)      1320.08 (-18.38%)       897.64 ( 19.51%)      1525.30 (-36.78%)      8637.27 (-674.53%)
Stddev 5       1367.09 (  0.00%)      9710.70 (-610.32%)      1022.09 ( 25.24%)       944.31 ( 30.93%)      1003.82 ( 26.57%)       824.03 ( 39.72%)      1128.73 ( 17.44%)     13504.42 (-887.82%)
Stddev 6       1125.22 (  0.00%)      1097.83 (  2.43%)      1013.52 (  9.93%)      1170.85 ( -4.06%)      1971.57 (-75.22%)      1042.93 (  7.31%)      2416.06 (-114.72%)      9214.24 (-718.89%)
Stddev 7       3211.72 (  0.00%)      1533.62 ( 52.25%)       512.61 ( 84.04%)      4186.42 (-30.35%)      5832.10 (-81.59%)      4264.34 (-32.77%)      2886.05 ( 10.14%)      2628.35 ( 18.16%)
Stddev 8       4194.96 (  0.00%)      1518.26 ( 63.81%)       493.64 ( 88.23%)      2203.56 ( 47.47%)      1961.15 ( 53.25%)      2913.42 ( 30.55%)      3445.70 ( 17.86%)     13053.31 (-211.17%)
Stddev 9       6175.10 (  0.00%)      2648.75 ( 57.11%)      2109.83 ( 65.83%)      2732.83 ( 55.74%)      2205.91 ( 64.28%)      3808.45 ( 38.33%)      3246.22 ( 47.43%)      5511.26 ( 10.75%)
Stddev 10      4754.87 (  0.00%)      1941.47 ( 59.17%)      2948.98 ( 37.98%)      1533.87 ( 67.74%)      2395.65 ( 49.62%)      3207.51 ( 32.54%)      3564.21 ( 25.04%)       783.51 ( 83.52%)
Stddev 11      2706.18 (  0.00%)      1247.95 ( 53.89%)      5907.16 (-118.28%)      3030.54 (-11.99%)      2989.54 (-10.47%)      2983.44 (-10.25%)      3156.67 (-16.65%)       939.68 ( 65.28%)
Stddev 12      3607.76 (  0.00%)       663.63 ( 81.61%)      9063.28 (-151.22%)      3191.77 ( 11.53%)      2849.20 ( 21.03%)      1810.51 ( 49.82%)      3422.89 (  5.12%)       305.09 ( 91.54%)
Stddev 13      2771.67 (  0.00%)      1447.87 ( 47.76%)      8716.51 (-214.49%)      3516.13 (-26.86%)      1425.69 ( 48.56%)      2564.87 (  7.46%)      1667.33 ( 39.84%)       118.01 ( 95.74%)
Stddev 14      2522.18 (  0.00%)      1510.28 ( 40.12%)      9286.98 (-268.21%)      3144.22 (-24.66%)      1866.90 ( 25.98%)       784.45 ( 68.90%)       369.15 ( 85.36%)       764.26 ( 69.70%)
Stddev 15      2711.16 (  0.00%)      1719.54 ( 36.58%)      9895.88 (-265.01%)      2889.53 ( -6.58%)      1059.84 ( 60.91%)      2043.26 ( 24.64%)      1149.45 ( 57.60%)       297.90 ( 89.01%)
Stddev 16      2797.21 (  0.00%)       983.63 ( 64.84%)      9302.92 (-232.58%)      2734.35 (  2.25%)       817.51 ( 70.77%)       937.10 ( 66.50%)      1031.85 ( 63.11%)       223.38 ( 92.01%)
Stddev 17      4019.85 (  0.00%)      1927.25 ( 52.06%)      9998.34 (-148.72%)      2567.94 ( 36.12%)      1301.02 ( 67.64%)      1803.98 ( 55.12%)      1683.85 ( 58.11%)       697.06 ( 82.66%)
Stddev 18      3332.20 (  0.00%)      1401.68 ( 57.94%)     12056.08 (-261.80%)      2297.48 ( 31.05%)      1852.32 ( 44.41%)       675.02 ( 79.74%)      1190.98 ( 64.26%)       285.90 ( 91.42%)
TPut   1     100137.00 (  0.00%)     82394.00 (-17.72%)    100769.00 (  0.63%)    100675.00 (  0.54%)    102103.00 (  1.96%)    100585.00 (  0.45%)     97081.00 ( -3.05%)     98815.00 ( -1.32%)
TPut   2     212704.00 (  0.00%)    175626.00 (-17.43%)    222033.00 (  4.39%)    211638.00 ( -0.50%)    198753.00 ( -6.56%)    203683.00 ( -4.24%)    206121.00 ( -3.09%)    188525.00 (-11.37%)
TPut   3     309402.00 (  0.00%)    241371.00 (-21.99%)    328490.00 (  6.17%)    306228.00 ( -1.03%)    302129.00 ( -2.35%)    294907.00 ( -4.68%)    298080.00 ( -3.66%)    252929.00 (-18.25%)
TPut   4     399678.00 (  0.00%)    323127.00 (-19.15%)    428933.00 (  7.32%)    395774.00 ( -0.98%)    388660.00 ( -2.76%)    384019.00 ( -3.92%)    383136.00 ( -4.14%)    268488.00 (-32.82%)
TPut   5     479188.00 (  0.00%)    391480.00 (-18.30%)    524064.00 (  9.37%)    474129.00 ( -1.06%)    469938.00 ( -1.93%)    467610.00 ( -2.42%)    466041.00 ( -2.74%)    278080.00 (-41.97%)
TPut   6     543432.00 (  0.00%)    495650.00 ( -8.79%)    609779.00 ( 12.21%)    535047.00 ( -1.54%)    534901.00 ( -1.57%)    537270.00 ( -1.13%)    529783.00 ( -2.51%)    168227.00 (-69.04%)
TPut   7     544296.00 (  0.00%)    506297.00 ( -6.98%)    629491.00 ( 15.65%)    535483.00 ( -1.62%)    543290.00 ( -0.18%)    551401.00 (  1.31%)    542911.00 ( -0.25%)     78523.00 (-85.57%)
TPut   8     529705.00 (  0.00%)    487064.00 ( -8.05%)    646621.00 ( 22.07%)    526422.00 ( -0.62%)    546789.00 (  3.23%)    543272.00 (  2.56%)    542236.00 (  2.37%)    109071.00 (-79.41%)
TPut   9     517731.00 (  0.00%)    456897.00 (-11.75%)    642122.00 ( 24.03%)    521994.00 (  0.82%)    536484.00 (  3.62%)    534813.00 (  3.30%)    536195.00 (  3.57%)     75108.00 (-85.49%)
TPut   10    473599.00 (  0.00%)    436162.00 ( -7.90%)    634768.00 ( 34.03%)    501422.00 (  5.87%)    526327.00 ( 11.13%)    517182.00 (  9.20%)    522741.00 ( 10.38%)     18260.00 (-96.14%)
TPut   11    478416.00 (  0.00%)    422266.00 (-11.74%)    617848.00 ( 29.14%)    504623.00 (  5.48%)    508344.00 (  6.26%)    502537.00 (  5.04%)    501704.00 (  4.87%)     19245.00 (-95.98%)
TPut   12    450969.00 (  0.00%)    406915.00 ( -9.77%)    598184.00 ( 32.64%)    445676.00 ( -1.17%)    472547.00 (  4.78%)    474503.00 (  5.22%)    481085.00 (  6.68%)     20117.00 (-95.54%)
TPut   13    437923.00 (  0.00%)    414950.00 ( -5.25%)    579717.00 ( 32.38%)    437553.00 ( -0.08%)    457405.00 (  4.45%)    460628.00 (  5.18%)    471008.00 (  7.55%)     13257.00 (-96.97%)
TPut   14    438896.00 (  0.00%)    414064.00 ( -5.66%)    575218.00 ( 31.06%)    435609.00 ( -0.75%)    461691.00 (  5.19%)    456607.00 (  4.04%)    463976.00 (  5.71%)     21250.00 (-95.16%)
TPut   15    436447.00 (  0.00%)    403268.00 ( -7.60%)    567512.00 ( 30.03%)    432853.00 ( -0.82%)    462560.00 (  5.98%)    451480.00 (  3.44%)    468068.00 (  7.25%)     21195.00 (-95.14%)
TPut   16    421543.00 (  0.00%)    397309.00 ( -5.75%)    560627.00 ( 32.99%)    420639.00 ( -0.21%)    452513.00 (  7.35%)    447346.00 (  6.12%)    460680.00 (  9.28%)     16367.00 (-96.12%)
TPut   17    407614.00 (  0.00%)    385858.00 ( -5.34%)    553608.00 ( 35.82%)    418331.00 (  2.63%)    450304.00 ( 10.47%)    451870.00 ( 10.86%)    453563.00 ( 11.27%)     22405.00 (-94.50%)
TPut   18    414530.00 (  0.00%)    382530.00 ( -7.72%)    551126.00 ( 32.95%)    412672.00 ( -0.45%)    441848.00 (  6.59%)    454491.00 (  9.64%)    452836.00 (  9.24%)     24867.00 (-94.00%)

One JVM runs per numa node. Mean is average ops/sec per JVM. Tput is
overall throughput of all nodes.

lessflush-v3r27 does reasonably well here. It's slower for smaller number of
warehouses and sees 3-10% performance gains for larger numbers of warehouses.
This is quite encouraging. Note that moron-v3r27 which is roughly similar
to v1 of this series is crap because of its brain-damaged handling of
PMD faults.

The cpu-follows policy does nothing useful here. If it's making better placement
decisions, it's losing all the gain.

The adaptive scan COMPLETELY wrecks everything. I was tempted to delete this patch
entirely and pretend it didn't exist but some sort of adaptive scan rate is required.
The patch at least acts as a "Don't Do What Donny Don't Did".

schednuma regressses badly here and it has to be established why as Ingo reports
the exact opposite. It has been discussed elsewhere but it could be down to the
kernel, the machine, the JVM configuration or which specjbb figures we are
actually reporting.

schednuma and lessflush-v3r27 are reasonably good in terms of variations
across JVMs and is generally more. autonuma has very variable performance between JVMs.

autonuma dominates here.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                             rc4-stats-v2r34         rc4-schednuma-v2r3  rc4-autonuma-v28fast       rc4-moron-v3r27    rc4-twostage-v3r27   rc4-lessflush-v3r27      rc4-cpuone-v3r27   rc4-adaptscan-v3r27
 Expctd Warehouse                   12.00 (  0.00%)     12.00 (  0.00%)       12.00 (  0.00%)       12.00 (  0.00%)       12.00 (  0.00%)       12.00 (  0.00%)       12.00 (  0.00%)       12.00 (  0.00%)
 Expctd Peak Bops               450969.00 (  0.00%) 406915.00 ( -9.77%)   598184.00 ( 32.64%)   445676.00 ( -1.17%)   472547.00 (  4.78%)   474503.00 (  5.22%)   481085.00 (  6.68%)    20117.00 (-95.54%)
 Actual Warehouse                    7.00 (  0.00%)      7.00 (  0.00%)        8.00 ( 14.29%)        7.00 (  0.00%)        8.00 ( 14.29%)        7.00 (  0.00%)        7.00 (  0.00%)        5.00 (-28.57%)
 Actual Peak Bops               544296.00 (  0.00%) 506297.00 ( -6.98%)   646621.00 ( 18.80%)   535483.00 ( -1.62%)   546789.00 (  0.46%)   551401.00 (  1.31%)   542911.00 ( -0.25%)   278080.00 (-48.91%)

Other than autonuma, peak performance did not go well. balancenuma
sustains performance for greater numbers of warehouses but it's actual
peak performance is not improved. As before, adaptive scan killed everything.


MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc4-stats-v2r34rc4-schednuma-v2r3rc4-autonuma-v28fastrc4-moron-v3r27rc4-twostage-v3r27rc4-lessflush-v3r27rc4-cpuone-v3r27rc4-adaptscan-v3r27
User       101949.84    86817.79   101748.80   100943.56    99799.41    99896.98    99813.11    12790.74
System         66.05    13094.99      191.40      948.00     1948.39     1939.91     1995.15    40647.38
Elapsed      2456.35     2459.16     2451.96     2456.83     2462.20     2462.01     2462.97     2502.24

schednumas system CPU costs were high.

autonumas were low but again, the cost could be hidden.

balancenumas is relatively not too bad (other than adaptive scan which
kills the world) but it is still stupidly high. A proper placement policy
that reduced migrations would help a lot.


MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
                          rc4-stats-v2r34rc4-schednuma-v2r3rc4-autonuma-v28fastrc4-moron-v3r27rc4-twostage-v3r27rc4-lessflush-v3r27rc4-cpuone-v3r27rc4-adaptscan-v3r27
Page Ins                         34920       36128       37356       38264       38368       37952       38196       38236
Page Outs                        32116       34000       31140       31604       31152       32872       31592       33280
Swap Ins                             0           0           0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0           0           0
THP fault alloc                      1           1           1           2           1           2           1           2
THP collapse alloc                   0           0          23           0           0           0           0           2
THP splits                           0           0           7           0           3           1           7           5
THP fault fallback                   0           0           0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0           0           0
Page migrate success                 0           0           0      890168    53347569    53708970    53869395   381749347
Page migrate failure                 0           0           0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0           0           0
Compaction cost                      0           0           0         923       55374       55749       55916      396255
NUMA PTE updates                     0           0           0  2959462982   382203645   383516027   388145421  3175106653
NUMA hint faults                     0           0           0  2958118854   381790747   382914344   387738802  3202932515
NUMA hint local faults               0           0           0   771705175   102887391   104071278   102032245  1038500179
NUMA pages migrated                  0           0           0      890168    53347569    53708970    53869395   381749347
AutoNUMA cost                        0           0           0    14811327     1912642     1918276     1942434    16044141

THP is not really a factor for this workload but one thing to note is the
migration rate for lessflush-v3r27. It works out at migrating 85MB/s on
average throughout the entire test. Again, a proper placement policy
should reduce this.

So in summary, patches 1-35 are not perfect and needs a proper placement
policy and scheduler smarts but out of the box it's not completely crap
either.

 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    1 +
 arch/x86/include/asm/pgtable.h       |   11 +-
 arch/x86/include/asm/pgtable_types.h |   20 +
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |    7 +
 include/linux/huge_mm.h              |   10 +
 include/linux/init_task.h            |    8 +
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   27 +-
 include/linux/mm.h                   |   34 ++
 include/linux/mm_types.h             |   44 ++
 include/linux/mmzone.h               |   13 +
 include/linux/sched.h                |   52 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 +++
 include/uapi/linux/mempolicy.h       |   24 +-
 init/Kconfig                         |   22 +
 kernel/fork.c                        |   18 +
 kernel/sched/core.c                  |   60 ++-
 kernel/sched/debug.c                 |    3 +
 kernel/sched/fair.c                  |  764 ++++++++++++++++++++++++++++++++--
 kernel/sched/features.h              |   25 ++
 kernel/sched/sched.h                 |   36 ++
 kernel/sysctl.c                      |   38 +-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |   53 +++
 mm/memory-failure.c                  |    3 +-
 mm/memory.c                          |  198 ++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  381 +++++++++++++++--
 mm/migrate.c                         |  178 +++++++-
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |   59 ++-
 mm/vmstat.c                          |   16 +-
 36 files changed, 2131 insertions(+), 90 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 01/43] mm: compaction: Move migration fail/success stats to migrate.c
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 02/43] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
                   ` (42 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    4 +++-
 mm/compaction.c               |    4 ----
 mm/migrate.c                  |    6 ++++++
 mm/vmstat.c                   |    7 ++++---
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..8aa7cb9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,8 +38,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_MIGRATION
+		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
+#endif
 #ifdef CONFIG_COMPACTION
-		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..00ad883 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -994,10 +994,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
-		count_vm_event(COMPACTBLOCKS);
-		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
-		if (nr_remaining)
-			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
 		trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
 						nr_remaining);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..04687f6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -962,6 +962,7 @@ int migrate_pages(struct list_head *from,
 {
 	int retry = 1;
 	int nr_failed = 0;
+	int nr_succeeded = 0;
 	int pass = 0;
 	struct page *page;
 	struct page *page2;
@@ -988,6 +989,7 @@ int migrate_pages(struct list_head *from,
 				retry++;
 				break;
 			case 0:
+				nr_succeeded++;
 				break;
 			default:
 				/* Permanent failure */
@@ -998,6 +1000,10 @@ int migrate_pages(struct list_head *from,
 	}
 	rc = 0;
 out:
+	if (nr_succeeded)
+		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+	if (nr_failed)
+		count_vm_events(PGMIGRATE_FAIL, nr_failed);
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..89a7fd6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,10 +774,11 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_MIGRATION
+	"pgmigrate_success",
+	"pgmigrate_fail",
+#endif
 #ifdef CONFIG_COMPACTION
-	"compact_blocks_moved",
-	"compact_pages_moved",
-	"compact_pagemigrate_failed",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 02/43] mm: migrate: Add a tracepoint for migrate_pages
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
  2012-11-16 11:22 ` [PATCH 01/43] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 03/43] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
                   ` (41 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h        |   13 ++++++++--
 include/trace/events/migrate.h |   51 ++++++++++++++++++++++++++++++++++++++++
 mm/compaction.c                |    3 ++-
 mm/memory-failure.c            |    3 ++-
 mm/memory_hotplug.c            |    3 ++-
 mm/mempolicy.c                 |    6 +++--
 mm/migrate.c                   |   10 ++++++--
 mm/page_alloc.c                |    3 ++-
 8 files changed, 82 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/migrate.h

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..9d1c159 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,15 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+enum migrate_reason {
+	MR_COMPACTION,
+	MR_MEMORY_FAILURE,
+	MR_MEMORY_HOTPLUG,
+	MR_SYSCALL,		/* also applies to cpusets */
+	MR_MEMPOLICY_MBIND,
+	MR_CMA
+};
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
@@ -14,7 +23,7 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			enum migrate_mode mode);
+			enum migrate_mode mode, int reason);
 extern int migrate_huge_page(struct page *, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
@@ -35,7 +44,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		enum migrate_mode mode) { return -ENOSYS; }
+		enum migrate_mode mode, int reason) { return -ENOSYS; }
 static inline int migrate_huge_page(struct page *page, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
new file mode 100644
index 0000000..ec2a6cc
--- /dev/null
+++ b/include/trace/events/migrate.h
@@ -0,0 +1,51 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM migrate
+
+#if !defined(_TRACE_MIGRATE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MIGRATE_H
+
+#define MIGRATE_MODE						\
+	{MIGRATE_ASYNC,		"MIGRATE_ASYNC"},		\
+	{MIGRATE_SYNC_LIGHT,	"MIGRATE_SYNC_LIGHT"},		\
+	{MIGRATE_SYNC,		"MIGRATE_SYNC"}		
+
+#define MIGRATE_REASON						\
+	{MR_COMPACTION,		"compaction"},			\
+	{MR_MEMORY_FAILURE,	"memory_failure"},		\
+	{MR_MEMORY_HOTPLUG,	"memory_hotplug"},		\
+	{MR_SYSCALL,		"syscall_or_cpuset"},		\
+	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
+	{MR_CMA,		"cma"}
+
+TRACE_EVENT(mm_migrate_pages,
+
+	TP_PROTO(unsigned long succeeded, unsigned long failed,
+		 enum migrate_mode mode, int reason),
+
+	TP_ARGS(succeeded, failed, mode, reason),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,		succeeded)
+		__field(	unsigned long,		failed)
+		__field(	enum migrate_mode,	mode)
+		__field(	int,			reason)
+	),
+
+	TP_fast_assign(
+		__entry->succeeded	= succeeded;
+		__entry->failed		= failed;
+		__entry->mode		= mode;
+		__entry->reason		= reason;
+	),
+
+	TP_printk("nr_succeeded=%lu nr_failed=%lu mode=%s reason=%s",
+		__entry->succeeded,
+		__entry->failed,
+		__print_symbolic(__entry->mode, MIGRATE_MODE),
+		__print_symbolic(__entry->reason, MIGRATE_REASON))
+);
+
+#endif /* _TRACE_MIGRATE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/compaction.c b/mm/compaction.c
index 00ad883..2c077a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -990,7 +990,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
+				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
+				MR_COMPACTION);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..ddb68a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1558,7 +1558,8 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_MEMORY_FAILURE);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 56b758a..af60ce7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -819,7 +819,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		 * migrate_pages returns # of failed pages.
 		 */
 		ret = migrate_pages(&source, alloc_migrate_target, 0,
-							true, MIGRATE_SYNC);
+							true, MIGRATE_SYNC,
+							MR_MEMORY_HOTPLUG);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..66e90ec 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -961,7 +961,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1202,7 +1203,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
-						false, MIGRATE_SYNC);
+						false, MIGRATE_SYNC,
+						MR_MEMPOLICY_MBIND);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 04687f6..27be9c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -38,6 +38,9 @@
 
 #include <asm/tlbflush.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/migrate.h>
+
 #include "internal.h"
 
 /*
@@ -958,7 +961,7 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		enum migrate_mode mode)
+		enum migrate_mode mode, int reason)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -1004,6 +1007,8 @@ out:
 		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
 	if (nr_failed)
 		count_vm_events(PGMIGRATE_FAIL, nr_failed);
+	trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
@@ -1145,7 +1150,8 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, MIGRATE_SYNC);
+				(unsigned long)pm, 0, MIGRATE_SYNC,
+				MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..4681fc4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5707,7 +5707,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		ret = migrate_pages(&cc->migratepages,
 				    alloc_migrate_target,
-				    0, false, MIGRATE_SYNC);
+				    0, false, MIGRATE_SYNC,
+				    MR_CMA);
 	}
 
 	putback_lru_pages(&cc->migratepages);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 03/43] mm: compaction: Add scanned and isolated counters for compaction
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
  2012-11-16 11:22 ` [PATCH 01/43] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
  2012-11-16 11:22 ` [PATCH 02/43] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 04/43] mm: numa: define _PAGE_NUMA Mel Gorman
                   ` (40 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u:	a word	sizeof(void *)

Ca  = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure   = Ca * 2
Ci  = Cost page isolation    = (Ca + Wi)
	where Wi is a constant that should reflect the approximate
	cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost free    scanning = Ca

Overall cost =	(Csm * compact_migrate_scanned) +
	      	(Csf * compact_free_scanned)    +
	      	(Ci  * compact_isolated)	+
		(Cmc * pgmigrate_success)	+
		(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/compaction.c               |    8 ++++++++
 mm/vmstat.c                   |    3 +++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8aa7cb9..a1f750b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -42,6 +42,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
 #ifdef CONFIG_COMPACTION
+		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+		COMPACTISOLATED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c077a7..aee7443 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -356,6 +356,10 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	if (blockpfn == end_pfn)
 		update_pageblock_skip(cc, valid_page, total_isolated, false);
 
+	count_vm_events(COMPACTFREE_SCANNED, nr_scanned);
+	if (total_isolated)
+		count_vm_events(COMPACTISOLATED, total_isolated);
+
 	return total_isolated;
 }
 
@@ -646,6 +650,10 @@ next_pageblock:
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
+	count_vm_events(COMPACTMIGRATE_SCANNED, nr_scanned);
+	if (nr_isolated)
+		count_vm_events(COMPACTISOLATED, nr_isolated);
+
 	return low_pfn;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89a7fd6..3a067fa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -779,6 +779,9 @@ const char * const vmstat_text[] = {
 	"pgmigrate_fail",
 #endif
 #ifdef CONFIG_COMPACTION
+	"compact_migrate_scanned",
+	"compact_free_scanned",
+	"compact_isolated",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 04/43] mm: numa: define _PAGE_NUMA
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (2 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 03/43] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 05/43] mm: numa: pte_numa() and pmd_numa() Mel Gorman
                   ` (39 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
faults to identify the per NUMA node working set of the thread at
runtime.

Arming the NUMA hinting page fault mechanism works similarly to
setting up a mprotect(PROT_NONE) virtual range: the present bit is
cleared at the same time that _PAGE_NUMA is set, so when the fault
triggers we can identify it as a NUMA hinting page fault.

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
could also use a different bitflag, it's up to the architecture to
decide).

It would be confusing to call the "NUMA hinting page faults" as
"do_prot_none faults". They're different events and _PAGE_NUMA doesn't
alter the semantics of mprotect(PROT_NONE) in any way.

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths executed
by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
step into each other toes.

Because we want to be able to set this bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, this bitflag must never be set when the pte and
pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
be used by the swap entry format.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..3c32db8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * _PAGE_NUMA indicates that this page will trigger a numa hinting
+ * minor page fault to gather numa placement statistics (see
+ * pte_numa()). The bit picked (8) is within the range between
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ *
+ * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
+ * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
+ * couldn't reach, like handle_mm_fault() (see access_error in
+ * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
+ * handle_mm_fault() to be invoked).
+ */
+#define _PAGE_NUMA	_PAGE_PROTNONE
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 05/43] mm: numa: pte_numa() and pmd_numa()
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (3 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 04/43] mm: numa: define _PAGE_NUMA Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation Mel Gorman
                   ` (38 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

The expectation is that a NUMA hinting page fault is used as part
of a placement policy that decides if a page should remain on the
current node or migrated to a different node.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h |   65 ++++++++++++++++++++++++++++++++++++++--
 include/asm-generic/pgtable.h  |   12 ++++++++
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..e075d57 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+				 _PAGE_NUMA);
+}
+
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -479,6 +536,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_BALANCE_NUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..896667e 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -554,6 +554,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_BALANCE_NUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (4 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 05/43] mm: numa: pte_numa() and pmd_numa() Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 14:09   ` Rik van Riel
  2012-11-16 11:22 ` [PATCH 07/43] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
                   ` (37 subsequent siblings)
  43 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

It was pointed out by Ingo Molnar that the per-architecture definition of
the NUMA PTE helper functions means that each supporting architecture
will have to cut and paste it which is unfortunate. He suggested instead
that the helpers should be weak functions that can be overridden by the
architecture.

This patch moves the helpers to mm/pgtable-generic.c and makes them weak
functions. Architectures wishing to use this will still be required to
define _PAGE_NUMA and potentially update their p[te|md]_present and
pmd_bad helpers if they choose to make PAGE_NUMA similar to PROT_NONE.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h |   56 +---------------------------------------
 include/asm-generic/pgtable.h  |   17 +++++-------
 mm/pgtable-generic.c           |   53 +++++++++++++++++++++++++++++++++++++
 3 files changed, 60 insertions(+), 66 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e075d57..4a4c11c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -425,61 +425,6 @@ static inline int pmd_present(pmd_t pmd)
 				 _PAGE_NUMA);
 }
 
-#ifdef CONFIG_BALANCE_NUMA
-/*
- * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
- * same bit too). It's set only when _PAGE_PRESET is not set and it's
- * never set if _PAGE_PRESENT is set.
- *
- * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
- * fault triggers on those regions if pte/pmd_numa returns true
- * (because _PAGE_PRESENT is not set).
- */
-static inline int pte_numa(pte_t pte)
-{
-	return (pte_flags(pte) &
-		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-
-static inline int pmd_numa(pmd_t pmd)
-{
-	return (pmd_flags(pmd) &
-		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-#endif
-
-/*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
- */
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-	pte = pte_clear_flags(pte, _PAGE_NUMA);
-	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
-	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
-	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
-}
-
-static inline pte_t pte_mknuma(pte_t pte)
-{
-	pte = pte_set_flags(pte, _PAGE_NUMA);
-	return pte_clear_flags(pte, _PAGE_PRESENT);
-}
-
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
-	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
-}
-
 static inline int pmd_none(pmd_t pmd)
 {
 	/* Only check low word on 32-bit platforms, since it might be
@@ -534,6 +479,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 	return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
 }
 
+extern int pmd_numa(pmd_t pmd);
 static inline int pmd_bad(pmd_t pmd)
 {
 #ifdef CONFIG_BALANCE_NUMA
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 896667e..da3e761 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -554,17 +554,12 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
-#ifndef CONFIG_BALANCE_NUMA
-static inline int pte_numa(pte_t pte)
-{
-	return 0;
-}
-
-static inline int pmd_numa(pmd_t pmd)
-{
-	return 0;
-}
-#endif /* CONFIG_BALANCE_NUMA */
+extern int pte_numa(pte_t pte);
+extern int pmd_numa(pmd_t pmd);
+extern pte_t pte_mknonnuma(pte_t pte);
+extern pmd_t pmd_mknonnuma(pmd_t pmd);
+extern pte_t pte_mknuma(pte_t pte);
+extern pmd_t pmd_mknuma(pmd_t pmd);
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..6b6507f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -170,3 +170,56 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
+
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+__weak int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+__weak int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+__weak pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+__weak pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+__weak pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+__weak pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 07/43] mm: numa: Support NUMA hinting page faults from gup/gup_fast
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (5 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 14:09   ` Rik van Riel
  2012-11-16 11:22 ` [PATCH 08/43] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
                   ` (36 subsequent siblings)
  43 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..e64af99 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1569,6 +1569,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_MLOCK	0x40	/* mark page as mlocked */
 #define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+#define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..de8aa11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1517,6 +1517,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto no_page_table;
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
 			split_huge_page_pmd(mm, pmd);
@@ -1546,6 +1548,8 @@ split_fallthrough:
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto no_page;
+	if ((flags & FOLL_NUMA) && pte_numa(pte))
+		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 
@@ -1697,6 +1701,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
 	vm_flags &= (gup_flags & FOLL_FORCE) ?
 			(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+	/*
+	 * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+	 * would be called on PROT_NONE ranges. We must never invoke
+	 * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+	 * page faults would unprotect the PROT_NONE ranges if
+	 * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+	 * bitflag. So to avoid that, don't set FOLL_NUMA if
+	 * FOLL_FORCE is set.
+	 */
+	if (!(gup_flags & FOLL_FORCE))
+		gup_flags |= FOLL_NUMA;
+
 	i = 0;
 
 	do {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 08/43] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (6 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 07/43] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 09/43] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
                   ` (35 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

When we split a transparent hugepage, transfer the NUMA type from the
pmd to the pte if needed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..3aaf242 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1363,6 +1363,8 @@ static int __split_huge_page_map(struct page *page,
 				BUG_ON(page_mapcount(page) != 1);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
+			if (pmd_numa(*pmd))
+				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 09/43] mm: numa: Create basic numa page hinting infrastructure
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (7 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 08/43] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 10/43] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
                   ` (34 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: This patch started as "mm/mpol: Create special PROT_NONE
	infrastructure" and preserves the basic idea but steals *very*
	heavily from "autonuma: numa hinting page faults entry points" for
	the actual fault handlers without the migration parts.	The end
	result is barely recognisable as either patch so all Signed-off
	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
	this version, I will re-add the signed-offs-by to reflect the history.

In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.

The meaning of PAGE_NUMA depends on the architecture but on x86 it is
effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
NUMA faults for the reason that the page fault code checks the permission on
the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
before it ever calls handle_mm_fault.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/huge_mm.h |   10 +++++
 mm/huge_memory.c        |   21 ++++++++++
 mm/memory.c             |   98 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 126 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..a13ebb1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,10 @@ static inline struct page *compound_trans_head(struct page *page)
 	}
 	return page;
 }
+
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				  pmd_t pmd, pmd_t *pmdp);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +199,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 {
 	return 0;
 }
+
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp);
+{
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3aaf242..92a64d2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1017,6 +1017,27 @@ out:
 	return page;
 }
 
+/* NUMA hinting page fault entry point for trans huge pmds */
+int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				pmd_t pmd, pmd_t *pmdp)
+{
+	struct page *page;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
+	page = pmd_page(pmd);
+	pmd = pmd_mknonnuma(pmd);
+	set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+	VM_BUG_ON(pmd_numa(*pmdp));
+	update_mmu_cache_pmd(vma, addr, ptep);
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return 0;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index de8aa11..4291fa3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3450,6 +3450,89 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+	struct page *page;
+	spinlock_t *ptl;
+
+	/*
+	* The "pte" at this point cannot be used safely without
+	* validation through pte_unmap_same(). It's of NUMA type but
+	* the pfn may be screwed if the read is non atomic.
+	*
+	* ptep_modify_prot_start is not called as this is clearing
+	* the _PAGE_NUMA bit and it is not really expected that there
+	* would be concurrent hardware modifications to the PTE.
+	*/
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte)))
+		goto out_unlock;
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	update_mmu_cache(vma, addr, ptep);
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+	return 0;
+}
+
+/* NUMA hinting page fault entry point for regular pmds */
+int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte, *orig_pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return 0;
+
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page *page;
+		if (!pte_present(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknonnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+	}
+	pte_unmap_unlock(orig_pte, ptl);
+
+	return 0;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3488,6 +3571,9 @@ int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(entry))
+		return do_numa_page(mm, vma, address, entry, pte, pmd);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3556,9 +3642,11 @@ retry:
 
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_numa(*pmd))
+				return do_huge_pmd_numa_page(mm, address,
+							     orig_pmd, pmd);
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3570,10 +3658,14 @@ retry:
 					goto retry;
 				return ret;
 			}
+
 			return 0;
 		}
 	}
 
+	if (pmd_numa(*pmd))
+		return do_pmd_numa_page(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 10/43] mm: mempolicy: Make MPOL_LOCAL a real policy
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (8 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 09/43] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 11/43] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
                   ` (33 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.

Requested-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |    9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..3e835c9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_LOCAL,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66e90ec..54bd3e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
 		}
+	} else if (mode == MPOL_LOCAL) {
+		if (!nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2399,7 +2403,6 @@ void numa_default_policy(void)
  * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
  * Used only for mpol_parse_str() and mpol_to_str()
  */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2452,12 +2455,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 	if (flags)
 		*flags++ = '\0';	/* terminate mode string */
 
-	for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+	for (mode = 0; mode < MPOL_MAX; mode++) {
 		if (!strcmp(str, policy_modes[mode])) {
 			break;
 		}
 	}
-	if (mode > MPOL_LOCAL)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 11/43] mm: mempolicy: Add MPOL_MF_NOOP
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (9 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 10/43] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 12/43] mm: mempolicy: Check for misplaced page Mel Gorman
                   ` (32 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

NOTE: I have not yet addressed by own review feedback of this patch. At
	this point I'm trying to construct a baseline tree and will apply
	my own review feedback later and then fold it in.

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3e835c9..d23dca8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 54bd3e5..c21e914 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1147,7 +1147,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2409,7 +2409,8 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
-	[MPOL_LOCAL]      = "local"
+	[MPOL_LOCAL]      = "local",
+	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2460,7 +2461,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX)
+	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 12/43] mm: mempolicy: Check for misplaced page
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (10 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 11/43] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 13/43] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
                   ` (31 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.

A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy.  So, I just mimic the alloc_page_vma() node computation
logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
  simplified code now that we don't have to bother
  with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h      |    8 +++++
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   76 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 #else
 
 struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 	return 0;
 }
 
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
+
 #endif /* CONFIG_NUMA */
 #endif
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index d23dca8..472de8a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
+#define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c21e914..df1466d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2181,6 +2181,82 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *	-1	- not misplaced, page is in the right node
+ *	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	unsigned long pgoff;
+	int polnid = -1;
+	int ret = -1;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(pol->flags & MPOL_F_MOF))
+		goto out;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+	if (curnid != polnid)
+		ret = polnid;
+out:
+	mpol_cond_put(pol);
+
+	return ret;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 13/43] mm: migrate: Introduce migrate_misplaced_page()
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (11 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 12/43] mm: mempolicy: Check for misplaced page Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-19 19:44   ` [tip:numa/core] mm/migration: Improve migrate_misplaced_page() tip-bot for Mel Gorman
  2012-11-16 11:22 ` [PATCH 14/43] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
                   ` (30 subsequent siblings)
  43 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Note: This was originally based on Peter's patch "mm/migrate: Introduce
	migrate_misplaced_page()" but borrows extremely heavily from Andrea's
	"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
	collection". The end result is barely recognisable so signed-offs
	had to be dropped. If original authors are ok with it, I'll
	re-add the signed-off-bys.

Add migrate_misplaced_page() which deals with migrating pages from
faults.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h |    8 ++++
 mm/migrate.c            |  104 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9d1c159..69f60b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -13,6 +13,7 @@ enum migrate_reason {
 	MR_MEMORY_HOTPLUG,
 	MR_SYSCALL,		/* also applies to cpusets */
 	MR_MEMPOLICY_MBIND,
+	MR_NUMA_MISPLACED,
 	MR_CMA
 };
 
@@ -39,6 +40,7 @@ extern int migrate_vmas(struct mm_struct *mm,
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -72,5 +74,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 27be9c9..4a92808 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -282,7 +282,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode)
 {
-	int expected_count;
+	int expected_count = 0;
 	void **pslot;
 
 	if (!mapping) {
@@ -1415,4 +1415,104 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
  	}
  	return err;
 }
-#endif
+
+/*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which crude
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_exact_node(nid,
+					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+					  __GFP_NOMEMALLOC | __GFP_NORETRY |
+					  __GFP_NOWARN) &
+					 ~GFP_IOFS, 0);
+	return newpage;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	int isolated = 0;
+	LIST_HEAD(migratepages);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1)
+		goto out;
+
+	/* Avoid migrating to a node that is nearly full */
+	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+		int page_lru;
+
+		if (isolate_lru_page(page)) {
+			put_page(page);
+			goto out;
+		}
+		isolated = 1;
+
+		/*
+		 * Page is isolated which takes a reference count so now the
+		 * callers reference can be safely dropped without the page
+		 * disappearing underneath us during migration
+		 */
+		put_page(page);
+
+		page_lru = page_is_file_cache(page);
+		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		list_add(&page->lru, &migratepages);
+	}
+
+	if (isolated) {
+		int nr_remaining;
+
+		nr_remaining = migrate_pages(&migratepages,
+				alloc_misplaced_dst_page,
+				node, false, MIGRATE_ASYNC,
+				MR_NUMA_MISPLACED);
+		if (nr_remaining) {
+			putback_lru_pages(&migratepages);
+			isolated = 0;
+		}
+	}
+	BUG_ON(!list_empty(&migratepages));
+out:
+	return isolated;
+}
+
+#endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 14/43] mm: mempolicy: Use _PAGE_NUMA to migrate pages
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (12 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 13/43] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 16:08   ` Rik van Riel
  2012-11-16 11:22 ` [PATCH 15/43] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
                   ` (29 subsequent siblings)
  43 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
	sufficiently different that the signed-off-bys were dropped

Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
pieces into an effective migrate on fault scheme.

Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
page-migration performance.

Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |    8 ++++----
 mm/huge_memory.c        |   32 +++++++++++++++++++++++++++++---
 mm/memory.c             |   44 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a13ebb1..406f81c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,8 +160,8 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
-extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				  pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
@@ -200,8 +200,8 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	return 0;
 }
 
-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-					pmd_t pmd, pmd_t *pmdp);
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+					unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92a64d2..1453c30 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1018,16 +1019,39 @@ out:
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
-	struct page *page;
+	struct page *page = NULL;
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int target_nid;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1)
+		goto clear_pmdnuma;
+
+	/*
+	 * Due to lacking code to migrate thp pages, we'll split
+	 * (which preserves the special PROT_NONE) and re-take the
+	 * fault on the normal pages.
+	 */
+	split_huge_page(page);
+	put_page(page);
+	return 0;
+
+clear_pmdnuma:
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
@@ -1035,6 +1059,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (page)
+		put_page(page);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4291fa3..d5dda73 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3453,8 +3454,9 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
-	struct page *page;
+	struct page *page = NULL;
 	spinlock_t *ptl;
+	int current_nid, target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3469,14 +3471,48 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*ptep, pte)))
 		goto out_unlock;
-	pte = pte_mknonnuma(pte);
-	set_pte_at(mm, addr, ptep, pte);
+
 	page = vm_normal_page(vma, addr, pte);
 	BUG_ON(!page);
+
+	get_page(page);
+	current_nid = page_to_nid(page);
+	target_nid = mpol_misplaced(page, vma, addr);
+	if (target_nid == -1) {
+		/*
+		 * Account for the fault against the current node if it not
+		 * being replaced regardless of where the page is located.
+		 */
+		current_nid = numa_node_id();
+		goto clear_pmdnuma;
+	}
+	pte_unmap_unlock(ptep, ptl);
+
+	/* Migrate to the requested node */
+	if (migrate_misplaced_page(page, target_nid)) {
+		/*
+		 * If the page was migrated then the pte_same check below is
+		 * guaranteed to fail so just retry the entire fault.
+		 */
+		current_nid = target_nid;
+		goto out;
+	}
+	page = NULL;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!pte_same(*ptep, pte))
+		goto out_unlock;
+
+clear_pmdnuma:
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
+	if (page)
+		put_page(page);
+out:
 	return 0;
 }
 
@@ -3643,7 +3679,7 @@ retry:
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
 			if (pmd_numa(*pmd))
-				return do_huge_pmd_numa_page(mm, address,
+				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
 			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 15/43] mm: mempolicy: Add MPOL_MF_LAZY
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (13 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 14/43] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 16/43] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
                   ` (28 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

NOTE: Once again there is a lot of patch stealing and the end result
	is sufficiently different that I had to drop the signed-offs.
	Will re-add if the original authors are ok with that.

This patch adds another mbind() flag to request "lazy migration".  The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm.h             |    3 +
 include/uapi/linux/mempolicy.h |   13 ++-
 mm/mempolicy.c                 |  177 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 175 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e64af99..a451a9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1548,6 +1548,9 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+void change_prot_numa(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
+			 MPOL_MF_MOVE     | 	\
+			 MPOL_MF_MOVE_ALL |	\
+			 MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index df1466d..11052ea 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
 #include <linux/syscalls.h>
 #include <linux/ctype.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -566,6 +567,137 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
 }
 
 /*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault.
+ */
+static int
+change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		int page_nid;
+		ret = HPAGE_PMD_NR;
+
+		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page = pmd_page(*pmd);
+
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page_nid = page_to_nid(page);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		ret += HPAGE_PMD_NR;
+		/* defer TLB flush to lower the overhead */
+		spin_unlock(&mm->page_table_lock);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		goto out;
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+/* Assumes mmap_sem is held */
+void
+change_prot_numa(struct vm_area_struct *vma,
+			unsigned long address, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int progress = 0;
+
+	while (address < end) {
+		VM_BUG_ON(address < vma->vm_start ||
+			  address + PAGE_SIZE > vma->vm_end);
+
+		progress += change_prot_numa_range(mm, vma, address);
+		address = (address + PMD_SIZE) & PMD_MASK;
+	}
+
+	/*
+	 * Flush the TLB for the mm to start the NUMA hinting
+	 * page faults after we finish scanning this vma part
+	 * if there were any PTE updates
+	 */
+	if (progress) {
+		mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
+		flush_tlb_range(vma, address, end);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
+	}
+}
+
+/*
  * Check if all pages in a range are on a set of nodes.
  * If pagelist != NULL then isolate pages from the LRU and
  * put them on the pagelist.
@@ -583,22 +715,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		return ERR_PTR(-EFAULT);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+		unsigned long endvma = vma->vm_end;
+
+		if (endvma > end)
+			endvma = end;
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
 			if (!vma->vm_next && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
-		if (!is_vm_hugetlb_page(vma) &&
-		    ((flags & MPOL_MF_STRICT) ||
+
+		if (is_vm_hugetlb_page(vma))
+			goto next;
+
+		if (flags & MPOL_MF_LAZY) {
+			change_prot_numa(vma, start, endvma);
+			goto next;
+		}
+
+		if ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
-			unsigned long endvma = vma->vm_end;
+		      vma_migratable(vma))) {
 
-			if (endvma > end)
-				endvma = end;
-			if (vma->vm_start > start)
-				start = vma->vm_start;
 			err = check_pgd_range(vma, start, endvma, nodes,
 						flags, private);
 			if (err) {
@@ -606,6 +748,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 				break;
 			}
 		}
+next:
 		prev = vma;
 	}
 	return first;
@@ -1138,8 +1281,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1162,6 +1304,9 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	if (flags & MPOL_MF_LAZY)
+		new->flags |= MPOL_F_MOF;
+
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -1198,13 +1343,15 @@ static long do_mbind(unsigned long start, unsigned long len,
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
+			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
 						false, MIGRATE_SYNC,
@@ -1213,7 +1360,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 16/43] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (14 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 15/43] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 16:22   ` Rik van Riel
  2012-11-16 11:22 ` [PATCH 17/43] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Mel Gorman
                   ` (27 subsequent siblings)
  43 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
explicitly request lazy migration is a good idea but the actual
API has not been well reviewed and once released we have to support it.
For now this patch prevents an application using the services. This
will need to be revisited.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    4 +---
 mm/mempolicy.c                 |    9 ++++-----
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 6a1baae..16fb4e6 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,7 +21,6 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
-	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
@@ -57,8 +56,7 @@ enum mpol_rebind_step {
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
-			 MPOL_MF_MOVE_ALL |	\
-			 MPOL_MF_LAZY)
+			 MPOL_MF_MOVE_ALL)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 11052ea..09d477a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -252,7 +252,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
+	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
 		return NULL;
@@ -1289,7 +1289,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
+	if (mode == MPOL_DEFAULT)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -1344,7 +1344,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
 	err = PTR_ERR(vma);	/* maybe ... */
-	if (!IS_ERR(vma) && mode != MPOL_NOOP)
+	if (!IS_ERR(vma))
 		err = mbind_range(mm, start, end, new);
 
 	if (!err) {
@@ -2633,7 +2633,6 @@ static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
-	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2684,7 +2683,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 17/43] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (15 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 16/43] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 18/43] mm: numa: Add fault driven placement and migration Mel Gorman
                   ` (26 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Allow architectures to opt-in to the adaptive affinity NUMA balancing code.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 init/Kconfig |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..17434ca 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,13 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# For architectures that want to enable the PROT_NUMA driven,
+# NUMA-affine scheduler balancing logic:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+	bool
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 18/43] mm: numa: Add fault driven placement and migration
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (16 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 17/43] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 19/43] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
                   ` (25 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

NOTE: This patch is based on "sched, numa, mm: Add fault driven
	placement and migration policy" but as it throws away all the policy
	to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on.  In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/sh/mm/Kconfig       |    1 +
 arch/x86/Kconfig         |    1 +
 include/linux/mm_types.h |   11 ++++
 include/linux/sched.h    |   20 ++++++++
 init/Kconfig             |   15 ++++++
 kernel/sched/core.c      |   13 +++++
 kernel/sched/fair.c      |  125 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h  |    7 +++
 kernel/sched/sched.h     |    6 +++
 kernel/sysctl.c          |   24 ++++++++-
 mm/huge_memory.c         |    5 +-
 mm/memory.c              |   15 +++++-
 12 files changed, 239 insertions(+), 4 deletions(-)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..0f7c852 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
 config NUMA
 	bool "Non Uniform Memory Access (NUMA) Support"
 	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+	select ARCH_WANT_NUMA_VARIABLE_LOCALITY
 	default n
 	help
 	  Some SH systems have many various memories scattered around
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..02d0f2a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,7 @@ config X86
 	def_bool y
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_NUMA_BALANCING
 	select HAVE_IDE
 	select HAVE_OPROFILE
 	select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..d82accb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -398,6 +398,17 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	/*
+	 * numa_next_scan is the next time when the PTEs will me marked
+	 * pte_numa to gather statistics and migrate pages to new nodes
+	 * if necessary
+	 */
+	unsigned long numa_next_scan;
+
+	/* numa_scan_seq prevents two threads setting pte_numa */
+	int numa_scan_seq;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..ac71181 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1479,6 +1479,14 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	int numa_scan_seq;
+	int numa_migrate_seq;
+	unsigned int numa_scan_period;
+	u64 node_stamp;			/* migration stamp  */
+	struct callback_head numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
+
 	struct rcu_head rcu;
 
 	/*
@@ -1553,6 +1561,14 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#ifdef CONFIG_BALANCE_NUMA
+extern void task_numa_fault(int node, int pages);
+#else
+static inline void task_numa_fault(int node, int pages)
+{
+}
+#endif
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -1990,6 +2006,10 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_period_min;
+extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_settle_count;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
diff --git a/init/Kconfig b/init/Kconfig
index 17434ca..c15ae42 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -703,6 +703,21 @@ config HAVE_UNSTABLE_SCHED_CLOCK
 config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANT_NUMA_VARIABLE_LOCALITY
+	bool
+
+config BALANCE_NUMA
+	bool "Memory placement aware NUMA scheduler"
+	default n
+	depends on ARCH_SUPPORTS_NUMA_BALANCING
+	depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
+	depends on SMP && NUMA && MIGRATION
+	help
+	  This option adds support for automatic NUMA aware memory/task placement.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..81fa185 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1533,6 +1533,19 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_scan_seq = 0;
+	}
+
+	p->node_stamp = 0ULL;
+	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..e8bdaef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,8 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
 
 #include <trace/events/sched.h>
 
@@ -776,6 +778,126 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_balance_numa_scan_period_min = 5000;
+unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+
+static void task_numa_placement(struct task_struct *p)
+{
+	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+
+	if (p->numa_scan_seq == seq)
+		return;
+	p->numa_scan_seq = seq;
+
+	/* FIXME: Scheduling placement policy hints go here */
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int pages)
+{
+	struct task_struct *p = current;
+
+	/* FIXME: Allocate task-specific structure for placement policy here */
+
+	task_numa_placement(p);
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+	unsigned long migrate, next_scan, now = jiffies;
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work; /* protect against double add */
+	/*
+	 * Who cares about NUMA placement when they're dying.
+	 *
+	 * NOTE: make sure not to dereference p->mm before this check,
+	 * exit_task_work() happens _after_ exit_mm() so we could be called
+	 * without p->mm even though we still had it when we enqueued this
+	 * work.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/*
+	 * Enforce maximal scan/migration frequency..
+	 */
+	migrate = mm->numa_next_scan;
+	if (time_before(now, migrate))
+		return;
+
+	if (WARN_ON_ONCE(p->numa_scan_period) == 0)
+		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+
+	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
+	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+		return;
+
+	ACCESS_ONCE(mm->numa_scan_seq)++;
+	{
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (!vma_migratable(vma))
+				continue;
+			change_prot_numa(vma, vma->vm_start, vma->vm_end);
+		}
+		up_read(&mm->mmap_sem);
+	}
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_work;
+	u64 period, now;
+
+	/*
+	 * We don't care about NUMA placement if we don't have memory.
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+		return;
+
+	/*
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the selection from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * NUMA placement.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+	if (now - curr->node_stamp > period) {
+		curr->node_stamp = now;
+
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+			task_work_add(curr, work, true);
+		}
+	}
+}
+#else
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -4954,6 +5076,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	if (sched_feat_numa(NUMA))
+		task_tick_numa(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..7cfd289 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,3 +61,10 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+
+/*
+ * Apply the automatic NUMA scheduling policy
+ */
+#ifdef CONFIG_BALANCE_NUMA
+SCHED_FEAT(NUMA,	true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a7db09..9a43241 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -648,6 +648,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
+#ifdef CONFIG_BALANCE_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..1359f51 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000;		/* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
+#ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_COMPACTION
 static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &min_wakeup_granularity_ns,
 		.extra2		= &max_wakeup_granularity_ns,
 	},
+#ifdef CONFIG_SMP
 	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
@@ -347,7 +350,24 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_BALANCE_NUMA
+	{
+		.procname	= "balance_numa_scan_period_min_ms",
+		.data		= &sysctl_balance_numa_scan_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "balance_numa_scan_period_max_ms",
+		.data		= &sysctl_balance_numa_scan_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_BALANCE_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1453c30..ccff412 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1045,6 +1045,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	split_huge_page(page);
 	put_page(page);
+
 	return 0;
 
 clear_pmdnuma:
@@ -1059,8 +1060,10 @@ clear_pmdnuma:
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page)
+	if (page) {
 		put_page(page);
+		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
+	}
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index d5dda73..ba5a7ff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3456,7 +3456,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid, target_nid;
+	int current_nid = -1;
+	int target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3486,6 +3487,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		current_nid = numa_node_id();
 		goto clear_pmdnuma;
 	}
+
 	pte_unmap_unlock(ptep, ptl);
 
 	/* Migrate to the requested node */
@@ -3513,6 +3515,7 @@ out_unlock:
 	if (page)
 		put_page(page);
 out:
+	task_numa_fault(current_nid, 1);
 	return 0;
 }
 
@@ -3548,6 +3551,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
+		int curr_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (addr >= vma->vm_end) {
@@ -3563,6 +3567,15 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
+		/* only check non-shared pages */
+		if (unlikely(page_mapcount(page) != 1))
+			continue;
+		pte_unmap_unlock(pte, ptl);
+
+		curr_nid = page_to_nid(page);
+		task_numa_fault(curr_nid, 1);
+
+		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 19/43] mm: numa: Avoid double faulting after migrating misplaced page
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (17 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 18/43] mm: numa: Add fault driven placement and migration Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 20/43] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
                   ` (24 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The pte_same check after a misplaced page is successfully migrated will
never succeed and force a double fault to fix it up as pointed out by Rik
van Riel. This was the "safe" option but it's expensive.

This patch uses the migration allocation callback to record the location
of the newly migrated page. If the page is the same when the PTE lock is
reacquired it is assumed that it is safe to complete the pte_numa fault
without incurring a double fault.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h |    4 ++--
 mm/memory.c             |   28 +++++++++++++++++-----------
 mm/migrate.c            |   27 ++++++++++++++++++---------
 3 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 69f60b5..e5ab5db 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -40,7 +40,7 @@ extern int migrate_vmas(struct mm_struct *mm,
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern struct page *migrate_misplaced_page(struct page *page, int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -75,7 +75,7 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define fail_migrate_page NULL
 
 static inline
-int migrate_misplaced_page(struct page *page, int node)
+struct page *migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index ba5a7ff..2d13be92 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3454,7 +3454,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
-	struct page *page = NULL;
+	struct page *page = NULL, *newpage = NULL;
 	spinlock_t *ptl;
 	int current_nid = -1;
 	int target_nid;
@@ -3491,19 +3491,26 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_unmap_unlock(ptep, ptl);
 
 	/* Migrate to the requested node */
-	if (migrate_misplaced_page(page, target_nid)) {
-		/*
-		 * If the page was migrated then the pte_same check below is
-		 * guaranteed to fail so just retry the entire fault.
-		 */
+	newpage = migrate_misplaced_page(page, target_nid);
+	if (newpage)
 		current_nid = target_nid;
-		goto out;
-	}
 	page = NULL;
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
-	if (!pte_same(*ptep, pte))
-		goto out_unlock;
+
+	/*
+	 * If we failed to migrate, we have to check the PTE has not changed during
+	 * the migration attempt. If it has, retry the fault. If it has migrated,
+	 * relookup the ptep and confirm it's the same page to avoid double faulting.
+	 */
+	if (!newpage) {
+		if (!pte_same(*ptep, pte))
+			goto out_unlock;
+	} else {
+		pte = *ptep;
+		if (!pte_numa(pte) || vm_normal_page(vma, addr, pte) != newpage)
+			goto out_unlock;
+	}
 
 clear_pmdnuma:
 	pte = pte_mknonnuma(pte);
@@ -3514,7 +3521,6 @@ out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 	if (page)
 		put_page(page);
-out:
 	task_numa_fault(current_nid, 1);
 	return 0;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 4a92808..631b2c5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1444,19 +1444,23 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 	return false;
 }
 
+struct misplaced_request
+{
+	int nid;		/* Node to migrate to */
+	struct page *newpage;	/* New location of page */
+};
+
 static struct page *alloc_misplaced_dst_page(struct page *page,
 					   unsigned long data,
 					   int **result)
 {
-	int nid = (int) data;
-	struct page *newpage;
-
-	newpage = alloc_pages_exact_node(nid,
+	struct misplaced_request *req = (struct misplaced_request *)data;
+	req->newpage = alloc_pages_exact_node(req->nid,
 					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
 					  __GFP_NOMEMALLOC | __GFP_NORETRY |
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
-	return newpage;
+	return req->newpage;
 }
 
 /*
@@ -1464,8 +1468,12 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+struct page *migrate_misplaced_page(struct page *page, int node)
 {
+	struct misplaced_request req = {
+		.nid = node,
+		.newpage = NULL,
+	};
 	int isolated = 0;
 	LIST_HEAD(migratepages);
 
@@ -1503,16 +1511,17 @@ int migrate_misplaced_page(struct page *page, int node)
 
 		nr_remaining = migrate_pages(&migratepages,
 				alloc_misplaced_dst_page,
-				node, false, MIGRATE_ASYNC,
+				(unsigned long)&req,
+				false, MIGRATE_ASYNC,
 				MR_NUMA_MISPLACED);
 		if (nr_remaining) {
 			putback_lru_pages(&migratepages);
-			isolated = 0;
+			req.newpage = NULL;
 		}
 	}
 	BUG_ON(!list_empty(&migratepages));
 out:
-	return isolated;
+	return req.newpage;
 }
 
 #endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 20/43] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (18 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 19/43] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 21/43] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
                   ` (23 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds.  The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    1 +
 kernel/sched/fair.c      |   65 ++++++++++++++++++++++++++++++++++++----------
 kernel/sysctl.c          |    7 +++++
 4 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d82accb..b40f4ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -406,6 +406,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* Restart point for scanning and setting pte_numa */
+	unsigned long numa_scan_offset;
+
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac71181..abb1c70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8bdaef..5f4382f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_BALANCE_NUMA
 /*
- * numa task sample period in ms: 5s
+ * numa task sample period in ms
  */
-unsigned int sysctl_balance_numa_scan_period_min = 5000;
-unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+unsigned int sysctl_balance_numa_scan_period_min = 100;
+unsigned int sysctl_balance_numa_scan_period_max = 100*16;
+
+/* Portion of address space to scan in MB */
+unsigned int sysctl_balance_numa_scan_size = 256;
 
 static void task_numa_placement(struct task_struct *p)
 {
@@ -808,6 +811,12 @@ void task_numa_fault(int node, int pages)
 	task_numa_placement(p);
 }
 
+static void reset_ptenuma_scan(struct task_struct *p)
+{
+	ACCESS_ONCE(p->mm->numa_scan_seq)++;
+	p->mm->numa_scan_offset = 0;
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -817,6 +826,9 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -846,18 +858,45 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
+	offset = mm->numa_scan_offset;
+	length = sysctl_balance_numa_scan_size;
+	length <<= 20;
 
-		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_prot_numa(vma, vma->vm_start, vma->vm_end);
-		}
-		up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		reset_ptenuma_scan(p);
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		/* Skip small VMAs. They are not likely to be of relevance */
+		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_numa(vma, offset, end);
+
+		offset = end;
 	}
+
+	/*
+	 * It is possible to reach the end of the VMA list but the last few VMAs are
+	 * not guaranteed to the vma_migratable. If they are not, we would find the
+	 * !migratable VMA on the next scan but not reset the scanner to the start
+	 * so check it now.
+	 */
+	if (vma)
+		mm->numa_scan_offset = offset;
+	else
+		reset_ptenuma_scan(p);
+	up_read(&mm->mmap_sem);
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1359f51..d191203 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "balance_numa_scan_size_mb",
+		.data		= &sysctl_balance_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif /* CONFIG_BALANCE_NUMA */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 21/43] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (19 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 20/43] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 22/43] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
                   ` (22 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h  |    2 +-
 kernel/sched/fair.c |   36 +++++++++++++++++++++---------------
 mm/mempolicy.c      |    6 ++++--
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a451a9f..34f8ce9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1548,7 +1548,7 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
-void change_prot_numa(struct vm_area_struct *vma,
+int change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f4382f..c673567 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -827,8 +827,8 @@ void task_numa_work(struct callback_head *work)
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
-	unsigned long offset, end;
-	long length;
+	unsigned long start, end;
+	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -858,18 +858,20 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	offset = mm->numa_scan_offset;
-	length = sysctl_balance_numa_scan_size;
-	length <<= 20;
+	start = mm->numa_scan_offset;
+	pages = sysctl_balance_numa_scan_size;
+	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	if (!pages)
+		return;
 
 	down_read(&mm->mmap_sem);
-	vma = find_vma(mm, offset);
+	vma = find_vma(mm, start);
 	if (!vma) {
 		reset_ptenuma_scan(p);
-		offset = 0;
+		start = 0;
 		vma = mm->mmap;
 	}
-	for (; vma && length > 0; vma = vma->vm_next) {
+	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma))
 			continue;
 
@@ -877,15 +879,19 @@ void task_numa_work(struct callback_head *work)
 		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
 			continue;
 
-		offset = max(offset, vma->vm_start);
-		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
-		length -= end - offset;
-
-		change_prot_numa(vma, offset, end);
+		do {
+			start = max(start, vma->vm_start);
+			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+			end = min(end, vma->vm_end);
+			pages -= change_prot_numa(vma, start, end);
 
-		offset = end;
+			start = end;
+			if (pages <= 0)
+				goto out;
+		} while (end != vma->vm_end);
 	}
 
+out:
 	/*
 	 * It is possible to reach the end of the VMA list but the last few VMAs are
 	 * not guaranteed to the vma_migratable. If they are not, we would find the
@@ -893,7 +899,7 @@ void task_numa_work(struct callback_head *work)
 	 * so check it now.
 	 */
 	if (vma)
-		mm->numa_scan_offset = offset;
+		mm->numa_scan_offset = start;
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 09d477a..e1534e3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -669,8 +669,8 @@ out:
 	return ret;
 }
 
-/* Assumes mmap_sem is held */
-void
+/* Assumes mmap_sem is held. Returns range of base ptes updated */
+int
 change_prot_numa(struct vm_area_struct *vma,
 			unsigned long address, unsigned long end)
 {
@@ -695,6 +695,8 @@ change_prot_numa(struct vm_area_struct *vma,
 		flush_tlb_range(vma, address, end);
 		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
 	}
+
+	return progress;
 }
 
 /*
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 22/43] mm: sched: numa: Implement slow start for working set sampling
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (20 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 21/43] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 23/43] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
                   ` (21 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/balance_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 +-
 kernel/sched/fair.c   |    5 +++++
 kernel/sysctl.c       |    7 +++++++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index abb1c70..a2b06ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2006,6 +2006,7 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
 extern unsigned int sysctl_balance_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81fa185..047e3c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1543,7 +1543,7 @@ static void __sched_fork(struct task_struct *p)
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
-	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_scan_period = sysctl_balance_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_BALANCE_NUMA */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c673567..1bf97b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -788,6 +788,9 @@ unsigned int sysctl_balance_numa_scan_period_max = 100*16;
 /* Portion of address space to scan in MB */
 unsigned int sysctl_balance_numa_scan_size = 256;
 
+/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
+unsigned int sysctl_balance_numa_scan_delay = 1000;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -929,6 +932,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
 	if (now - curr->node_stamp > period) {
+		if (!curr->node_stamp)
+			curr->numa_scan_period = sysctl_balance_numa_scan_period_min;
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d191203..5ee587d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_BALANCE_NUMA
 	{
+		.procname	= "balance_numa_scan_delay_ms",
+		.data		= &sysctl_balance_numa_scan_delay,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "balance_numa_scan_period_min_ms",
 		.data		= &sysctl_balance_numa_scan_period_min,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 23/43] mm: numa: Add pte updates, hinting and migration stats
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (21 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 22/43] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 24/43] mm: numa: Migrate on reference policy Mel Gorman
                   ` (20 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

It is tricky to quantify the basic cost of automatic NUMA placement in a
meaningful manner. This patch adds some vmstats that can be used as part
of a basic costing model.

u    = basic unit = sizeof(void *)
Ca   = cost of struct page access = sizeof(struct page) / u
Cpte = Cost PTE access = Ca
Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
	where Cpte is incurred twice for a read and a write and Wlock
	is a constant representing the cost of taking or releasing a
	lock
Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
Ci = Cost of page isolation = Ca + Wi
	where Wi is a constant that should reflect the approximate cost
	of the locking operation
Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
	would imply that remote accesses are 20% more expensive

Balancing cost = Cpte * numa_pte_updates +
		Cnumahint * numa_hint_faults +
		Ci * numa_pages_migrated +
		Cpagecopy * numa_pages_migrated

Note that numa_pages_migrated is used as a measure of how many pages
were isolated even though it would miss pages that failed to migrate. A
vmstat counter could have been added for it but the isolation cost is
pretty marginal in comparison to the overall cost so it seemed overkill.

The ideal way to measure automatic placement benefit would be to count
the number of remote accesses versus local accesses and do something like

	benefit = (remote_accesses_before - remove_access_after) * Wnuma

but the information is not readily available. As a workload converges, the
expection would be that the number of remote numa hints would reduce to 0.

	convergence = numa_hint_faults_local / numa_hint_faults
		where this is measured for the last N number of
		numa hints recorded. When the workload is fully
		converged the value is 1.

This can measure if the placement policy is converging and how fast it is
doing it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    6 ++++++
 include/linux/vmstat.h        |    8 ++++++++
 mm/huge_memory.c              |    1 +
 mm/memory.c                   |   12 ++++++++++++
 mm/mempolicy.c                |    5 +++++
 mm/migrate.c                  |    3 ++-
 mm/vmstat.c                   |    6 ++++++
 7 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a1f750b..dded0af 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,6 +38,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_BALANCE_NUMA
+		NUMA_PTE_UPDATES,
+		NUMA_HINT_FAULTS,
+		NUMA_HINT_FAULTS_LOCAL,
+		NUMA_PAGE_MIGRATE,
+#endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 92a86b2..dffccfa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -80,6 +80,14 @@ static inline void vm_events_fold_cpu(int cpu)
 
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 
+#ifdef CONFIG_BALANCE_NUMA
+#define count_vm_numa_event(x)     count_vm_event(x)
+#define count_vm_numa_events(x, y) count_vm_events(x, y)
+#else
+#define count_vm_numa_event(x) do {} while (0)
+#define count_vm_numa_events(x, y) do {} while (0)
+#endif /* CONFIG_BALANCE_NUMA */
+
 #define __count_zone_vm_events(item, zone, delta) \
 		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
 		zone_idx(zone), delta)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ccff412..86a133c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1033,6 +1033,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
+	count_vm_numa_event(NUMA_HINT_FAULTS);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1)
diff --git a/mm/memory.c b/mm/memory.c
index 2d13be92..8795a0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3473,11 +3473,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pte_same(*ptep, pte)))
 		goto out_unlock;
 
+	count_vm_numa_event(NUMA_HINT_FAULTS);
 	page = vm_normal_page(vma, addr, pte);
 	BUG_ON(!page);
 
 	get_page(page);
 	current_nid = page_to_nid(page);
+	if (current_nid == numa_node_id())
+		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 	target_nid = mpol_misplaced(page, vma, addr);
 	if (target_nid == -1) {
 		/*
@@ -3535,6 +3538,9 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int local_nid = numa_node_id();
+	unsigned long nr_faults = 0;
+	unsigned long nr_faults_local = 0;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3581,10 +3587,16 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		curr_nid = page_to_nid(page);
 		task_numa_fault(curr_nid, 1);
 
+		nr_faults++;
+		if (curr_nid == local_nid)
+			nr_faults_local++;
+
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
+	count_vm_numa_events(NUMA_HINT_FAULTS, nr_faults);
+	count_vm_numa_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
 	return 0;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e1534e3..045714d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -583,6 +583,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long _address, end;
 	spinlock_t *ptl;
 	int ret = 0;
+	int nr_pte_updates = 0;
 
 	VM_BUG_ON(address & ~PAGE_MASK);
 
@@ -626,6 +627,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
 		ret += HPAGE_PMD_NR;
+		nr_pte_updates++;
 		/* defer TLB flush to lower the overhead */
 		spin_unlock(&mm->page_table_lock);
 		goto out;
@@ -652,6 +654,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 
 		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+		nr_pte_updates++;
 
 		/* defer TLB flush to lower the overhead */
 		ret++;
@@ -666,6 +669,8 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 out:
+	if (nr_pte_updates)
+		count_vm_numa_events(NUMA_PTE_UPDATES, nr_pte_updates);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 631b2c5..88b9a7e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1517,7 +1517,8 @@ struct page *migrate_misplaced_page(struct page *page, int node)
 		if (nr_remaining) {
 			putback_lru_pages(&migratepages);
 			req.newpage = NULL;
-		}
+		} else
+			count_vm_numa_event(NUMA_PAGE_MIGRATE);
 	}
 	BUG_ON(!list_empty(&migratepages));
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a067fa..cfa386da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,6 +774,12 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_BALANCE_NUMA
+	"numa_pte_updates",
+	"numa_hint_faults",
+	"numa_hint_faults_local",
+	"numa_pages_migrated",
+#endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
 	"pgmigrate_fail",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 24/43] mm: numa: Migrate on reference policy
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (22 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 23/43] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 25/43] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
                   ` (19 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

This is the simplest possible policy that still does something of note.
When a pte_numa is faulted, it is moved immediately. Any replacement
policy must at least do better than this and in all likelihood this
policy regresses normal workloads.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 16fb4e6..0d11c3d 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -67,6 +67,7 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_numa Reference On Node */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 045714d..bcaa4fe 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,6 +118,26 @@ static struct mempolicy default_policy = {
 	.flags = MPOL_F_LOCAL,
 };
 
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+	struct mempolicy *pol = p->mempolicy;
+	int node;
+
+	if (!pol) {
+		node = numa_node_id();
+		if (node != -1)
+			pol = &preferred_node_policy[node];
+
+		/* preferred_node_policy is not initialised early in boot */
+		if (!pol->mode)
+			pol = NULL;
+	}
+
+	return pol;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -1706,7 +1726,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol = get_task_policy(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -2129,7 +2149,7 @@ retry_cpuset:
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
@@ -2403,6 +2423,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	default:
 		BUG();
 	}
+
+	/* Migrate the page towards the node whose CPU is referencing it */
+	if (pol->flags & MPOL_F_MORON)
+		polnid = numa_node_id();
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
@@ -2591,6 +2616,15 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	for_each_node(nid) {
+		preferred_node_policy[nid] = (struct mempolicy) {
+			.refcnt = ATOMIC_INIT(1),
+			.mode = MPOL_PREFERRED,
+			.flags = MPOL_F_MOF | MPOL_F_MORON,
+			.v = { .preferred_node = nid, },
+		};
+	}
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 25/43] mm: numa: Migrate pages handled during a pmd_numa hinting fault
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (23 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 24/43] mm: numa: Migrate on reference policy Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 26/43] mm: numa: Only mark a PMD pmd_numa if the pages are all on the same node Mel Gorman
                   ` (18 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

To say that the PMD handling code was incorrectly transferred from autonuma
is an understatement. The intention was to handle a PMDs worth of pages
in the same fault and effectively batch the taking of the PTL and page
migration. The copied version instead has the impact of clearing a number
of pte_numa PTE entries and whether any page migration takes place depends
on racing. This just happens to work in some cases.

This patch handles pte_numa faults in batch when a pmd_numa fault is
handled. The pages are migrated if they are currently misplaced.
Essentially this is making an assumption that NUMA locality is
on a PMD boundary but that could be addressed by only setting
pmd_numa if all the pages within that PMD are on the same node
if necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/memory.c |   54 +++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8795a0a..a498e8d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3451,6 +3451,18 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+				unsigned long addr, int current_nid)
+{
+	get_page(page);
+
+	count_vm_numa_event(NUMA_HINT_FAULTS);
+	if (current_nid == numa_node_id())
+		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+
+	return mpol_misplaced(page, vma, addr);
+}
+
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
@@ -3473,15 +3485,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pte_same(*ptep, pte)))
 		goto out_unlock;
 
-	count_vm_numa_event(NUMA_HINT_FAULTS);
 	page = vm_normal_page(vma, addr, pte);
 	BUG_ON(!page);
 
-	get_page(page);
 	current_nid = page_to_nid(page);
-	if (current_nid == numa_node_id())
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-	target_nid = mpol_misplaced(page, vma, addr);
+	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	if (target_nid == -1) {
 		/*
 		 * Account for the fault against the current node if it not
@@ -3491,9 +3499,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto clear_pmdnuma;
 	}
 
-	pte_unmap_unlock(ptep, ptl);
-
 	/* Migrate to the requested node */
+	pte_unmap_unlock(ptep, ptl);
 	newpage = migrate_misplaced_page(page, target_nid);
 	if (newpage)
 		current_nid = target_nid;
@@ -3524,7 +3531,8 @@ out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 	if (page)
 		put_page(page);
-	task_numa_fault(current_nid, 1);
+	if (current_nid != -1)
+		task_numa_fault(current_nid, 1);
 	return 0;
 }
 
@@ -3539,8 +3547,6 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
-	unsigned long nr_faults = 0;
-	unsigned long nr_faults_local = 0;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3563,7 +3569,8 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid;
+		int curr_nid = local_nid;
+		int target_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (addr >= vma->vm_end) {
@@ -3582,21 +3589,30 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* only check non-shared pages */
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
-		pte_unmap_unlock(pte, ptl);
 
-		curr_nid = page_to_nid(page);
-		task_numa_fault(curr_nid, 1);
+		/*
+		 * Note that the NUMA fault is later accounted to either
+		 * the node that is currently running or where the page is
+		 * migrated to.
+		 */
+		curr_nid = local_nid;
+		target_nid = numa_migrate_prep(page, vma, addr,
+					       page_to_nid(page));
+		if (target_nid == -1) {
+			put_page(page);
+			continue;
+		}
 
-		nr_faults++;
-		if (curr_nid == local_nid)
-			nr_faults_local++;
+		/* Migrate to the requested node */
+		pte_unmap_unlock(pte, ptl);
+		if (migrate_misplaced_page(page, target_nid))
+			curr_nid = target_nid;
+		task_numa_fault(curr_nid, 1);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-	count_vm_numa_events(NUMA_HINT_FAULTS, nr_faults);
-	count_vm_numa_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
 	return 0;
 }
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 26/43] mm: numa: Only mark a PMD pmd_numa if the pages are all on the same node
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (24 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 25/43] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 27/43] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
                   ` (17 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

When a pmd_numa fault is handled, all PTEs are treated as if the current
CPU had referenced them and handles it as one fault. This effectively
batches the ptl but loses precision. This patch will only set the PMD
pmd_numa if the examined pages are all on the same node. If the workload
is converged on a PMD boundary then the batch handling is equivalent.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mempolicy.c |   21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bcaa4fe..ca201e9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -604,6 +604,8 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 	int nr_pte_updates = 0;
+	bool all_same_node = true;
+	int last_nid = -1;
 
 	VM_BUG_ON(address & ~PAGE_MASK);
 
@@ -662,6 +664,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (_address = address, _pte = pte; _address < end;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
+		int this_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (pte_numa(pteval))
@@ -669,6 +672,18 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			continue;
+
+		/*
+		 * Check if all pages within the PMD are on the same node. This
+		 * is an approximation as existing pte_numa pages are not
+		 * examined.
+		 */
+		this_nid = page_to_nid(page);
+		if (last_nid == -1)
+			last_nid = this_nid;
+		if (last_nid != this_nid)
+			all_same_node = false;
+
 		/* only check non-shared pages */
 		if (page_mapcount(page) != 1)
 			continue;
@@ -681,7 +696,11 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	pte_unmap_unlock(pte, ptl);
 
-	if (ret && !pmd_numa(*pmd)) {
+	/*
+	 * If all the pages within the PMD are on the same node then mark
+	 * the PMD so it is handled in one fault when next referenced.
+	 */
+	if (all_same_node && !pmd_numa(*pmd)) {
 		spin_lock(&mm->page_table_lock);
 		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
 		spin_unlock(&mm->page_table_lock);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 27/43] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (25 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 26/43] mm: numa: Only mark a PMD pmd_numa if the pages are all on the same node Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 28/43] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
                   ` (16 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/page_alloc.c        |    5 +++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 50aaca8..abe9fea 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -717,6 +717,19 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_BALANCE_NUMA
+	/*
+	 * Lock serializing the per destination node AutoNUMA memory
+	 * migration rate limiting data.
+	 */
+	spinlock_t balancenuma_migrate_lock;
+
+	/* Rate limiting time interval */
+	unsigned long balancenuma_migrate_next_window;
+
+	/* Number of pages migrated during the rate limiting time interval */
+	unsigned long balancenuma_migrate_nr_pages;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4681fc4..8827523 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4449,6 +4449,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int ret;
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_BALANCE_NUMA
+	spin_lock_init(&pgdat->balancenuma_migrate_lock);
+	pgdat->balancenuma_migrate_nr_pages = 0;
+	pgdat->balancenuma_migrate_next_window = jiffies;
+#endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat_page_cgroup_init(pgdat);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 28/43] mm: numa: Rate limit the amount of memory that is migrated between nodes
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (26 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 27/43] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 29/43] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
                   ` (15 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

NOTE: This is very heavily based on similar logic in autonuma. It should
	be signed off by Andrea but because there was no standalone
	patch and it's sufficiently different from what he did that
	the signed-off is omitted. Will be added back if requested.

If a large number of pages are misplaced then the memory bus can be
saturated just migrating pages between nodes. This patch rate-limits
the amount of memory that can be migrating between nodes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 88b9a7e..dac5a43 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1464,12 +1464,21 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 }
 
 /*
+ * page migration rate limiting control.
+ * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
+ * window of time. Default here says do not migrate more than 1280M per second.
+ */
+static unsigned int migrate_interval_millisecs __read_mostly = 100;
+static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+
+/*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
 struct page *migrate_misplaced_page(struct page *page, int node)
 {
+	pg_data_t *pgdat = NODE_DATA(node);
 	struct misplaced_request req = {
 		.nid = node,
 		.newpage = NULL,
@@ -1484,8 +1493,26 @@ struct page *migrate_misplaced_page(struct page *page, int node)
 	if (page_mapcount(page) != 1)
 		goto out;
 
+	/*
+	 * Rate-limit the amount of data that is being migrated to a node.
+	 * Optimal placement is no good if the memory bus is saturated and
+	 * all the time is being spent migrating!
+	 */
+	spin_lock(&pgdat->balancenuma_migrate_lock);
+	if (time_after(jiffies, pgdat->balancenuma_migrate_next_window)) {
+		pgdat->balancenuma_migrate_nr_pages = 0;
+		pgdat->balancenuma_migrate_next_window = jiffies +
+			msecs_to_jiffies(migrate_interval_millisecs);
+	}
+	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages) {
+		spin_unlock(&pgdat->balancenuma_migrate_lock);
+		goto out;
+	}
+	pgdat->balancenuma_migrate_nr_pages++;
+	spin_unlock(&pgdat->balancenuma_migrate_lock);
+
 	/* Avoid migrating to a node that is nearly full */
-	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+	if (migrate_balanced_pgdat(pgdat, 1)) {
 		int page_lru;
 
 		if (isolate_lru_page(page)) {
@@ -1521,6 +1548,7 @@ struct page *migrate_misplaced_page(struct page *page, int node)
 			count_vm_numa_event(NUMA_PAGE_MIGRATE);
 	}
 	BUG_ON(!list_empty(&migratepages));
+
 out:
 	return req.newpage;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 29/43] mm: numa: Rate limit setting of pte_numa if node is saturated
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (27 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 28/43] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 30/43] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
                   ` (14 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

If there are a large number of NUMA hinting faults and all of them
are resulting in migrations it may indicate that memory is just
bouncing uselessly around. NUMA balancing cost is likely exceeding
any benefit from locality. Rate limit the PTE updates if the node
is migration rate-limited. As noted in the comments, this distorts
the NUMA faulting statistics.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |    6 ++++++
 mm/mempolicy.c          |    9 +++++++++
 mm/migrate.c            |   22 ++++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e5ab5db..08538ac 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -41,6 +41,7 @@ extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
 extern struct page *migrate_misplaced_page(struct page *page, int node);
+extern bool migrate_ratelimited(int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -79,6 +80,11 @@ struct page *migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline
+bool migrate_ratelimited(int node)
+{
+	return false;
+}
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ca201e9..7acc97b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -688,6 +688,15 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (page_mapcount(page) != 1)
 			continue;
 
+		/*
+		 * Do not set pte_numa if migrate ratelimited. This
+		 * loses statistics on the fault but if we are
+		 * unwilling to migrate to this node, we cannot do
+		 * useful work anyway.
+		 */
+		if (migrate_ratelimited(page_to_nid(page)))
+			continue;
+
 		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
 		nr_pte_updates++;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index dac5a43..1654bb7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1467,10 +1467,32 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  * page migration rate limiting control.
  * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
  * window of time. Default here says do not migrate more than 1280M per second.
+ * If a node is rate-limited then PTE NUMA updates are also rate-limited. However
+ * as it is faults that reset the window, pte updates will happen unconditionally
+ * if there has not been a fault since @pteupdate_interval_millisecs after the
+ * throttle window closed.
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
+static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
 static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
 
+#ifdef CONFIG_BALANCE_NUMA
+/* Returns true if NUMA migration is currently rate limited */
+bool migrate_ratelimited(int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+
+	if (time_after(jiffies, pgdat->balancenuma_migrate_next_window +
+				msecs_to_jiffies(pteupdate_interval_millisecs)))
+		return false;
+
+	if (pgdat->balancenuma_migrate_nr_pages < ratelimit_pages)
+		return false;
+
+	return true;
+}
+#endif
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 30/43] sched: numa: Slowly increase the scanning period as NUMA faults are handled
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (28 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 29/43] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 31/43] mm: numa: Introduce last_nid to the page frame Mel Gorman
                   ` (13 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Currently the rate of scanning for an address space is controlled
by the individual tasks. The next scan is simply determined by
2*p->numa_scan_period.

The 2*p->numa_scan_period is arbitrary and never changes. At this point
there is still no proper policy that decides if a task or process is
properly placed. It just scans and assumes the next NUMA fault will
place it properly. As it is assumed that pages will get properly placed
over time, increase the scan window each time a fault is incurred. This
is a big assumption as noted in the comments.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1bf97b5..14bd61a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -811,6 +811,15 @@ void task_numa_fault(int node, int pages)
 
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
+	/*
+	 * Assume that as faults occur that pages are getting properly placed
+	 * and fewer NUMA hints are required. Note that this is a big
+	 * assumption, it assumes processes reach a steady steady with no
+	 * further phase changes.
+	 */
+	p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+				p->numa_scan_period + jiffies_to_msecs(2));
+
 	task_numa_placement(p);
 }
 
@@ -857,7 +866,7 @@ void task_numa_work(struct callback_head *work)
 	if (WARN_ON_ONCE(p->numa_scan_period) == 0)
 		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
 
-	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
+	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 31/43] mm: numa: Introduce last_nid to the page frame
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (29 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 30/43] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 32/43] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
                   ` (12 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h       |   30 ++++++++++++++++++++++++++++++
 include/linux/mm_types.h |    4 ++++
 mm/page_alloc.c          |    2 ++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 34f8ce9..f290cc9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -693,6 +693,36 @@ static inline int page_to_nid(const struct page *page)
 }
 #endif
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return xchg(&page->_last_nid, nid);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page->_last_nid;
+}
+static inline void reset_page_last_nid(struct page *page)
+{
+	page->_last_nid = -1;
+}
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return page_to_nid(page);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page_to_nid(page);
+}
+
+static inline void reset_page_last_nid(struct page *page)
+{
+}
+#endif
+
 static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b40f4ef..6b478ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -175,6 +175,10 @@ struct page {
 	 */
 	void *shadow;
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	int _last_nid;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8827523..cc1ca7e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -608,6 +608,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	reset_page_last_nid(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3826,6 +3827,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
+		reset_page_last_nid(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 32/43] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (30 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 31/43] mm: numa: Introduce last_nid to the page frame Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 33/43] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
                   ` (11 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

While it is desirable that all threads in a process run on its home
node, this is not always possible or necessary. There may be more
threads than exist within the node or the node might over-subscribed
with unrelated processes.

This can cause a situation whereby a page gets migrated off its home
node because the threads clearing pte_numa were running off-node. This
patch uses page->last_nid to build a two-stage filter before pages get
migrated to avoid problems with short or unlikely task<->node
relationships.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mempolicy.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7acc97b..648423a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2453,9 +2453,37 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	}
 
 	/* Migrate the page towards the node whose CPU is referencing it */
-	if (pol->flags & MPOL_F_MORON)
+	if (pol->flags & MPOL_F_MORON) {
+		int last_nid;
+
 		polnid = numa_node_id();
 
+		/*
+		 * Multi-stage node selection is used in conjunction
+		 * with a periodic migration fault to build a temporal
+		 * task<->page relation. By using a two-stage filter we
+		 * remove short/unlikely relations.
+		 *
+		 * Using P(p) ~ n_p / n_t as per frequentist
+		 * probability, we can equate a task's usage of a
+		 * particular page (n_p) per total usage of this
+		 * page (n_t) (in a given time-span) to a probability.
+		 *
+		 * Our periodic faults will sample this probability and
+		 * getting the same result twice in a row, given these
+		 * samples are fully independent, is then given by
+		 * P(n)^2, provided our sample period is sufficiently
+		 * short compared to the usage pattern.
+		 *
+		 * This quadric squishes small probabilities, making
+		 * it less likely we act on an unlikely task<->page
+		 * relation.
+		 */
+		last_nid = page_xchg_last_nid(page, polnid);
+		if (last_nid != polnid)
+			goto out;
+	}
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 33/43] x86: mm: only do a local tlb flush in ptep_set_access_flags()
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (31 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 32/43] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 34/43] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
                   ` (10 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags() is only ever invoked to set access
flags or add write permission on a PTE.  The write bit is only ever set
together with the dirty bit.

Because we only ever upgrade a PTE, it is safe to skip flushing entries on
remote TLBs. The worst that can happen is a spurious page fault on other
CPUs, which would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally is
(much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
 int ptep_set_access_flags(struct vm_area_struct *vma,
 			  unsigned long address, pte_t *ptep,
 			  pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		flush_tlb_page(vma, address);
+		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 34/43] x86: mm: drop TLB flush from ptep_set_access_flags
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (32 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 33/43] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 35/43] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
                   ` (9 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this.  However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/mm/pgtable.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 35/43] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (33 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 34/43] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 36/43] sched: numa: Introduce tsk_home_node() Mel Gorman
                   ` (8 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 6b6507f..501be39 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write 
+ * permission. Furthermore, we know it always gets set to a "more
  * permissive" setting, which allows most architectures to optimize
  * this. We return whether the PTE actually changed, which in turn
  * instructs the caller to do things like update__mmu_cache.  This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(*ptep, entry);
 	if (changed) {
 		set_pte_at(vma->vm_mm, address, ptep, entry);
-		flush_tlb_page(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 	return changed;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 36/43] sched: numa: Introduce tsk_home_node()
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (34 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 35/43] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 37/43] sched: numa: Make find_busiest_queue() a method Mel Gorman
                   ` (7 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.

These are no hard guarantees, merely soft preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.

Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.

This patch merely introduces the basic infrastructure, all policy
comes later.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/sched/core.c       |   36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..fdf0692 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_BALANCE_NUMA
+# define INIT_TASK_NUMA(tsk)						\
+	.home_node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_TASK_NUMA(tsk)						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2b06ea..b8580f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1480,6 +1480,7 @@ struct task_struct {
 	short pref_node_fork;
 #endif
 #ifdef CONFIG_BALANCE_NUMA
+	int home_node;
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
@@ -1569,6 +1570,15 @@ static inline void task_numa_fault(int node, int pages)
 }
 #endif
 
+static inline int tsk_home_node(struct task_struct *p)
+{
+#ifdef CONFIG_BALANCE_NUMA
+	return p->home_node;
+#else
+	return -1;
+#endif
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 047e3c7..55dcf53 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5972,6 +5972,42 @@ static struct sched_domain_topology_level default_topology[] = {
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_BALANCE_NUMA
+
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Note that we cannot actively migrate ourselves since our callers
+ * can be from atomic context. We rely on the regular load-balance
+ * mechanisms to move us around -- its all preference anyway.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+	unsigned long flags;
+	int on_rq, running;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->home_node = node;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_BALANCE_NUMA */
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 37/43] sched: numa: Make find_busiest_queue() a method
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (35 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 36/43] sched: numa: Introduce tsk_home_node() Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 38/43] sched: numa: Implement home-node awareness Mel Gorman
                   ` (6 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in the next patch to conditionally use a queue.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 14bd61a8..af71f94 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3244,6 +3244,9 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	struct rq *		(*find_busiest_queue)(struct lb_env *,
+						      struct sched_group *);
 };
 
 /*
@@ -4417,13 +4420,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
 
 	struct lb_env env = {
-		.sd		= sd,
-		.dst_cpu	= this_cpu,
-		.dst_rq		= this_rq,
-		.dst_grpmask    = sched_group_cpus(sd->groups),
-		.idle		= idle,
-		.loop_break	= sched_nr_migrate_break,
-		.cpus		= cpus,
+		.sd		    = sd,
+		.dst_cpu	    = this_cpu,
+		.dst_rq		    = this_rq,
+		.dst_grpmask        = sched_group_cpus(sd->groups),
+		.idle		    = idle,
+		.loop_break	    = sched_nr_migrate_break,
+		.cpus		    = cpus,
+		.find_busiest_queue = find_busiest_queue,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4442,7 +4446,7 @@ redo:
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(&env, group);
+	busiest = env.find_busiest_queue(&env, group);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 38/43] sched: numa: Implement home-node awareness
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (36 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 37/43] sched: numa: Make find_busiest_queue() a method Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 39/43] sched: numa: Introduce per-mm and per-task structures Mel Gorman
                   ` (5 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

NOTE: Entirely on "sched, numa, mm: Implement home-node awareness" but
	only a subset of it. There was stuff in there that was disabled
	by default and generally did slightly more than what I felt was
	necessary at this stage. In particular the random queue selection
	logic is gone because it looks broken but it does mean that the
	last CPU in a node may see increased scheduling pressure which
	is almost certainly the wrong thing to do. Needs re-examination
	Signed-offs removed as a result but will re-add if authors are ok.

Implement home node preference in the scheduler's load-balancer.

- task_numa_hot(); make it harder to migrate tasks away from their
  home-node, controlled using the NUMA_HOMENODE_PREFERRED feature flag.

- load_balance(); during the regular pull load-balance pass, try
  pulling tasks that are on the wrong node first with a preference of
  moving them nearer to their home-node through task_numa_hot(), controlled
  through the NUMA_PULL feature flag.

- load_balance(); when the balancer finds no imbalance, introduce
  some such that it still prefers to move tasks towards their home-node,
  using active load-balance if needed, controlled through the NUMA_PULL_BIAS
  feature flag.

  In particular, only introduce this BIAS if the system is otherwise properly
  (weight) balanced and we either have an offnode or !numa task to trade
  for it.

In order to easily find off-node tasks, split the per-cpu task list
into two parts.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h   |    3 +
 kernel/sched/core.c     |   14 ++-
 kernel/sched/debug.c    |    3 +
 kernel/sched/fair.c     |  298 +++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |   18 +++
 kernel/sched/sched.h    |   16 +++
 6 files changed, 324 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b8580f5..1cccfc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
@@ -1481,6 +1482,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_BALANCE_NUMA
 	int home_node;
+	unsigned long numa_contrib;
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
@@ -2104,6 +2106,7 @@ extern int sched_setscheduler(struct task_struct *, int,
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern void sched_setnode(struct task_struct *p, int node);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55dcf53..3d9fc26 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5978,9 +5978,9 @@ static struct sched_domain_topology_level *sched_domain_topology = default_topol
  * Requeues a task ensuring its on the right load-balance list so
  * that it might get migrated to its new home.
  *
- * Note that we cannot actively migrate ourselves since our callers
- * can be from atomic context. We rely on the regular load-balance
- * mechanisms to move us around -- its all preference anyway.
+ * Since home-node is pure preference there's no hard migrate to force
+ * us anywhere, this also allows us to call this from atomic context if
+ * required.
  */
 void sched_setnode(struct task_struct *p, int node)
 {
@@ -6053,6 +6053,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
@@ -6914,7 +6915,12 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
-
+#ifdef CONFIG_BALANCE_NUMA
+		INIT_LIST_HEAD(&rq->offnode_tasks);
+		rq->onnode_running = 0;
+		rq->offnode_running = 0;
+		rq->offnode_weight = 0;
+#endif
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_flags = 0;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6f79596..2474a02 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -132,6 +132,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	SEQ_printf(m, " %d/%d", p->home_node, cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af71f94..219158f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -775,6 +775,51 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 /**************************************************
+ * Scheduling class numa methods.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_BALANCE_NUMA
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct list_head *tasks = &rq->cfs_tasks;
+
+	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+		p->numa_contrib = task_h_load(p);
+		rq->offnode_weight += p->numa_contrib;
+		rq->offnode_running++;
+		tasks = &rq->offnode_tasks;
+	} else
+		rq->onnode_running++;
+
+	return tasks;
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+		rq->offnode_weight -= p->numa_contrib;
+		rq->offnode_running--;
+	} else
+		rq->onnode_running--;
+}
+#else
+#ifdef CONFIG_SMP
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	return NULL;
+}
+#endif
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
+/**************************************************
  * Scheduling class queueing methods:
  */
 
@@ -964,9 +1009,17 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct task_struct *p = task_of(se);
+		struct list_head *tasks = &rq->cfs_tasks;
+
+		if (tsk_home_node(p) != -1)
+			tasks = account_numa_enqueue(rq, p);
+
+		list_add(&se->group_node, tasks);
+	}
+#endif /* CONFIG_SMP */
 	cfs_rq->nr_running++;
 }
 
@@ -976,8 +1029,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
+
 		list_del_init(&se->group_node);
+
+		if (tsk_home_node(p) != -1)
+			account_numa_dequeue(rq_of(cfs_rq), p);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -3241,6 +3300,8 @@ struct lb_env {
 
 	unsigned int		flags;
 
+	struct list_head	*tasks;
+
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
@@ -3262,10 +3323,32 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 }
 
 /*
+ * Returns true if task should stay on the current node. The intent is that
+ * a task that is running on a node identified as the "home node" should
+ * stay there if possible
+ */
+static bool task_numa_hot(struct task_struct *p, struct lb_env *env)
+{
+	int from_dist, to_dist;
+	int node = tsk_home_node(p);
+
+	if (!sched_feat_numa(NUMA_HOMENODE_PREFERRED) || node == -1)
+		return false; /* no node preference */
+
+	from_dist = node_distance(cpu_to_node(env->src_cpu), node);
+	to_dist = node_distance(cpu_to_node(env->dst_cpu), node);
+
+	if (to_dist < from_dist)
+		return false; /* getting closer is ok */
+
+	return true; /* stick to where we are */
+}
+
+/*
  * Is this task likely cache-hot:
  */
 static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
@@ -3288,7 +3371,7 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
-	delta = now - p->se.exec_start;
+	delta = env->src_rq->clock_task - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
 }
@@ -3345,7 +3428,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 2) too many balance attempts have failed.
 	 */
 
-	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+	tsk_cache_hot = task_hot(p, env);
+	if (env->idle == CPU_NOT_IDLE)
+		tsk_cache_hot |= task_numa_hot(p, env);
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
@@ -3367,15 +3452,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 /*
  * move_one_task tries to move exactly one task from busiest to this_rq, as
  * part of active balancing operations within "domain".
- * Returns 1 if successful and 0 otherwise.
+ * Returns true if successful and false otherwise.
  *
  * Called with both runqueues locked.
  */
-static int move_one_task(struct lb_env *env)
+static bool __move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
-	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+	list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -3389,12 +3474,25 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
-		return 1;
+		return true;
 	}
-	return 0;
+	return false;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
+static bool move_one_task(struct lb_env *env)
+{
+	if (sched_feat_numa(NUMA_HOMENODE_PULL)) {
+		env->tasks = offnode_tasks(env->src_rq);
+		if (__move_one_task(env))
+			return true;
+	}
+
+	env->tasks = &env->src_rq->cfs_tasks;
+	if (__move_one_task(env))
+		return true;
+
+	return false;
+}
 
 static const unsigned int sched_nr_migrate_break = 32;
 
@@ -3407,7 +3505,6 @@ static const unsigned int sched_nr_migrate_break = 32;
  */
 static int move_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
 	unsigned long load;
 	int pulled = 0;
@@ -3415,8 +3512,9 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
-	while (!list_empty(tasks)) {
-		p = list_first_entry(tasks, struct task_struct, se.group_node);
+again:
+	while (!list_empty(env->tasks)) {
+		p = list_first_entry(env->tasks, struct task_struct, se.group_node);
 
 		env->loop++;
 		/* We've more or less seen every task there is, call it quits */
@@ -3427,7 +3525,7 @@ static int move_tasks(struct lb_env *env)
 		if (env->loop > env->loop_break) {
 			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
-			break;
+			goto out;
 		}
 
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3455,7 +3553,7 @@ static int move_tasks(struct lb_env *env)
 		 * the critical section.
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+			goto out;
 #endif
 
 		/*
@@ -3463,13 +3561,20 @@ static int move_tasks(struct lb_env *env)
 		 * weighted load.
 		 */
 		if (env->imbalance <= 0)
-			break;
+			goto out;
 
 		continue;
 next:
-		list_move_tail(&p->se.group_node, tasks);
+		list_move_tail(&p->se.group_node, env->tasks);
 	}
 
+	if (env->tasks == offnode_tasks(env->src_rq)) {
+		env->tasks = &env->src_rq->cfs_tasks;
+		env->loop = 0;
+		goto again;
+	}
+
+out:
 	/*
 	 * Right now, this is one of only two places move_task() is called,
 	 * so we can safely collect move_task() stats here rather than
@@ -3588,12 +3693,13 @@ static inline void update_shares(int cpu)
 static inline void update_h_load(long cpu)
 {
 }
-
+#ifdef CONFIG_SMP
 static unsigned long task_h_load(struct task_struct *p)
 {
 	return p->se.load.weight;
 }
 #endif
+#endif
 
 /********** Helpers for find_busiest_group ************************/
 /*
@@ -3624,6 +3730,14 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
+#ifdef CONFIG_BALANCE_NUMA
+	struct sched_group *numa_group; /* group which has offnode_tasks */
+	unsigned long numa_group_weight;
+	unsigned long numa_group_running;
+
+	unsigned long this_offnode_running;
+	unsigned long this_onnode_running;
+#endif
 };
 
 /*
@@ -3639,6 +3753,11 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_BALANCE_NUMA
+	unsigned long numa_offnode_weight;
+	unsigned long numa_offnode_running;
+	unsigned long numa_onnode_running;
+#endif
 };
 
 /**
@@ -3667,6 +3786,121 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+	sgs->numa_offnode_weight += rq->offnode_weight;
+	sgs->numa_offnode_running += rq->offnode_running;
+	sgs->numa_onnode_running += rq->onnode_running;
+}
+
+/*
+ * Since the offnode lists are indiscriminate (they contain tasks for all other
+ * nodes) it is impossible to say if there's any task on there that wants to
+ * move towards the pulling cpu. Therefore select a random offnode list to pull
+ * from such that eventually we'll try them all.
+ *
+ * Select a random group that has offnode tasks as sds->numa_group
+ */
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+	if (!(sd->flags & SD_NUMA))
+		return;
+
+	if (local_group) {
+		sds->this_offnode_running = sgs->numa_offnode_running;
+		sds->this_onnode_running  = sgs->numa_onnode_running;
+		return;
+	}
+
+	if (!sgs->numa_offnode_running)
+		return;
+
+	if (!sds->numa_group) {
+		sds->numa_group = group;
+		sds->numa_group_weight = sgs->numa_offnode_weight;
+		sds->numa_group_running = sgs->numa_offnode_running;
+	}
+}
+
+/*
+ * Pick a random queue from the group that has offnode tasks.
+ */
+static struct rq *find_busiest_numa_queue(struct lb_env *env,
+					  struct sched_group *group)
+{
+	struct rq *busiest = NULL, *rq;
+	int cpu;
+
+	for_each_cpu_and(cpu, sched_group_cpus(group), env->cpus) {
+		rq = cpu_rq(cpu);
+		if (!rq->offnode_running)
+			continue;
+		if (!busiest)
+			busiest = rq;
+	}
+
+	return busiest;
+}
+
+/*
+ * Called in case of no other imbalance. Returns true if there is a queue
+ * running offnode tasks which pretends we are imbalanced anyway to nudge these
+ * tasks towards their home node.
+ */
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	if (!sched_feat(NUMA_HOMENODE_PULL_BIAS))
+		return false;
+
+	if (!sds->numa_group)
+		return false;
+
+	/*
+	 * Only pull an offnode task home if we've got offnode or !numa tasks to trade for it.
+	 */
+	if (!sds->this_offnode_running &&
+	    !(sds->this_nr_running - sds->this_onnode_running - sds->this_offnode_running))
+		return false;
+
+	env->imbalance = sds->numa_group_weight / sds->numa_group_running;
+	sds->busiest = sds->numa_group;
+	env->find_busiest_queue = find_busiest_numa_queue;
+	return true;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return env->find_busiest_queue == find_busiest_numa_queue &&
+			env->src_rq->offnode_running == 1 &&
+			env->src_rq->nr_running == 1;
+}
+
+#else /* CONFIG_BALANCE_NUMA */
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+}
+
+static inline bool check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	return false;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return false;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -3882,6 +4116,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
+
+		update_sg_numa_stats(sgs, rq);
 	}
 
 	/*
@@ -4035,6 +4271,8 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);
+
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -4265,7 +4503,7 @@ find_busiest_group(struct lb_env *env, int *balance)
 
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || sds.busiest_nr_running == 0)
-		goto out_balanced;
+		goto ret;
 
 	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
@@ -4287,14 +4525,14 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 * don't try and pull any tasks.
 	 */
 	if (sds.this_load >= sds.max_load)
-		goto out_balanced;
+		goto ret;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
 	if (sds.this_load >= sds.avg_load)
-		goto out_balanced;
+		goto ret;
 
 	if (env->idle == CPU_IDLE) {
 		/*
@@ -4321,6 +4559,9 @@ force_balance:
 	return sds.busiest;
 
 out_balanced:
+	if (check_numa_busiest_group(env, &sds))
+		return sds.busiest;
+
 ret:
 	env->imbalance = 0;
 	return NULL;
@@ -4399,6 +4640,9 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	if (need_active_numa_balance(env))
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -4451,6 +4695,8 @@ redo:
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
 	}
+	env.src_rq  = busiest;
+	env.src_cpu = busiest->cpu;
 
 	BUG_ON(busiest == env.dst_rq);
 
@@ -4469,6 +4715,10 @@ redo:
 		env.src_cpu   = busiest->cpu;
 		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		if (sched_feat_numa(NUMA_HOMENODE_PULL))
+			env.tasks = offnode_tasks(busiest);
+		else
+			env.tasks = &busiest->cfs_tasks;
 
 		update_h_load(env.src_cpu);
 more_balance:
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7cfd289..4ae02cb 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,22 @@ SCHED_FEAT(LB_MIN, false)
  */
 #ifdef CONFIG_BALANCE_NUMA
 SCHED_FEAT(NUMA,	true)
+
+/* Keep tasks running on their home node if possible */
+SCHED_FEAT(NUMA_HOMENODE_PREFERRED, true)
+
+/*
+ * During the regular pull load-balance pass, try pulling tasks that are
+ * running off their home node first with a preference to moving them
+ * nearer their home node through task_numa_hot.
+ */
+SCHED_FEAT(NUMA_HOMENODE_PULL, true)
+
+/*
+ * When the balancer finds no imbalance, introduce some such that it
+ * still prefers to move tasks towards their home node, using active
+ * load-balance if needed.
+ */
+SCHED_FEAT(NUMA_HOMENODE_PULL_BIAS, true)
+
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a43241..3f0e5a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -418,6 +418,13 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_BALANCE_NUMA
+	unsigned long    onnode_running;
+	unsigned long    offnode_running;
+	unsigned long	 offnode_weight;
+	struct list_head offnode_tasks;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -469,6 +476,15 @@ struct rq {
 #endif
 };
 
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+#ifdef CONFIG_BALANCE_NUMA
+	return &rq->offnode_tasks;
+#else
+	return NULL;
+#endif
+}
+
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 39/43] sched: numa: Introduce per-mm and per-task structures
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (37 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 38/43] sched: numa: Implement home-node awareness Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 40/43] sched: numa: CPU follows memory Mel Gorman
                   ` (4 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

NOTE: This is heavily based on "autonuma: CPU follows memory algorithm"
	and "autonuma: mm_autonuma and task_autonuma data structures"

At the most basic level, any placement policy is going to make some
sort of smart decision based on per-mm and per-task statistics. This
patch simply introduces the structures with basic fault statistics
that can be expaned upon or replaced later. It may be that a placement
policy can approximate without needing both structures in which case
they can be safely deleted later while still having a comparison point
to ensure the approximation is accurate.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h |   26 ++++++++++++++++++++++++++
 include/linux/sched.h    |   18 ++++++++++++++++++
 kernel/fork.c            |   18 ++++++++++++++++++
 kernel/sched/core.c      |    3 +++
 kernel/sched/fair.c      |   25 ++++++++++++++++++++++++-
 kernel/sched/sched.h     |   14 ++++++++++++++
 6 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b478ff..9588a91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -312,6 +312,29 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Per-mm structure that contains the NUMA memory placement statistics
+ * generated by pte_numa faults.
+ */
+struct mm_balancenuma {
+	/*
+	 * Number of pages that will trigger NUMA faults for this mm. Total
+	 * decays each time whether the home node should change to keep
+	 * track only of recent events
+	 */
+	unsigned long mm_numa_fault_tot;
+
+	/*
+	 * Number of pages that will trigger NUMA faults for each [nid].
+	 * Also decays.
+	 */
+	unsigned long mm_numa_fault[0];
+
+	/* do not add more variables here, the above array size is dynamic */
+};
+#endif /* CONFIG_BALANCE_NUMA */
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -415,6 +438,9 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
+
+	/* this is used by the scheduler and the page allocator */
+	struct mm_balancenuma *mm_balancenuma;
 #endif
 	struct uprobes_state uprobes_state;
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1cccfc3..7b6625a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1188,6 +1188,23 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Per-task structure that contains the NUMA memory placement statistics
+ * generated by pte_numa faults. This structure is dynamically allocated
+ * when the first pte_numa fault is handled.
+ */
+struct task_balancenuma {
+	/* Total number of eligible pages that triggered NUMA faults */
+	unsigned long task_numa_fault_tot;
+
+	/* Number of pages that triggered NUMA faults for each [nid] */
+	unsigned long task_numa_fault[0];
+
+	/* do not add more variables here, the above array size is dynamic */
+};
+#endif /* CONFIG_BALANCE_NUMA */
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1488,6 +1505,7 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+	struct task_balancenuma *task_balancenuma;
 #endif /* CONFIG_BALANCE_NUMA */
 
 	struct rcu_head rcu;
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..c8752f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -525,6 +525,20 @@ static void mm_init_aio(struct mm_struct *mm)
 #endif
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline void free_mm_balancenuma(struct mm_struct *mm)
+{
+	if (mm->mm_balancenuma)
+		kfree(mm->mm_balancenuma);
+
+	mm->mm_balancenuma = NULL;
+}
+#else
+static inline void free_mm_balancenuma(struct mm_struct *mm)
+{
+}
+#endif
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
 	atomic_set(&mm->mm_users, 1);
@@ -539,6 +553,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	spin_lock_init(&mm->page_table_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
+	mm->mm_balancenuma = NULL;
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 
@@ -548,6 +563,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_balancenuma(mm);
 	free_mm(mm);
 	return NULL;
 }
@@ -597,6 +613,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_balancenuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -854,6 +871,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_balancenuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d9fc26..9472d5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1543,6 +1543,7 @@ static void __sched_fork(struct task_struct *p)
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->task_balancenuma = NULL;
 	p->numa_scan_period = sysctl_balance_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_BALANCE_NUMA */
@@ -1787,6 +1788,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		free_task_balancenuma(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 219158f..98c621c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -854,7 +854,30 @@ void task_numa_fault(int node, int pages)
 {
 	struct task_struct *p = current;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	if (!p->task_balancenuma) {
+		int size = sizeof(struct task_balancenuma) +
+				(sizeof(unsigned long) * nr_node_ids);
+		p->task_balancenuma = kzalloc(size, GFP_KERNEL);
+		if (!p->task_balancenuma)
+			return;
+	}
+
+	if (!p->mm->mm_balancenuma) {
+		int size = sizeof(struct mm_balancenuma) +
+				(sizeof(unsigned long) * nr_node_ids);
+		p->mm->mm_balancenuma = kzalloc(size, GFP_KERNEL);
+		if (!p->mm->mm_balancenuma) {
+			kfree(p->task_balancenuma);
+			p->task_balancenuma = NULL;
+			return;
+		}
+	}
+
+	/* Record fault statistics */
+	p->task_balancenuma->task_numa_fault_tot++;
+	p->task_balancenuma->task_numa_fault[node]++;
+	p->mm->mm_balancenuma->mm_numa_fault_tot++;
+	p->mm->mm_balancenuma->mm_numa_fault[node]++;
 
 	/*
 	 * Assume that as faults occur that pages are getting properly placed
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3f0e5a1..92df3d4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -502,6 +502,20 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+
+#ifdef CONFIG_BALANCE_NUMA
+static inline void free_task_balancenuma(struct task_struct *p)
+{
+	if (p->task_balancenuma)
+		kfree(p->task_balancenuma);
+	p->task_balancenuma = NULL;
+}
+#else
+static inline void free_task_balancenuma(struct task_struct *p)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 40/43] sched: numa: CPU follows memory
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (38 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 39/43] sched: numa: Introduce per-mm and per-task structures Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 41/43] sched: numa: Rename mempolicy to HOME Mel Gorman
                   ` (3 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

NOTE: This is heavily based on "autonuma: CPU follows memory algorithm"
	and "autonuma: mm_autonuma and task_autonuma data structures"
	with bits taken but worked within the scheduler hooks and home
	node mechanism as defined by schednuma.

This patch adds per-mm and per-task data structures to track the number
of faults in total and on a per-nid basis. On each NUMA fault it
checks if the system would benefit if the current task was migrated
to another node. If the task should be migrated, its home node is
updated and the task is requeued.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |    1 -
 kernel/sched/fair.c   |  228 ++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b6625a..269ff7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2040,7 +2040,6 @@ extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
 extern unsigned int sysctl_balance_numa_scan_size;
-extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98c621c..0f63743 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -836,15 +836,229 @@ unsigned int sysctl_balance_numa_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_balance_numa_scan_delay = 1000;
 
+#define BALANCENUMA_SCALE 1000
+static inline unsigned long balancenuma_weight(unsigned long nid_faults,
+					       unsigned long total_faults)
+{
+	if (nid_faults > total_faults)
+		nid_faults = total_faults;
+
+	return nid_faults * BALANCENUMA_SCALE / total_faults;
+}
+
+static inline unsigned long balancenuma_task_weight(struct task_struct *p,
+							int nid)
+{
+	struct task_balancenuma *task_balancenuma = p->task_balancenuma;
+	unsigned long nid_faults, total_faults;
+
+	nid_faults = task_balancenuma->task_numa_fault[nid];
+	total_faults = task_balancenuma->task_numa_fault_tot;
+	return balancenuma_weight(nid_faults, total_faults);
+}
+
+static inline unsigned long balancenuma_mm_weight(struct task_struct *p,
+							int nid)
+{
+	struct mm_balancenuma *mm_balancenuma = p->mm->mm_balancenuma;
+	unsigned long nid_faults, total_faults;
+
+	nid_faults = mm_balancenuma->mm_numa_fault[nid];
+	total_faults = mm_balancenuma->mm_numa_fault_tot;
+
+	/* It's possible for total_faults to decay to 0 in parallel so check */
+	return total_faults ? balancenuma_weight(nid_faults, total_faults) : 0;
+}
+
+/*
+ * Examines all other nodes examining remote tasks to see if there would
+ * be fewer remote numa faults if tasks swapped home nodes
+ */
+static void task_numa_find_placement(struct task_struct *p)
+{
+	struct cpumask *allowed = tsk_cpus_allowed(p);
+	int this_cpu = smp_processor_id();
+	int this_nid = numa_node_id();
+	long p_task_weight, p_mm_weight;
+	long weight_diff_max = 0;
+	struct task_struct *selected_task = NULL;
+	int selected_nid = -1;
+	int nid;
+
+	p_task_weight = balancenuma_task_weight(p, this_nid);
+	p_mm_weight = balancenuma_mm_weight(p, this_nid);
+
+	/* Examine a task on every other node */
+	for_each_online_node(nid) {
+		int cpu;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			struct rq *rq;
+			struct mm_struct *other_mm;
+			struct task_struct *other_task;
+			long this_weight, other_weight, p_weight;
+			long other_diff, this_diff;
+
+			if (!cpu_online(cpu) || idle_cpu(cpu))
+				continue;
+
+			/* Racy check if a task is running on the other rq */
+			rq = cpu_rq(cpu);
+			other_mm = rq->curr->mm;
+			if (!other_mm || !other_mm->mm_balancenuma)
+				continue;
+
+			/* Effectively pin the other task to get fault stats */
+			raw_spin_lock_irq(&rq->lock);
+			other_task = rq->curr;
+			other_mm = other_task->mm;
+
+			/* Ensure the other task has usable stats */
+			if (!other_task->task_balancenuma ||
+			    !other_task->task_balancenuma->task_numa_fault_tot ||
+			    !other_mm ||
+			    !other_mm->mm_balancenuma ||
+			    !other_mm->mm_balancenuma->mm_numa_fault_tot) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+
+			/* Ensure the other task can be swapped */
+			if (!cpumask_test_cpu(this_cpu,
+					      tsk_cpus_allowed(other_task))) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+
+			/*
+			 * Read the fault statistics. If the remote task is a
+			 * thread in the process then use the task statistics.
+			 * Otherwise use the per-mm statistics.
+			 */
+			if (other_mm == p->mm) {
+				this_weight = balancenuma_task_weight(p, nid);
+				other_weight = balancenuma_task_weight(other_task, nid);
+				p_weight = p_task_weight;
+			} else {
+				this_weight = balancenuma_mm_weight(p, nid);
+				other_weight = balancenuma_mm_weight(other_task, nid);
+				p_weight = p_mm_weight;
+			}
+
+			raw_spin_unlock_irq(&rq->lock);
+
+			/*
+			 * other_diff: How much does the current task perfer to
+			 * run on the remote node thn the task that is
+			 * currently running there?
+			 */
+			other_diff = this_weight - other_weight;
+
+			/*
+			 * this_diff: How much does the currrent task prefer to
+			 * run on the remote NUMA node compared to the current
+			 * node?
+			 */
+			this_diff = this_weight - p_weight;
+
+			/*
+			 * Would swapping the tasks reduce the overall
+			 * cross-node NUMA faults?
+			 */
+			if (other_diff > 0 && this_diff > 0) {
+				long weight_diff = other_diff + this_diff;
+
+				/* Remember the best candidate. */
+				if (weight_diff > weight_diff_max) {
+					weight_diff_max = weight_diff;
+					selected_nid = nid;
+					selected_task = other_task;
+				}
+			}
+		}
+	}
+
+	/* Swap the task on the selected target node */
+	if (selected_nid != -1) {
+		sched_setnode(p, selected_nid);
+		sched_setnode(selected_task, this_nid);
+	}
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
+	unsigned long task_total, mm_total;
+	struct mm_balancenuma *mm_balancenuma;
+	struct task_balancenuma *task_balancenuma;
+	unsigned long mm_max_weight, task_max_weight;
+	int this_nid, nid, mm_selected_nid, task_selected_nid;
+
 	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
 
-	/* FIXME: Scheduling placement policy hints go here */
+	this_nid = numa_node_id();
+	mm_balancenuma = p->mm->mm_balancenuma;
+	task_balancenuma = p->task_balancenuma;
+
+	/* If the task has no NUMA hinting page faults, use current nid */
+	mm_total = ACCESS_ONCE(mm_balancenuma->mm_numa_fault_tot);
+	if (!mm_total)
+		return;
+	task_total = task_balancenuma->task_numa_fault_tot;
+	if (!task_total)
+		return;
+
+	/*
+	 * Identify the NUMA node where this thread (task_struct), and
+	 * the process (mm_struct) as a whole, has the largest number
+	 * of NUMA faults
+	 */
+	mm_selected_nid = task_selected_nid = -1;
+	mm_max_weight = task_max_weight = 0;
+	for_each_online_node(nid) {
+		unsigned long mm_nid_fault, task_nid_fault;
+		unsigned long mm_numa_weight, task_numa_weight;
+
+		/* Read the number of task and mm faults on node */
+		mm_nid_fault = ACCESS_ONCE(mm_balancenuma->mm_numa_fault[nid]);
+		task_nid_fault = task_balancenuma->task_numa_fault[nid];
+
+		/*
+		 * The weights are the relative number of pte_numa faults that
+		 * were handled on this node in comparison to all pte_numa faults
+		 * overall
+		 */
+		mm_numa_weight = balancenuma_weight(mm_nid_fault, mm_total);
+		task_numa_weight = balancenuma_weight(task_nid_fault, task_total);
+		if (mm_numa_weight > mm_max_weight) {
+			mm_max_weight = mm_numa_weight;
+			mm_selected_nid = nid;
+		}
+		if (task_numa_weight > task_max_weight) {
+			task_max_weight = task_numa_weight;
+			task_selected_nid = nid;
+		}
+
+		/* Decay the stats by a factor of 2 */
+		p->mm->mm_balancenuma->mm_numa_fault[nid] >>= 1;
+	}
+
+	/*
+	 * If this NUMA node is the selected one based on process
+	 * memory and task NUMA faults then set the home node.
+	 * There should be no need to requeue the task.
+	 */
+	if (task_selected_nid == this_nid && mm_selected_nid == this_nid) {
+		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+					  p->numa_scan_period * 2);
+		p->home_node = this_nid;
+		return;
+	}
+
+	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	task_numa_find_placement(p);
 }
 
 /*
@@ -895,6 +1109,16 @@ static void reset_ptenuma_scan(struct task_struct *p)
 {
 	ACCESS_ONCE(p->mm->numa_scan_seq)++;
 	p->mm->numa_scan_offset = 0;
+	
+	if (p->mm && p->mm->mm_balancenuma)
+		p->mm->mm_balancenuma->mm_numa_fault_tot >>= 1;
+	if (p->task_balancenuma) {
+		int nid;
+		p->task_balancenuma->task_numa_fault_tot >>= 1;
+		for_each_online_node(nid) {
+			p->task_balancenuma->task_numa_fault[nid] >>= 1;
+		}
+	}
 }
 
 /*
@@ -976,7 +1200,7 @@ out:
 	 * It is possible to reach the end of the VMA list but the last few VMAs are
 	 * not guaranteed to the vma_migratable. If they are not, we would find the
 	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * so we must check it now.
 	 */
 	if (vma)
 		mm->numa_scan_offset = start;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 41/43] sched: numa: Rename mempolicy to HOME
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (39 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 40/43] sched: numa: CPU follows memory Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 42/43] sched: numa: Consider only one CPU per node for CPU-follows-memory Mel Gorman
                   ` (2 subsequent siblings)
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Rename the policy to reflect that while allocations and migrations are
based on reference that the home node is taken into account for
migration decisions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    9 ++++++++-
 mm/mempolicy.c                 |    9 ++++++---
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 0d11c3d..4506772 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -67,7 +67,14 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
-#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_numa Reference On Node */
+#define MPOL_F_HOME	(1 << 4) /*
+				  * Migrate towards referencing node.
+				  * By building up stats on faults, the
+				  * scheduler will reinforce the choice
+				  * by identifying a home node and
+				  * queueing the task on that node
+				  * where possible.
+				  */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 648423a..c6f85eb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2452,8 +2452,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		BUG();
 	}
 
-	/* Migrate the page towards the node whose CPU is referencing it */
-	if (pol->flags & MPOL_F_MORON) {
+	/*
+	 * Migrate pages towards their referencing node. Based on the fault
+	 * statistics a home node will be chosen by the scheduler
+	 */
+	if (pol->flags & MPOL_F_HOME) {
 		int last_nid;
 
 		polnid = numa_node_id();
@@ -2676,7 +2679,7 @@ void __init numa_policy_init(void)
 		preferred_node_policy[nid] = (struct mempolicy) {
 			.refcnt = ATOMIC_INIT(1),
 			.mode = MPOL_PREFERRED,
-			.flags = MPOL_F_MOF | MPOL_F_MORON,
+			.flags = MPOL_F_MOF | MPOL_F_HOME,
 			.v = { .preferred_node = nid, },
 		};
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 42/43] sched: numa: Consider only one CPU per node for CPU-follows-memory
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (40 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 41/43] sched: numa: Rename mempolicy to HOME Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 11:22 ` [PATCH 43/43] sched: numa: Increase and decrease a tasks scanning period based on task fault statistics Mel Gorman
  2012-11-16 14:56 ` [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

The implementation of CPU follows memory was intended to reflect
the considerations made by autonuma on the basis that it had the
best performance figures at the time of writing. However, a major
criticism was the use of kernel threads and the impact of the
cost of the load balancer paths. As a consequence, the cpu follows
memory algorithm moved to the task_numa_work() path where it would
be incurred directly by the process. Unfortunately, it's still very
heavy, it's just much easier to measure now.

This patch attempts to reduce the cost of the path. Only one CPU
per node is considered for tasks to swap. If there is a task running
on that CPU, the calculations will determine if the system would be
better overall if the tasks were swapped. If the CPU is idle, it
will be checked if running on that node would be better than running
on the current node.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f63743..6d2ccd3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -898,9 +898,18 @@ static void task_numa_find_placement(struct task_struct *p)
 			long this_weight, other_weight, p_weight;
 			long other_diff, this_diff;
 
-			if (!cpu_online(cpu) || idle_cpu(cpu))
+			if (!cpu_online(cpu))
 				continue;
 
+			/* Idle CPU, consider running this task on that node */
+ 			if (idle_cpu(cpu)) {
+				this_weight = balancenuma_task_weight(p, nid);
+				other_weight = 0;
+				other_task = NULL;
+				p_weight = p_task_weight;
+				goto compare_other;
+			}
+
 			/* Racy check if a task is running on the other rq */
 			rq = cpu_rq(cpu);
 			other_mm = rq->curr->mm;
@@ -946,6 +955,7 @@ static void task_numa_find_placement(struct task_struct *p)
 
 			raw_spin_unlock_irq(&rq->lock);
 
+compare_other:
 			/*
 			 * other_diff: How much does the current task perfer to
 			 * run on the remote node thn the task that is
@@ -974,13 +984,20 @@ static void task_numa_find_placement(struct task_struct *p)
 					selected_task = other_task;
 				}
 			}
+
+			/*
+			 * Examine just one task per node. Examing all tasks
+			 * disrupts the system excessively
+			 */
+			break;
 		}
 	}
 
 	/* Swap the task on the selected target node */
 	if (selected_nid != -1) {
 		sched_setnode(p, selected_nid);
-		sched_setnode(selected_task, this_nid);
+		if (selected_task)
+			sched_setnode(selected_task, this_nid);
 	}
 }
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 43/43] sched: numa: Increase and decrease a tasks scanning period based on task fault statistics
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (41 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 42/43] sched: numa: Consider only one CPU per node for CPU-follows-memory Mel Gorman
@ 2012-11-16 11:22 ` Mel Gorman
  2012-11-16 14:56 ` [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

Currently the rate of scanning for an address space is controlled by the
individual tasks. The next scan is determined by p->numa_scan_period
and slowly increases as NUMA faults are handled. This assumes there are
no phase changes.

Now that there is a policy in place that guesses if a task or process
is properly placed, use that information to grow/shrink the scanning
window on a per-task basis.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d2ccd3..598f657 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1063,18 +1063,25 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/*
+	 * If this NUMA node is the selected on based on task NUMA
+	 * faults then increase the time before it scans again
+	 */
+	if (task_selected_nid == this_nid)
+		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+					  p->numa_scan_period * 2);
+
+	/*
 	 * If this NUMA node is the selected one based on process
 	 * memory and task NUMA faults then set the home node.
 	 * There should be no need to requeue the task.
 	 */
 	if (task_selected_nid == this_nid && mm_selected_nid == this_nid) {
-		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
-					  p->numa_scan_period * 2);
 		p->home_node = this_nid;
 		return;
 	}
 
-	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_scan_period = max(sysctl_balance_numa_scan_period_min,
+				p->numa_scan_period / 2);
 	task_numa_find_placement(p);
 }
 
@@ -1110,15 +1117,6 @@ void task_numa_fault(int node, int pages)
 	p->mm->mm_balancenuma->mm_numa_fault_tot++;
 	p->mm->mm_balancenuma->mm_numa_fault[node]++;
 
-	/*
-	 * Assume that as faults occur that pages are getting properly placed
-	 * and fewer NUMA hints are required. Note that this is a big
-	 * assumption, it assumes processes reach a steady steady with no
-	 * further phase changes.
-	 */
-	p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
-				p->numa_scan_period + jiffies_to_msecs(2));
-
 	task_numa_placement(p);
 }
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 11:22 ` [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation Mel Gorman
@ 2012-11-16 14:09   ` Rik van Riel
  2012-11-16 14:41     ` Mel Gorman
  0 siblings, 1 reply; 62+ messages in thread
From: Rik van Riel @ 2012-11-16 14:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/16/2012 06:22 AM, Mel Gorman wrote:
> It was pointed out by Ingo Molnar that the per-architecture definition of
> the NUMA PTE helper functions means that each supporting architecture
> will have to cut and paste it which is unfortunate. He suggested instead
> that the helpers should be weak functions that can be overridden by the
> architecture.
>
> This patch moves the helpers to mm/pgtable-generic.c and makes them weak
> functions. Architectures wishing to use this will still be required to
> define _PAGE_NUMA and potentially update their p[te|md]_present and
> pmd_bad helpers if they choose to make PAGE_NUMA similar to PROT_NONE.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Is uninlining these simple tests really the right thing to do,
or would they be better off as inlines in asm-generic/pgtable.h ?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 07/43] mm: numa: Support NUMA hinting page faults from gup/gup_fast
  2012-11-16 11:22 ` [PATCH 07/43] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
@ 2012-11-16 14:09   ` Rik van Riel
  0 siblings, 0 replies; 62+ messages in thread
From: Rik van Riel @ 2012-11-16 14:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/16/2012 06:22 AM, Mel Gorman wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> Introduce FOLL_NUMA to tell follow_page to check
> pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
> so because it always invokes handle_mm_fault and retries the
> follow_page later.
>
> KVM secondary MMU page faults will trigger the NUMA hinting page
> faults through gup_fast -> get_user_pages -> follow_page ->
> handle_mm_fault.
>
> Other follow_page callers like KSM should not use FOLL_NUMA, or they
> would fail to get the pages if they use follow_page instead of
> get_user_pages.
>
> [ This patch was picked up from the AutoNUMA tree. ]
>
> Originally-by: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> [ ported to this tree. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 14:09   ` Rik van Riel
@ 2012-11-16 14:41     ` Mel Gorman
  2012-11-16 15:32       ` Linus Torvalds
  0 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 14:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Fri, Nov 16, 2012 at 09:09:03AM -0500, Rik van Riel wrote:
> On 11/16/2012 06:22 AM, Mel Gorman wrote:
> >It was pointed out by Ingo Molnar that the per-architecture definition of
> >the NUMA PTE helper functions means that each supporting architecture
> >will have to cut and paste it which is unfortunate. He suggested instead
> >that the helpers should be weak functions that can be overridden by the
> >architecture.
> >
> >This patch moves the helpers to mm/pgtable-generic.c and makes them weak
> >functions. Architectures wishing to use this will still be required to
> >define _PAGE_NUMA and potentially update their p[te|md]_present and
> >pmd_bad helpers if they choose to make PAGE_NUMA similar to PROT_NONE.
> >
> >Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Is uninlining these simple tests really the right thing to do,
> or would they be better off as inlines in asm-generic/pgtable.h ?
> 

I would have preferred asm-generic/pgtable.h myself and use
__HAVE_ARCH_whatever tricks to keep the inlining but Ingo's suggestion
was to use __weak (https://lkml.org/lkml/2012/11/13/134) and I did not
have a strong reason to disagree. Is there a compelling choice either
way or a preference?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 00/43] Automatic NUMA Balancing V3
  2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
                   ` (42 preceding siblings ...)
  2012-11-16 11:22 ` [PATCH 43/43] sched: numa: Increase and decrease a tasks scanning period based on task fault statistics Mel Gorman
@ 2012-11-16 14:56 ` Mel Gorman
  43 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 14:56 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Fri, Nov 16, 2012 at 11:22:10AM +0000, Mel Gorman wrote:
> tldr: Benchmarkers, only test patches 1-35.
> 

A very basic sniff test of other benchmarks completed so I thought I'd
report them too. aim9, hackbench both pipes and sockets were inconclusive.
STREAM showed no difference but it was not really configured for running
on NUMA machines.

KERNBENCH
                               3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                     rc4-stats-v2r34    rc4-schednuma-v2r3  rc4-autonuma-v28fast       rc4-moron-v3r27    rc4-twostage-v3r27   rc4-lessflush-v3r27      rc4-cpuone-v3r27   rc4-adaptscan-v3r27
User    min        1295.88 (  0.00%)     1402.92 ( -8.26%)     1290.31 (  0.43%)     1298.47 ( -0.20%)     1297.88 ( -0.15%)     1299.04 ( -0.24%)     1296.06 ( -0.01%)     1300.04 ( -0.32%)
User    mean       1298.20 (  0.00%)     1842.51 (-41.93%)     1294.05 (  0.32%)     1301.40 ( -0.25%)     1300.28 ( -0.16%)     1301.86 ( -0.28%)     1300.27 ( -0.16%)     1302.41 ( -0.32%)
User    stddev        1.53 (  0.00%)      223.82 (-14561.84%)        3.29 (-115.30%)        1.96 (-28.37%)        1.82 (-19.21%)        1.62 ( -5.92%)        2.84 (-85.83%)        2.19 (-43.19%)
User    max        1300.23 (  0.00%)     1999.76 (-53.80%)     1298.14 (  0.16%)     1303.91 ( -0.28%)     1302.36 ( -0.16%)     1303.48 ( -0.25%)     1303.67 ( -0.26%)     1306.42 ( -0.48%)
System  min         118.62 (  0.00%)      176.64 (-48.91%)      122.14 ( -2.97%)      122.22 ( -3.03%)      122.11 ( -2.94%)      122.23 ( -3.04%)      120.96 ( -1.97%)      122.88 ( -3.59%)
System  mean        118.96 (  0.00%)      261.27 (-119.63%)      123.23 ( -3.59%)      123.22 ( -3.58%)      122.52 ( -3.00%)      122.68 ( -3.13%)      121.48 ( -2.12%)      123.58 ( -3.88%)
System  stddev        0.20 (  0.00%)       42.66 (-21088.95%)        0.57 (-183.80%)        1.09 (-439.58%)        0.31 (-51.61%)        0.40 (-96.69%)        0.34 (-70.93%)        0.60 (-199.57%)
System  max         119.21 (  0.00%)      291.11 (-144.20%)      123.73 ( -3.79%)      125.24 ( -5.06%)      122.97 ( -3.15%)      123.20 ( -3.35%)      121.92 ( -2.27%)      124.62 ( -4.54%)
Elapsed min          40.26 (  0.00%)       50.31 (-24.96%)       40.63 ( -0.92%)       39.89 (  0.92%)       40.45 ( -0.47%)       40.92 ( -1.64%)       41.33 ( -2.66%)       40.60 ( -0.84%)
Elapsed mean         41.65 (  0.00%)       64.74 (-55.43%)       42.04 ( -0.94%)       41.57 (  0.18%)       41.72 ( -0.18%)       41.96 ( -0.75%)       42.13 ( -1.16%)       41.93 ( -0.67%)
Elapsed stddev        0.89 (  0.00%)        7.44 (-737.89%)        1.25 (-40.28%)        0.94 ( -5.80%)        0.90 ( -1.42%)        0.77 ( 13.48%)        0.57 ( 36.26%)        0.72 ( 19.00%)
Elapsed max          42.96 (  0.00%)       70.82 (-64.85%)       44.00 ( -2.42%)       42.63 (  0.77%)       42.89 (  0.16%)       42.78 (  0.42%)       42.81 (  0.35%)       42.78 (  0.42%)
CPU     min        3303.00 (  0.00%)     3139.00 (  4.97%)     3221.00 (  2.48%)     3332.00 ( -0.88%)     3310.00 ( -0.21%)     3322.00 ( -0.58%)     3310.00 ( -0.21%)     3331.00 ( -0.85%)
CPU     mean       3403.00 (  0.00%)     3243.60 (  4.68%)     3373.40 (  0.87%)     3428.00 ( -0.73%)     3411.00 ( -0.24%)     3395.00 (  0.24%)     3374.40 (  0.84%)     3401.40 (  0.05%)
CPU     stddev       70.24 (  0.00%)       60.02 ( 14.56%)      103.48 (-47.33%)       85.77 (-22.12%)       78.33 (-11.52%)       64.52 (  8.15%)       50.06 ( 28.73%)       64.99 (  7.47%)
CPU     max        3518.00 (  0.00%)     3310.00 (  5.91%)     3496.00 (  0.63%)     3582.00 ( -1.82%)     3522.00 ( -0.11%)     3482.00 (  1.02%)     3447.00 (  2.02%)     3524.00 ( -0.17%)
MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc4-stats-v2r34rc4-schednuma-v2r3rc4-autonuma-v28fastrc4-moron-v3r27rc4-twostage-v3r27rc4-lessflush-v3r27rc4-cpuone-v3r27rc4-adaptscan-v3r27
User         7821.95    11152.89     7792.64     7841.33     7835.71     7840.44     7829.82     7852.37
System        734.94     1599.73      757.50      760.34      756.44      756.76      749.66      762.89
Elapsed       297.60      438.56      299.92      296.24      298.16      298.33      299.63      300.04

This is a plain kernel build benchmark. 5 builds averaged.

schednuma regressed elapsed time by 55.43% with major deviations and high
system CPU overhead. Breakage in the logic that starts scanning maybe?
Problem could be in the tip/sched/core patches for all I know. I didn't
investigate. It's something to keep an eye on if it's rebased.

autonuma and balancenuma are ok. No major impact but none is expected.

PAGE FAULT TEST

This is a microbench for page fault tests. The ordering of the threads
is screwed up obviously. I'll fix that some other time. It just makes it
a bit trickier to read.

                              3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0                 3.7.0
                    rc4-stats-v2r34    rc4-schednuma-v2r3  rc4-autonuma-v28fast       rc4-moron-v3r27    rc4-twostage-v3r27   rc4-lessflush-v3r27      rc4-cpuone-v3r27   rc4-adaptscan-v3r27
User       1       0.6940 (  0.00%)      0.7065 ( -1.80%)      0.6535 (  5.84%)      0.6930 (  0.14%)      0.6985 ( -0.65%)      0.6950 ( -0.14%)      0.6930 (  0.14%)      0.6785 (  2.23%)
User       10      0.8245 (  0.00%)      0.8020 (  2.73%)      0.8650 ( -4.91%)      0.8305 ( -0.73%)      0.8335 ( -1.09%)      0.8110 (  1.64%)      0.8440 ( -2.37%)      0.8320 ( -0.91%)
User       11      0.8410 (  0.00%)      0.8105 (  3.63%)      0.8765 ( -4.22%)      0.8150 (  3.09%)      0.8350 (  0.71%)      0.8335 (  0.89%)      0.8300 (  1.31%)      0.8470 ( -0.71%)
User       12      0.8260 (  0.00%)      0.7370 ( 10.77%)      0.8975 ( -8.66%)      0.8120 (  1.69%)      0.7865 (  4.78%)      0.7880 (  4.60%)      0.7755 (  6.11%)      0.8065 (  2.36%)
User       13      0.8490 (  0.00%)      0.8445 (  0.53%)      0.9275 ( -9.25%)      0.8775 ( -3.36%)      0.8310 (  2.12%)      0.8620 ( -1.53%)      0.8380 (  1.30%)      0.8290 (  2.36%)
User       14      0.8515 (  0.00%)      0.8305 (  2.47%)      0.9170 ( -7.69%)      0.8460 (  0.65%)      0.8295 (  2.58%)      0.8525 ( -0.12%)      0.8550 ( -0.41%)      0.8615 ( -1.17%)
User       15      0.7785 (  0.00%)      0.7745 (  0.51%)      0.9530 (-22.41%)      0.7665 (  1.54%)      0.8185 ( -5.14%)      0.8250 ( -5.97%)      0.8445 ( -8.48%)      0.8035 ( -3.21%)
User       16      0.8750 (  0.00%)      0.8395 (  4.06%)      0.9345 ( -6.80%)      0.8705 (  0.51%)      0.8790 ( -0.46%)      0.8485 (  3.03%)      0.8720 (  0.34%)      0.7620 ( 12.91%)
User       17      0.8865 (  0.00%)      0.9235 ( -4.17%)      0.9585 ( -8.12%)      0.9105 ( -2.71%)      0.8995 ( -1.47%)      0.8800 (  0.73%)      0.8600 (  2.99%)      0.8755 (  1.24%)
User       18      0.9240 (  0.00%)      0.9020 (  2.38%)      0.9615 ( -4.06%)      0.9945 ( -7.63%)      0.9380 ( -1.52%)      1.0210 (-10.50%)      0.8790 (  4.87%)      0.8400 (  9.09%)
User       19      0.8335 (  0.00%)      0.9030 ( -8.34%)      0.9890 (-18.66%)      0.9480 (-13.74%)      0.9160 ( -9.90%)      0.9315 (-11.76%)      0.9575 (-14.88%)      0.8390 ( -0.66%)
User       2       0.6915 (  0.00%)      0.7105 ( -2.75%)      0.6890 (  0.36%)      0.6990 ( -1.08%)      0.7020 ( -1.52%)      0.7000 ( -1.23%)      0.6980 ( -0.94%)      0.7170 ( -3.69%)
User       20      0.8055 (  0.00%)      0.7515 (  6.70%)      0.9285 (-15.27%)      0.8135 ( -0.99%)      0.8900 (-10.49%)      0.7495 (  6.95%)      0.8910 (-10.61%)      0.7920 (  1.68%)
User       21      0.9490 (  0.00%)      0.9575 ( -0.90%)      0.9895 ( -4.27%)      0.9060 (  4.53%)      0.9640 ( -1.58%)      0.9990 ( -5.27%)      1.0165 ( -7.11%)      0.8760 (  7.69%)
User       22      0.9325 (  0.00%)      0.9545 ( -2.36%)      0.9820 ( -5.31%)      0.9205 (  1.29%)      0.9215 (  1.18%)      0.9285 (  0.43%)      0.9625 ( -3.22%)      0.8985 (  3.65%)
User       23      0.8595 (  0.00%)      0.8640 ( -0.52%)      0.9835 (-14.43%)      0.8820 ( -2.62%)      0.8450 (  1.69%)      0.8240 (  4.13%)      0.8750 ( -1.80%)      0.8765 ( -1.98%)
User       24      0.8060 (  0.00%)      0.8255 ( -2.42%)      0.9790 (-21.46%)      0.7695 (  4.53%)      0.8260 ( -2.48%)      0.8165 ( -1.30%)      0.8275 ( -2.67%)      0.8055 (  0.06%)
User       25      0.9570 (  0.00%)      1.1360 (-18.70%)      0.9685 ( -1.20%)      0.9785 ( -2.25%)      0.9655 ( -0.89%)      0.9510 (  0.63%)      0.9815 ( -2.56%)      0.9225 (  3.61%)
User       26      0.9535 (  0.00%)      1.0865 (-13.95%)      1.0100 ( -5.93%)      0.9640 ( -1.10%)      0.9405 (  1.36%)      0.9995 ( -4.82%)      0.9415 (  1.26%)      0.8970 (  5.93%)
User       27      0.8870 (  0.00%)      1.0045 (-13.25%)      0.9885 (-11.44%)      0.9405 ( -6.03%)      0.9375 ( -5.69%)      0.8770 (  1.13%)      0.8865 (  0.06%)      0.8580 (  3.27%)
User       28      0.8935 (  0.00%)      0.8965 ( -0.34%)      1.0015 (-12.09%)      0.8760 (  1.96%)      0.8595 (  3.81%)      0.8435 (  5.60%)      0.8715 (  2.46%)      0.8880 (  0.62%)
User       29      0.9450 (  0.00%)      0.9595 ( -1.53%)      1.0110 ( -6.98%)      0.9620 ( -1.80%)      0.8990 (  4.87%)      1.0160 ( -7.51%)      0.9530 ( -0.85%)      0.9230 (  2.33%)
User       3       0.7355 (  0.00%)      0.7275 (  1.09%)      0.7465 ( -1.50%)      0.7220 (  1.84%)      0.7180 (  2.38%)      0.7180 (  2.38%)      0.7300 (  0.75%)      0.7195 (  2.18%)
User       30      0.9245 (  0.00%)      0.9810 ( -6.11%)      1.0525 (-13.85%)      0.9480 ( -2.54%)      0.9805 ( -6.06%)      0.9885 ( -6.92%)      0.9405 ( -1.73%)      0.8875 (  4.00%)
User       31      0.9120 (  0.00%)      0.9950 ( -9.10%)      1.0360 (-13.60%)      0.9680 ( -6.14%)      0.9820 ( -7.68%)      0.9715 ( -6.52%)      0.9070 (  0.55%)      0.9345 ( -2.47%)
User       32      0.9220 (  0.00%)      0.8985 (  2.55%)      1.0345 (-12.20%)      0.8940 (  3.04%)      0.9725 ( -5.48%)      0.9200 (  0.22%)      1.0100 ( -9.54%)      0.9805 ( -6.34%)
User       33      1.0340 (  0.00%)      1.0530 ( -1.84%)      1.0750 ( -3.97%)      0.9800 (  5.22%)      0.9850 (  4.74%)      1.0520 ( -1.74%)      1.0115 (  2.18%)      0.9250 ( 10.54%)
User       34      1.0150 (  0.00%)      1.0495 ( -3.40%)      1.0415 ( -2.61%)      1.0615 ( -4.58%)      1.0060 (  0.89%)      1.0145 (  0.05%)      1.0180 ( -0.30%)      0.9760 (  3.84%)
User       35      1.0010 (  0.00%)      1.0865 ( -8.54%)      1.0520 ( -5.09%)      1.0420 ( -4.10%)      1.0105 ( -0.95%)      1.0410 ( -4.00%)      1.0015 ( -0.05%)      1.0430 ( -4.20%)
User       36      1.0250 (  0.00%)      1.0835 ( -5.71%)      1.0755 ( -4.93%)      1.0670 ( -4.10%)      1.0410 ( -1.56%)      1.0390 ( -1.37%)      1.0695 ( -4.34%)      1.0590 ( -3.32%)
User       37      1.0520 (  0.00%)      1.1555 ( -9.84%)      1.0565 ( -0.43%)      1.0800 ( -2.66%)      1.0640 ( -1.14%)      1.0600 ( -0.76%)      1.0800 ( -2.66%)      1.0825 ( -2.90%)
User       38      1.0760 (  0.00%)      1.1405 ( -5.99%)      1.0725 (  0.33%)      1.0910 ( -1.39%)      1.0815 ( -0.51%)      1.0935 ( -1.63%)      1.0625 (  1.25%)      1.0430 (  3.07%)
User       39      1.0805 (  0.00%)      1.1335 ( -4.91%)      1.0815 ( -0.09%)      1.1335 ( -4.91%)      1.1095 ( -2.68%)      1.1150 ( -3.19%)      1.0830 ( -0.23%)      1.0540 (  2.45%)
User       4       0.7455 (  0.00%)      0.7555 ( -1.34%)      0.7725 ( -3.62%)      0.7275 (  2.41%)      0.7490 ( -0.47%)      0.7460 ( -0.07%)      0.7330 (  1.68%)      0.7535 ( -1.07%)
User       40      1.0880 (  0.00%)      1.1400 ( -4.78%)      1.0795 (  0.78%)      1.0875 (  0.05%)      1.1345 ( -4.27%)      1.1115 ( -2.16%)      1.1405 ( -4.83%)      1.1125 ( -2.25%)
User       41      1.0750 (  0.00%)      1.1730 ( -9.12%)      1.1250 ( -4.65%)      1.1595 ( -7.86%)      1.1045 ( -2.74%)      1.1930 (-10.98%)      1.1220 ( -4.37%)      1.1385 ( -5.91%)
User       42      1.1400 (  0.00%)      1.2005 ( -5.31%)      1.1060 (  2.98%)      1.1405 ( -0.04%)      1.1720 ( -2.81%)      1.1455 ( -0.48%)      1.1685 ( -2.50%)      1.1305 (  0.83%)
User       43      1.1345 (  0.00%)      1.2520 (-10.36%)      1.0965 (  3.35%)      1.1170 (  1.54%)      1.1875 ( -4.67%)      1.1815 ( -4.14%)      1.1630 ( -2.51%)      1.1165 (  1.59%)
User       44      1.0975 (  0.00%)      1.1655 ( -6.20%)      1.1005 ( -0.27%)      1.1045 ( -0.64%)      1.1765 ( -7.20%)      1.1630 ( -5.97%)      1.1730 ( -6.88%)      1.1285 ( -2.82%)
User       45      1.1195 (  0.00%)      1.1700 ( -4.51%)      1.1365 ( -1.52%)      1.1225 ( -0.27%)      1.1685 ( -4.38%)      1.1325 ( -1.16%)      1.1245 ( -0.45%)      1.1185 (  0.09%)
User       46      1.1265 (  0.00%)      1.1740 ( -4.22%)      1.1390 ( -1.11%)      1.0720 (  4.84%)      1.1665 ( -3.55%)      1.1800 ( -4.75%)      1.1235 (  0.27%)      1.1595 ( -2.93%)
User       47      1.0555 (  0.00%)      1.1835 (-12.13%)      1.2075 (-14.40%)      1.1155 ( -5.68%)      1.0930 ( -3.55%)      1.0810 ( -2.42%)      1.0600 ( -0.43%)      1.0525 (  0.28%)
User       48      1.0190 (  0.00%)      1.1240 (-10.30%)      1.1795 (-15.75%)      1.0510 ( -3.14%)      1.0845 ( -6.43%)      1.0635 ( -4.37%)      1.0770 ( -5.69%)      1.0390 ( -1.96%)
User       5       0.7695 (  0.00%)      0.7935 ( -3.12%)      0.7620 (  0.97%)      0.7735 ( -0.52%)      0.7820 ( -1.62%)      0.7785 ( -1.17%)      0.7710 ( -0.19%)      0.7815 ( -1.56%)
User       6       0.8050 (  0.00%)      0.7950 (  1.24%)      0.8205 ( -1.93%)      0.8020 (  0.37%)      0.7980 (  0.87%)      0.8100 ( -0.62%)      0.7795 (  3.17%)      0.7850 (  2.48%)
User       7       0.8035 (  0.00%)      0.8025 (  0.12%)      0.8470 ( -5.41%)      0.8020 (  0.19%)      0.8325 ( -3.61%)      0.8155 ( -1.49%)      0.7985 (  0.62%)      0.8010 (  0.31%)
User       8       0.7800 (  0.00%)      0.7720 (  1.03%)      0.8165 ( -4.68%)      0.7750 (  0.64%)      0.7675 (  1.60%)      0.7635 (  2.12%)      0.7705 (  1.22%)      0.7590 (  2.69%)
User       9       0.8265 (  0.00%)      0.8205 (  0.73%)      0.8675 ( -4.96%)      0.8180 (  1.03%)      0.8410 ( -1.75%)      0.8325 ( -0.73%)      0.8440 ( -2.12%)      0.8380 ( -1.39%)

okayish

System     1       8.0225 (  0.00%)      8.3030 ( -3.50%)      7.7995 (  2.78%)      8.1195 ( -1.21%)      8.8805 (-10.69%)      8.8315 (-10.08%)      8.9385 (-11.42%)      8.9270 (-11.27%)
System     10      9.3950 (  0.00%)      9.3540 (  0.44%)     12.4220 (-32.22%)      9.4675 ( -0.77%)      9.4720 ( -0.82%)      9.3825 (  0.13%)      9.4545 ( -0.63%)      9.4245 ( -0.31%)
System     11      9.6080 (  0.00%)      9.5930 (  0.16%)     12.8430 (-33.67%)      9.5195 (  0.92%)      9.6365 ( -0.30%)      9.5765 (  0.33%)      9.5655 (  0.44%)      9.6655 ( -0.60%)
System     12      9.6835 (  0.00%)      9.4295 (  2.62%)     13.8420 (-42.94%)      9.6550 (  0.29%)      9.5810 (  1.06%)      9.5840 (  1.03%)      9.5485 (  1.39%)      9.7285 ( -0.46%)
System     13     10.2280 (  0.00%)     10.2590 ( -0.30%)     14.6750 (-43.48%)     10.3410 ( -1.10%)     10.1735 (  0.53%)     10.1995 (  0.28%)     10.1645 (  0.62%)     10.1835 (  0.44%)
System     14     10.5900 (  0.00%)     10.5680 (  0.21%)     14.0160 (-32.35%)     10.5915 ( -0.01%)     10.4970 (  0.88%)     10.5605 (  0.28%)     10.5495 (  0.38%)     10.5625 (  0.26%)
System     15     10.6810 (  0.00%)     10.7840 ( -0.96%)     15.3735 (-43.93%)     10.6965 ( -0.15%)     10.7710 ( -0.84%)     10.7660 ( -0.80%)     10.9565 ( -2.58%)     10.7735 ( -0.87%)
System     16     11.2160 (  0.00%)     11.1445 (  0.64%)     15.7790 (-40.68%)     11.2480 ( -0.29%)     11.2270 ( -0.10%)     11.1350 (  0.72%)     11.1960 (  0.18%)     10.9570 (  2.31%)
System     17     11.8005 (  0.00%)     11.8740 ( -0.62%)     16.0135 (-35.70%)     11.8500 ( -0.42%)     11.8655 ( -0.55%)     11.8020 ( -0.01%)     11.8435 ( -0.36%)     11.7465 (  0.46%)
System     18     12.3135 (  0.00%)     12.3210 ( -0.06%)     17.7965 (-44.53%)     12.4070 ( -0.76%)     12.3475 ( -0.28%)     12.4035 ( -0.73%)     12.3925 ( -0.64%)     12.4900 ( -1.43%)
System     19     12.9255 (  0.00%)     12.8265 (  0.77%)     17.0350 (-31.79%)     12.9055 (  0.15%)     12.7370 (  1.46%)     12.9020 (  0.18%)     12.8535 (  0.56%)     12.7640 (  1.25%)
System     2       7.9740 (  0.00%)      8.1645 ( -2.39%)      8.0575 ( -1.05%)      8.0620 ( -1.10%)      8.3845 ( -5.15%)      8.4270 ( -5.68%)      8.3680 ( -4.94%)      8.4245 ( -5.65%)
System     20     13.1690 (  0.00%)     13.2255 ( -0.43%)     22.2560 (-69.00%)     13.1780 ( -0.07%)     13.4530 ( -2.16%)     13.1810 ( -0.09%)     13.4035 ( -1.78%)     13.1695 ( -0.00%)
System     21     13.9900 (  0.00%)     14.1500 ( -1.14%)     19.5060 (-39.43%)     14.0500 ( -0.43%)     13.8970 (  0.66%)     13.8255 (  1.18%)     13.8665 (  0.88%)     13.9995 ( -0.07%)
System     22     14.4825 (  0.00%)     14.6855 ( -1.40%)     17.4625 (-20.58%)     14.4845 ( -0.01%)     14.5035 ( -0.15%)     14.7185 ( -1.63%)     14.4410 (  0.29%)     14.3950 (  0.60%)
System     23     15.0315 (  0.00%)     15.1310 ( -0.66%)     19.5000 (-29.73%)     15.0550 ( -0.16%)     15.0435 ( -0.08%)     15.0830 ( -0.34%)     15.1505 ( -0.79%)     15.0270 (  0.03%)
System     24     15.5875 (  0.00%)     15.6670 ( -0.51%)     19.4715 (-24.92%)     15.5890 ( -0.01%)     15.5645 (  0.15%)     15.5870 (  0.00%)     15.5850 (  0.02%)     15.5945 ( -0.04%)
System     25     16.1680 (  0.00%)     16.2540 ( -0.53%)     19.1600 (-18.51%)     16.1870 ( -0.12%)     16.2020 ( -0.21%)     16.1655 (  0.02%)     16.1390 (  0.18%)     16.1530 (  0.09%)
System     26     16.7865 (  0.00%)     16.8390 ( -0.31%)     19.1500 (-14.08%)     17.0290 ( -1.44%)     17.0090 ( -1.33%)     16.6930 (  0.56%)     16.9675 ( -1.08%)     16.7230 (  0.38%)
System     27     17.3615 (  0.00%)     17.3930 ( -0.18%)     21.9550 (-26.46%)     17.2930 (  0.39%)     17.4680 ( -0.61%)     17.3075 (  0.31%)     17.3500 (  0.07%)     17.2985 (  0.36%)
System     28     17.8905 (  0.00%)     17.9420 ( -0.29%)     21.7615 (-21.64%)     17.8855 (  0.03%)     17.8505 (  0.22%)     17.8400 (  0.28%)     17.8895 (  0.01%)     17.8940 ( -0.02%)
System     29     18.4965 (  0.00%)     18.6105 ( -0.62%)     20.9715 (-13.38%)     18.5045 ( -0.04%)     18.6050 ( -0.59%)     18.7750 ( -1.51%)     18.4845 (  0.06%)     18.4235 (  0.39%)
System     3       8.0840 (  0.00%)      8.1965 ( -1.39%)      8.4280 ( -4.26%)      8.1570 ( -0.90%)      8.2880 ( -2.52%)      8.3125 ( -2.83%)      8.2720 ( -2.33%)      8.2870 ( -2.51%)
System     30     19.1560 (  0.00%)     19.2185 ( -0.33%)     20.8670 ( -8.93%)     19.1430 (  0.07%)     19.0980 (  0.30%)     19.3585 ( -1.06%)     19.1110 (  0.23%)     19.1110 (  0.23%)
System     31     19.7080 (  0.00%)     19.7565 ( -0.25%)     21.6420 ( -9.81%)     19.7960 ( -0.45%)     19.7590 ( -0.26%)     19.6950 (  0.07%)     19.6655 (  0.22%)     19.6855 (  0.11%)
System     32     20.3105 (  0.00%)     20.2960 (  0.07%)     22.1755 ( -9.18%)     20.2305 (  0.39%)     20.2545 (  0.28%)     20.6380 ( -1.61%)     20.3245 ( -0.07%)     20.6485 ( -1.66%)
System     33     20.9800 (  0.00%)     21.3255 ( -1.65%)     24.1000 (-14.87%)     21.2675 ( -1.37%)     21.0655 ( -0.41%)     20.8960 (  0.40%)     21.0060 ( -0.12%)     20.9885 ( -0.04%)
System     34     21.6850 (  0.00%)     21.6340 (  0.24%)     23.1505 ( -6.76%)     21.7575 ( -0.33%)     21.6265 (  0.27%)     21.6225 (  0.29%)     21.5930 (  0.42%)     21.8515 ( -0.77%)
System     35     22.3675 (  0.00%)     22.3655 (  0.01%)     24.6240 (-10.09%)     22.3405 (  0.12%)     22.3260 (  0.19%)     22.4570 ( -0.40%)     22.2735 (  0.42%)     22.3960 ( -0.13%)
System     36     22.9230 (  0.00%)     22.9350 ( -0.05%)     24.4070 ( -6.47%)     23.1835 ( -1.14%)     23.3975 ( -2.07%)     22.9810 ( -0.25%)     23.1720 ( -1.09%)     22.9655 ( -0.19%)
System     37     23.8260 (  0.00%)     23.6405 (  0.78%)     26.7380 (-12.22%)     23.6795 (  0.61%)     23.7115 (  0.48%)     23.7050 (  0.51%)     23.7755 (  0.21%)     23.7005 (  0.53%)
System     38     24.5010 (  0.00%)     24.4410 (  0.24%)     26.4575 ( -7.99%)     24.4370 (  0.26%)     24.5340 ( -0.13%)     24.4795 (  0.09%)     24.4630 (  0.16%)     24.3760 (  0.51%)
System     39     25.1385 (  0.00%)     25.0860 (  0.21%)     25.9555 ( -3.25%)     25.2515 ( -0.45%)     25.2305 ( -0.37%)     25.2435 ( -0.42%)     25.0690 (  0.28%)     25.0730 (  0.26%)
System     4       8.1825 (  0.00%)      8.2740 ( -1.12%)      9.0735 (-10.89%)      8.2220 ( -0.48%)      8.3215 ( -1.70%)      8.3500 ( -2.05%)      8.2945 ( -1.37%)      8.2665 ( -1.03%)
System     40     25.9080 (  0.00%)     25.8030 (  0.41%)     25.9760 ( -0.26%)     25.6900 (  0.84%)     25.6950 (  0.82%)     25.7610 (  0.57%)     25.9280 ( -0.08%)     25.7840 (  0.48%)
System     41     26.6045 (  0.00%)     26.6025 (  0.01%)     26.7530 ( -0.56%)     26.6155 ( -0.04%)     26.6660 ( -0.23%)     26.6985 ( -0.35%)     26.5565 (  0.18%)     26.6280 ( -0.09%)
System     42     27.3815 (  0.00%)     27.3430 (  0.14%)     27.1135 (  0.98%)     27.3435 (  0.14%)     27.3335 (  0.18%)     27.3950 ( -0.05%)     27.5040 ( -0.45%)     27.3690 (  0.05%)
System     43     28.1110 (  0.00%)     28.2565 ( -0.52%)     28.7255 ( -2.19%)     28.0770 (  0.12%)     28.1630 ( -0.18%)     28.0530 (  0.21%)     27.9880 (  0.44%)     28.0160 (  0.34%)
System     44     28.7445 (  0.00%)     28.9265 ( -0.63%)     27.9265 (  2.85%)     28.7380 (  0.02%)     28.7090 (  0.12%)     28.7205 (  0.08%)     28.7120 (  0.11%)     28.7355 (  0.03%)
System     45     29.4660 (  0.00%)     29.6940 ( -0.77%)     28.7570 (  2.41%)     29.4360 (  0.10%)     29.5085 ( -0.14%)     29.5505 ( -0.29%)     29.5690 ( -0.35%)     29.5495 ( -0.28%)
System     46     30.3190 (  0.00%)     30.4165 ( -0.32%)     29.2040 (  3.68%)     30.2740 (  0.15%)     30.2490 (  0.23%)     30.2530 (  0.22%)     30.2540 (  0.21%)     30.1500 (  0.56%)
System     47     30.9935 (  0.00%)     31.2000 ( -0.67%)     29.5720 (  4.59%)     30.9790 (  0.05%)     30.9705 (  0.07%)     30.9835 (  0.03%)     30.9770 (  0.05%)     30.9540 (  0.13%)
System     48     31.7365 (  0.00%)     31.9850 ( -0.78%)     29.8695 (  5.88%)     31.7240 (  0.04%)     31.6080 (  0.40%)     31.4910 (  0.77%)     31.7400 ( -0.01%)     31.6185 (  0.37%)
System     5       8.4450 (  0.00%)      8.6465 ( -2.39%)      9.7375 (-15.30%)      8.5655 ( -1.43%)      8.6400 ( -2.31%)      8.5890 ( -1.71%)      8.6010 ( -1.85%)      8.6535 ( -2.47%)
System     6       8.7270 (  0.00%)      8.7060 (  0.24%)     10.1565 (-16.38%)      8.7445 ( -0.20%)      8.8040 ( -0.88%)      8.8985 ( -1.97%)      8.7645 ( -0.43%)      8.7625 ( -0.41%)
System     7       8.7955 (  0.00%)      8.8210 ( -0.29%)      9.8575 (-12.07%)      8.8850 ( -1.02%)      8.9770 ( -2.06%)      8.9645 ( -1.92%)      8.9130 ( -1.34%)      8.9485 ( -1.74%)
System     8       8.7685 (  0.00%)      8.8625 ( -1.07%)     11.7395 (-33.88%)      8.8640 ( -1.09%)      8.8125 ( -0.50%)      8.7915 ( -0.26%)      8.8170 ( -0.55%)      8.8815 ( -1.29%)
System     9       9.2585 (  0.00%)      9.2545 (  0.04%)     11.5375 (-24.62%)      9.2480 (  0.11%)      9.3775 ( -1.29%)      9.2645 ( -0.06%)      9.3285 ( -0.76%)      9.2975 ( -0.42%)

autonuma severely regresses here.

Elapsed    1       8.7345 (  0.00%)      9.0280 ( -3.36%)      8.4700 (  3.03%)      8.8290 ( -1.08%)      9.5940 ( -9.84%)      9.5440 ( -9.27%)      9.6495 (-10.48%)      9.6235 (-10.18%)
Elapsed    10      1.0810 (  0.00%)      1.0745 (  0.60%)      1.4245 (-31.78%)      1.0935 ( -1.16%)      1.0895 ( -0.79%)      1.0795 (  0.14%)      1.0990 ( -1.67%)      1.0875 ( -0.60%)
Elapsed    11      1.0140 (  0.00%)      1.0040 (  0.99%)      1.3640 (-34.52%)      1.0065 (  0.74%)      1.0135 (  0.05%)      1.0075 (  0.64%)      0.9965 (  1.73%)      1.0175 ( -0.35%)
Elapsed    12      0.9345 (  0.00%)      0.8745 (  6.42%)      1.4000 (-49.81%)      0.9300 (  0.48%)      0.9160 (  1.98%)      0.9155 (  2.03%)      0.8980 (  3.91%)      0.9375 ( -0.32%)
Elapsed    13      0.9370 (  0.00%)      0.9335 (  0.37%)      1.4165 (-51.17%)      0.9435 ( -0.69%)      0.9270 (  1.07%)      0.9360 (  0.11%)      0.9195 (  1.87%)      0.9195 (  1.87%)
Elapsed    14      0.8840 (  0.00%)      0.8800 (  0.45%)      1.2840 (-45.25%)      0.8870 ( -0.34%)      0.8785 (  0.62%)      0.8835 (  0.06%)      0.8865 ( -0.28%)      0.8770 (  0.79%)
Elapsed    15      0.8165 (  0.00%)      0.8195 ( -0.37%)      1.4100 (-72.69%)      0.8150 (  0.18%)      0.8395 ( -2.82%)      0.8375 ( -2.57%)      0.8420 ( -3.12%)      0.8220 ( -0.67%)
Elapsed    16      0.8090 (  0.00%)      0.7835 (  3.15%)      1.3955 (-72.50%)      0.7980 (  1.36%)      0.8120 ( -0.37%)      0.7780 (  3.83%)      0.8070 (  0.25%)      0.7520 (  7.05%)
Elapsed    17      0.8365 (  0.00%)      0.8340 (  0.30%)      1.3865 (-65.75%)      0.8280 (  1.02%)      0.8370 ( -0.06%)      0.8310 (  0.66%)      0.8280 (  1.02%)      0.8195 (  2.03%)
Elapsed    18      0.8070 (  0.00%)      0.7990 (  0.99%)      1.4900 (-84.63%)      0.8040 (  0.37%)      0.7985 (  1.05%)      0.8025 (  0.56%)      0.8065 (  0.06%)      0.8035 (  0.43%)
Elapsed    19      0.7760 (  0.00%)      0.7620 (  1.80%)      1.3920 (-79.38%)      0.7715 (  0.58%)      0.7665 (  1.22%)      0.7665 (  1.22%)      0.7765 ( -0.06%)      0.7755 (  0.06%)
Elapsed    2       4.3875 (  0.00%)      4.5205 ( -3.03%)      4.4680 ( -1.83%)      4.4385 ( -1.16%)      4.5830 ( -4.46%)      4.6285 ( -5.49%)      4.5885 ( -4.58%)      4.6315 ( -5.56%)
Elapsed    20      0.7320 (  0.00%)      0.7205 (  1.57%)      1.8875 (-157.86%)      0.7360 ( -0.55%)      0.7660 ( -4.64%)      0.7275 (  0.61%)      0.7700 ( -5.19%)      0.7250 (  0.96%)
Elapsed    21      0.8115 (  0.00%)      0.8150 ( -0.43%)      1.5500 (-91.00%)      0.8060 (  0.68%)      0.8000 (  1.42%)      0.7925 (  2.34%)      0.7885 (  2.83%)      0.8050 (  0.80%)
Elapsed    22      0.7865 (  0.00%)      0.7880 ( -0.19%)      1.2990 (-65.16%)      0.7820 (  0.57%)      0.7795 (  0.89%)      0.7995 ( -1.65%)      0.7740 (  1.59%)      0.7615 (  3.18%)
Elapsed    23      0.7550 (  0.00%)      0.7505 (  0.60%)      1.4240 (-88.61%)      0.7585 ( -0.46%)      0.7530 (  0.26%)      0.7620 ( -0.93%)      0.7760 ( -2.78%)      0.7535 (  0.20%)
Elapsed    24      0.7405 (  0.00%)      0.7315 (  1.22%)      1.4250 (-92.44%)      0.7485 ( -1.08%)      0.7420 ( -0.20%)      0.7705 ( -4.05%)      0.7495 ( -1.22%)      0.7475 ( -0.95%)
Elapsed    25      0.8405 (  0.00%)      0.8675 ( -3.21%)      1.3745 (-63.53%)      0.8430 ( -0.30%)      0.8460 ( -0.65%)      0.8380 (  0.30%)      0.8250 (  1.84%)      0.8280 (  1.49%)
Elapsed    26      0.8240 (  0.00%)      0.8400 ( -1.94%)      1.2865 (-56.13%)      0.8420 ( -2.18%)      0.8390 ( -1.82%)      0.8200 (  0.49%)      0.8235 (  0.06%)      0.8030 (  2.55%)
Elapsed    27      0.8095 (  0.00%)      0.8065 (  0.37%)      1.5595 (-92.65%)      0.8080 (  0.19%)      0.8280 ( -2.29%)      0.8070 (  0.31%)      0.8070 (  0.31%)      0.7985 (  1.36%)
Elapsed    28      0.7945 (  0.00%)      0.7805 (  1.76%)      1.5045 (-89.36%)      0.7935 (  0.13%)      0.7955 ( -0.13%)      0.7895 (  0.63%)      0.7955 ( -0.13%)      0.7870 (  0.94%)
Elapsed    29      0.8105 (  0.00%)      0.8015 (  1.11%)      1.2645 (-56.01%)      0.8055 (  0.62%)      0.8205 ( -1.23%)      0.8155 ( -0.62%)      0.8025 (  0.99%)      0.7915 (  2.34%)
Elapsed    3       2.9775 (  0.00%)      3.0160 ( -1.29%)      3.1050 ( -4.28%)      2.9970 ( -0.65%)      3.0270 ( -1.66%)      3.0450 ( -2.27%)      3.0395 ( -2.08%)      3.0295 ( -1.75%)
Elapsed    30      0.7980 (  0.00%)      0.7960 (  0.25%)      1.2380 (-55.14%)      0.8015 ( -0.44%)      0.8045 ( -0.81%)      0.8080 ( -1.25%)      0.8000 ( -0.25%)      0.7930 (  0.63%)
Elapsed    31      0.7895 (  0.00%)      0.7790 (  1.33%)      1.2440 (-57.57%)      0.8020 ( -1.58%)      0.7900 ( -0.06%)      0.7975 ( -1.01%)      0.7800 (  1.20%)      0.7965 ( -0.89%)
Elapsed    32      0.7875 (  0.00%)      0.7565 (  3.94%)      1.2395 (-57.40%)      0.7800 (  0.95%)      0.7850 (  0.32%)      0.7885 ( -0.13%)      0.7935 ( -0.76%)      0.7965 ( -1.14%)
Elapsed    33      0.7910 (  0.00%)      0.7885 (  0.32%)      1.3885 (-75.54%)      0.7965 ( -0.70%)      0.8030 ( -1.52%)      0.7840 (  0.88%)      0.7925 ( -0.19%)      0.7850 (  0.76%)
Elapsed    34      0.7965 (  0.00%)      0.7760 (  2.57%)      1.2035 (-51.10%)      0.8030 ( -0.82%)      0.7750 (  2.70%)      0.7950 (  0.19%)      0.7845 (  1.51%)      0.8015 ( -0.63%)
Elapsed    35      0.7960 (  0.00%)      0.7820 (  1.76%)      1.4090 (-77.01%)      0.7965 ( -0.06%)      0.7800 (  2.01%)      0.7900 (  0.75%)      0.7805 (  1.95%)      0.7975 ( -0.19%)
Elapsed    36      0.7690 (  0.00%)      0.7620 (  0.91%)      1.1845 (-54.03%)      0.8080 ( -5.07%)      0.8230 ( -7.02%)      0.7780 ( -1.17%)      0.7815 ( -1.63%)      0.7800 ( -1.43%)
Elapsed    37      0.8040 (  0.00%)      0.7740 (  3.73%)      1.5740 (-95.77%)      0.7815 (  2.80%)      0.7860 (  2.24%)      0.7890 (  1.87%)      0.7955 (  1.06%)      0.7760 (  3.48%)
Elapsed    38      0.7975 (  0.00%)      0.7755 (  2.76%)      1.3840 (-73.54%)      0.7885 (  1.13%)      0.7845 (  1.63%)      0.7790 (  2.32%)      0.7860 (  1.44%)      0.7620 (  4.45%)
Elapsed    39      0.7840 (  0.00%)      0.7650 (  2.42%)      1.2020 (-53.32%)      0.7850 ( -0.13%)      0.7875 ( -0.45%)      0.7855 ( -0.19%)      0.7755 (  1.08%)      0.7675 (  2.10%)
Elapsed    4       2.2725 (  0.00%)      2.2930 ( -0.90%)      2.5105 (-10.47%)      2.2700 (  0.11%)      2.3045 ( -1.41%)      2.3150 ( -1.87%)      2.2990 ( -1.17%)      2.2895 ( -0.75%)
Elapsed    40      0.7865 (  0.00%)      0.7585 (  3.56%)      1.1700 (-48.76%)      0.7675 (  2.42%)      0.7620 (  3.12%)      0.7720 (  1.84%)      0.7830 (  0.45%)      0.7750 (  1.46%)
Elapsed    41      0.7710 (  0.00%)      0.7730 ( -0.26%)      1.0885 (-41.18%)      0.7700 (  0.13%)      0.7850 ( -1.82%)      0.7895 ( -2.40%)      0.7660 (  0.65%)      0.7825 ( -1.49%)
Elapsed    42      0.7710 (  0.00%)      0.7620 (  1.17%)      1.1805 (-53.11%)      0.7760 ( -0.65%)      0.7835 ( -1.62%)      0.7965 ( -3.31%)      0.7895 ( -2.40%)      0.7750 ( -0.52%)
Elapsed    43      0.7760 (  0.00%)      0.7890 ( -1.68%)      1.4025 (-80.73%)      0.7660 (  1.29%)      0.7815 ( -0.71%)      0.7855 ( -1.22%)      0.7715 (  0.58%)      0.7800 ( -0.52%)
Elapsed    44      0.7620 (  0.00%)      0.7560 (  0.79%)      1.0490 (-37.66%)      0.7650 ( -0.39%)      0.7580 (  0.52%)      0.7635 ( -0.20%)      0.7630 ( -0.13%)      0.7575 (  0.59%)
Elapsed    45      0.7685 (  0.00%)      0.7555 (  1.69%)      1.0340 (-34.55%)      0.7520 (  2.15%)      0.7745 ( -0.78%)      0.7725 ( -0.52%)      0.7685 ( -0.00%)      0.7715 ( -0.39%)
Elapsed    46      0.7675 (  0.00%)      0.7670 (  0.07%)      1.0510 (-36.94%)      0.7530 (  1.89%)      0.7755 ( -1.04%)      0.7790 ( -1.50%)      0.7565 (  1.43%)      0.7690 ( -0.20%)
Elapsed    47      0.7655 (  0.00%)      0.7615 (  0.52%)      0.9980 (-30.37%)      0.7595 (  0.78%)      0.7555 (  1.31%)      0.7580 (  0.98%)      0.7485 (  2.22%)      0.7515 (  1.83%)
Elapsed    48      0.7565 (  0.00%)      0.7665 ( -1.32%)      1.0145 (-34.10%)      0.7550 (  0.20%)      0.7585 ( -0.26%)      0.7675 ( -1.45%)      0.7555 (  0.13%)      0.7630 ( -0.86%)
Elapsed    5       1.8960 (  0.00%)      1.9495 ( -2.82%)      2.1470 (-13.24%)      1.9300 ( -1.79%)      1.9435 ( -2.51%)      1.9385 ( -2.24%)      1.9345 ( -2.03%)      1.9715 ( -3.98%)
Elapsed    6       1.6435 (  0.00%)      1.6235 (  1.22%)      1.8870 (-14.82%)      1.6450 ( -0.09%)      1.6465 ( -0.18%)      1.6780 ( -2.10%)      1.6330 (  0.64%)      1.6335 (  0.61%)
Elapsed    7       1.4235 (  0.00%)      1.4280 ( -0.32%)      1.5900 (-11.70%)      1.4330 ( -0.67%)      1.4540 ( -2.14%)      1.4400 ( -1.16%)      1.4385 ( -1.05%)      1.4440 ( -1.44%)
Elapsed    8       1.2430 (  0.00%)      1.2560 ( -1.05%)      1.6855 (-35.60%)      1.2615 ( -1.49%)      1.2450 ( -0.16%)      1.2345 (  0.68%)      1.2455 ( -0.20%)      1.2505 ( -0.60%)
Elapsed    9       1.1960 (  0.00%)      1.1740 (  1.84%)      1.4805 (-23.79%)      1.1890 (  0.59%)      1.2175 ( -1.80%)      1.1960 (  0.00%)      1.2135 ( -1.46%)      1.1960 (  0.00%)

autonuma severely regresses again.

Faults/cpu 1  379172.8815 (  0.00%) 366916.6749 ( -3.23%) 390865.4622 (  3.08%) 375062.6065 ( -1.08%) 344398.2334 ( -9.17%) 346316.7026 ( -8.67%) 342593.7671 ( -9.65%) 343478.2077 ( -9.41%)
Faults/cpu 10 323604.1493 (  0.00%) 325718.3813 (  0.65%) 258196.7306 (-20.21%) 321198.7019 ( -0.74%) 320441.2754 ( -0.98%) 324020.5574 (  0.13%) 320504.8353 ( -0.96%) 321797.4986 ( -0.56%)
Faults/cpu 11 316519.9065 (  0.00%) 317833.8690 (  0.42%) 243939.0320 (-22.93%) 320036.2189 (  1.11%) 315177.2430 ( -0.42%) 317178.5745 (  0.21%) 317748.8098 (  0.39%) 313869.3320 ( -0.84%)
Faults/cpu 12 314628.1208 (  0.00%) 325383.9891 (  3.42%) 230412.8991 (-26.77%) 315907.2186 (  0.41%) 318360.5441 (  1.19%) 318227.8914 (  1.14%) 320184.1807 (  1.77%) 313449.8158 ( -0.37%)
Faults/cpu 13 298592.4770 (  0.00%) 297979.0298 ( -0.21%) 220614.4681 (-26.12%) 294750.6269 ( -1.29%) 299824.4052 (  0.41%) 298329.0947 ( -0.09%) 300100.8660 (  0.51%) 299793.3858 (  0.40%)
Faults/cpu 14 288928.7491 (  0.00%) 290144.9339 (  0.42%) 223268.2541 (-22.73%) 289070.0442 (  0.05%) 291401.9587 (  0.86%) 289055.8228 (  0.04%) 289312.9895 (  0.13%) 288960.1284 (  0.01%)
Faults/cpu 15 288473.0724 (  0.00%) 285993.7526 ( -0.86%) 207111.4074 (-28.20%) 288432.1628 ( -0.01%) 284684.3744 ( -1.31%) 284659.4118 ( -1.32%) 279829.5199 ( -3.00%) 285039.1880 ( -1.19%)
Faults/cpu 16 273522.3548 (  0.00%) 276017.7447 (  0.91%) 203116.4171 (-25.74%) 273046.0156 ( -0.17%) 272635.1353 ( -0.32%) 275644.9327 (  0.78%) 273495.7977 ( -0.01%) 281660.0911 (  2.98%)
Faults/cpu 17 260585.8755 (  0.00%) 258319.0870 ( -0.87%) 197740.2319 (-24.12%) 259092.8477 ( -0.57%) 258662.7866 ( -0.74%) 260215.8817 ( -0.14%) 260070.3928 ( -0.20%) 261341.7127 (  0.29%)
Faults/cpu 18 249689.2296 (  0.00%) 249968.8075 (  0.11%) 182930.9296 (-26.74%) 246603.2969 ( -1.24%) 248314.5609 ( -0.55%) 245716.2923 ( -1.59%) 248976.3339 ( -0.29%) 248073.9163 ( -0.65%)
Faults/cpu 19 240750.3659 (  0.00%) 240789.6353 (  0.02%) 188771.5218 (-21.59%) 238684.0390 ( -0.86%) 241604.9375 (  0.35%) 238763.3031 ( -0.83%) 238906.5567 ( -0.77%) 242483.4568 (  0.72%)
Faults/cpu 2  381421.0777 (  0.00%) 372376.1903 ( -2.37%) 377934.7123 ( -0.91%) 377273.1157 ( -1.09%) 363045.3746 ( -4.82%) 361409.2242 ( -5.25%) 363834.2320 ( -4.61%) 360820.2154 ( -5.40%)
Faults/cpu 20 236477.3449 (  0.00%) 236457.1431 ( -0.01%) 157382.2381 (-33.45%) 236200.5157 ( -0.12%) 230414.6321 ( -2.56%) 236756.7760 (  0.12%) 231175.6010 ( -2.24%) 236282.0209 ( -0.08%)
Faults/cpu 21 221441.8099 (  0.00%) 219259.4023 ( -0.99%) 170488.7764 (-23.01%) 221341.7555 ( -0.05%) 221960.5560 (  0.23%) 222477.2287 (  0.47%) 221624.7875 (  0.08%) 222009.8368 (  0.26%)
Faults/cpu 22 214404.4785 (  0.00%) 211526.3457 ( -1.34%) 180914.1781 (-15.62%) 214533.1781 (  0.06%) 213834.6982 ( -0.27%) 211470.0826 ( -1.37%) 214114.3111 ( -0.14%) 215667.1744 (  0.59%)
Faults/cpu 23 207965.3607 (  0.00%) 206622.5181 ( -0.65%) 164690.9435 (-20.81%) 207352.5344 ( -0.29%) 207596.1677 ( -0.18%) 207356.1440 ( -0.29%) 205885.1082 ( -1.00%) 207364.0749 ( -0.29%)
Faults/cpu 24 201603.7131 (  0.00%) 200366.9546 ( -0.61%) 164329.3135 (-18.49%) 201995.6658 (  0.19%) 201208.3329 ( -0.20%) 201058.1035 ( -0.27%) 200970.7206 ( -0.31%) 201088.4438 ( -0.26%)
Faults/cpu 25 192982.6559 (  0.00%) 190032.9728 ( -1.53%) 165277.5793 (-14.36%) 192529.6572 ( -0.23%) 192126.4710 ( -0.44%) 192693.8739 ( -0.15%) 192660.3384 ( -0.17%) 193155.8498 (  0.09%)
Faults/cpu 26 186311.3491 (  0.00%) 184369.8079 ( -1.04%) 164639.6026 (-11.63%) 184378.1722 ( -1.04%) 184081.5062 ( -1.20%) 186415.4759 (  0.06%) 184932.4077 ( -0.74%) 187163.1699 (  0.46%)
Faults/cpu 27 181102.8388 (  0.00%) 179639.8924 ( -0.81%) 148296.2637 (-18.11%) 181244.6370 (  0.08%) 179313.3665 ( -0.99%) 181360.8913 (  0.14%) 180856.6142 ( -0.14%) 181653.5192 (  0.30%)
Faults/cpu 28 175945.1131 (  0.00%) 175424.0244 ( -0.30%) 150059.6742 (-14.71%) 176154.8663 (  0.12%) 176295.5024 (  0.20%) 176521.8006 (  0.33%) 175800.4190 ( -0.08%) 175615.2235 ( -0.19%)
Faults/cpu 29 169967.6343 (  0.00%) 168866.6936 ( -0.65%) 151317.4262 (-10.97%) 169771.0342 ( -0.12%) 169147.7679 ( -0.48%) 167103.7776 ( -1.68%) 169676.5522 ( -0.17%) 170481.0704 (  0.30%)
Faults/cpu 3  374694.3435 (  0.00%) 370320.2762 ( -1.17%) 360488.0454 ( -3.79%) 372186.8749 ( -0.67%) 366227.3210 ( -2.26%) 365202.5855 ( -2.53%) 366384.7088 ( -2.22%) 366173.4018 ( -2.27%)
Faults/cpu 30 164584.3989 (  0.00%) 163602.5225 ( -0.60%) 150945.3476 ( -8.29%) 164561.4904 ( -0.01%) 164276.7060 ( -0.19%) 162630.9288 ( -1.19%) 164498.4388 ( -0.05%) 164931.7620 (  0.21%)
Faults/cpu 31 160285.7571 (  0.00%) 159258.2232 ( -0.64%) 146029.2600 ( -8.89%) 159189.9619 ( -0.68%) 159106.5543 ( -0.74%) 159592.9926 ( -0.43%) 160311.9449 (  0.02%) 159957.6423 ( -0.20%)
Faults/cpu 32 155660.9421 (  0.00%) 155903.5774 (  0.16%) 142990.8439 ( -8.14%) 156425.1204 (  0.49%) 155383.8761 ( -0.18%) 153457.0020 ( -1.42%) 154651.0878 ( -0.65%) 153053.4511 ( -1.68%)
Faults/cpu 33 150131.4250 (  0.00%) 148205.0207 ( -1.28%) 133302.3457 (-11.21%) 148901.5659 ( -0.82%) 149606.0924 ( -0.35%) 150268.8294 (  0.09%) 149823.9355 ( -0.20%) 150523.0778 (  0.26%)
Faults/cpu 34 145610.6963 (  0.00%) 145687.3000 (  0.05%) 136800.0015 ( -6.05%) 144857.3713 ( -0.52%) 145742.6888 (  0.09%) 145700.6362 (  0.06%) 145889.6267 (  0.19%) 144569.4039 ( -0.72%)
Faults/cpu 35 141430.9804 (  0.00%) 140922.1273 ( -0.36%) 130391.6158 ( -7.81%) 141342.0054 ( -0.06%) 141339.5519 ( -0.06%) 140456.5555 ( -0.69%) 141719.6577 (  0.20%) 140756.6821 ( -0.48%)
Faults/cpu 36 138008.0201 (  0.00%) 137585.8607 ( -0.31%) 129745.2949 ( -5.99%) 136505.1727 ( -1.09%) 135117.1746 ( -2.09%) 137327.9664 ( -0.49%) 136222.3458 ( -1.29%) 137303.4438 ( -0.51%)
Faults/cpu 37 132858.3984 (  0.00%) 133274.9207 (  0.31%) 120230.8912 ( -9.50%) 133480.7639 (  0.47%) 133137.9432 (  0.21%) 133190.1657 (  0.25%) 132715.4547 ( -0.11%) 133098.7368 (  0.18%)
Faults/cpu 38 129215.3699 (  0.00%) 129181.6598 ( -0.03%) 121015.4440 ( -6.35%) 129457.7455 (  0.19%) 128772.2298 ( -0.34%) 129008.2496 ( -0.16%) 129241.0413 (  0.02%) 129763.2194 (  0.42%)
Faults/cpu 39 126052.2877 (  0.00%) 126028.8097 ( -0.02%) 122698.7134 ( -2.66%) 125348.3350 ( -0.56%) 125239.3263 ( -0.64%) 125155.4174 ( -0.71%) 126122.8165 (  0.06%) 126236.0479 (  0.15%)
Faults/cpu 4  370240.6003 (  0.00%) 366062.3825 ( -1.13%) 338351.9994 ( -8.61%) 369361.7956 ( -0.24%) 363663.5136 ( -1.78%) 362702.0305 ( -2.04%) 365453.3980 ( -1.29%) 365703.0736 ( -1.23%)
Faults/cpu 40 122435.0743 (  0.00%) 122651.8912 (  0.18%) 122193.1722 ( -0.20%) 123412.4231 (  0.80%) 122927.1480 (  0.40%) 122733.6456 (  0.24%) 121857.1499 ( -0.47%) 122625.6619 (  0.16%)
Faults/cpu 41 119399.1440 (  0.00%) 118978.1414 ( -0.35%) 118572.7506 ( -0.69%) 119004.8027 ( -0.33%) 118774.6690 ( -0.52%) 118295.3548 ( -0.92%) 119164.5404 ( -0.20%) 118799.4351 ( -0.50%)
Faults/cpu 42 115875.2469 (  0.00%) 115784.7633 ( -0.08%) 117239.6976 (  1.18%) 116017.0993 (  0.12%) 115713.0031 ( -0.14%) 115567.8722 ( -0.27%) 115062.3753 ( -0.70%) 115763.2412 ( -0.10%)
Faults/cpu 43 112999.9277 (  0.00%) 111988.1086 ( -0.90%) 111327.6586 ( -1.48%) 113200.3248 (  0.18%) 112383.9303 ( -0.55%) 112818.6976 ( -0.16%) 113141.1904 (  0.13%) 113210.6053 (  0.19%)
Faults/cpu 44 110740.1358 (  0.00%) 109819.0161 ( -0.83%) 113847.0028 (  2.81%) 110740.2330 (  0.00%) 110362.8114 ( -0.34%) 110357.7976 ( -0.35%) 110360.7290 ( -0.34%) 110439.5280 ( -0.27%)
Faults/cpu 45 108055.0128 (  0.00%) 107074.4961 ( -0.91%) 110522.1286 (  2.28%) 108136.1353 (  0.08%) 107516.7357 ( -0.50%) 107489.4668 ( -0.52%) 107458.7411 ( -0.55%) 107539.5220 ( -0.48%)
Faults/cpu 46 105092.0969 (  0.00%) 104611.8746 ( -0.46%) 108951.8847 (  3.67%) 105424.3434 (  0.32%) 104985.6988 ( -0.10%) 104923.7530 ( -0.16%) 105117.4772 (  0.02%) 105350.0949 (  0.25%)
Faults/cpu 47 103113.8932 (  0.00%) 102053.7667 ( -1.03%) 107375.4080 (  4.13%) 102964.4128 ( -0.14%) 102863.3429 ( -0.24%) 102859.6629 ( -0.25%) 102949.6548 ( -0.16%) 103048.4885 ( -0.06%)
Faults/cpu 48 100888.0329 (  0.00%)  99814.7029 ( -1.06%) 106438.5374 (  5.50%) 100824.3418 ( -0.06%) 100889.5606 (  0.00%) 101325.3091 (  0.43%) 100501.1339 ( -0.38%) 100997.9388 (  0.11%)
Faults/cpu 5  358736.2684 (  0.00%) 350166.8442 ( -2.39%) 321329.4860 (-10.43%) 353953.4999 ( -1.33%) 350121.1564 ( -2.40%) 352131.6663 ( -1.84%) 352003.8687 ( -1.88%) 349701.4718 ( -2.52%)
Faults/cpu 6  346831.9903 (  0.00%) 347879.3546 (  0.30%) 309546.5744 (-10.75%) 346268.1822 ( -0.16%) 343589.6859 ( -0.93%) 339778.4001 ( -2.03%) 345688.1163 ( -0.33%) 345496.5812 ( -0.39%)
Faults/cpu 7  344501.0568 (  0.00%) 343488.4476 ( -0.29%) 309529.8566 (-10.15%) 341283.5598 ( -0.93%) 336374.8748 ( -2.36%) 337359.2856 ( -2.07%) 339749.2184 ( -1.38%) 338309.3052 ( -1.80%)
Faults/cpu 8  346174.6199 (  0.00%) 343129.3801 ( -0.88%) 278357.0642 (-19.59%) 342992.8225 ( -0.92%) 344325.4495 ( -0.53%) 345312.7516 ( -0.25%) 344172.0340 ( -0.58%) 342254.9877 ( -1.13%)
Faults/cpu 9  328009.1948 (  0.00%) 328338.6341 (  0.10%) 271264.4886 (-17.30%) 328555.3156 (  0.17%) 323107.2113 ( -1.49%) 327109.9752 ( -0.27%) 324555.4670 ( -1.05%) 325690.7369 ( -0.71%)

Same.

Faults/sec 1  378449.1966 (  0.00%) 366185.0651 ( -3.24%) 390125.0553 (  3.09%) 374336.9521 ( -1.09%) 343747.2006 ( -9.17%) 345658.1953 ( -8.66%) 341940.4609 ( -9.65%) 342823.8216 ( -9.41%)
Faults/sec 103057941.3484 (  0.00%)3086181.0691 (  0.92%)2409975.1283 (-21.19%)3028074.8136 ( -0.98%)3029815.4380 ( -0.92%)3064096.7082 (  0.20%)3003064.6477 ( -1.79%)3033433.7765 ( -0.80%)
Faults/sec 113261713.5772 (  0.00%)3294466.8662 (  1.00%)2469654.8367 (-24.28%)3294208.5261 (  1.00%)3257973.1112 ( -0.11%)3278781.5108 (  0.52%)3312117.8594 (  1.55%)3246393.6600 ( -0.47%)
Faults/sec 123536124.1581 (  0.00%)3797714.3716 (  7.40%)2476800.0595 (-29.96%)3556102.7891 (  0.56%)3608809.1736 (  2.06%)3611977.9089 (  2.15%)3683380.5918 (  4.16%)3523474.7009 ( -0.36%)
Faults/sec 133522134.0988 (  0.00%)3548599.2386 (  0.75%)2446087.8125 (-30.55%)3501014.1288 ( -0.60%)3557311.6279 (  1.00%)3522629.3381 (  0.01%)3593101.6671 (  2.01%)3585947.0037 (  1.81%)
Faults/sec 143741399.7322 (  0.00%)3756052.7614 (  0.39%)2619922.1868 (-29.97%)3727293.7004 ( -0.38%)3756884.1995 (  0.41%)3733097.2019 ( -0.22%)3719147.6690 ( -0.59%)3760976.2567 (  0.52%)
Faults/sec 154049357.6044 (  0.00%)4033305.0846 ( -0.40%)2457803.8113 (-39.30%)4059688.2409 (  0.26%)3935065.7173 ( -2.82%)3940383.4551 ( -2.69%)3928630.3240 ( -2.98%)4021120.7646 ( -0.70%)
Faults/sec 164097157.8975 (  0.00%)4217099.0091 (  2.93%)2455395.0021 (-40.07%)4148897.7305 (  1.26%)4079041.4563 ( -0.44%)4254698.5340 (  3.85%)4101174.5232 (  0.10%)4392298.9242 (  7.20%)
Faults/sec 173950363.6366 (  0.00%)3965738.5936 (  0.39%)2464431.7234 (-37.62%)4000612.2796 (  1.27%)3939263.8259 ( -0.28%)3971590.5182 (  0.54%)3992282.1709 (  1.06%)4021605.0297 (  1.80%)
Faults/sec 184103861.4744 (  0.00%)4134124.3588 (  0.74%)2344019.0093 (-42.88%)4110297.5479 (  0.16%)4128836.6183 (  0.61%)4111127.7293 (  0.18%)4103327.8307 ( -0.01%)4116154.7820 (  0.30%)
Faults/sec 194276053.9277 (  0.00%)4332901.1932 (  1.33%)2497192.6587 (-41.60%)4294323.6430 (  0.43%)4307490.7887 (  0.74%)4301598.1106 (  0.60%)4260088.0090 ( -0.37%)4263646.0497 ( -0.29%)
Faults/sec 2  753370.4894 (  0.00%) 731232.6896 ( -2.94%) 739970.8294 ( -1.78%) 744801.6031 ( -1.14%) 719833.6148 ( -4.45%) 712972.4722 ( -5.36%) 719078.6344 ( -4.55%) 712311.6191 ( -5.45%)
Faults/sec 204516401.2478 (  0.00%)4579569.6777 (  1.40%)2031136.9372 (-55.03%)4494195.4045 ( -0.49%)4346639.2787 ( -3.76%)4534616.2385 (  0.40%)4317538.8462 ( -4.40%)4547820.8464 (  0.70%)
Faults/sec 214087026.2936 (  0.00%)4068731.0651 ( -0.45%)2377079.8174 (-41.84%)4123663.5132 (  0.90%)4127628.7535 (  0.99%)4169241.3553 (  2.01%)4188156.6912 (  2.47%)4118111.1871 (  0.76%)
Faults/sec 224211993.0582 (  0.00%)4204705.8000 ( -0.17%)2591796.8264 (-38.47%)4227135.0022 (  0.36%)4232124.1497 (  0.48%)4164897.9655 ( -1.12%)4266884.4798 (  1.30%)4334746.2025 (  2.91%)
Faults/sec 234387001.3207 (  0.00%)4401482.1147 (  0.33%)2405782.9280 (-45.16%)4367309.2063 ( -0.45%)4385823.7866 ( -0.03%)4337430.0302 ( -1.13%)4262380.6674 ( -2.84%)4377168.7657 ( -0.22%)
Faults/sec 244470740.3575 (  0.00%)4515720.6916 (  1.01%)2418614.6959 (-45.90%)4423814.9355 ( -1.05%)4450908.8746 ( -0.44%)4295205.4302 ( -3.93%)4412036.4296 ( -1.31%)4421135.3125 ( -1.11%)
Faults/sec 253937957.1532 (  0.00%)3809703.8550 ( -3.26%)2439820.5747 (-38.04%)3925622.9027 ( -0.31%)3904169.7898 ( -0.86%)3942084.3654 (  0.10%)4004341.9529 (  1.69%)3988732.9759 (  1.29%)
Faults/sec 264016798.7416 (  0.00%)3934975.7249 ( -2.04%)2606188.8779 (-35.12%)3950491.6949 ( -1.65%)3939682.4112 ( -1.92%)4029254.7256 (  0.31%)4024343.2095 (  0.19%)4109742.7410 (  2.31%)
Faults/sec 274086906.0951 (  0.00%)4097020.1548 (  0.25%)2304217.5641 (-43.62%)4089202.5746 (  0.06%)3992381.3302 ( -2.31%)4094242.0021 (  0.18%)4088343.4903 (  0.04%)4133544.6102 (  1.14%)
Faults/sec 284167197.2625 (  0.00%)4244831.3753 (  1.86%)2437237.9013 (-41.51%)4174417.1522 (  0.17%)4149484.5269 ( -0.43%)4175318.3689 (  0.19%)4150715.1941 ( -0.40%)4195704.1843 (  0.68%)
Faults/sec 294084242.8855 (  0.00%)4133645.1554 (  1.21%)2648559.3047 (-35.15%)4101901.9212 (  0.43%)4034820.5255 ( -1.21%)4052956.8133 ( -0.77%)4113191.3654 (  0.71%)4177020.1760 (  2.27%)
Faults/sec 3 1109918.7564 (  0.00%)1096015.2945 ( -1.25%)1065097.6375 ( -4.04%)1102922.5624 ( -0.63%)1089698.3673 ( -1.82%)1084060.8798 ( -2.33%)1085531.7843 ( -2.20%)1088643.6050 ( -1.92%)
Faults/sec 304151304.2546 (  0.00%)4158133.0937 (  0.16%)2683984.9308 (-35.35%)4137216.5074 ( -0.34%)4114545.5170 ( -0.89%)4103882.9041 ( -1.14%)4127238.4153 ( -0.58%)4162033.7762 (  0.26%)
Faults/sec 314191217.6158 (  0.00%)4243033.9025 (  1.24%)2676235.5259 (-36.15%)4137476.7286 ( -1.28%)4186115.3991 ( -0.12%)4140639.1014 ( -1.21%)4233785.2070 (  1.02%)4151948.3967 ( -0.94%)
Faults/sec 324217316.7870 (  0.00%)4365955.9624 (  3.52%)2759893.2658 (-34.56%)4243468.3822 (  0.62%)4209952.8113 ( -0.17%)4202827.3816 ( -0.34%)4173003.7934 ( -1.05%)4153518.1495 ( -1.51%)
Faults/sec 334194428.3071 (  0.00%)4211766.2074 (  0.41%)2605358.9387 (-37.89%)4167864.3697 ( -0.63%)4126534.7061 ( -1.62%)4219846.7104 (  0.61%)4173198.7504 ( -0.51%)4222243.7872 (  0.66%)
Faults/sec 344162964.6453 (  0.00%)4271808.7841 (  2.61%)2812057.1083 (-32.45%)4130364.6776 ( -0.78%)4265449.5564 (  2.46%)4160577.0467 ( -0.06%)4219222.5480 (  1.35%)4130231.2312 ( -0.79%)
Faults/sec 354161390.4401 (  0.00%)4234213.1345 (  1.75%)2607427.3999 (-37.34%)4161469.3984 (  0.00%)4238638.5058 (  1.86%)4185086.8231 (  0.57%)4233991.6586 (  1.74%)4156311.6986 ( -0.12%)
Faults/sec 364309141.4009 (  0.00%)4344553.7007 (  0.82%)2823543.9475 (-34.48%)4122760.9818 ( -4.33%)4035348.2080 ( -6.35%)4254890.0613 ( -1.26%)4241646.1233 ( -1.57%)4236136.0794 ( -1.69%)
Faults/sec 374122489.8845 (  0.00%)4278165.5642 (  3.78%)2421649.8216 (-41.26%)4239905.9435 (  2.85%)4213085.0811 (  2.20%)4198176.2146 (  1.84%)4160112.1225 (  0.91%)4256401.1564 (  3.25%)
Faults/sec 384163428.6206 (  0.00%)4265362.0071 (  2.45%)2665193.5296 (-35.99%)4194590.2131 (  0.75%)4219610.6125 (  1.35%)4242228.8659 (  1.89%)4214551.3392 (  1.23%)4340591.0193 (  4.26%)
Faults/sec 394226045.5491 (  0.00%)4321204.5745 (  2.25%)2936765.1552 (-30.51%)4222799.9685 ( -0.08%)4203297.8809 ( -0.54%)4208847.4974 ( -0.41%)4269451.3378 (  1.03%)4299557.6689 (  1.74%)
Faults/sec 4 1455428.2533 (  0.00%)1441908.4873 ( -0.93%)1325768.8684 ( -8.91%)1455562.4808 (  0.01%)1432189.8288 ( -1.60%)1425444.8766 ( -2.06%)1436425.1946 ( -1.31%)1441507.8962 ( -0.96%)
Faults/sec 404214077.2062 (  0.00%)4360652.1975 (  3.48%)2888662.5408 (-31.45%)4319445.3297 (  2.50%)4337024.7790 (  2.92%)4272911.1494 (  1.40%)4222006.9619 (  0.19%)4261499.5141 (  1.13%)
Faults/sec 414291578.8817 (  0.00%)4288112.8915 ( -0.08%)3058310.1247 (-28.74%)4302334.8101 (  0.25%)4204185.8980 ( -2.04%)4196357.7441 ( -2.22%)4312545.3593 (  0.49%)4237388.6213 ( -1.26%)
Faults/sec 424293722.5431 (  0.00%)4346426.1052 (  1.23%)2952124.2363 (-31.25%)4264837.4512 ( -0.67%)4219652.2680 ( -1.73%)4157078.8993 ( -3.18%)4185778.6339 ( -2.51%)4267668.8611 ( -0.61%)
Faults/sec 434260488.0795 (  0.00%)4190510.2320 ( -1.64%)2805833.3144 (-34.14%)4319181.6684 (  1.38%)4224080.6393 ( -0.85%)4209214.3757 ( -1.20%)4281123.5734 (  0.48%)4234533.2835 ( -0.61%)
Faults/sec 444341635.8888 (  0.00%)4371324.7434 (  0.68%)3197034.6773 (-26.36%)4325882.5323 ( -0.36%)4358300.5382 (  0.38%)4322866.2609 ( -0.43%)4334350.3160 ( -0.17%)4355499.3787 (  0.32%)
Faults/sec 454306175.0810 (  0.00%)4366968.0190 (  1.41%)3207825.3608 (-25.51%)4395367.0756 (  2.07%)4267770.3357 ( -0.89%)4280666.3010 ( -0.59%)4294599.8207 ( -0.27%)4281532.0883 ( -0.57%)
Faults/sec 464307792.6910 (  0.00%)4315328.0769 (  0.17%)3205375.1650 (-25.59%)4385402.7649 (  1.80%)4251443.9562 ( -1.31%)4241371.4893 ( -1.54%)4366618.7751 (  1.37%)4295960.6074 ( -0.27%)
Faults/sec 474321312.4226 (  0.00%)4341256.5948 (  0.46%)3319058.0014 (-23.19%)4358965.6574 (  0.87%)4365015.3672 (  1.01%)4357240.4439 (  0.83%)4405570.9439 (  1.95%)4386073.9254 (  1.50%)
Faults/sec 484366778.2107 (  0.00%)4322882.0123 ( -1.01%)3273802.9460 (-25.03%)4373380.1336 (  0.15%)4343783.8913 ( -0.53%)4297699.9587 ( -1.58%)4365676.1788 ( -0.03%)4328938.3135 ( -0.87%)
Faults/sec 5 1743122.2118 (  0.00%)1695802.2374 ( -2.71%)1567664.0733 (-10.07%)1713142.7753 ( -1.72%)1697865.9386 ( -2.60%)1701956.0663 ( -2.36%)1706302.3236 ( -2.11%)1675439.2691 ( -3.88%)
Faults/sec 6 2011718.4281 (  0.00%)2035980.9355 (  1.21%)1794156.8192 (-10.81%)2008929.2145 ( -0.14%)2003938.3687 ( -0.39%)1967048.8061 ( -2.22%)2018971.6464 (  0.36%)2018509.6143 (  0.34%)
Faults/sec 7 2324635.7987 (  0.00%)2315685.9962 ( -0.38%)2085519.9592 (-10.29%)2308763.4444 ( -0.68%)2271854.7036 ( -2.27%)2294688.0258 ( -1.29%)2295619.0940 ( -1.25%)2286072.4376 ( -1.66%)
Faults/sec 8 2662426.3046 (  0.00%)2634764.5080 ( -1.04%)2116672.9963 (-20.50%)2620493.5231 ( -1.57%)2651049.1842 ( -0.43%)2673473.8717 (  0.41%)2652192.7971 ( -0.38%)2638043.5422 ( -0.92%)
Faults/sec 9 2769511.8307 (  0.00%)2820315.9462 (  1.83%)2290888.7061 (-17.28%)2781974.0535 (  0.45%)2716619.0152 ( -1.91%)2759963.7524 ( -0.34%)2720367.4176 ( -1.77%)2763007.0284 ( -0.23%)

Same.

I don't know why autonuma has such a major impact here.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0       3.7.0
        rc4-stats-v2r34rc4-schednuma-v2r3rc4-autonuma-v28fastrc4-moron-v3r27rc4-twostage-v3r27rc4-lessflush-v3r27rc4-cpuone-v3r27rc4-adaptscan-v3r27
User         1067.83      986.85     1256.40     1077.54     1088.63     1092.13     1082.83     1064.14
System      18678.00    18730.98    22240.00    18708.24    18741.07    18742.99    18797.77    18776.84
Elapsed      1348.03     1355.33     1857.83     1352.56     1376.40     1375.49     1380.79     1375.69

autonuma has high system cpu usage and the elapsed time is high. based
on earlier results this is not a surprise.

Like the other tests this was all driven from mmtests using
configs/config-global-dhp__pagealloc-performance with the pagealloc
test removed.  pagealloc is a page allocator microbenchmark that depends
on systemtap which is broken for 3.7 kernels.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 14:41     ` Mel Gorman
@ 2012-11-16 15:32       ` Linus Torvalds
  2012-11-16 16:08         ` Ingo Molnar
  2012-11-16 16:19         ` Mel Gorman
  0 siblings, 2 replies; 62+ messages in thread
From: Linus Torvalds @ 2012-11-16 15:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML

On Fri, Nov 16, 2012 at 6:41 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> I would have preferred asm-generic/pgtable.h myself and use
> __HAVE_ARCH_whatever tricks

PLEASE NO!

Dammit, why is this disease still so prevalent, and why do people
continue to do this crap?

__HAVE_ARCH_xyzzy is a f*cking retarded thing to do, and that's
actually an insult to retarded people.

Use either:

 - Kconfig entries for bigger features where that makes sense, and
using the Kconfig files allows you to use the Kconfig logic for things
(ie there are dependencies etc, so you can avoid having to have
complicated conditionals in the #ifdef's, and instead introduce them
as rules in Kconfig files).

 - the SAME F*CKING NAME for the #ifdef, not some totally different
namespace with __HAVE_ARCH_xyzzy crap.

So if your architecture wants to override one (or more) of the
pte_*numa() functions, just make it do so. And do it with

  static inline pmd_t pmd_mknuma(pmd_t pmd)
  {
          pmd = pmd_set_flags(pmd, _PAGE_NUMA);
          return pmd_clear_flags(pmd, _PAGE_PRESENT);
  }
  #define pmd_mknuma pmd_mknuma

and then you can have the generic code have code like

   #ifndef pmd_mknuma
   .. generic version goes here ..
   #endif

and the advantage is two-fold:

 - none of the "let's make up another name to test for this"

 - "git grep" actually _works_, and the end results make sense, and
you can clearly see the logic of where things are declared, and which
one is used.

The __ARCH_HAVE_xyzzy (and some places call it __HAVE_ARCH_xyzzy)
thing is a disease.

That said, the __weak thing works too (and greps fine, as long as you
use the proper K&R C format, not the idiotic "let's put the name of
the function on a different line than the type of the function"
format), it just doesn't allow inlining.

In this case, I suspect the inlined function is generally a single
instruction, is it not? In which case I really do think that inlining
makes sense.

                      Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/43] mm: mempolicy: Use _PAGE_NUMA to migrate pages
  2012-11-16 11:22 ` [PATCH 14/43] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
@ 2012-11-16 16:08   ` Rik van Riel
  0 siblings, 0 replies; 62+ messages in thread
From: Rik van Riel @ 2012-11-16 16:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/16/2012 06:22 AM, Mel Gorman wrote:
> Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
> 	sufficiently different that the signed-off-bys were dropped
>
> Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
> pieces into an effective migrate on fault scheme.
>
> Note that (on x86) we rely on PROT_NONE pages being !present and avoid
> the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
> page-migration performance.
>
> Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

(this is getting easier, I must have reviewed this code 4x now)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 15:32       ` Linus Torvalds
@ 2012-11-16 16:08         ` Ingo Molnar
  2012-11-16 16:56           ` Mel Gorman
  2012-11-16 16:19         ` Mel Gorman
  1 sibling, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2012-11-16 16:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Nov 16, 2012 at 6:41 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > I would have preferred asm-generic/pgtable.h myself and use
> > __HAVE_ARCH_whatever tricks
> 
> PLEASE NO!
> 
> Dammit, why is this disease still so prevalent, and why do people
> continue to do this crap?

Also, why is this done in a weird, roundabout way of first 
picking up a bad patch and then modifying it and making it even 
worse?

Why not use something what we have in numa/core already:

  f05ea0948708 mm/mpol: Create special PROT_NONE infrastructure

AFAICS, this portion of numa/core:

c740b1cccdcb x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
02743c9c03f1 mm/mpol: Use special PROT_NONE to migrate pages
b33467764d8a mm/migrate: Introduce migrate_misplaced_page()
db4aa58db59a numa, mm: Support NUMA hinting page faults from gup/gup_fast
ca2ea0747a5b mm/mpol: Add MPOL_MF_LAZY
f05ea0948708 mm/mpol: Create special PROT_NONE infrastructure
37081a3de2bf mm/mpol: Check for misplaced page
cd203e33c39d mm/mpol: Add MPOL_MF_NOOP
88f4670789e3 mm/mpol: Make MPOL_LOCAL a real policy
83babc0d2944 mm/pgprot: Move the pgprot_modify() fallback definition to mm.h
536165ead34b sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation
6fe64360a759 mm: Only flush the TLB when clearing an accessible pte
e9df40bfeb25 x86/mm: Introduce pte_accessible()
3f2b613771ec mm/thp: Preserve pgprot across huge page split
a5a608d83e0e sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390
995334a2ee83 sched, numa, mm: Describe the NUMA scheduling problem formally
7ee9d9209c57 sched, numa, mm: Make find_busiest_queue() a method
4fd98847ba5c x86/mm: Only do a local tlb flush in ptep_set_access_flags()
d24fc0571afb mm/generic: Only flush the local TLB in ptep_set_access_flags()

is a good foundation already with no WIP policy bits in it.

Mel, could you please work on this basis, or point out the bits 
you don't agree with so I can fix it?

Since I'm working on improving the policy bits I essentially 
need and have done all the 'foundation' work already - you might 
as well reuse it as-is instead of rebasing it?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 15:32       ` Linus Torvalds
  2012-11-16 16:08         ` Ingo Molnar
@ 2012-11-16 16:19         ` Mel Gorman
  1 sibling, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 16:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML

On Fri, Nov 16, 2012 at 07:32:01AM -0800, Linus Torvalds wrote:
> On Fri, Nov 16, 2012 at 6:41 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > I would have preferred asm-generic/pgtable.h myself and use
> > __HAVE_ARCH_whatever tricks
> 
> PLEASE NO!
> 
> Dammit, why is this disease still so prevalent, and why do people
> continue to do this crap?
> 

By personal experience because they read the header, see the other examples
and say "fair enough". I'm tempted to...

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index da3e761..572d3f1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -7,6 +7,12 @@
 #include <linux/mm_types.h>
 #include <linux/bug.h>
 
+/*
+ * NOTE: Do NOT copy the __HAVE_ARCH convention when adding new generic
+ * helpers. You will have to wear a D hat and be called names
+ * https://lkml.org/lkml/2012/11/16/340
+ */
+
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 extern int ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,


> __HAVE_ARCH_xyzzy is a f*cking retarded thing to do, and that's
> actually an insult to retarded people.
> 
> Use either:
> 
>  - Kconfig entries for bigger features where that makes sense, and
> using the Kconfig files allows you to use the Kconfig logic for things
> (ie there are dependencies etc, so you can avoid having to have
> complicated conditionals in the #ifdef's, and instead introduce them
> as rules in Kconfig files).
> 
>  - the SAME F*CKING NAME for the #ifdef, not some totally different
> namespace with __HAVE_ARCH_xyzzy crap.
> 
> So if your architecture wants to override one (or more) of the
> pte_*numa() functions, just make it do so. And do it with
> 
>   static inline pmd_t pmd_mknuma(pmd_t pmd)
>   {
>           pmd = pmd_set_flags(pmd, _PAGE_NUMA);
>           return pmd_clear_flags(pmd, _PAGE_PRESENT);
>   }
>   #define pmd_mknuma pmd_mknuma
> 
> and then you can have the generic code have code like
> 
>    #ifndef pmd_mknuma
>    .. generic version goes here ..
>    #endif
> 
> and the advantage is two-fold:
> 
>  - none of the "let's make up another name to test for this"
> 
>  - "git grep" actually _works_, and the end results make sense, and
> you can clearly see the logic of where things are declared, and which
> one is used.
> 

Understood, makes sense and is a straight-forward conversion. Now that I
read this, this explanation feels familiar. Clearly it did not sink in
with me when you shouted at the last person that tried.

> The __ARCH_HAVE_xyzzy (and some places call it __HAVE_ARCH_xyzzy)
> thing is a disease.
> 

And now I have been healed! I've had worse starts to a weekend.

> That said, the __weak thing works too (and greps fine, as long as you
> use the proper K&R C format, not the idiotic "let's put the name of
> the function on a different line than the type of the function"
> format), it just doesn't allow inlining.
> 
> In this case, I suspect the inlined function is generally a single
> instruction, is it not? In which case I really do think that inlining
> makes sense.
> 

I would expect a single instruction for the checks (pte_numa, pmd_numa).
It's probably two for the setters (pte_mknuma, pmd_mknuma, pte_mknonnuma,
pmd_mknonnuma) unless paravirt gets involved. paravirt might add a
function call in there but should be nothing crazy.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 16/43] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now
  2012-11-16 11:22 ` [PATCH 16/43] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
@ 2012-11-16 16:22   ` Rik van Riel
  0 siblings, 0 replies; 62+ messages in thread
From: Rik van Riel @ 2012-11-16 16:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On 11/16/2012 06:22 AM, Mel Gorman wrote:
> The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
> explicitly request lazy migration is a good idea but the actual
> API has not been well reviewed and once released we have to support it.
> For now this patch prevents an application using the services. This
> will need to be revisited.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 16:08         ` Ingo Molnar
@ 2012-11-16 16:56           ` Mel Gorman
  2012-11-16 17:12             ` Ingo Molnar
                               ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 16:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML

On Fri, Nov 16, 2012 at 05:08:52PM +0100, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Fri, Nov 16, 2012 at 6:41 AM, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > I would have preferred asm-generic/pgtable.h myself and use
> > > __HAVE_ARCH_whatever tricks
> > 
> > PLEASE NO!
> > 
> > Dammit, why is this disease still so prevalent, and why do people
> > continue to do this crap?
> 
> Also, why is this done in a weird, roundabout way of first 
> picking up a bad patch and then modifying it and making it even 
> worse?
> 

To keep the history so that someone looking at mails with just new subjects
may spot it. I'll collapse the patches together in the next release because
the difference will no longer beinteresting.

> Why not use something what we have in numa/core already:
> 
>   f05ea0948708 mm/mpol: Create special PROT_NONE infrastructure
> 

Because it's hard-coded to PROT_NONE underneath which I've complained about
before. It also depends on being able to use vm_get_page_prot(vmflags)
which will always be a function call which to me would always be heavier.

> AFAICS, this portion of numa/core:
> 
> c740b1cccdcb x86/mm: Completely drop the TLB flush from ptep_set_access_flags()

We share this.

> 02743c9c03f1 mm/mpol: Use special PROT_NONE to migrate pages

hard-codes prot_none

> b33467764d8a mm/migrate: Introduce migrate_misplaced_page()

bolts onto the side of migration and introduces MIGRATE_FAULT which
should not have been necessary. Already complained about.

The alternative uses the existing migrate_pages() function but has
different requirements for taking a reference to the page.

> db4aa58db59a numa, mm: Support NUMA hinting page faults from gup/gup_fast

We share this.

> ca2ea0747a5b mm/mpol: Add MPOL_MF_LAZY

We more or less share this except I backed out the userspace visible bits
in a separate patch because I didn't think it had been carefully reviewed
how an application should use it and if it was a good idea. Covered in an
earlier review.

> f05ea0948708 mm/mpol: Create special PROT_NONE infrastructure

hard-codes to prot_none.

I know I am should convert change_prot_numa to wrap around change_protection
if _PAGE_NUMA == _PAGE_PROTNONE but wanted to make sure we had all the
requirements for change_prot_none correct before adapting it.

Otherwise from a high-level we are very similar here.

> 37081a3de2bf mm/mpol: Check for misplaced page

Think we share this.

> cd203e33c39d mm/mpol: Add MPOL_MF_NOOP

I have a patch that backs this out on the grounds that I don't think we
have adequately discussed if it was the correct userspace interface. I
know Peter put a lot of time into it so it's probably correct but
without man pages or spending time writing an example program that used
it, I played safe.

> 88f4670789e3 mm/mpol: Make MPOL_LOCAL a real policy

We share this.

> 83babc0d2944 mm/pgprot: Move the pgprot_modify() fallback definition to mm.h

Related to prot_none hard-coding

> 536165ead34b sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation

Same

> 6fe64360a759 mm: Only flush the TLB when clearing an accessible pte

I missed this. Stupid stupid stupid! It would reduce the TLB flushes from
migration context.

> e9df40bfeb25 x86/mm: Introduce pte_accessible()

prot_none.

> 3f2b613771ec mm/thp: Preserve pgprot across huge page split

Lot more churn in there than is necessary which was covered in review.
Otherwise functionally we share this.

> a5a608d83e0e sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390

prot_none choice

> 995334a2ee83 sched, numa, mm: Describe the NUMA scheduling problem formally

I like this. I didn't pick it up until the policy stuff was more
advanced so it could be properly described in the same fashion.

> 7ee9d9209c57 sched, numa, mm: Make find_busiest_queue() a method

We share this. I introduce it much later when it becomes required.

> 4fd98847ba5c x86/mm: Only do a local tlb flush in ptep_set_access_flags()
> d24fc0571afb mm/generic: Only flush the local TLB in ptep_set_access_flags()
> 

We share this.

> is a good foundation already with no WIP policy bits in it.
> 
> Mel, could you please work on this basis, or point out the bits 
> you don't agree with so I can fix it?
> 

My main hangup is the prot_none choice and I know it's something we have
butted heads on without progress. I feel it is a lot cleaner to have
the _PAGE_NUMA bit (even if it's PROT_NONE underneath) and the helpers
avoid function calls where possible. It also made the PMD handling sortof
straight-forward and allowed the batching taking of the PTL and migration
if the pages in the PMD were all on the same node. I liked this.

Yours is closer to what the architecture does and can use change_protect()
with very few changes but on balance I did not find this a compelling
alternative.

Further I took the time to put together a basic policy instead of jumping
straight to the end so the logical progression from beginning to end is
obvious. This to me at least is a more incremental approach.  It also
allows us to keep a close eye on the system CPU usage and know which parts
of it might be due to the underlying mechanics and which are due to poor
placement policy decisions. The series is longer as a result but it is
more tractable and will be bisectable.

> Since I'm working on improving the policy bits I essentially 
> need and have done all the 'foundation' work already - you might 
> as well reuse it as-is instead of rebasing it?
> 

I really have hangups about the prot_none thing. I also have hang-ups
that the "policy" bit is one big patch doing all the changes at once
making it harder to figure out whether it is the load balancer changes,
scanning machinery or placement policy that are making the big differences.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 16:56           ` Mel Gorman
@ 2012-11-16 17:12             ` Ingo Molnar
  2012-11-16 17:48               ` Mel Gorman
  2012-11-16 17:26             ` Rik van Riel
  2012-11-16 17:37             ` Ingo Molnar
  2 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2012-11-16 17:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > Why not use something what we have in numa/core already:
> > 
> >   f05ea0948708 mm/mpol: Create special PROT_NONE infrastructure
> > 
> 
> Because it's hard-coded to PROT_NONE underneath which I've 
> complained about before. [...]

To which I replied that this is the current generic 
implementation, the moment some different architecture comes 
around we can accomodate it - on a strictly as-needed basis.

It is *better* and cleaner to not expose random arch hooks but 
let the core kernel modification be documented in the very patch 
that the architecture support patch makes use of it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 16:56           ` Mel Gorman
  2012-11-16 17:12             ` Ingo Molnar
@ 2012-11-16 17:26             ` Rik van Riel
  2012-11-16 17:37             ` Ingo Molnar
  2 siblings, 0 replies; 62+ messages in thread
From: Rik van Riel @ 2012-11-16 17:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML

On 11/16/2012 11:56 AM, Mel Gorman wrote:

>> b33467764d8a mm/migrate: Introduce migrate_misplaced_page()
>
> bolts onto the side of migration and introduces MIGRATE_FAULT which
> should not have been necessary. Already complained about.
>
> The alternative uses the existing migrate_pages() function but has
> different requirements for taking a reference to the page.

Indeed, NACK to b33467764d8a

Mel's tree implements this in a much cleaner way.

>> ca2ea0747a5b mm/mpol: Add MPOL_MF_LAZY
>
> We more or less share this except I backed out the userspace visible bits
> in a separate patch because I didn't think it had been carefully reviewed
> how an application should use it and if it was a good idea. Covered in an
> earlier review.

Agreed, these bits should not be userspace visible, at least
not for now.

>> cd203e33c39d mm/mpol: Add MPOL_MF_NOOP
>
> I have a patch that backs this out on the grounds that I don't think we
> have adequately discussed if it was the correct userspace interface. I
> know Peter put a lot of time into it so it's probably correct but
> without man pages or spending time writing an example program that used
> it, I played safe.

Ditto.

>> 6fe64360a759 mm: Only flush the TLB when clearing an accessible pte
>
> I missed this. Stupid stupid stupid! It would reduce the TLB flushes from
> migration context.

However, Ingo's tree still incurs the double page fault for
migrated pages. Both trees could use a little improvement in
this area :)

>> e9df40bfeb25 x86/mm: Introduce pte_accessible()
>
> prot_none.

This one is x86 specific, and would work as well with Andrea's
_PAGE_NUMA as it would with _PAGE_PROTNONE.

>> is a good foundation already with no WIP policy bits in it.
>>
>> Mel, could you please work on this basis, or point out the bits
>> you don't agree with so I can fix it?
>>
>
> My main hangup is the prot_none choice and I know it's something we have
> butted heads on without progress. I feel it is a lot cleaner to have
> the _PAGE_NUMA bit (even if it's PROT_NONE underneath) and the helpers
> avoid function calls where possible.

I am pretty neutral on whether we use _PAGE_NUMA with _PAGE_PROTNONE
underneath, or the slightly higher overhead actual prot_none stuff.

I can live with whichever of these Linus ends up merging.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 16:56           ` Mel Gorman
  2012-11-16 17:12             ` Ingo Molnar
  2012-11-16 17:26             ` Rik van Riel
@ 2012-11-16 17:37             ` Ingo Molnar
  2012-11-16 18:44               ` Mel Gorman
  2 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2012-11-16 17:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > AFAICS, this portion of numa/core:
> > 
> > c740b1cccdcb x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
> 
> We share this.
> 
> > 02743c9c03f1 mm/mpol: Use special PROT_NONE to migrate pages
> 
> hard-codes prot_none

I prefer any arch support extensions to be done in the patch 
that adds that specific arch support.

That way we can consider the pros and cons of abstraction. Also 
see further below.

> > cd203e33c39d mm/mpol: Add MPOL_MF_NOOP
> 
> I have a patch that backs this out on the grounds that I don't 
> think we have adequately discussed if it was the correct 
> userspace interface. I know Peter put a lot of time into it so 
> it's probably correct but without man pages or spending time 
> writing an example program that used it, I played safe.

I'm fine with not exposing it to user-space.

> > Mel, could you please work on this basis, or point out the 
> > bits you don't agree with so I can fix it?
> 
> My main hangup is the prot_none choice and I know it's 
> something we have butted heads on without progress. [...]

It's the basic KISS concept - I think you are over-designing 
this. An architecture opts in to the new, generic code via 
doing:

  select ARCH_SUPPORTS_NUMA_BALANCING

... if it cannot enable that then it will extend the core code 
in *very* visible ways.

> [...] I feel it is a lot cleaner to have the _PAGE_NUMA bit 
> (even if it's PROT_NONE underneath) and the helpers avoid 
> function calls where possible. It also made the PMD handling 
> sortof straight-forward and allowed the batching taking of the 
> PTL and migration if the pages in the PMD were all on the same 
> node. I liked this.
> 
> Yours is closer to what the architecture does and can use 
> change_protect() with very few changes but on balance I did 
> not find this a compelling alternative.

IMO here you are on the wrong side of history as well.

For example reusing change_protection() *already* uncovered 
useful optimizations to the generic code:

   http://comments.gmane.org/gmane.linux.kernel.mm/89707

(regardless of how this particular change_protection() 
 optimization will look like.)

that optimization would not have happened with your open-coded 
change_protection() variant plain and simple.

So, to put it bluntly, you are not only doing a stupid thing, 
you are doing an actively harmful thing here...

If you fix that then most of the differences between your tree 
and numa/core disappears. You'll end up very close to:

  - rebasing numa/core pretty much as-is
  + add your migrate_displaced() function
  - remove the user-facing lazy migration facilities.
  + inline pte_numa()/pmd_numa() if you think it's beneficial

If that works for you I'll test and backmerge all such deltas 
quickly and we can move on.

Then you could hack whatever policy and instrumentation bits you 
want, on top of that agreed upon base.

Would that approach work for you?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 17:12             ` Ingo Molnar
@ 2012-11-16 17:48               ` Mel Gorman
  2012-11-16 18:04                 ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 17:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML

On Fri, Nov 16, 2012 at 06:12:43PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > Why not use something what we have in numa/core already:
> > > 
> > >   f05ea0948708 mm/mpol: Create special PROT_NONE infrastructure
> > > 
> > 
> > Because it's hard-coded to PROT_NONE underneath which I've 
> > complained about before. [...]
> 
> To which I replied that this is the current generic 
> implementation, the moment some different architecture comes 
> around we can accomodate it - on a strictly as-needed basis.
> 

To which I responded that a new architecutre would have to retrofit and
then change callers like change_prot_none() which is more churn than should
be necessary to add architecture support.

> It is *better* and cleaner to not expose random arch hooks but 
> let the core kernel modification be documented in the very patch 
> that the architecture support patch makes use of it.
> 

And yours requires that arches define pmd_pgprot so there are additional
hooks anyway.

That said, your approach just ends up being heavier. Take this simple
case for what we need for pte_numa.

+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+       /*
+        * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+        */
+       vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+       return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}

...

+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+       /*
+        * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+        * "normal" vma->vm_page_prot protections.  Genuine PROT_NONE
+        * VMAs should never get here, because the fault handling code
+        * will notice that the VMA has no read or write permissions.
+        *
+        * This means we cannot get 'special' PROT_NONE faults from genuine
+        * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+        * tracking.
+        *
+        * Neither case is really interesting for our current use though so we
+        * don't care.
+        */
+       if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+               return false;
+
+       return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}

pte_numa requires a call to vma_prot_none which requires a function call
to vm_get_page_prot.

This is the _PAGE_NUMA equivalent.

+__weak int pte_numa(pte_t pte)
+{
+       return (pte_flags(pte) &
+               (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}

If that was moved to inline as Linus suggests, it becomes one, maybe two
instructions.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 17:48               ` Mel Gorman
@ 2012-11-16 18:04                 ` Ingo Molnar
  2012-11-16 18:55                   ` Mel Gorman
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2012-11-16 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> That said, your approach just ends up being heavier. [...]

Well, it's more fundamental than just whether to inline or not 
(which I think should be a separate optimization and I won't 
object to two-instruction variants the slightest) - but you 
ended up open-coding change_protection() 
via:

   change_prot_numa_range() et al

which is a far bigger problem...

Do you have valid technical arguments in favor of that 
duplication?

If you just embrace the PROT_NONE reuse approach of numa/core 
then 90% of the differences in your tree will disappear and 
you'll have a code base very close to where numa/core was 3 
weeks ago already, modulo a handful of renames.

It's not like PROT_NONE will go away anytime soon.

PROT_NONE is available on every architecture, and we use the 
exact semantics of it in the scheduler, we just happen to drive 
it from a special worklet instead of a syscall, and happen to 
have a callback to the faults when they happen...

Please stay open to that approach.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 17:37             ` Ingo Molnar
@ 2012-11-16 18:44               ` Mel Gorman
  0 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 18:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML

On Fri, Nov 16, 2012 at 06:37:55PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > AFAICS, this portion of numa/core:
> > > 
> > > c740b1cccdcb x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
> > 
> > We share this.
> > 
> > > 02743c9c03f1 mm/mpol: Use special PROT_NONE to migrate pages
> > 
> > hard-codes prot_none
> 
> I prefer any arch support extensions to be done in the patch 
> that adds that specific arch support.
> 
> That way we can consider the pros and cons of abstraction. Also 
> see further below.
> 

Using _PAGE_NUMA mapped onto _PROT_NONE does not prevent the same
abstraction.

> > > cd203e33c39d mm/mpol: Add MPOL_MF_NOOP
> > 
> > I have a patch that backs this out on the grounds that I don't 
> > think we have adequately discussed if it was the correct 
> > userspace interface. I know Peter put a lot of time into it so 
> > it's probably correct but without man pages or spending time 
> > writing an example program that used it, I played safe.
> 
> I'm fine with not exposing it to user-space.
> 

Ok.

> > > Mel, could you please work on this basis, or point out the 
> > > bits you don't agree with so I can fix it?
> > 
> > My main hangup is the prot_none choice and I know it's 
> > something we have butted heads on without progress. [...]
> 
> It's the basic KISS concept - I think you are over-designing 
> this. An architecture opts in to the new, generic code via 
> doing:
> 
>   select ARCH_SUPPORTS_NUMA_BALANCING
> 
> ... if it cannot enable that then it will extend the core code 
> in *very* visible ways.
> 

It won't kill them to decide if they really want to use _PAGE_NUMA or
not. The fact that the generic helpers end up being a few instructions
is nice although you probably could do the same with some juggling.

> > [...] I feel it is a lot cleaner to have the _PAGE_NUMA bit 
> > (even if it's PROT_NONE underneath) and the helpers avoid 
> > function calls where possible. It also made the PMD handling 
> > sortof straight-forward and allowed the batching taking of the 
> > PTL and migration if the pages in the PMD were all on the same 
> > node. I liked this.
> > 
> > Yours is closer to what the architecture does and can use 
> > change_protect() with very few changes but on balance I did 
> > not find this a compelling alternative.
> 
> IMO here you are on the wrong side of history as well.
> 
> For example reusing change_protection() *already* uncovered 
> useful optimizations to the generic code:
> 
>    http://comments.gmane.org/gmane.linux.kernel.mm/89707
> 
> (regardless of how this particular change_protection() 
>  optimization will look like.)
> 
> that optimization would not have happened with your open-coded 
> change_protection() variant plain and simple.
> 

As I said before, very little actually stops me using change_protection if
_PAGE_NUMA == _PAGE_NONE. The only reason I didn't convert yet is because
I wanted to see what the full set of requirements were. Right now they
are simple;

1. Something to avoid unnecessary TLB flushes if there are no updates
2. Return if all the pages underneath are on the same node or not so
   that pmd_numa can be set if desired
3. Collect stats on PTE updates

1 should already be there. 2 would be trivial. 3 should also be fairly
trivial with some jiggery pokery.

A conversion is not a fundamental problem. If an arch cannot use _PAGE_NONE
they will need to implement their own version of change_prot_numa() but
that in itself should be sufficient discouragment.

If an arch cannot use _PAGE_NONE in your case, it's a retrofit to find
all the places that use prot_none and see if they really mean prot_none
or if they meant prot_numa.

> So, to put it bluntly, you are not only doing a stupid thing, 
> you are doing an actively harmful thing here...
> 

Great, calling me stupid is going to help.

Is your major problem the change_page_numa() part? If so, I can fix
that and adjust change_protection in the way I need.

> If you fix that then most of the differences between your tree 
> and numa/core disappears. You'll end up very close to:
> 

MIGRATE_FAULT is still there.

The lack of batch handling of a PMD fault may also be a problem. Right
now you only handle transparent hugepages and then depend on being able to
natively migrate them to avoid a big hit. In my case it is possible to mark
a PMD and deal with it as a single fault even if it's not a transparent
hugepage. This batches the taking of the PTL and migration of pages. This
will trap less although the guy that does trap takes a heavier hit. Maybe
this will work out best, maybe not, but it's possible.

An optimisation of this would be that if all pages in a PMD are on the same
node then only set the PMD. On the next fault if the fault is properly
placed it's one PMD update and the fault is complete. If it's misplaced
then one page needs to migrate and the pte_numa needs to be set on all the
pages below. On a fully converged workload this will be faster as we'll take
one PMD fault instead of 512 PTE faults reducing overall system CPU usage.

There are also the stats that track PTE updates, faults and migrations
which allow a user to make an estimation for how expensive automatic
balancing is from /proc/vmstat. This will help debugging user problems,
possibly without profiling.

>   - rebasing numa/core pretty much as-is
>   + add your migrate_displaced() function
>   - remove the user-facing lazy migration facilities.
>   + inline pte_numa()/pmd_numa() if you think it's beneficial
> 

+ regular pmd batch handling
+ stats on PTE updates and faults to estimate costs from /proc/vmstat

> If that works for you I'll test and backmerge all such deltas 
> quickly and we can move on.
> 

Or if you're willing to backmerge then why not rebase the policy bits on
top of the basic migration policy picking some point between here
depending on what you'd like to do?

 mm: numa: Rate limit setting of pte_numa if node is saturated
 sched: numa: Slowly increase the scanning period as NUMA faults are handled
 mm: numa: Introduce last_nid to the page frame
 mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
 sched: numa: Introduce tsk_home_node()
 sched: numa: Make find_busiest_queue() a method
 sched: numa: Implement home-node awareness
 sched: numa: Introduce per-mm and per-task structures

So that way, not only can we see the logical progression of how your
stuff works but also compare it to a basic policy that is not
particularly smart to make sure we are actually going to the right
direction.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
  2012-11-16 18:04                 ` Ingo Molnar
@ 2012-11-16 18:55                   ` Mel Gorman
  0 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2012-11-16 18:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Andrew Morton,
	Linux-MM, LKML

On Fri, Nov 16, 2012 at 07:04:04PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > That said, your approach just ends up being heavier. [...]
> 
> Well, it's more fundamental than just whether to inline or not 
> (which I think should be a separate optimization and I won't 
> object to two-instruction variants the slightest) - but you 
> ended up open-coding change_protection() 
> via:
> 
>    change_prot_numa_range() et al
> 
> which is a far bigger problem...
> 
> Do you have valid technical arguments in favor of that 
> duplication?
> 

No, I don't and I have not claimed that it *has* to exist. In fact I've
said multiple times than I can convert to change_protection as long as
_PAGE_NUMA == _PAGE_NONE. This initial step was to build the list
of requirements without worrying about breaking existing users of
change_protection. Now that I know what the requirements are, I can convert.

> If you just embrace the PROT_NONE reuse approach of numa/core 
> then 90% of the differences in your tree will disappear and 
> you'll have a code base very close to where numa/core was 3 
> weeks ago already, modulo a handful of renames.
> 

Pointed out the missing parts in another mail already -- MIGRATE_FAULT,
pmd handling in batch, stats and a logical progression from a simple to
a complex policy.

> It's not like PROT_NONE will go away anytime soon.
> 
> PROT_NONE is available on every architecture, and we use the 
> exact semantics of it in the scheduler, we just happen to drive 
> it from a special worklet instead of a syscall, and happen to 
> have a callback to the faults when they happen...
> 
> Please stay open to that approach.
> 

I will.

If anything, me switching to prot_none would be a hell of a lot easier
than you trying to pick up the bits you're missing. I'll take a look
Monday and see what falls out.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip:numa/core] mm/migration: Improve migrate_misplaced_page()
  2012-11-16 11:22 ` [PATCH 13/43] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
@ 2012-11-19 19:44   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot for Mel Gorman @ 2012-11-19 19:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, hannes, hughd,
	riel, Lee.Schermerhorn, aarcange, mgorman, tglx, linux-mm

Commit-ID:  292c8cf52d4c65e1f8744e5c7ce774516d868ee8
Gitweb:     http://git.kernel.org/tip/292c8cf52d4c65e1f8744e5c7ce774516d868ee8
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Fri, 16 Nov 2012 11:22:23 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 19 Nov 2012 03:31:22 +0100

mm/migration: Improve migrate_misplaced_page()

Fix, improve and clean up migrate_misplaced_page() to
reuse migrate_pages() and to check for zone watermarks
to make sure we don't overload the node.

This was originally based on Peter's patch "mm/migrate: Introduce
migrate_misplaced_page()" but borrows extremely heavily from Andrea's
"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
collection".

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: http://lkml.kernel.org/r/1353064973-26082-14-git-send-email-mgorman@suse.de
[ Adapted to the numa/core tree. Kept Mel's patch separate to retain
  original authorship for the authors. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/migrate_mode.h |   3 -
 mm/memory.c                  |  13 ++--
 mm/migrate.c                 | 143 +++++++++++++++++++++++++++----------------
 3 files changed, 95 insertions(+), 64 deletions(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 40b37dc..ebf3d89 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,14 +6,11 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
- * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
- *	this path has an extra reference count
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
-	MIGRATE_FAULT,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index 23ad2eb..52ad29d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3492,28 +3492,25 @@ out_pte_upgrade_unlock:
 
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
-out:
+
 	if (page) {
 		task_numa_fault(page_nid, last_cpu, 1);
 		put_page(page);
 	}
-
+out:
 	return 0;
 
 migrate:
 	pte_unmap_unlock(ptep, ptl);
 
-	if (!migrate_misplaced_page(page, node)) {
-		page_nid = node;
+	if (migrate_misplaced_page(page, node)) {
 		goto out;
 	}
+	page = NULL;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_same(*ptep, entry)) {
-		put_page(page);
-		page = NULL;
+	if (!pte_same(*ptep, entry))
 		goto out_unlock;
-	}
 
 	goto out_pte_upgrade_unlock;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index b89062d..16a4709 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
+	if (mode != MIGRATE_ASYNC) {
 		do {
 			get_bh(bh);
 			lock_buffer(bh);
@@ -282,19 +282,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 0;
 	void **pslot;
 
-	if (mode == MIGRATE_FAULT) {
-		/*
-		 * MIGRATE_FAULT has an extra reference on the page and
-		 * otherwise acts like ASYNC, no point in delaying the
-		 * fault, we'll try again next time.
-		 */
-		expected_count++;
-	}
-
 	if (!mapping) {
 		/* Anonymous page without mapping */
-		expected_count += 1;
-		if (page_count(page) != expected_count)
+		if (page_count(page) != 1)
 			return -EAGAIN;
 		return 0;
 	}
@@ -304,7 +294,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	expected_count += 2 + page_has_private(page);
+	expected_count = 2 + page_has_private(page);
 	if (page_count(page) != expected_count ||
 		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -323,7 +313,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	 * the mapping back due to an elevated page count, we would have to
 	 * block waiting on other references to be dropped.
 	 */
-	if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
+	if (mode == MIGRATE_ASYNC && head &&
 			!buffer_migrate_lock_buffers(head, mode)) {
 		page_unfreeze_refs(page, expected_count);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -531,7 +521,7 @@ int buffer_migrate_page(struct address_space *mapping,
 	 * with an IRQ-safe spinlock held. In the sync case, the buffers
 	 * need to be locked now
 	 */
-	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
+	if (mode != MIGRATE_ASYNC)
 		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
 
 	ClearPagePrivate(page);
@@ -697,7 +687,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
+		if (!force || mode == MIGRATE_ASYNC)
 			goto out;
 
 		/*
@@ -1415,55 +1405,102 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
 }
 
 /*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which is a bit crude.
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	int z;
+
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_exact_node(nid,
+					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+					  __GFP_NOMEMALLOC | __GFP_NORETRY |
+					  __GFP_NOWARN) &
+					 ~GFP_IOFS, 0);
+	return newpage;
+}
+
+/*
  * Attempt to migrate a misplaced page to the specified destination
- * node.
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
  */
 int migrate_misplaced_page(struct page *page, int node)
 {
-	struct address_space *mapping = page_mapping(page);
-	int page_lru = page_is_file_cache(page);
-	struct page *newpage;
-	int ret = -EAGAIN;
-	gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+	int isolated = 0;
+	LIST_HEAD(migratepages);
 
 	/*
-	 * Never wait for allocations just to migrate on fault, but don't dip
-	 * into reserves. And, only accept pages from the specified node. No
-	 * sense migrating to a different "misplaced" page!
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
 	 */
-	if (mapping)
-		gfp = mapping_gfp_mask(mapping);
-	gfp &= ~__GFP_WAIT;
-	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
-
-	newpage = alloc_pages_node(node, gfp, 0);
-	if (!newpage) {
-		ret = -ENOMEM;
+	if (page_mapcount(page) != 1)
 		goto out;
-	}
 
-	if (isolate_lru_page(page)) {
-		ret = -EBUSY;
-		goto put_new;
+	/* Avoid migrating to a node that is nearly full */
+	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+		int page_lru;
+
+		if (isolate_lru_page(page)) {
+			put_page(page);
+			goto out;
+		}
+		isolated = 1;
+
+		/*
+		 * Page is isolated which takes a reference count so now the
+		 * callers reference can be safely dropped without the page
+		 * disappearing underneath us during migration
+		 */
+		put_page(page);
+
+		page_lru = page_is_file_cache(page);
+		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		list_add(&page->lru, &migratepages);
 	}
 
-	inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
-	ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
-	/*
-	 * A page that has been migrated has all references removed and will be
-	 * freed. A page that has not been migrated will have kepts its
-	 * references and be restored.
-	 */
-	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
-	putback_lru_page(page);
-put_new:
-	/*
-	 * Move the new page to the LRU. If migration was not successful
-	 * then this will free the page.
-	 */
-	putback_lru_page(newpage);
+	if (isolated) {
+		int nr_remaining;
+
+		nr_remaining = migrate_pages(&migratepages,
+				alloc_misplaced_dst_page,
+				node, false, MIGRATE_ASYNC);
+		if (nr_remaining) {
+			putback_lru_pages(&migratepages);
+			isolated = 0;
+		}
+	}
+	BUG_ON(!list_empty(&migratepages));
 out:
-	return ret;
+	return isolated;
 }
 
 #endif /* CONFIG_NUMA */

^ permalink raw reply related	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2012-11-19 19:46 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
2012-11-16 11:22 ` [PATCH 01/43] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
2012-11-16 11:22 ` [PATCH 02/43] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
2012-11-16 11:22 ` [PATCH 03/43] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
2012-11-16 11:22 ` [PATCH 04/43] mm: numa: define _PAGE_NUMA Mel Gorman
2012-11-16 11:22 ` [PATCH 05/43] mm: numa: pte_numa() and pmd_numa() Mel Gorman
2012-11-16 11:22 ` [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation Mel Gorman
2012-11-16 14:09   ` Rik van Riel
2012-11-16 14:41     ` Mel Gorman
2012-11-16 15:32       ` Linus Torvalds
2012-11-16 16:08         ` Ingo Molnar
2012-11-16 16:56           ` Mel Gorman
2012-11-16 17:12             ` Ingo Molnar
2012-11-16 17:48               ` Mel Gorman
2012-11-16 18:04                 ` Ingo Molnar
2012-11-16 18:55                   ` Mel Gorman
2012-11-16 17:26             ` Rik van Riel
2012-11-16 17:37             ` Ingo Molnar
2012-11-16 18:44               ` Mel Gorman
2012-11-16 16:19         ` Mel Gorman
2012-11-16 11:22 ` [PATCH 07/43] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
2012-11-16 14:09   ` Rik van Riel
2012-11-16 11:22 ` [PATCH 08/43] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
2012-11-16 11:22 ` [PATCH 09/43] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
2012-11-16 11:22 ` [PATCH 10/43] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
2012-11-16 11:22 ` [PATCH 11/43] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
2012-11-16 11:22 ` [PATCH 12/43] mm: mempolicy: Check for misplaced page Mel Gorman
2012-11-16 11:22 ` [PATCH 13/43] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
2012-11-19 19:44   ` [tip:numa/core] mm/migration: Improve migrate_misplaced_page() tip-bot for Mel Gorman
2012-11-16 11:22 ` [PATCH 14/43] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
2012-11-16 16:08   ` Rik van Riel
2012-11-16 11:22 ` [PATCH 15/43] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
2012-11-16 11:22 ` [PATCH 16/43] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
2012-11-16 16:22   ` Rik van Riel
2012-11-16 11:22 ` [PATCH 17/43] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Mel Gorman
2012-11-16 11:22 ` [PATCH 18/43] mm: numa: Add fault driven placement and migration Mel Gorman
2012-11-16 11:22 ` [PATCH 19/43] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
2012-11-16 11:22 ` [PATCH 20/43] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
2012-11-16 11:22 ` [PATCH 21/43] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
2012-11-16 11:22 ` [PATCH 22/43] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
2012-11-16 11:22 ` [PATCH 23/43] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
2012-11-16 11:22 ` [PATCH 24/43] mm: numa: Migrate on reference policy Mel Gorman
2012-11-16 11:22 ` [PATCH 25/43] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
2012-11-16 11:22 ` [PATCH 26/43] mm: numa: Only mark a PMD pmd_numa if the pages are all on the same node Mel Gorman
2012-11-16 11:22 ` [PATCH 27/43] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
2012-11-16 11:22 ` [PATCH 28/43] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
2012-11-16 11:22 ` [PATCH 29/43] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
2012-11-16 11:22 ` [PATCH 30/43] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
2012-11-16 11:22 ` [PATCH 31/43] mm: numa: Introduce last_nid to the page frame Mel Gorman
2012-11-16 11:22 ` [PATCH 32/43] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
2012-11-16 11:22 ` [PATCH 33/43] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
2012-11-16 11:22 ` [PATCH 34/43] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
2012-11-16 11:22 ` [PATCH 35/43] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
2012-11-16 11:22 ` [PATCH 36/43] sched: numa: Introduce tsk_home_node() Mel Gorman
2012-11-16 11:22 ` [PATCH 37/43] sched: numa: Make find_busiest_queue() a method Mel Gorman
2012-11-16 11:22 ` [PATCH 38/43] sched: numa: Implement home-node awareness Mel Gorman
2012-11-16 11:22 ` [PATCH 39/43] sched: numa: Introduce per-mm and per-task structures Mel Gorman
2012-11-16 11:22 ` [PATCH 40/43] sched: numa: CPU follows memory Mel Gorman
2012-11-16 11:22 ` [PATCH 41/43] sched: numa: Rename mempolicy to HOME Mel Gorman
2012-11-16 11:22 ` [PATCH 42/43] sched: numa: Consider only one CPU per node for CPU-follows-memory Mel Gorman
2012-11-16 11:22 ` [PATCH 43/43] sched: numa: Increase and decrease a tasks scanning period based on task fault statistics Mel Gorman
2012-11-16 14:56 ` [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).