linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/40] Automatic NUMA Balancing V5
@ 2012-11-22 19:25 Mel Gorman
  2012-11-22 19:25 ` [PATCH 01/40] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
                   ` (40 more replies)
  0 siblings, 41 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

tldr: Benchmarkers, unlikely earlier series the full of this series
	is eligible for testing.

git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v5r4

This series can be treated as 5 major stages.

1. TLB optimisations that we're likely to want unconditionally.
2. Basic foundation and core mechanics, initial policy that does very little
3. Full PMD fault handling, rate limiting of migration, two-stage migration
   filter to mitigate poor migration decisions.  This will migrate pages
   on a PTE or PMD level using just the current referencing CPU as a
   placement hint
4. Native THP migration
5. Scan rate adaption

Stages 4 and 5 should probably be swapped but my testing was stacked
like this.

Very broadly speaking the TODOs that spring to mind are

1. Revisit MPOL_NOOP and MPOL_MF_LAZY
2. Other architecture support or at least validation that it could be made work. I'm
   half-hoping that the PPC64 people are watching because they tend to be interested
   in this type of thing.

I recognise that the series is quite large. In many cases I kept patches
split-out so the progression can be seen and replacing individual components
may be easier.

Some advantages of the series are;

1. It handles regular PMDs which reduces overhead in case where pages within
   a PMD are on the same node
2. It rate limits migrations to avoid saturating the bus and backs off
   PTE scanning (in a fairly heavy manner) if the node is rate-limited
3. It keeps major optimisations like THP towards the end to be sure I am
   not accidentally depending on them
4. It has some vmstats which allow a user to make a rough guess as to how
   much overhead the balancing is introducing
5. It implements a basic policy that acts as a second performance baseline.
   The three baselines become vanilla kernel, basic placement policy,
   complex placement policy. This allows like-with-like comparisons with
   implementations.

Changelog since V4
  o Allow enabling/disable from command line
  o Delay PTE scanning until tasks are running on a new node
  o THP migration bits needed for memcg
  o Adapt the scanning rate depending on whether pages need to migrate
  o Drop all the scheduler policy stuff on top, it was broken

Changelog since V3
  o Use change_protection
  o Architecture-hook twiddling
  o Port of the THP migration patch.
  o Additional TLB optimisations
  o Fixes from Hillf Danton

Changelog since V2
  o Do not allocate from home node
  o Mostly remove pmd_numa handling for regular pmds
  o HOME policy will allocate from and migrate towards local node
  o Load balancer is more aggressive about moving tasks towards home node
  o Renames to sync up more with -tip version
  o Move pte handlers to generic code
  o Scanning rate starts at 100ms, system CPU usage expected to increase
  o Handle migration of PMD hinting faults
  o Rate limit migration on a per-node basis
  o Alter how the rate of PTE scanning is adapted
  o Rate limit setting of pte_numa if node is congested
  o Only flush local TLB is unmapping a pte_numa page
  o Only consider one CPU in cpu follow algorithm

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two (three depending on how you look at it) competing
approaches to implement support for automatically migrating pages to
optimise NUMA locality. Performance results are available but review
highlighted different problems in both.  They are not compatible with each
other even though some fundamental mechanics should have been the same.
This series addresses part of the integration and sharing problem by
implementing a foundation that either the policy for schednuma or autonuma
can be rebased on.

The initial policy it implements is a very basic greedy policy called
"Migrate On Reference Of pte_numa Node (MORON)".  I expect people to
build upon this revised policy and rename it to something more sensible
that reflects what it means. The ideal *worst-case* behaviour is that
it is comparable to current mainline but for some workloads this is an
improvement over mainline.

In terms of building on top of the foundation the ideal would be that
patches affect one of the following areas although obviously that will
not always be possible

1. The PTE update helper functions
2. The PTE scanning machinary driven from task_numa_tick
3. Task and process fault accounting and how that information is used
   to determine if a page is misplaced
4. Fault handling, migrating the page if misplaced, what information is
   provided to the placement policy
5. Scheduler and load balancing

Patches 1-5 are some TLB optimisations that mostly make sense on their own.
	They are likely to make it into the tree either way

Patches 6-7 are an mprotect optimisation

Patches 8-10 move some vmstat counters so that migrated pages get accounted
	for. In the past the primary user of migration was compaction but
	if pages are to migrate for NUMA optimisation then the counters
	need to be generally useful.

Patch 11 defines an arch-specific PTE bit called _PAGE_NUMA that is used
	to trigger faults later in the series. A placement policy is expected
	to use these faults to determine if a page should migrate.  On x86,
	the bit is the same as _PAGE_PROTNONE but other architectures
	may differ. Note that it is also possible to avoid using this bit
	and go with plain PROT_NONE but the resulting helpers are then
	heavier.

Patch 12-14 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
	friends, updated GUP and huge page splitting.

Patch 15 creates the fault handler for p[te|md]_numa PTEs and just clears
	them again.

Patch 16 adds a MPOL_LOCAL policy so applications can explicitly request the
	historical behaviour.

Patch 17 is premature but adds a MPOL_NOOP policy that can be used in
	conjunction with the LAZY flags introduced later in the series.

Patch 18 adds migrate_misplaced_page which is responsible for migrating
	a page to a new location.

Patch 19 migrates the page on fault if mpol_misplaced() says to do so.

Patch 20 updates the page fault handlers. Transparent huge pages are split.
	Pages pointed to by PTEs are migrated. Pages pointed to by PMDs
	are not properly handed until later in the series.

Patch 21 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
	On the next reference the memory should be migrated to the node that
	references the memory.

Patch 22 reimplements change_prot_numa in terms of change_protection. It could
	be collapsed with patch 21 but this might be easier to review.

Patch 23 notes that the MPOL_MF_LAZY and MPOL_NOOP flags have not been properly
	reviewed and there are no manual pages. They are removed for now and
	need to be revisited.

Patch 24 sets pte_numa within the context of the scheduler.

Patches 25-27 note that the marking of pte_numa has a number of disadvantages and
	instead incrementally updates a limited range of the address space
	each tick.

Patch 28 adds some vmstats that can be used to approximate the cost of the
	scheduling policy in a more fine-grained fashion than looking at
	the system CPU usage.

Patch 29 implements the MORON policy.

Patch 30 properly handles the migration of pages faulted when handling a pmd
	numa hinting fault. This could be improved as it's a bit tangled
	to follow. PMDs are only marked if the PTEs underneath are expected
	to point to pages on the same node.

Patches 31-33 rate-limit the number of pages being migrated and marked as pte_numa

Patch 34 slowly decreases the pte_numa update scanning rate

Patch 35-36 introduces last_nid and uses it to build a two-stage filter
	that delays when a page gets migrated to avoid a situation where
	a task running temporarily off its home node forces a migration.

Patch 37 implements native THP migration for NUMA hinting faults.

Patch 38 adapts the scanning rate if pages do not have to be migrated

Patch 39 allows the enabling/disabling from command line

Patch 40 delays scanning in the PTE until a task using an address space is
	scheduled on a new node

Documentation is sorely missing.

Kernels tested were

stats-v5r1	Patches 1-10. TLB optimisations, migration stats
thpmigrate-v5r1	Patches 1-37. Basic placement policy, PMD handling, THP migration etc.
adaptscan-v5r1	Patches 1-38. Heavy handed PTE scan reduction
delaystart-v5r1 Patches 1-40. Delay the PTE scan until running on a new node

By rights the series should be shuffled to move THP to the end but the
scan adaption stuff was developed later and I did not want to discard the
old results due to time.

AUTONUMA BENCH
                                          3.7.0                 3.7.0                 3.7.0                 3.7.0
                                 rc6-stats-v5r1   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
User    NUMA01               75064.91 (  0.00%)    54454.75 ( 27.46%)    58561.99 ( 21.98%)    56747.85 ( 24.40%)
User    NUMA01_THEADLOCAL    62045.39 (  0.00%)    16906.80 ( 72.75%)    17813.47 ( 71.29%)    18021.32 ( 70.95%)
User    NUMA02                6921.18 (  0.00%)     2065.29 ( 70.16%)     2049.90 ( 70.38%)     2098.25 ( 69.68%)
User    NUMA02_SMT            2924.84 (  0.00%)      987.17 ( 66.25%)      995.65 ( 65.96%)     1000.24 ( 65.80%)
System  NUMA01                  48.75 (  0.00%)      696.82 (-1329.37%)      273.76 (-461.56%)      271.95 (-457.85%)
System  NUMA01_THEADLOCAL       46.05 (  0.00%)      156.85 (-240.61%)      135.24 (-193.68%)      122.13 (-165.21%)
System  NUMA02                   1.73 (  0.00%)        8.74 (-405.20%)        6.35 (-267.05%)        9.02 (-421.39%)
System  NUMA02_SMT              18.34 (  0.00%)        3.31 ( 81.95%)        3.53 ( 80.75%)        3.55 ( 80.64%)
Elapsed NUMA01                1666.60 (  0.00%)     1234.33 ( 25.94%)     1321.51 ( 20.71%)     1269.96 ( 23.80%)
Elapsed NUMA01_THEADLOCAL     1391.37 (  0.00%)      370.06 ( 73.40%)      396.18 ( 71.53%)      397.63 ( 71.42%)
Elapsed NUMA02                 176.41 (  0.00%)       48.89 ( 72.29%)       50.66 ( 71.28%)       50.34 ( 71.46%)
Elapsed NUMA02_SMT             163.88 (  0.00%)       46.83 ( 71.42%)       48.29 ( 70.53%)       47.63 ( 70.94%)
CPU     NUMA01                4506.00 (  0.00%)     4468.00 (  0.84%)     4452.00 (  1.20%)     4489.00 (  0.38%)
CPU     NUMA01_THEADLOCAL     4462.00 (  0.00%)     4610.00 ( -3.32%)     4530.00 ( -1.52%)     4562.00 ( -2.24%)
CPU     NUMA02                3924.00 (  0.00%)     4241.00 ( -8.08%)     4058.00 ( -3.41%)     4185.00 ( -6.65%)
CPU     NUMA02_SMT            1795.00 (  0.00%)     2114.00 (-17.77%)     2068.00 (-15.21%)     2107.00 (-17.38%)

numa01's elapsed time sucks. It's better than mainline but that's about
it. It's an adverse workload and the ideal would be that the policy
interleaves memory. This series cannot do that. It migrates some pages
so it gets some benefit from increased memory bandwidth but for the
most part it sets PTEs and traps faults.

The other workloads are much better with 70% gains in performance in
comparisong to mainline. The System CPU overhead is higher than I'd like
but improved. For example, system CPU usage for numa01 has gone from 489.09
seconds in V4 of this series to 271.95 seconds.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User       274653.21   130223.93   142154.84   146804.10
System       1329.11     2773.99     1453.79     1814.66
Elapsed      6827.56     3508.55     3757.51     3843.07

Reduced elapsed time and higher system CPU usage as you'd expect. Again,
placement policies should help reduce this overhead further by dialing
back the PTE scanner.


MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                        195440      169788      167656      168860
Page Outs                       355400      246860      264276      269304
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                  42264       47486       32077       34343
THP collapse alloc                  23          23          26          22
THP splits                           5           6           5           4
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0      523123      180790      209771
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0         543         187         217
NUMA PTE updates                     0   842347410   295302723   301160396
NUMA hint faults                     0     6924258     3277126     3189624
NUMA hint local faults               0     3757418     1824546     1872917
NUMA pages migrated                  0      523123      180790      209771
AutoNUMA cost                        0       40527       18456       18060

Note what the scan adaption does to the number of PTE updates and the
number of faults incurred. A policy may not necessarily like this. It
depends on its requirements but if it wants higher PTE scan rates it
will have to compensate for it.

SPECJBB Multiple JVM instances one per node, THP disabled

                          3.7.0                 3.7.0                 3.7.0                 3.7.0
                 rc6-stats-v5r1   rc6-thpmigrate-v5r1    rc6-adaptscan-v5r1   rc6-delaystart-v5r4
Mean   1      25269.25 (  0.00%)     25138.00 ( -0.52%)     25539.25 (  1.07%)     25193.00 ( -0.30%)
Mean   2      53467.00 (  0.00%)     50813.00 ( -4.96%)     52803.50 ( -1.24%)     52637.50 ( -1.55%)
Mean   3      77112.50 (  0.00%)     75274.25 ( -2.38%)     76097.00 ( -1.32%)     76324.25 ( -1.02%)
Mean   4      99928.75 (  0.00%)     97444.75 ( -2.49%)     99426.75 ( -0.50%)     99767.25 ( -0.16%)
Mean   5     119616.75 (  0.00%)    117350.00 ( -1.90%)    118417.25 ( -1.00%)    118298.50 ( -1.10%)
Mean   6     133944.75 (  0.00%)    133565.75 ( -0.28%)    135268.75 (  0.99%)    137512.50 (  2.66%)
Mean   7     137063.00 (  0.00%)    136744.75 ( -0.23%)    139218.25 (  1.57%)    138919.25 (  1.35%)
Mean   8     130814.25 (  0.00%)    137088.25 (  4.80%)    139649.50 (  6.75%)    138273.00 (  5.70%)
Mean   9     124815.00 (  0.00%)    135275.50 (  8.38%)    137494.50 ( 10.16%)    137386.25 ( 10.07%)
Mean   10    123741.00 (  0.00%)    131418.00 (  6.20%)    132662.00 (  7.21%)    132379.25 (  6.98%)
Mean   11    116966.25 (  0.00%)    125246.00 (  7.08%)    124420.25 (  6.37%)    128132.00 (  9.55%)
Mean   12    106682.00 (  0.00%)    118489.50 ( 11.07%)    119624.25 ( 12.13%)    121050.75 ( 13.47%)
Mean   13    106395.00 (  0.00%)    118143.75 ( 11.04%)    116799.25 (  9.78%)    121032.25 ( 13.76%)
Mean   14    104384.25 (  0.00%)    119562.75 ( 14.54%)    117898.75 ( 12.95%)    114255.25 (  9.46%)
Mean   15    103699.00 (  0.00%)    115845.50 ( 11.71%)    117527.25 ( 13.33%)    109329.50 (  5.43%)
Mean   16    100955.00 (  0.00%)    113216.75 ( 12.15%)    114046.50 ( 12.97%)    108669.75 (  7.64%)
Mean   17     99528.25 (  0.00%)    112736.50 ( 13.27%)    115917.00 ( 16.47%)    113464.50 ( 14.00%)
Mean   18     97694.00 (  0.00%)    108930.00 ( 11.50%)    114137.50 ( 16.83%)    114161.25 ( 16.86%)
Stddev 1        898.91 (  0.00%)       786.81 ( 12.47%)       756.10 ( 15.89%)      1061.69 (-18.11%)
Stddev 2        676.51 (  0.00%)      1591.35 (-135.23%)       968.21 (-43.12%)       919.08 (-35.86%)
Stddev 3        629.58 (  0.00%)       291.72 ( 53.66%)      1181.68 (-87.69%)       701.90 (-11.49%)
Stddev 4        363.04 (  0.00%)      1288.56 (-254.94%)      1757.87 (-384.21%)      2050.94 (-464.94%)
Stddev 5        437.02 (  0.00%)      1148.94 (-162.90%)      1294.70 (-196.26%)       861.14 (-97.05%)
Stddev 6       1484.12 (  0.00%)       860.24 ( 42.04%)      1703.57 (-14.79%)      1367.56 (  7.85%)
Stddev 7       3856.79 (  0.00%)      1517.99 ( 60.64%)      2676.34 ( 30.61%)      1818.15 ( 52.86%)
Stddev 8       4910.41 (  0.00%)      5022.25 ( -2.28%)      3113.14 ( 36.60%)      3958.06 ( 19.39%)
Stddev 9       2107.95 (  0.00%)      2932.34 (-39.11%)      6568.79 (-211.62%)      7450.20 (-253.43%)
Stddev 10      2012.98 (  0.00%)      4649.56 (-130.98%)      2703.19 (-34.29%)      4193.34 (-108.31%)
Stddev 11      5263.81 (  0.00%)      1647.81 ( 68.70%)      4683.05 ( 11.03%)      3702.45 ( 29.66%)
Stddev 12      4316.09 (  0.00%)      2202.13 ( 48.98%)      2520.73 ( 41.60%)      3572.75 ( 17.22%)
Stddev 13      4116.97 (  0.00%)      3042.07 ( 26.11%)      1705.18 ( 58.58%)       464.36 ( 88.72%)
Stddev 14      4711.12 (  0.00%)      1597.01 ( 66.10%)      1983.88 ( 57.89%)      1513.32 ( 67.88%)
Stddev 15      4582.30 (  0.00%)      1966.56 ( 57.08%)       420.63 ( 90.82%)      1049.66 ( 77.09%)
Stddev 16      3805.96 (  0.00%)      1493.18 ( 60.77%)      2524.84 ( 33.66%)      2030.46 ( 46.65%)
Stddev 17      4560.83 (  0.00%)      1709.65 ( 62.51%)      2449.37 ( 46.30%)      1259.00 ( 72.40%)
Stddev 18      4503.57 (  0.00%)      1334.37 ( 70.37%)      1693.93 ( 62.39%)       975.71 ( 78.33%)
TPut   1     101077.00 (  0.00%)    100552.00 ( -0.52%)    102157.00 (  1.07%)    100772.00 ( -0.30%)
TPut   2     213868.00 (  0.00%)    203252.00 ( -4.96%)    211214.00 ( -1.24%)    210550.00 ( -1.55%)
TPut   3     308450.00 (  0.00%)    301097.00 ( -2.38%)    304388.00 ( -1.32%)    305297.00 ( -1.02%)
TPut   4     399715.00 (  0.00%)    389779.00 ( -2.49%)    397707.00 ( -0.50%)    399069.00 ( -0.16%)
TPut   5     478467.00 (  0.00%)    469400.00 ( -1.90%)    473669.00 ( -1.00%)    473194.00 ( -1.10%)
TPut   6     535779.00 (  0.00%)    534263.00 ( -0.28%)    541075.00 (  0.99%)    550050.00 (  2.66%)
TPut   7     548252.00 (  0.00%)    546979.00 ( -0.23%)    556873.00 (  1.57%)    555677.00 (  1.35%)
TPut   8     523257.00 (  0.00%)    548353.00 (  4.80%)    558598.00 (  6.75%)    553092.00 (  5.70%)
TPut   9     499260.00 (  0.00%)    541102.00 (  8.38%)    549978.00 ( 10.16%)    549545.00 ( 10.07%)
TPut   10    494964.00 (  0.00%)    525672.00 (  6.20%)    530648.00 (  7.21%)    529517.00 (  6.98%)
TPut   11    467865.00 (  0.00%)    500984.00 (  7.08%)    497681.00 (  6.37%)    512528.00 (  9.55%)
TPut   12    426728.00 (  0.00%)    473958.00 ( 11.07%)    478497.00 ( 12.13%)    484203.00 ( 13.47%)
TPut   13    425580.00 (  0.00%)    472575.00 ( 11.04%)    467197.00 (  9.78%)    484129.00 ( 13.76%)
TPut   14    417537.00 (  0.00%)    478251.00 ( 14.54%)    471595.00 ( 12.95%)    457021.00 (  9.46%)
TPut   15    414796.00 (  0.00%)    463382.00 ( 11.71%)    470109.00 ( 13.33%)    437318.00 (  5.43%)
TPut   16    403820.00 (  0.00%)    452867.00 ( 12.15%)    456186.00 ( 12.97%)    434679.00 (  7.64%)
TPut   17    398113.00 (  0.00%)    450946.00 ( 13.27%)    463668.00 ( 16.47%)    453858.00 ( 14.00%)
TPut   18    390776.00 (  0.00%)    435720.00 ( 11.50%)    456550.00 ( 16.83%)    456645.00 ( 16.86%)

By and large with THP disabled, balancenuma sees performance gains and the
variation between JVMs is reduced. There is little gained by the adaptive
scan in terms of throughput for larger numbers of warehouses but note it
helps for low numbers. This is because the expected savings is in system
CPU time and this cost is spent by one thread per JVM. Up to 4 warehouses
(possible 1 thread per JVM active) all you can see is the system CPU cost
and the throughput is lower.  As the number of warehouses grow, the system
CPU cost is less obvious as the other threads make up the difference.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0                      3.7.0
                              rc6-stats-v5r1        rc6-thpmigrate-v5r1         rc6-adaptscan-v5r1        rc6-delaystart-v5r4
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        426728.00 (  0.00%)        473958.00 ( 11.07%)        478497.00 ( 12.13%)        484203.00 ( 13.47%)
 Actual Warehouse             7.00 (  0.00%)             8.00 ( 14.29%)             8.00 ( 14.29%)             7.00 (  0.00%)
 Actual Peak Bops        548252.00 (  0.00%)        548353.00 (  0.02%)        558598.00 (  1.89%)        555677.00 (  1.35%)
 SpecJBB Bops            221334.00 (  0.00%)        248285.00 ( 12.18%)        251062.00 ( 13.43%)        246759.00 ( 11.49%)
 SpecJBB Bops/JVM         55334.00 (  0.00%)         62071.00 ( 12.18%)         62766.00 ( 13.43%)         61690.00 ( 11.49%)

Balancenuma can sustain performance for large number of warehouses but
it's peak performance is about the same. The specjbb benchmark itself
takes a range of warehouses into account around the peak and I've included
the figures it reports this time as "SpecJBB Bops" and "SpecJBB Bops/JVM"
balancenuma sees about a 11-13% performance gain over the vanilla kernel
for the range of warehouses.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0       3.7.0
        rc6-stats-v5r1rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User       203906.38   200055.62   202076.09   201985.74
System        577.16     4114.76     2129.71     2177.70
Elapsed      5030.84     5019.25     5026.83     5017.79

Note what adaptscan does to the System CPU time. A placement policy should try
focusing on how the scan rate can be reduced more.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0       3.7.0
                          rc6-stats-v5r1rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins                        157624      163492      164776      163348
Page Outs                       322264      491668      401644      523684
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                      2           2           1           3
THP collapse alloc                   0           0           0           5
THP splits                           0           0           0           0
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0   100618401    47601498    49370903
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0      104441       49410       51246
NUMA PTE updates                     0   783430956   381926529   389134805
NUMA hint faults                     0   730273702   352415076   360742428
NUMA hint local faults               0   191790656    92208827    93522412
NUMA pages migrated                  0   100618401    47601498    49370903
AutoNUMA cost                        0     3658764     1765653     1807374

Note the lack of THP activity due to it being disabled. There are quite a
large number of migrations. As THP is disabled we know all the migrations
are for base pages so we can work out how much copying we're doing. Without
the scan rate adaption migration is going at a rate of about 78MB/sec on
average!  With the rate adaption, that still still at a huge 38MB/sec. A
good placement and scheduling policy should be able to reduce this and
gain higher throughput by avoiding wasting CPU cycles on copying.

These are all the figures I had at the time of writing. The rest of the
tests will be running overnight and should the multi JVM with THP and if all
goes well, single JVM figures as well as some kernel benchmarks, hackbench
and the page fault microbenchmark as snifftests. While I would prefer to
have a full set of results before release I released now as I was seeing
evidence that people were preparing to test the full of V4 instead of the
subset that should have been used. Hopefully this will catch them in time!

 Documentation/kernel-parameters.txt  |    3 +
 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    2 +
 arch/x86/include/asm/pgtable.h       |   17 +-
 arch/x86/include/asm/pgtable_types.h |   20 +++
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |   78 +++++++++
 include/linux/huge_mm.h              |   13 +-
 include/linux/hugetlb.h              |    8 +-
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   43 ++++-
 include/linux/mm.h                   |   39 +++++
 include/linux/mm_types.h             |   31 ++++
 include/linux/mmzone.h               |   13 ++
 include/linux/sched.h                |   27 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 ++++++
 include/uapi/linux/mempolicy.h       |   15 +-
 init/Kconfig                         |   41 +++++
 kernel/fork.c                        |    3 +
 kernel/sched/core.c                  |   62 +++++--
 kernel/sched/fair.c                  |  227 +++++++++++++++++++++++++
 kernel/sched/features.h              |   11 ++
 kernel/sched/sched.h                 |    6 +
 kernel/sysctl.c                      |   45 ++++-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |   93 +++++++++-
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |    7 +-
 mm/memcontrol.c                      |    7 +-
 mm/memory-failure.c                  |    3 +-
 mm/memory.c                          |  188 ++++++++++++++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  283 ++++++++++++++++++++++++++++---
 mm/migrate.c                         |  308 +++++++++++++++++++++++++++++++++-
 mm/mprotect.c                        |  124 +++++++++++---
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |    9 +-
 mm/vmstat.c                          |   16 +-
 40 files changed, 1759 insertions(+), 109 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/40] x86: mm: only do a local tlb flush in ptep_set_access_flags()
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
                   ` (39 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags() is only ever invoked to set access
flags or add write permission on a PTE.  The write bit is only ever set
together with the dirty bit.

Because we only ever upgrade a PTE, it is safe to skip flushing entries on
remote TLBs. The worst that can happen is a spurious page fault on other
CPUs, which would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally is
(much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
 int ptep_set_access_flags(struct vm_area_struct *vma,
 			  unsigned long address, pte_t *ptep,
 			  pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		flush_tlb_page(vma, address);
+		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
  2012-11-22 19:25 ` [PATCH 01/40] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 20:56   ` Alan Cox
  2012-11-22 19:25 ` [PATCH 03/40] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
                   ` (38 subsequent siblings)
  40 siblings, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Rik van Riel <riel@redhat.com>

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this.  However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/mm/pgtable.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/40] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
  2012-11-22 19:25 ` [PATCH 01/40] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
  2012-11-22 19:25 ` [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 04/40] x86/mm: Introduce pte_accessible() Mel Gorman
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write 
+ * permission. Furthermore, we know it always gets set to a "more
  * permissive" setting, which allows most architectures to optimize
  * this. We return whether the PTE actually changed, which in turn
  * instructs the caller to do things like update__mmu_cache.  This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(*ptep, entry);
 	if (changed) {
 		set_pte_at(vma->vm_mm, address, ptep, entry);
-		flush_tlb_page(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 	return changed;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/40] x86/mm: Introduce pte_accessible()
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (2 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 03/40] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 05/40] mm: Only flush the TLB when clearing an accessible pte Mel Gorman
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Rik van Riel <riel@redhat.com>

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page.  This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Fill in this method for x86 and provide a safe (but slower) method
on other architectures.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fixed-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |    6 ++++++
 include/asm-generic/pgtable.h  |    4 ++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..5fe03aa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -407,6 +407,12 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+	return pte_flags(a) & _PAGE_PRESENT;
+}
+
 static inline int pte_hidden(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_HIDDEN;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..48fc1dc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef pte_accessible
+# define pte_accessible(pte)		((void)(pte),1)
+#endif
+
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/40] mm: Only flush the TLB when clearing an accessible pte
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (3 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 04/40] x86/mm: Introduce pte_accessible() Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 06/40] mm: Count the number of pages affected in change_protection() Mel Gorman
                   ` (35 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Rik van Riel <riel@redhat.com>

If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d8397da..0c8323f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pte_t pte;
 	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	flush_tlb_page(vma, address);
+	if (pte_accessible(pte))
+		flush_tlb_page(vma, address);
 	return pte;
 }
 #endif
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/40] mm: Count the number of pages affected in change_protection()
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (4 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 05/40] mm: Only flush the TLB when clearing an accessible pte Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 07/40] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Mel Gorman
                   ` (34 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

This will be used for three kinds of purposes:

 - to optimize mprotect()

 - to speed up working set scanning for working set areas that
   have not been touched

 - to more accurately scan per real working set

No change in functionality from this patch.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/hugetlb.h |    8 +++++--
 include/linux/mm.h      |    3 +++
 mm/hugetlb.c            |   10 ++++++--
 mm/mprotect.c           |   58 +++++++++++++++++++++++++++++++++++------------
 4 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2251648..06e691b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 				pud_t *pud, int write);
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
 #else /* !CONFIG_HUGETLB_PAGE */
@@ -132,7 +132,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
 
-#define hugetlb_change_protection(vma, address, end, newprot)
+static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+		unsigned long address, unsigned long end, pgprot_t newprot)
+{
+	return 0;
+}
 
 static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..1856c62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1078,6 +1078,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long old_len, unsigned long new_len,
 			       unsigned long flags, unsigned long new_addr);
+extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end, pgprot_t newprot,
+			      int dirty_accountable);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..712895e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3014,7 +3014,7 @@ same_page:
 	return i ? i : -EFAULT;
 }
 
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -3022,6 +3022,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
+	unsigned long pages = 0;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
@@ -3032,12 +3033,15 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
-		if (huge_pmd_unshare(mm, &address, ptep))
+		if (huge_pmd_unshare(mm, &address, ptep)) {
+			pages++;
 			continue;
+		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
 			pte = huge_ptep_get_and_clear(mm, address, ptep);
 			pte = pte_mkhuge(pte_modify(pte, newprot));
 			set_huge_pte_at(mm, address, ptep, pte);
+			pages++;
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -3049,6 +3053,8 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+
+	return pages << h->order;
 }
 
 int hugetlb_reserve_pages(struct inode *inode,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..1e265be 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,12 +35,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 }
 #endif
 
-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
+	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -60,6 +61,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
+			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
@@ -72,18 +74,22 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
 			}
+			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
+
+	return pages;
 }
 
-static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
+	unsigned long pages = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -91,35 +97,42 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot))
+			else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+				pages += HPAGE_PMD_NR;
 				continue;
+			}
 			/* fall through */
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+		pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
 				 dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
+
+	return pages;
 }
 
-static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
+	unsigned long pages = 0;
 
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(vma, pud, addr, next, newprot,
+		pages += change_pmd_range(vma, pud, addr, next, newprot,
 				 dirty_accountable);
 	} while (pud++, addr = next, addr != end);
+
+	return pages;
 }
 
-static void change_protection(struct vm_area_struct *vma,
+static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -127,6 +140,7 @@ static void change_protection(struct vm_area_struct *vma,
 	pgd_t *pgd;
 	unsigned long next;
 	unsigned long start = addr;
+	unsigned long pages = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
@@ -135,10 +149,30 @@ static void change_protection(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(vma, pgd, addr, next, newprot,
+		pages += change_pud_range(vma, pgd, addr, next, newprot,
 				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
+
 	flush_tlb_range(vma, start, end);
+
+	return pages;
+}
+
+unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, pgprot_t newprot,
+		       int dirty_accountable)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long pages;
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	if (is_vm_hugetlb_page(vma))
+		pages = hugetlb_change_protection(vma, start, end, newprot);
+	else
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+
+	return pages;
 }
 
 int
@@ -213,12 +247,8 @@ success:
 		dirty_accountable = 1;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
-	else
-		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	perf_event_mmap(vma);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/40] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (5 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 06/40] mm: Count the number of pages affected in change_protection() Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 08/40] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
                   ` (33 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Reuse the NUMA code's 'modified page protections' count that
change_protection() computes and skip the TLB flush if there's
no changes to a range that sys_mprotect() modifies.

Given that mprotect() already optimizes the same-flags case
I expected this optimization to dominantly trigger on
CONFIG_NUMA_BALANCING=y kernels - but even with that feature
disabled it triggers rather often.

There's two reasons for that:

1)

While sys_mprotect() already optimizes the same-flag case:

        if (newflags == oldflags) {
                *pprev = vma;
                return 0;
        }

and this test works in many cases, but it is too sharp in some
others, where it differentiates between protection values that the
underlying PTE format makes no distinction about, such as
PROT_EXEC == PROT_READ on x86.

2)

Even where the pte format over vma flag changes necessiates a
modification of the pagetables, there might be no pagetables
yet to modify: they might not be instantiated yet.

During a regular desktop bootup this optimization hits a couple
of hundred times. During a Java test I measured thousands of
hits.

So this optimization improves sys_mprotect() in general, not just
CONFIG_NUMA_BALANCING=y kernels.

[ We could further increase the efficiency of this optimization if
  change_pte_range() and change_huge_pmd() was a bit smarter about
  recognizing exact-same-value protection masks - when the hardware
  can do that safely. This would probably further speed up mprotect(). ]

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mprotect.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e265be..7c3628a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -153,7 +153,9 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 
-	flush_tlb_range(vma, start, end);
+	/* Only flush the TLB if we actually modified any entries: */
+	if (pages)
+		flush_tlb_range(vma, start, end);
 
 	return pages;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/40] mm: compaction: Move migration fail/success stats to migrate.c
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (6 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 07/40] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 09/40] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
                   ` (32 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    4 +++-
 mm/compaction.c               |    4 ----
 mm/migrate.c                  |    6 ++++++
 mm/vmstat.c                   |    7 ++++---
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..8aa7cb9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,8 +38,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_MIGRATION
+		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
+#endif
 #ifdef CONFIG_COMPACTION
-		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..00ad883 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -994,10 +994,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
-		count_vm_event(COMPACTBLOCKS);
-		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
-		if (nr_remaining)
-			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
 		trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
 						nr_remaining);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..04687f6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -962,6 +962,7 @@ int migrate_pages(struct list_head *from,
 {
 	int retry = 1;
 	int nr_failed = 0;
+	int nr_succeeded = 0;
 	int pass = 0;
 	struct page *page;
 	struct page *page2;
@@ -988,6 +989,7 @@ int migrate_pages(struct list_head *from,
 				retry++;
 				break;
 			case 0:
+				nr_succeeded++;
 				break;
 			default:
 				/* Permanent failure */
@@ -998,6 +1000,10 @@ int migrate_pages(struct list_head *from,
 	}
 	rc = 0;
 out:
+	if (nr_succeeded)
+		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+	if (nr_failed)
+		count_vm_events(PGMIGRATE_FAIL, nr_failed);
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..89a7fd6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,10 +774,11 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_MIGRATION
+	"pgmigrate_success",
+	"pgmigrate_fail",
+#endif
 #ifdef CONFIG_COMPACTION
-	"compact_blocks_moved",
-	"compact_pages_moved",
-	"compact_pagemigrate_failed",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/40] mm: migrate: Add a tracepoint for migrate_pages
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (7 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 08/40] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 10/40] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
                   ` (31 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h        |   13 ++++++++--
 include/trace/events/migrate.h |   51 ++++++++++++++++++++++++++++++++++++++++
 mm/compaction.c                |    3 ++-
 mm/memory-failure.c            |    3 ++-
 mm/memory_hotplug.c            |    3 ++-
 mm/mempolicy.c                 |    6 +++--
 mm/migrate.c                   |   10 ++++++--
 mm/page_alloc.c                |    3 ++-
 8 files changed, 82 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/migrate.h

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..9d1c159 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,15 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+enum migrate_reason {
+	MR_COMPACTION,
+	MR_MEMORY_FAILURE,
+	MR_MEMORY_HOTPLUG,
+	MR_SYSCALL,		/* also applies to cpusets */
+	MR_MEMPOLICY_MBIND,
+	MR_CMA
+};
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
@@ -14,7 +23,7 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			enum migrate_mode mode);
+			enum migrate_mode mode, int reason);
 extern int migrate_huge_page(struct page *, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
@@ -35,7 +44,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		enum migrate_mode mode) { return -ENOSYS; }
+		enum migrate_mode mode, int reason) { return -ENOSYS; }
 static inline int migrate_huge_page(struct page *page, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
new file mode 100644
index 0000000..ec2a6cc
--- /dev/null
+++ b/include/trace/events/migrate.h
@@ -0,0 +1,51 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM migrate
+
+#if !defined(_TRACE_MIGRATE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MIGRATE_H
+
+#define MIGRATE_MODE						\
+	{MIGRATE_ASYNC,		"MIGRATE_ASYNC"},		\
+	{MIGRATE_SYNC_LIGHT,	"MIGRATE_SYNC_LIGHT"},		\
+	{MIGRATE_SYNC,		"MIGRATE_SYNC"}		
+
+#define MIGRATE_REASON						\
+	{MR_COMPACTION,		"compaction"},			\
+	{MR_MEMORY_FAILURE,	"memory_failure"},		\
+	{MR_MEMORY_HOTPLUG,	"memory_hotplug"},		\
+	{MR_SYSCALL,		"syscall_or_cpuset"},		\
+	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
+	{MR_CMA,		"cma"}
+
+TRACE_EVENT(mm_migrate_pages,
+
+	TP_PROTO(unsigned long succeeded, unsigned long failed,
+		 enum migrate_mode mode, int reason),
+
+	TP_ARGS(succeeded, failed, mode, reason),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,		succeeded)
+		__field(	unsigned long,		failed)
+		__field(	enum migrate_mode,	mode)
+		__field(	int,			reason)
+	),
+
+	TP_fast_assign(
+		__entry->succeeded	= succeeded;
+		__entry->failed		= failed;
+		__entry->mode		= mode;
+		__entry->reason		= reason;
+	),
+
+	TP_printk("nr_succeeded=%lu nr_failed=%lu mode=%s reason=%s",
+		__entry->succeeded,
+		__entry->failed,
+		__print_symbolic(__entry->mode, MIGRATE_MODE),
+		__print_symbolic(__entry->reason, MIGRATE_REASON))
+);
+
+#endif /* _TRACE_MIGRATE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/compaction.c b/mm/compaction.c
index 00ad883..2c077a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -990,7 +990,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
+				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
+				MR_COMPACTION);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..ddb68a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1558,7 +1558,8 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_MEMORY_FAILURE);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..e598bd1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -812,7 +812,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		 * migrate_pages returns # of failed pages.
 		 */
 		ret = migrate_pages(&source, alloc_migrate_target, 0,
-							true, MIGRATE_SYNC);
+							true, MIGRATE_SYNC,
+							MR_MEMORY_HOTPLUG);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..66e90ec 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -961,7 +961,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1202,7 +1203,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
-						false, MIGRATE_SYNC);
+						false, MIGRATE_SYNC,
+						MR_MEMPOLICY_MBIND);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 04687f6..27be9c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -38,6 +38,9 @@
 
 #include <asm/tlbflush.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/migrate.h>
+
 #include "internal.h"
 
 /*
@@ -958,7 +961,7 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		enum migrate_mode mode)
+		enum migrate_mode mode, int reason)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -1004,6 +1007,8 @@ out:
 		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
 	if (nr_failed)
 		count_vm_events(PGMIGRATE_FAIL, nr_failed);
+	trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
@@ -1145,7 +1150,8 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, MIGRATE_SYNC);
+				(unsigned long)pm, 0, MIGRATE_SYNC,
+				MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..5953dc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5707,7 +5707,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		ret = migrate_pages(&cc->migratepages,
 				    alloc_migrate_target,
-				    0, false, MIGRATE_SYNC);
+				    0, false, MIGRATE_SYNC,
+				    MR_CMA);
 	}
 
 	putback_lru_pages(&cc->migratepages);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/40] mm: compaction: Add scanned and isolated counters for compaction
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (8 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 09/40] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 11/40] mm: numa: define _PAGE_NUMA Mel Gorman
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u:	a word	sizeof(void *)

Ca  = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure   = Ca * 2
Ci  = Cost page isolation    = (Ca + Wi)
	where Wi is a constant that should reflect the approximate
	cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost free    scanning = Ca

Overall cost =	(Csm * compact_migrate_scanned) +
	      	(Csf * compact_free_scanned)    +
	      	(Ci  * compact_isolated)	+
		(Cmc * pgmigrate_success)	+
		(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/compaction.c               |    8 ++++++++
 mm/vmstat.c                   |    3 +++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8aa7cb9..a1f750b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -42,6 +42,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
 #ifdef CONFIG_COMPACTION
+		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+		COMPACTISOLATED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c077a7..aee7443 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -356,6 +356,10 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	if (blockpfn == end_pfn)
 		update_pageblock_skip(cc, valid_page, total_isolated, false);
 
+	count_vm_events(COMPACTFREE_SCANNED, nr_scanned);
+	if (total_isolated)
+		count_vm_events(COMPACTISOLATED, total_isolated);
+
 	return total_isolated;
 }
 
@@ -646,6 +650,10 @@ next_pageblock:
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
+	count_vm_events(COMPACTMIGRATE_SCANNED, nr_scanned);
+	if (nr_isolated)
+		count_vm_events(COMPACTISOLATED, nr_isolated);
+
 	return low_pfn;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89a7fd6..3a067fa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -779,6 +779,9 @@ const char * const vmstat_text[] = {
 	"pgmigrate_fail",
 #endif
 #ifdef CONFIG_COMPACTION
+	"compact_migrate_scanned",
+	"compact_free_scanned",
+	"compact_isolated",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/40] mm: numa: define _PAGE_NUMA
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (9 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 10/40] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 12/40] mm: numa: pte_numa() and pmd_numa() Mel Gorman
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
faults to identify the per NUMA node working set of the thread at
runtime.

Arming the NUMA hinting page fault mechanism works similarly to
setting up a mprotect(PROT_NONE) virtual range: the present bit is
cleared at the same time that _PAGE_NUMA is set, so when the fault
triggers we can identify it as a NUMA hinting page fault.

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
could also use a different bitflag, it's up to the architecture to
decide).

It would be confusing to call the "NUMA hinting page faults" as
"do_prot_none faults". They're different events and _PAGE_NUMA doesn't
alter the semantics of mprotect(PROT_NONE) in any way.

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths executed
by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
step into each other toes.

Because we want to be able to set this bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, this bitflag must never be set when the pte and
pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
be used by the swap entry format.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..3c32db8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * _PAGE_NUMA indicates that this page will trigger a numa hinting
+ * minor page fault to gather numa placement statistics (see
+ * pte_numa()). The bit picked (8) is within the range between
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ *
+ * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
+ * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
+ * couldn't reach, like handle_mm_fault() (see access_error in
+ * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
+ * handle_mm_fault() to be invoked).
+ */
+#define _PAGE_NUMA	_PAGE_PROTNONE
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/40] mm: numa: pte_numa() and pmd_numa()
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (10 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 11/40] mm: numa: define _PAGE_NUMA Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 13/40] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

The expectation is that a NUMA hinting page fault is used as part
of a placement policy that decides if a page should remain on the
current node or migrated to a different node.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h |   11 ++++--
 include/asm-generic/pgtable.h  |   74 ++++++++++++++++++++++++++++++++++++++++
 init/Kconfig                   |   33 ++++++++++++++++++
 3 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5fe03aa..9cd7b72 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA);
 }
 
 #define pte_accessible pte_accessible
@@ -426,7 +427,8 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+				 _PAGE_NUMA);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -485,6 +487,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_BALANCE_NUMA
+	/* pmd_numa check */
+	if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 48fc1dc..1e236fe 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -558,6 +558,80 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+#ifndef pte_numa
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+#ifndef pmd_numa
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+#ifndef pte_mknonnuma
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+
+#ifndef pmd_mknonnuma
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+
+#ifndef pte_mknuma
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+#endif
+
+#ifndef pmd_mknuma
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+#endif
+#else
+extern int pte_numa(pte_t pte);
+extern int pmd_numa(pmd_t pmd);
+extern pte_t pte_mknonnuma(pte_t pte);
+extern pmd_t pmd_mknonnuma(pmd_t pmd);
+extern pte_t pte_mknuma(pte_t pte);
+extern pmd_t pmd_mknuma(pmd_t pmd);
+#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..6897a05 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,39 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# For architectures that want to enable the support for NUMA-affine scheduler
+# balancing logic:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+	bool
+
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANT_NUMA_VARIABLE_LOCALITY
+	bool
+
+#
+# For architectures that are willing to define _PAGE_NUMA as _PAGE_PROTNONE
+config ARCH_WANTS_PROT_NUMA_PROT_NONE
+	bool
+
+config ARCH_USES_NUMA_PROT_NONE
+	bool
+	default y
+	depends on ARCH_WANTS_PROT_NUMA_PROT_NONE
+	depends on BALANCE_NUMA
+
+config BALANCE_NUMA
+	bool "Memory placement aware NUMA scheduler"
+	default n
+	depends on ARCH_SUPPORTS_NUMA_BALANCING
+	depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
+	depends on SMP && NUMA && MIGRATION
+	help
+	  This option adds support for automatic NUMA aware memory/task placement.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 13/40] mm: numa: Support NUMA hinting page faults from gup/gup_fast
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (11 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 12/40] mm: numa: pte_numa() and pmd_numa() Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 14/40] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1856c62..fa16152 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1572,6 +1572,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_MLOCK	0x40	/* mark page as mlocked */
 #define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+#define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..73834e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1517,6 +1517,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto no_page_table;
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
 			split_huge_page_pmd(mm, pmd);
@@ -1546,6 +1548,8 @@ split_fallthrough:
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto no_page;
+	if ((flags & FOLL_NUMA) && pte_numa(pte))
+		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 
@@ -1697,6 +1701,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
 	vm_flags &= (gup_flags & FOLL_FORCE) ?
 			(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+	/*
+	 * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+	 * would be called on PROT_NONE ranges. We must never invoke
+	 * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+	 * page faults would unprotect the PROT_NONE ranges if
+	 * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+	 * bitflag. So to avoid that, don't set FOLL_NUMA if
+	 * FOLL_FORCE is set.
+	 */
+	if (!(gup_flags & FOLL_FORCE))
+		gup_flags |= FOLL_NUMA;
+
 	i = 0;
 
 	do {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 14/40] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (12 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 13/40] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 15/40] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
                   ` (26 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

When we split a transparent hugepage, transfer the NUMA type from the
pmd to the pte if needed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..3aaf242 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1363,6 +1363,8 @@ static int __split_huge_page_map(struct page *page,
 				BUG_ON(page_mapcount(page) != 1);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
+			if (pmd_numa(*pmd))
+				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 15/40] mm: numa: Create basic numa page hinting infrastructure
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (13 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 14/40] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 16/40] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

Note: This patch started as "mm/mpol: Create special PROT_NONE
	infrastructure" and preserves the basic idea but steals *very*
	heavily from "autonuma: numa hinting page faults entry points" for
	the actual fault handlers without the migration parts.	The end
	result is barely recognisable as either patch so all Signed-off
	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
	this version, I will re-add the signed-offs-by to reflect the history.

In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.

The meaning of PAGE_NUMA depends on the architecture but on x86 it is
effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
NUMA faults for the reason that the page fault code checks the permission on
the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
before it ever calls handle_mm_fault.

[dhillf@gmail.com: Fix typo]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/huge_mm.h |   10 +++++
 mm/huge_memory.c        |   21 ++++++++++
 mm/memory.c             |  104 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 132 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..a13ebb1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,10 @@ static inline struct page *compound_trans_head(struct page *page)
 	}
 	return page;
 }
+
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				  pmd_t pmd, pmd_t *pmdp);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +199,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 {
 	return 0;
 }
+
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp);
+{
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3aaf242..7224efd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1017,6 +1017,27 @@ out:
 	return page;
 }
 
+/* NUMA hinting page fault entry point for trans huge pmds */
+int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				pmd_t pmd, pmd_t *pmdp)
+{
+	struct page *page;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
+	page = pmd_page(pmd);
+	pmd = pmd_mknonnuma(pmd);
+	set_pmd_at(mm, haddr, pmdp, pmd);
+	VM_BUG_ON(pmd_numa(*pmdp));
+	update_mmu_cache_pmd(vma, addr, pmdp);
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return 0;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 73834e7..277b6d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3448,6 +3448,95 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+	struct page *page;
+	spinlock_t *ptl;
+
+	/*
+	* The "pte" at this point cannot be used safely without
+	* validation through pte_unmap_same(). It's of NUMA type but
+	* the pfn may be screwed if the read is non atomic.
+	*
+	* ptep_modify_prot_start is not called as this is clearing
+	* the _PAGE_NUMA bit and it is not really expected that there
+	* would be concurrent hardware modifications to the PTE.
+	*/
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte)))
+		goto out_unlock;
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	update_mmu_cache(vma, addr, ptep);
+
+	page = vm_normal_page(vma, addr, pte);
+	if (!page) {
+		pte_unmap_unlock(ptep, ptl);
+		return 0;
+	}
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+	return 0;
+}
+
+/* NUMA hinting page fault entry point for regular pmds */
+int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte, *orig_pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return 0;
+
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page *page;
+		if (!pte_present(pteval))
+			continue;
+		if (!pte_numa(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknonnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+	}
+	pte_unmap_unlock(orig_pte, ptl);
+
+	return 0;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3486,6 +3575,9 @@ int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(entry))
+		return do_numa_page(mm, vma, address, entry, pte, pmd);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3554,9 +3646,11 @@ retry:
 
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_numa(*pmd))
+				return do_huge_pmd_numa_page(mm, address,
+							     orig_pmd, pmd);
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3568,10 +3662,14 @@ retry:
 					goto retry;
 				return ret;
 			}
+
 			return 0;
 		}
 	}
 
+	if (pmd_numa(*pmd))
+		return do_pmd_numa_page(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 16/40] mm: mempolicy: Make MPOL_LOCAL a real policy
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (14 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 15/40] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 17/40] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.

Requested-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |    9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..3e835c9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_LOCAL,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66e90ec..54bd3e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
 		}
+	} else if (mode == MPOL_LOCAL) {
+		if (!nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2399,7 +2403,6 @@ void numa_default_policy(void)
  * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
  * Used only for mpol_parse_str() and mpol_to_str()
  */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2452,12 +2455,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 	if (flags)
 		*flags++ = '\0';	/* terminate mode string */
 
-	for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+	for (mode = 0; mode < MPOL_MAX; mode++) {
 		if (!strcmp(str, policy_modes[mode])) {
 			break;
 		}
 	}
-	if (mode > MPOL_LOCAL)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 17/40] mm: mempolicy: Add MPOL_MF_NOOP
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (15 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 16/40] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 18/40] mm: mempolicy: Check for misplaced page Mel Gorman
                   ` (23 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

NOTE: I have not yet addressed by own review feedback of this patch. At
	this point I'm trying to construct a baseline tree and will apply
	my own review feedback later and then fold it in.

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3e835c9..d23dca8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 54bd3e5..c21e914 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1147,7 +1147,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2409,7 +2409,8 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
-	[MPOL_LOCAL]      = "local"
+	[MPOL_LOCAL]      = "local",
+	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2460,7 +2461,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX)
+	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 18/40] mm: mempolicy: Check for misplaced page
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (16 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 17/40] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 19/40] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.

A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy.  So, I just mimic the alloc_page_vma() node computation
logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
  simplified code now that we don't have to bother
  with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h      |    8 +++++
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   76 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 #else
 
 struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 	return 0;
 }
 
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
+
 #endif /* CONFIG_NUMA */
 #endif
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index d23dca8..472de8a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
+#define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c21e914..df1466d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2181,6 +2181,82 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *	-1	- not misplaced, page is in the right node
+ *	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	unsigned long pgoff;
+	int polnid = -1;
+	int ret = -1;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(pol->flags & MPOL_F_MOF))
+		goto out;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+	if (curnid != polnid)
+		ret = polnid;
+out:
+	mpol_cond_put(pol);
+
+	return ret;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 19/40] mm: migrate: Introduce migrate_misplaced_page()
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (17 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 18/40] mm: mempolicy: Check for misplaced page Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 20/40] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Note: This was originally based on Peter's patch "mm/migrate: Introduce
	migrate_misplaced_page()" but borrows extremely heavily from Andrea's
	"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
	collection". The end result is barely recognisable so signed-offs
	had to be dropped. If original authors are ok with it, I'll
	re-add the signed-off-bys.

Add migrate_misplaced_page() which deals with migrating pages from
faults.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h |    8 ++++
 mm/migrate.c            |  108 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 114 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9d1c159..69f60b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -13,6 +13,7 @@ enum migrate_reason {
 	MR_MEMORY_HOTPLUG,
 	MR_SYSCALL,		/* also applies to cpusets */
 	MR_MEMPOLICY_MBIND,
+	MR_NUMA_MISPLACED,
 	MR_CMA
 };
 
@@ -39,6 +40,7 @@ extern int migrate_vmas(struct mm_struct *mm,
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -72,5 +74,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 27be9c9..a2c4567 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -282,7 +282,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode)
 {
-	int expected_count;
+	int expected_count = 0;
 	void **pslot;
 
 	if (!mapping) {
@@ -1415,4 +1415,108 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
  	}
  	return err;
 }
-#endif
+
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which crude
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_exact_node(nid,
+					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+					  __GFP_NOMEMALLOC | __GFP_NORETRY |
+					  __GFP_NOWARN) &
+					 ~GFP_IOFS, 0);
+	return newpage;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	int isolated = 0;
+	LIST_HEAD(migratepages);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1) {
+		put_page(page);
+		goto out;
+	}
+
+	/* Avoid migrating to a node that is nearly full */
+	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+		int page_lru;
+
+		if (isolate_lru_page(page)) {
+			put_page(page);
+			goto out;
+		}
+		isolated = 1;
+
+		/*
+		 * Page is isolated which takes a reference count so now the
+		 * callers reference can be safely dropped without the page
+		 * disappearing underneath us during migration
+		 */
+		put_page(page);
+
+		page_lru = page_is_file_cache(page);
+		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		list_add(&page->lru, &migratepages);
+	}
+
+	if (isolated) {
+		int nr_remaining;
+
+		nr_remaining = migrate_pages(&migratepages,
+				alloc_misplaced_dst_page,
+				node, false, MIGRATE_ASYNC,
+				MR_NUMA_MISPLACED);
+		if (nr_remaining) {
+			putback_lru_pages(&migratepages);
+			isolated = 0;
+		}
+	}
+	BUG_ON(!list_empty(&migratepages));
+out:
+	return isolated;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
+#endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 20/40] mm: mempolicy: Use _PAGE_NUMA to migrate pages
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (18 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 19/40] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 21/40] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
	sufficiently different that the signed-off-bys were dropped

Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
pieces into an effective migrate on fault scheme.

Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
page-migration performance.

Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |    8 ++++----
 mm/huge_memory.c        |   32 +++++++++++++++++++++++++++++---
 mm/memory.c             |   32 +++++++++++++++++++++++++++-----
 3 files changed, 60 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a13ebb1..406f81c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,8 +160,8 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
-extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				  pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
@@ -200,8 +200,8 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	return 0;
 }
 
-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-					pmd_t pmd, pmd_t *pmdp);
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+					unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7224efd..df1af09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1018,16 +1019,39 @@ out:
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
-	struct page *page;
+	struct page *page = NULL;
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int target_nid;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1)
+		goto clear_pmdnuma;
+
+	/*
+	 * Due to lacking code to migrate thp pages, we'll split
+	 * (which preserves the special PROT_NONE) and re-take the
+	 * fault on the normal pages.
+	 */
+	split_huge_page(page);
+	put_page(page);
+	return 0;
+
+clear_pmdnuma:
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
@@ -1035,6 +1059,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (page)
+		put_page(page);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 277b6d8..6bebb41 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3451,8 +3452,9 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
-	struct page *page;
+	struct page *page = NULL;
 	spinlock_t *ptl;
+	int current_nid, target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3465,8 +3467,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	*/
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
-	if (unlikely(!pte_same(*ptep, pte)))
-		goto out_unlock;
+	if (unlikely(!pte_same(*ptep, pte))) {
+		pte_unmap_unlock(ptep, ptl);
+		goto out;
+	}
+
 	pte = pte_mknonnuma(pte);
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
@@ -3477,8 +3482,25 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-out_unlock:
+	get_page(page);
+	current_nid = page_to_nid(page);
+	target_nid = mpol_misplaced(page, vma, addr);
 	pte_unmap_unlock(ptep, ptl);
+	if (target_nid == -1) {
+		/*
+		 * Account for the fault against the current node if it not
+		 * being replaced regardless of where the page is located.
+		 */
+		current_nid = numa_node_id();
+		put_page(page);
+		goto out;
+	}
+
+	/* Migrate to the requested node */
+	if (migrate_misplaced_page(page, target_nid))
+		current_nid = target_nid;
+
+out:
 	return 0;
 }
 
@@ -3647,7 +3669,7 @@ retry:
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
 			if (pmd_numa(*pmd))
-				return do_huge_pmd_numa_page(mm, address,
+				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
 			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 21/40] mm: mempolicy: Add MPOL_MF_LAZY
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (19 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 20/40] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 22/40] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() Mel Gorman
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

NOTE: Once again there is a lot of patch stealing and the end result
	is sufficiently different that I had to drop the signed-offs.
	Will re-add if the original authors are ok with that.

This patch adds another mbind() flag to request "lazy migration".  The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm.h             |    5 ++
 include/uapi/linux/mempolicy.h |   13 ++-
 mm/mempolicy.c                 |  185 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 185 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa16152..471185e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1551,6 +1551,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+void change_prot_numa(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
+#endif
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
+			 MPOL_MF_MOVE     | 	\
+			 MPOL_MF_MOVE_ALL |	\
+			 MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index df1466d..51d3ebd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
 #include <linux/syscalls.h>
 #include <linux/ctype.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -565,6 +566,145 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault.
+ */
+static int
+change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		int page_nid;
+		ret = HPAGE_PMD_NR;
+
+		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page = pmd_page(*pmd);
+
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page_nid = page_to_nid(page);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		ret += HPAGE_PMD_NR;
+		/* defer TLB flush to lower the overhead */
+		spin_unlock(&mm->page_table_lock);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		goto out;
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+/* Assumes mmap_sem is held */
+void
+change_prot_numa(struct vm_area_struct *vma,
+			unsigned long address, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int progress = 0;
+
+	while (address < end) {
+		VM_BUG_ON(address < vma->vm_start ||
+			  address + PAGE_SIZE > vma->vm_end);
+
+		progress += change_prot_numa_range(mm, vma, address);
+		address = (address + PMD_SIZE) & PMD_MASK;
+	}
+
+	/*
+	 * Flush the TLB for the mm to start the NUMA hinting
+	 * page faults after we finish scanning this vma part
+	 * if there were any PTE updates
+	 */
+	if (progress) {
+		mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
+		flush_tlb_range(vma, address, end);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
+	}
+}
+#else
+static unsigned long change_prot_numa(struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	return 0;
+}
+#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
+
 /*
  * Check if all pages in a range are on a set of nodes.
  * If pagelist != NULL then isolate pages from the LRU and
@@ -583,22 +723,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		return ERR_PTR(-EFAULT);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+		unsigned long endvma = vma->vm_end;
+
+		if (endvma > end)
+			endvma = end;
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
 			if (!vma->vm_next && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
-		if (!is_vm_hugetlb_page(vma) &&
-		    ((flags & MPOL_MF_STRICT) ||
+
+		if (is_vm_hugetlb_page(vma))
+			goto next;
+
+		if (flags & MPOL_MF_LAZY) {
+			change_prot_numa(vma, start, endvma);
+			goto next;
+		}
+
+		if ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
-			unsigned long endvma = vma->vm_end;
+		      vma_migratable(vma))) {
 
-			if (endvma > end)
-				endvma = end;
-			if (vma->vm_start > start)
-				start = vma->vm_start;
 			err = check_pgd_range(vma, start, endvma, nodes,
 						flags, private);
 			if (err) {
@@ -606,6 +756,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 				break;
 			}
 		}
+next:
 		prev = vma;
 	}
 	return first;
@@ -1138,8 +1289,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1162,6 +1312,9 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	if (flags & MPOL_MF_LAZY)
+		new->flags |= MPOL_F_MOF;
+
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -1198,13 +1351,15 @@ static long do_mbind(unsigned long start, unsigned long len,
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
+			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
 						false, MIGRATE_SYNC,
@@ -1213,7 +1368,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 22/40] mm: mempolicy: Implement change_prot_numa() in terms of change_protection()
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (20 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 21/40] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 23/40] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
                   ` (18 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch converts change_prot_numa() to use change_protection(). As
pte_numa and friends check the PTE bits directly it is necessary for
change_protection() to use pmd_mknuma(). Hence the required
modifications to change_protection() are a little clumsy but the
end result is that most of the numa page table helpers are just one or
two instructions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |    3 +-
 include/linux/mm.h      |    4 +-
 mm/huge_memory.c        |   14 ++++-
 mm/mempolicy.c          |  137 +++++------------------------------------------
 mm/mprotect.c           |   62 +++++++++++++++------
 5 files changed, 75 insertions(+), 145 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 406f81c..01f17a3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -27,7 +27,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 			 unsigned long new_addr, unsigned long old_end,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, pgprot_t newprot);
+			unsigned long addr, pgprot_t newprot,
+			int prot_numa);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 471185e..d04c2f0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1080,7 +1080,7 @@ extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long flags, unsigned long new_addr);
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
-			      int dirty_accountable);
+			      int dirty_accountable, int prot_numa);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
@@ -1552,7 +1552,7 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 #endif
 
 #ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
-void change_prot_numa(struct vm_area_struct *vma,
+unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 #endif
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index df1af09..68e0412 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1146,7 +1146,7 @@ out:
 }
 
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot)
+		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
@@ -1154,7 +1154,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
-		entry = pmd_modify(entry, newprot);
+		if (!prot_numa)
+			entry = pmd_modify(entry, newprot);
+		else {
+			struct page *page = pmd_page(*pmd);
+
+			/* only check non-shared pages */
+			if (page_mapcount(page) == 1 &&
+			    !pmd_numa(*pmd)) {
+				entry = pmd_mknuma(entry);
+			}
+		}
 		set_pmd_at(mm, addr, pmd, entry);
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		ret = 1;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 51d3ebd..75d4600 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -568,134 +568,23 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
 
 #ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
 /*
- * Here we search for not shared page mappings (mapcount == 1) and we
- * set up the pmd/pte_numa on those mappings so the very next access
- * will fire a NUMA hinting page fault.
+ * This is used to mark a range of virtual addresses to be inaccessible.
+ * These are later cleared by a NUMA hinting fault. Depending on these
+ * faults, pages may be migrated for better NUMA placement.
+ *
+ * This is assuming that NUMA faults are handled using PROT_NONE. If
+ * an architecture makes a different choice, it will need further
+ * changes to the core.
  */
-static int
-change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address)
-{
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte, *_pte;
-	struct page *page;
-	unsigned long _address, end;
-	spinlock_t *ptl;
-	int ret = 0;
-
-	VM_BUG_ON(address & ~PAGE_MASK);
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		goto out;
-
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		goto out;
-
-	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
-		goto out;
-
-	if (pmd_trans_huge_lock(pmd, vma) == 1) {
-		int page_nid;
-		ret = HPAGE_PMD_NR;
-
-		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-		if (pmd_numa(*pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
-		page = pmd_page(*pmd);
-
-		/* only check non-shared pages */
-		if (page_mapcount(page) != 1) {
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
-		page_nid = page_to_nid(page);
-
-		if (pmd_numa(*pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
-		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
-		ret += HPAGE_PMD_NR;
-		/* defer TLB flush to lower the overhead */
-		spin_unlock(&mm->page_table_lock);
-		goto out;
-	}
-
-	if (pmd_trans_unstable(pmd))
-		goto out;
-	VM_BUG_ON(!pmd_present(*pmd));
-
-	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	for (_address = address, _pte = pte; _address < end;
-	     _pte++, _address += PAGE_SIZE) {
-		pte_t pteval = *_pte;
-		if (!pte_present(pteval))
-			continue;
-		if (pte_numa(pteval))
-			continue;
-		page = vm_normal_page(vma, _address, pteval);
-		if (unlikely(!page))
-			continue;
-		/* only check non-shared pages */
-		if (page_mapcount(page) != 1)
-			continue;
-
-		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
-
-		/* defer TLB flush to lower the overhead */
-		ret++;
-	}
-	pte_unmap_unlock(pte, ptl);
-
-	if (ret && !pmd_numa(*pmd)) {
-		spin_lock(&mm->page_table_lock);
-		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
-		spin_unlock(&mm->page_table_lock);
-		/* defer TLB flush to lower the overhead */
-	}
-
-out:
-	return ret;
-}
-
-/* Assumes mmap_sem is held */
-void
-change_prot_numa(struct vm_area_struct *vma,
-			unsigned long address, unsigned long end)
+unsigned long change_prot_numa(struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	int progress = 0;
-
-	while (address < end) {
-		VM_BUG_ON(address < vma->vm_start ||
-			  address + PAGE_SIZE > vma->vm_end);
+	int nr_updated;
+	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
 
-		progress += change_prot_numa_range(mm, vma, address);
-		address = (address + PMD_SIZE) & PMD_MASK;
-	}
+	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
 
-	/*
-	 * Flush the TLB for the mm to start the NUMA hinting
-	 * page faults after we finish scanning this vma part
-	 * if there were any PTE updates
-	 */
-	if (progress) {
-		mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
-		flush_tlb_range(vma, address, end);
-		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
-	}
+	return nr_updated;
 }
 #else
 static unsigned long change_prot_numa(struct vm_area_struct *vma,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7c3628a..1b383b7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,10 +35,11 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 }
 #endif
 
-static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
@@ -49,19 +50,39 @@ static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
+			bool updated = false;
 
 			ptent = ptep_modify_prot_start(mm, addr, pte);
-			ptent = pte_modify(ptent, newprot);
+			if (!prot_numa) {
+				ptent = pte_modify(ptent, newprot);
+				updated = true;
+			} else {
+				struct page *page;
+
+				page = vm_normal_page(vma, addr, oldpte);
+				if (page) {
+					/* only check non-shared pages */
+					if (!pte_numa(oldpte) &&
+					    page_mapcount(page) == 1) {
+						ptent = pte_mknuma(ptent);
+						updated = true;
+					}
+				}
+			}
 
 			/*
 			 * Avoid taking write faults for pages we know to be
 			 * dirty.
 			 */
-			if (dirty_accountable && pte_dirty(ptent))
+			if (dirty_accountable && pte_dirty(ptent)) {
 				ptent = pte_mkwrite(ptent);
+				updated = true;
+			}
+
+			if (updated)
+				pages++;
 
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
-			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
@@ -85,7 +106,7 @@ static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -97,7 +118,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+			else if (change_huge_pmd(vma, pmd, addr, newprot, prot_numa)) {
 				pages += HPAGE_PMD_NR;
 				continue;
 			}
@@ -105,8 +126,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
-				 dirty_accountable);
+		pages += change_pte_range(vma, pmd, addr, next, newprot,
+				 dirty_accountable, prot_numa);
+
+		if (prot_numa) {
+			struct mm_struct *mm = vma->vm_mm;
+
+			spin_lock(&mm->page_table_lock);
+			set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
+			spin_unlock(&mm->page_table_lock);
+		}
+
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;
@@ -114,7 +144,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -126,7 +156,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable);
+				 dirty_accountable, prot_numa);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -134,7 +164,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -150,7 +180,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable);
+				 dirty_accountable, prot_numa);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -162,7 +192,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 
 unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
-		       int dirty_accountable)
+		       int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long pages;
@@ -171,7 +201,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
 	return pages;
@@ -249,7 +279,7 @@ success:
 		dirty_accountable = 1;
 	}
 
-	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable, 0);
 
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 23/40] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (21 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 22/40] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 24/40] mm: numa: Add fault driven placement and migration Mel Gorman
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
explicitly request lazy migration is a good idea but the actual
API has not been well reviewed and once released we have to support it.
For now this patch prevents an application using the services. This
will need to be revisited.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    4 +---
 mm/mempolicy.c                 |    9 ++++-----
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 6a1baae..16fb4e6 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,7 +21,6 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
-	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
@@ -57,8 +56,7 @@ enum mpol_rebind_step {
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
-			 MPOL_MF_MOVE_ALL |	\
-			 MPOL_MF_LAZY)
+			 MPOL_MF_MOVE_ALL)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 75d4600..a7a62fe 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -252,7 +252,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
+	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
 		return NULL;
@@ -1186,7 +1186,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
+	if (mode == MPOL_DEFAULT)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -1241,7 +1241,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
 	err = PTR_ERR(vma);	/* maybe ... */
-	if (!IS_ERR(vma) && mode != MPOL_NOOP)
+	if (!IS_ERR(vma))
 		err = mbind_range(mm, start, end, new);
 
 	if (!err) {
@@ -2530,7 +2530,6 @@ static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
-	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2581,7 +2580,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 24/40] mm: numa: Add fault driven placement and migration
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (22 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 23/40] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 25/40] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
                   ` (16 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

NOTE: This patch is based on "sched, numa, mm: Add fault driven
	placement and migration policy" but as it throws away all the policy
	to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on.  In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/sh/mm/Kconfig       |    1 +
 arch/x86/Kconfig         |    2 +
 include/linux/mm_types.h |   11 ++++
 include/linux/sched.h    |   20 ++++++++
 kernel/sched/core.c      |   13 +++++
 kernel/sched/fair.c      |  125 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h  |    7 +++
 kernel/sched/sched.h     |    6 +++
 kernel/sysctl.c          |   24 ++++++++-
 mm/huge_memory.c         |    5 +-
 mm/memory.c              |   14 +++++-
 11 files changed, 224 insertions(+), 4 deletions(-)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..0f7c852 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
 config NUMA
 	bool "Non Uniform Memory Access (NUMA) Support"
 	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+	select ARCH_WANT_NUMA_VARIABLE_LOCALITY
 	default n
 	help
 	  Some SH systems have many various memories scattered around
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..1137028 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,8 @@ config X86
 	def_bool y
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_NUMA_BALANCING
+	select ARCH_WANTS_PROT_NUMA_PROT_NONE
 	select HAVE_IDE
 	select HAVE_OPROFILE
 	select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..d82accb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -398,6 +398,17 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	/*
+	 * numa_next_scan is the next time when the PTEs will me marked
+	 * pte_numa to gather statistics and migrate pages to new nodes
+	 * if necessary
+	 */
+	unsigned long numa_next_scan;
+
+	/* numa_scan_seq prevents two threads setting pte_numa */
+	int numa_scan_seq;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..ac71181 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1479,6 +1479,14 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	int numa_scan_seq;
+	int numa_migrate_seq;
+	unsigned int numa_scan_period;
+	u64 node_stamp;			/* migration stamp  */
+	struct callback_head numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
+
 	struct rcu_head rcu;
 
 	/*
@@ -1553,6 +1561,14 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#ifdef CONFIG_BALANCE_NUMA
+extern void task_numa_fault(int node, int pages);
+#else
+static inline void task_numa_fault(int node, int pages)
+{
+}
+#endif
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -1990,6 +2006,10 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_period_min;
+extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_settle_count;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..81fa185 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1533,6 +1533,19 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_scan_seq = 0;
+	}
+
+	p->node_stamp = 0ULL;
+	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..b6d3ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,8 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
 
 #include <trace/events/sched.h>
 
@@ -776,6 +778,126 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_balance_numa_scan_period_min = 5000;
+unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+
+static void task_numa_placement(struct task_struct *p)
+{
+	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+
+	if (p->numa_scan_seq == seq)
+		return;
+	p->numa_scan_seq = seq;
+
+	/* FIXME: Scheduling placement policy hints go here */
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int pages)
+{
+	struct task_struct *p = current;
+
+	/* FIXME: Allocate task-specific structure for placement policy here */
+
+	task_numa_placement(p);
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+	unsigned long migrate, next_scan, now = jiffies;
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work; /* protect against double add */
+	/*
+	 * Who cares about NUMA placement when they're dying.
+	 *
+	 * NOTE: make sure not to dereference p->mm before this check,
+	 * exit_task_work() happens _after_ exit_mm() so we could be called
+	 * without p->mm even though we still had it when we enqueued this
+	 * work.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/*
+	 * Enforce maximal scan/migration frequency..
+	 */
+	migrate = mm->numa_next_scan;
+	if (time_before(now, migrate))
+		return;
+
+	if (p->numa_scan_period == 0)
+		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+
+	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
+	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+		return;
+
+	ACCESS_ONCE(mm->numa_scan_seq)++;
+	{
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (!vma_migratable(vma))
+				continue;
+			change_prot_numa(vma, vma->vm_start, vma->vm_end);
+		}
+		up_read(&mm->mmap_sem);
+	}
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_work;
+	u64 period, now;
+
+	/*
+	 * We don't care about NUMA placement if we don't have memory.
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+		return;
+
+	/*
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the selection from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * NUMA placement.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+	if (now - curr->node_stamp > period) {
+		curr->node_stamp = now;
+
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+			task_work_add(curr, work, true);
+		}
+	}
+}
+#else
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -4954,6 +5076,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	if (sched_feat_numa(NUMA))
+		task_tick_numa(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..7cfd289 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,3 +61,10 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+
+/*
+ * Apply the automatic NUMA scheduling policy
+ */
+#ifdef CONFIG_BALANCE_NUMA
+SCHED_FEAT(NUMA,	true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a7db09..9a43241 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -648,6 +648,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
+#ifdef CONFIG_BALANCE_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..1359f51 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000;		/* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
+#ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_COMPACTION
 static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &min_wakeup_granularity_ns,
 		.extra2		= &max_wakeup_granularity_ns,
 	},
+#ifdef CONFIG_SMP
 	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
@@ -347,7 +350,24 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_BALANCE_NUMA
+	{
+		.procname	= "balance_numa_scan_period_min_ms",
+		.data		= &sysctl_balance_numa_scan_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "balance_numa_scan_period_max_ms",
+		.data		= &sysctl_balance_numa_scan_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_BALANCE_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 68e0412..b3d4c4b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1045,6 +1045,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	split_huge_page(page);
 	put_page(page);
+
 	return 0;
 
 clear_pmdnuma:
@@ -1059,8 +1060,10 @@ clear_pmdnuma:
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page)
+	if (page) {
 		put_page(page);
+		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
+	}
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 6bebb41..b23b081 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3454,7 +3454,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid, target_nid;
+	int current_nid = -1;
+	int target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3501,6 +3502,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		current_nid = target_nid;
 
 out:
+	task_numa_fault(current_nid, 1);
 	return 0;
 }
 
@@ -3536,6 +3538,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
+		int curr_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3553,6 +3556,15 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
+		/* only check non-shared pages */
+		if (unlikely(page_mapcount(page) != 1))
+			continue;
+		pte_unmap_unlock(pte, ptl);
+
+		curr_nid = page_to_nid(page);
+		task_numa_fault(curr_nid, 1);
+
+		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 25/40] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (23 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 24/40] mm: numa: Add fault driven placement and migration Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 26/40] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
                   ` (15 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds.  The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    1 +
 kernel/sched/fair.c      |   65 ++++++++++++++++++++++++++++++++++++----------
 kernel/sysctl.c          |    7 +++++
 4 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d82accb..b40f4ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -406,6 +406,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* Restart point for scanning and setting pte_numa */
+	unsigned long numa_scan_offset;
+
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac71181..abb1c70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6d3ed7..66d8bd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_BALANCE_NUMA
 /*
- * numa task sample period in ms: 5s
+ * numa task sample period in ms
  */
-unsigned int sysctl_balance_numa_scan_period_min = 5000;
-unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+unsigned int sysctl_balance_numa_scan_period_min = 100;
+unsigned int sysctl_balance_numa_scan_period_max = 100*16;
+
+/* Portion of address space to scan in MB */
+unsigned int sysctl_balance_numa_scan_size = 256;
 
 static void task_numa_placement(struct task_struct *p)
 {
@@ -808,6 +811,12 @@ void task_numa_fault(int node, int pages)
 	task_numa_placement(p);
 }
 
+static void reset_ptenuma_scan(struct task_struct *p)
+{
+	ACCESS_ONCE(p->mm->numa_scan_seq)++;
+	p->mm->numa_scan_offset = 0;
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -817,6 +826,9 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -846,18 +858,45 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
+	offset = mm->numa_scan_offset;
+	length = sysctl_balance_numa_scan_size;
+	length <<= 20;
 
-		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_prot_numa(vma, vma->vm_start, vma->vm_end);
-		}
-		up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		reset_ptenuma_scan(p);
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		/* Skip small VMAs. They are not likely to be of relevance */
+		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_numa(vma, offset, end);
+
+		offset = end;
 	}
+
+	/*
+	 * It is possible to reach the end of the VMA list but the last few VMAs are
+	 * not guaranteed to the vma_migratable. If they are not, we would find the
+	 * !migratable VMA on the next scan but not reset the scanner to the start
+	 * so check it now.
+	 */
+	if (vma)
+		mm->numa_scan_offset = offset;
+	else
+		reset_ptenuma_scan(p);
+	up_read(&mm->mmap_sem);
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1359f51..d191203 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "balance_numa_scan_size_mb",
+		.data		= &sysctl_balance_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif /* CONFIG_BALANCE_NUMA */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 26/40] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (24 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 25/40] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 27/40] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
                   ` (14 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 66d8bd2..773ef97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -827,8 +827,8 @@ void task_numa_work(struct callback_head *work)
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
-	unsigned long offset, end;
-	long length;
+	unsigned long start, end;
+	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -858,18 +858,20 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	offset = mm->numa_scan_offset;
-	length = sysctl_balance_numa_scan_size;
-	length <<= 20;
+	start = mm->numa_scan_offset;
+	pages = sysctl_balance_numa_scan_size;
+	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	if (!pages)
+		return;
 
 	down_read(&mm->mmap_sem);
-	vma = find_vma(mm, offset);
+	vma = find_vma(mm, start);
 	if (!vma) {
 		reset_ptenuma_scan(p);
-		offset = 0;
+		start = 0;
 		vma = mm->mmap;
 	}
-	for (; vma && length > 0; vma = vma->vm_next) {
+	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma))
 			continue;
 
@@ -877,15 +879,19 @@ void task_numa_work(struct callback_head *work)
 		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
 			continue;
 
-		offset = max(offset, vma->vm_start);
-		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
-		length -= end - offset;
-
-		change_prot_numa(vma, offset, end);
+		do {
+			start = max(start, vma->vm_start);
+			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+			end = min(end, vma->vm_end);
+			pages -= change_prot_numa(vma, start, end);
 
-		offset = end;
+			start = end;
+			if (pages <= 0)
+				goto out;
+		} while (end != vma->vm_end);
 	}
 
+out:
 	/*
 	 * It is possible to reach the end of the VMA list but the last few VMAs are
 	 * not guaranteed to the vma_migratable. If they are not, we would find the
@@ -893,7 +899,7 @@ void task_numa_work(struct callback_head *work)
 	 * so check it now.
 	 */
 	if (vma)
-		mm->numa_scan_offset = offset;
+		mm->numa_scan_offset = start;
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 27/40] mm: sched: numa: Implement slow start for working set sampling
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (25 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 26/40] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 28/40] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
                   ` (13 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/balance_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 +-
 kernel/sched/fair.c   |    5 +++++
 kernel/sysctl.c       |    7 +++++++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index abb1c70..a2b06ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2006,6 +2006,7 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
 extern unsigned int sysctl_balance_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81fa185..047e3c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1543,7 +1543,7 @@ static void __sched_fork(struct task_struct *p)
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
-	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_scan_period = sysctl_balance_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_BALANCE_NUMA */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 773ef97..2e65f44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -788,6 +788,9 @@ unsigned int sysctl_balance_numa_scan_period_max = 100*16;
 /* Portion of address space to scan in MB */
 unsigned int sysctl_balance_numa_scan_size = 256;
 
+/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
+unsigned int sysctl_balance_numa_scan_delay = 1000;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -929,6 +932,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
 	if (now - curr->node_stamp > period) {
+		if (!curr->node_stamp)
+			curr->numa_scan_period = sysctl_balance_numa_scan_period_min;
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d191203..5ee587d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_BALANCE_NUMA
 	{
+		.procname	= "balance_numa_scan_delay_ms",
+		.data		= &sysctl_balance_numa_scan_delay,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "balance_numa_scan_period_min_ms",
 		.data		= &sysctl_balance_numa_scan_period_min,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 28/40] mm: numa: Add pte updates, hinting and migration stats
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (26 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 27/40] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 29/40] mm: numa: Migrate on reference policy Mel Gorman
                   ` (12 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

It is tricky to quantify the basic cost of automatic NUMA placement in a
meaningful manner. This patch adds some vmstats that can be used as part
of a basic costing model.

u    = basic unit = sizeof(void *)
Ca   = cost of struct page access = sizeof(struct page) / u
Cpte = Cost PTE access = Ca
Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
	where Cpte is incurred twice for a read and a write and Wlock
	is a constant representing the cost of taking or releasing a
	lock
Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
Ci = Cost of page isolation = Ca + Wi
	where Wi is a constant that should reflect the approximate cost
	of the locking operation
Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
	would imply that remote accesses are 20% more expensive

Balancing cost = Cpte * numa_pte_updates +
		Cnumahint * numa_hint_faults +
		Ci * numa_pages_migrated +
		Cpagecopy * numa_pages_migrated

Note that numa_pages_migrated is used as a measure of how many pages
were isolated even though it would miss pages that failed to migrate. A
vmstat counter could have been added for it but the isolation cost is
pretty marginal in comparison to the overall cost so it seemed overkill.

The ideal way to measure automatic placement benefit would be to count
the number of remote accesses versus local accesses and do something like

	benefit = (remote_accesses_before - remove_access_after) * Wnuma

but the information is not readily available. As a workload converges, the
expection would be that the number of remote numa hints would reduce to 0.

	convergence = numa_hint_faults_local / numa_hint_faults
		where this is measured for the last N number of
		numa hints recorded. When the workload is fully
		converged the value is 1.

This can measure if the placement policy is converging and how fast it is
doing it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    6 ++++++
 include/linux/vmstat.h        |    8 ++++++++
 mm/huge_memory.c              |    1 +
 mm/memory.c                   |   12 ++++++++++++
 mm/mempolicy.c                |    2 ++
 mm/migrate.c                  |    3 ++-
 mm/vmstat.c                   |    6 ++++++
 7 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a1f750b..dded0af 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,6 +38,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_BALANCE_NUMA
+		NUMA_PTE_UPDATES,
+		NUMA_HINT_FAULTS,
+		NUMA_HINT_FAULTS_LOCAL,
+		NUMA_PAGE_MIGRATE,
+#endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 92a86b2..dffccfa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -80,6 +80,14 @@ static inline void vm_events_fold_cpu(int cpu)
 
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 
+#ifdef CONFIG_BALANCE_NUMA
+#define count_vm_numa_event(x)     count_vm_event(x)
+#define count_vm_numa_events(x, y) count_vm_events(x, y)
+#else
+#define count_vm_numa_event(x) do {} while (0)
+#define count_vm_numa_events(x, y) do {} while (0)
+#endif /* CONFIG_BALANCE_NUMA */
+
 #define __count_zone_vm_events(item, zone, delta) \
 		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
 		zone_idx(zone), delta)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b3d4c4b..8f89a98 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1033,6 +1033,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
+	count_vm_numa_event(NUMA_HINT_FAULTS);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1)
diff --git a/mm/memory.c b/mm/memory.c
index b23b081..8367142 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3477,6 +3477,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
+	count_vm_numa_event(NUMA_HINT_FAULTS);
 	page = vm_normal_page(vma, addr, pte);
 	if (!page) {
 		pte_unmap_unlock(ptep, ptl);
@@ -3485,6 +3486,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	get_page(page);
 	current_nid = page_to_nid(page);
+	if (current_nid == numa_node_id())
+		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 	target_nid = mpol_misplaced(page, vma, addr);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
@@ -3516,6 +3519,9 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int local_nid = numa_node_id();
+	unsigned long nr_faults = 0;
+	unsigned long nr_faults_local = 0;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3564,10 +3570,16 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		curr_nid = page_to_nid(page);
 		task_numa_fault(curr_nid, 1);
 
+		nr_faults++;
+		if (curr_nid == local_nid)
+			nr_faults_local++;
+
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
+	count_vm_numa_events(NUMA_HINT_FAULTS, nr_faults);
+	count_vm_numa_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
 	return 0;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a7a62fe..516491f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -583,6 +583,8 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
 
 	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	if (nr_updated)
+		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
 	return nr_updated;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index a2c4567..4b876d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1511,7 +1511,8 @@ int migrate_misplaced_page(struct page *page, int node)
 		if (nr_remaining) {
 			putback_lru_pages(&migratepages);
 			isolated = 0;
-		}
+		} else
+			count_vm_numa_event(NUMA_PAGE_MIGRATE);
 	}
 	BUG_ON(!list_empty(&migratepages));
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a067fa..cfa386da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,6 +774,12 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_BALANCE_NUMA
+	"numa_pte_updates",
+	"numa_hint_faults",
+	"numa_hint_faults_local",
+	"numa_pages_migrated",
+#endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
 	"pgmigrate_fail",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 29/40] mm: numa: Migrate on reference policy
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (27 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 28/40] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 30/40] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

This is the simplest possible policy that still does something of note.
When a pte_numa is faulted, it is moved immediately. Any replacement
policy must at least do better than this and in all likelihood this
policy regresses normal workloads.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 16fb4e6..0d11c3d 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -67,6 +67,7 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_numa Reference On Node */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 516491f..4c1c8d8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,6 +118,26 @@ static struct mempolicy default_policy = {
 	.flags = MPOL_F_LOCAL,
 };
 
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+	struct mempolicy *pol = p->mempolicy;
+	int node;
+
+	if (!pol) {
+		node = numa_node_id();
+		if (node != -1)
+			pol = &preferred_node_policy[node];
+
+		/* preferred_node_policy is not initialised early in boot */
+		if (!pol->mode)
+			pol = NULL;
+	}
+
+	return pol;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -1598,7 +1618,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol = get_task_policy(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -2021,7 +2041,7 @@ retry_cpuset:
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
@@ -2295,6 +2315,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	default:
 		BUG();
 	}
+
+	/* Migrate the page towards the node whose CPU is referencing it */
+	if (pol->flags & MPOL_F_MORON)
+		polnid = numa_node_id();
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
@@ -2483,6 +2508,15 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	for_each_node(nid) {
+		preferred_node_policy[nid] = (struct mempolicy) {
+			.refcnt = ATOMIC_INIT(1),
+			.mode = MPOL_PREFERRED,
+			.flags = MPOL_F_MOF | MPOL_F_MORON,
+			.v = { .preferred_node = nid, },
+		};
+	}
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 30/40] mm: numa: Migrate pages handled during a pmd_numa hinting fault
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (28 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 29/40] mm: numa: Migrate on reference policy Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 31/40] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
                   ` (10 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

To say that the PMD handling code was incorrectly transferred from autonuma
is an understatement. The intention was to handle a PMDs worth of pages
in the same fault and effectively batch the taking of the PTL and page
migration. The copied version instead has the impact of clearing a number
of pte_numa PTE entries and whether any page migration takes place depends
on racing. This just happens to work in some cases.

This patch handles pte_numa faults in batch when a pmd_numa fault is
handled. The pages are migrated if they are currently misplaced.
Essentially this is making an assumption that NUMA locality is
on a PMD boundary but that could be addressed by only setting
pmd_numa if all the pages within that PMD are on the same node
if necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/memory.c   |   51 ++++++++++++++++++++++++++++++++++-----------------
 mm/mprotect.c |   26 ++++++++++++++++++++------
 2 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8367142..0f0ce80 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3449,6 +3449,18 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+				unsigned long addr, int current_nid)
+{
+	get_page(page);
+
+	count_vm_numa_event(NUMA_HINT_FAULTS);
+	if (current_nid == numa_node_id())
+		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+
+	return mpol_misplaced(page, vma, addr);
+}
+
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
@@ -3477,18 +3489,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
-	count_vm_numa_event(NUMA_HINT_FAULTS);
 	page = vm_normal_page(vma, addr, pte);
 	if (!page) {
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
 
-	get_page(page);
 	current_nid = page_to_nid(page);
-	if (current_nid == numa_node_id())
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-	target_nid = mpol_misplaced(page, vma, addr);
+	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
 		/*
@@ -3505,7 +3513,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		current_nid = target_nid;
 
 out:
-	task_numa_fault(current_nid, 1);
+	if (current_nid != -1)
+		task_numa_fault(current_nid, 1);
 	return 0;
 }
 
@@ -3520,8 +3529,6 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
-	unsigned long nr_faults = 0;
-	unsigned long nr_faults_local = 0;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3544,7 +3551,8 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid;
+		int curr_nid = local_nid;
+		int target_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3565,21 +3573,30 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* only check non-shared pages */
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
-		pte_unmap_unlock(pte, ptl);
 
-		curr_nid = page_to_nid(page);
-		task_numa_fault(curr_nid, 1);
+		/*
+		 * Note that the NUMA fault is later accounted to either
+		 * the node that is currently running or where the page is
+		 * migrated to.
+		 */
+		curr_nid = local_nid;
+		target_nid = numa_migrate_prep(page, vma, addr,
+					       page_to_nid(page));
+		if (target_nid == -1) {
+			put_page(page);
+			continue;
+		}
 
-		nr_faults++;
-		if (curr_nid == local_nid)
-			nr_faults_local++;
+		/* Migrate to the requested node */
+		pte_unmap_unlock(pte, ptl);
+		if (migrate_misplaced_page(page, target_nid))
+			curr_nid = target_nid;
+		task_numa_fault(curr_nid, 1);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-	count_vm_numa_events(NUMA_HINT_FAULTS, nr_faults);
-	count_vm_numa_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
 	return 0;
 }
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1b383b7..47335a9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,12 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
+	bool all_same_node = true;
+	int last_nid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -61,6 +63,12 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
+					int this_nid = page_to_nid(page);
+					if (last_nid == -1)
+						last_nid = this_nid;
+					if (last_nid != this_nid)
+						all_same_node = false;
+
 					/* only check non-shared pages */
 					if (!pte_numa(oldpte) &&
 					    page_mapcount(page) == 1) {
@@ -81,7 +89,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (updated)
 				pages++;
-
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -101,6 +108,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
+	*ret_all_same_node = all_same_node;
 	return pages;
 }
 
@@ -111,6 +119,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
+	bool all_same_node;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -127,16 +136,21 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
-
-		if (prot_numa) {
+				 dirty_accountable, prot_numa, &all_same_node);
+
+		/*
+		 * If we are changing protections for NUMA hinting faults then
+		 * set pmd_numa if the examined pages were all on the same
+		 * node. This allows a regular PMD to be handled as one fault
+		 * and effectively batches the taking of the PTL
+		 */
+		if (prot_numa && all_same_node) {
 			struct mm_struct *mm = vma->vm_mm;
 
 			spin_lock(&mm->page_table_lock);
 			set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
 			spin_unlock(&mm->page_table_lock);
 		}
-
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 31/40] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (29 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 30/40] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 32/40] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/page_alloc.c        |    5 +++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..1ed16e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -717,6 +717,19 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_BALANCE_NUMA
+	/*
+	 * Lock serializing the per destination node AutoNUMA memory
+	 * migration rate limiting data.
+	 */
+	spinlock_t balancenuma_migrate_lock;
+
+	/* Rate limiting time interval */
+	unsigned long balancenuma_migrate_next_window;
+
+	/* Number of pages migrated during the rate limiting time interval */
+	unsigned long balancenuma_migrate_nr_pages;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5953dc2..df58654 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4449,6 +4449,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int ret;
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_BALANCE_NUMA
+	spin_lock_init(&pgdat->balancenuma_migrate_lock);
+	pgdat->balancenuma_migrate_nr_pages = 0;
+	pgdat->balancenuma_migrate_next_window = jiffies;
+#endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat_page_cgroup_init(pgdat);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 32/40] mm: numa: Rate limit the amount of memory that is migrated between nodes
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (30 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 31/40] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 33/40] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
                   ` (8 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

NOTE: This is very heavily based on similar logic in autonuma. It should
	be signed off by Andrea but because there was no standalone
	patch and it's sufficiently different from what he did that
	the signed-off is omitted. Will be added back if requested.

If a large number of pages are misplaced then the memory bus can be
saturated just migrating pages between nodes. This patch rate-limits
the amount of memory that can be migrating between nodes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |   31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 4b876d2..cf0970f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1461,12 +1461,21 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 }
 
 /*
+ * page migration rate limiting control.
+ * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
+ * window of time. Default here says do not migrate more than 1280M per second.
+ */
+static unsigned int migrate_interval_millisecs __read_mostly = 100;
+static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+
+/*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
 int migrate_misplaced_page(struct page *page, int node)
 {
+	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	LIST_HEAD(migratepages);
 
@@ -1479,8 +1488,27 @@ int migrate_misplaced_page(struct page *page, int node)
 		goto out;
 	}
 
+	/*
+	 * Rate-limit the amount of data that is being migrated to a node.
+	 * Optimal placement is no good if the memory bus is saturated and
+	 * all the time is being spent migrating!
+	 */
+	spin_lock(&pgdat->balancenuma_migrate_lock);
+	if (time_after(jiffies, pgdat->balancenuma_migrate_next_window)) {
+		pgdat->balancenuma_migrate_nr_pages = 0;
+		pgdat->balancenuma_migrate_next_window = jiffies +
+			msecs_to_jiffies(migrate_interval_millisecs);
+	}
+	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages) {
+		spin_unlock(&pgdat->balancenuma_migrate_lock);
+		put_page(page);
+		goto out;
+	}
+	pgdat->balancenuma_migrate_nr_pages++;
+	spin_unlock(&pgdat->balancenuma_migrate_lock);
+
 	/* Avoid migrating to a node that is nearly full */
-	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+	if (migrate_balanced_pgdat(pgdat, 1)) {
 		int page_lru;
 
 		if (isolate_lru_page(page)) {
@@ -1515,6 +1543,7 @@ int migrate_misplaced_page(struct page *page, int node)
 			count_vm_numa_event(NUMA_PAGE_MIGRATE);
 	}
 	BUG_ON(!list_empty(&migratepages));
+
 out:
 	return isolated;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 33/40] mm: numa: Rate limit setting of pte_numa if node is saturated
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (31 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 32/40] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 34/40] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

If there are a large number of NUMA hinting faults and all of them
are resulting in migrations it may indicate that memory is just
bouncing uselessly around. NUMA balancing cost is likely exceeding
any benefit from locality. Rate limit the PTE updates if the node
is migration rate-limited. As noted in the comments, this distorts
the NUMA faulting statistics.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |    6 ++++++
 kernel/sched/fair.c     |    9 +++++++++
 mm/migrate.c            |   20 ++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 69f60b5..0d4ee94 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -41,6 +41,7 @@ extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
 extern int migrate_misplaced_page(struct page *page, int node);
+extern bool migrate_ratelimited(int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -79,6 +80,11 @@ int migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline
+bool migrate_ratelimited(int node)
+{
+	return false;
+}
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e65f44..357057c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -27,6 +27,7 @@
 #include <linux/profile.h>
 #include <linux/interrupt.h>
 #include <linux/mempolicy.h>
+#include <linux/migrate.h>
 #include <linux/task_work.h>
 
 #include <trace/events/sched.h>
@@ -861,6 +862,14 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
+	/*
+	 * Do not set pte_numa if the current running node is rate-limited.
+	 * This loses statistics on the fault but if we are unwilling to
+	 * migrate to this node, it is less likely we can do useful work
+	 */
+	if (migrate_ratelimited(numa_node_id()))
+		return;
+
 	start = mm->numa_scan_offset;
 	pages = sysctl_balance_numa_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
diff --git a/mm/migrate.c b/mm/migrate.c
index cf0970f..3033669 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1464,10 +1464,30 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  * page migration rate limiting control.
  * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
  * window of time. Default here says do not migrate more than 1280M per second.
+ * If a node is rate-limited then PTE NUMA updates are also rate-limited. However
+ * as it is faults that reset the window, pte updates will happen unconditionally
+ * if there has not been a fault since @pteupdate_interval_millisecs after the
+ * throttle window closed.
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
+static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
 static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
 
+/* Returns true if NUMA migration is currently rate limited */
+bool migrate_ratelimited(int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+
+	if (time_after(jiffies, pgdat->balancenuma_migrate_next_window +
+				msecs_to_jiffies(pteupdate_interval_millisecs)))
+		return false;
+
+	if (pgdat->balancenuma_migrate_nr_pages < ratelimit_pages)
+		return false;
+
+	return true;
+}
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 34/40] sched: numa: Slowly increase the scanning period as NUMA faults are handled
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (32 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 33/40] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 35/40] mm: numa: Introduce last_nid to the page frame Mel Gorman
                   ` (6 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

Currently the rate of scanning for an address space is controlled
by the individual tasks. The next scan is simply determined by
2*p->numa_scan_period.

The 2*p->numa_scan_period is arbitrary and never changes. At this point
there is still no proper policy that decides if a task or process is
properly placed. It just scans and assumes the next NUMA fault will
place it properly. As it is assumed that pages will get properly placed
over time, increase the scan window each time a fault is incurred. This
is a big assumption as noted in the comments.

It should be noted that changing to p->numa_scan_period will increase
system CPU usage because now the scanning rate has effectively doubled.
If that is a problem then the min_rate should be made 200ms instead of
restoring the 2* logic.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 357057c..3c632448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -812,6 +812,15 @@ void task_numa_fault(int node, int pages)
 
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
+	/*
+	 * Assume that as faults occur that pages are getting properly placed
+	 * and fewer NUMA hints are required. Note that this is a big
+	 * assumption, it assumes processes reach a steady steady with no
+	 * further phase changes.
+	 */
+	p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+				p->numa_scan_period + jiffies_to_msecs(2));
+
 	task_numa_placement(p);
 }
 
@@ -858,7 +867,7 @@ void task_numa_work(struct callback_head *work)
 	if (p->numa_scan_period == 0)
 		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
 
-	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
+	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 35/40] mm: numa: Introduce last_nid to the page frame
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (33 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 34/40] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 36/40] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h       |   30 ++++++++++++++++++++++++++++++
 include/linux/mm_types.h |    4 ++++
 mm/page_alloc.c          |    2 ++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d04c2f0..a0834e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -693,6 +693,36 @@ static inline int page_to_nid(const struct page *page)
 }
 #endif
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return xchg(&page->_last_nid, nid);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page->_last_nid;
+}
+static inline void reset_page_last_nid(struct page *page)
+{
+	page->_last_nid = -1;
+}
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return page_to_nid(page);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page_to_nid(page);
+}
+
+static inline void reset_page_last_nid(struct page *page)
+{
+}
+#endif
+
 static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b40f4ef..6b478ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -175,6 +175,10 @@ struct page {
 	 */
 	void *shadow;
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	int _last_nid;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df58654..fd6a073 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -608,6 +608,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	reset_page_last_nid(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3826,6 +3827,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
+		reset_page_last_nid(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 36/40] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (34 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 35/40] mm: numa: Introduce last_nid to the page frame Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 37/40] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

Note: This two-stage filter was taken directly from the sched/numa patch
	"sched, numa, mm: Add the scanning page fault machinery" but is
	only a partial extraction. As the end result is not necessarily
	recognisable, the signed-offs-by had to be removed. Will be added
	back if requested.

While it is desirable that all threads in a process run on its home
node, this is not always possible or necessary. There may be more
threads than exist within the node or the node might over-subscribed
with unrelated processes.

This can cause a situation whereby a page gets migrated off its home
node because the threads clearing pte_numa were running off-node. This
patch uses page->last_nid to build a two-stage filter before pages get
migrated to avoid problems with short or unlikely task<->node
relationships.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mempolicy.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4c1c8d8..fd20e28 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2317,9 +2317,37 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	}
 
 	/* Migrate the page towards the node whose CPU is referencing it */
-	if (pol->flags & MPOL_F_MORON)
+	if (pol->flags & MPOL_F_MORON) {
+		int last_nid;
+
 		polnid = numa_node_id();
 
+		/*
+		 * Multi-stage node selection is used in conjunction
+		 * with a periodic migration fault to build a temporal
+		 * task<->page relation. By using a two-stage filter we
+		 * remove short/unlikely relations.
+		 *
+		 * Using P(p) ~ n_p / n_t as per frequentist
+		 * probability, we can equate a task's usage of a
+		 * particular page (n_p) per total usage of this
+		 * page (n_t) (in a given time-span) to a probability.
+		 *
+		 * Our periodic faults will sample this probability and
+		 * getting the same result twice in a row, given these
+		 * samples are fully independent, is then given by
+		 * P(n)^2, provided our sample period is sufficiently
+		 * short compared to the usage pattern.
+		 *
+		 * This quadric squishes small probabilities, making
+		 * it less likely we act on an unlikely task<->page
+		 * relation.
+		 */
+		last_nid = page_xchg_last_nid(page, polnid);
+		if (last_nid != polnid)
+			goto out;
+	}
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 37/40] mm: numa: Add THP migration for the NUMA working set scanning fault case.
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (35 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 36/40] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-23 10:43   ` [PATCH] mm: numa: Add THP migration for the NUMA working set scanning fault case -fixes Mel Gorman
  2012-11-22 19:25 ` [PATCH 38/40] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate Mel Gorman
                   ` (3 subsequent siblings)
  40 siblings, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

Note: This is very heavily based on a patch from Peter Zijlstra with
	fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner.  That patch
	put a lot of migration logic into mm/huge_memory.c where it does
	not belong. This version puts tries to share some of the migration
	logic with migrate_misplaced_page.  However, it should be noted
	that now migrate.c is doing more with the pagetable manipulation
	than is preferred. The end result is barely recognisable so as
	before, the signed-offs had to be removed but will be re-added if
	the original authors are ok with it.

Add THP migration for the NUMA working set scanning fault case.

It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |   16 ++++
 mm/huge_memory.c        |   55 +++++++-----
 mm/internal.h           |    7 +-
 mm/memcontrol.c         |    7 +-
 mm/migrate.c            |  212 ++++++++++++++++++++++++++++++++++++++---------
 5 files changed, 234 insertions(+), 63 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0d4ee94..23dc324 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -41,6 +41,11 @@ extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
 extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, pmd_t entry,
+			unsigned long address,
+			struct page *page, int node);
 extern bool migrate_ratelimited(int node);
 #else
 
@@ -80,6 +85,17 @@ int migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+
+static inline
+int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, pmd_t entry,
+			unsigned long address,
+			struct page *page, int node)
+{
+	return -EAGAIN;
+}
+
 static inline
 bool migrate_ratelimited(int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f89a98..e74cb93 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,7 +600,7 @@ out:
 }
 __setup("transparent_hugepage=", setup_transparent_hugepage);
 
-static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
 		pmd = pmd_mkwrite(pmd);
@@ -1022,9 +1022,11 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
-	struct page *page = NULL;
+	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
+	bool migrated;
+	bool page_locked = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1032,39 +1034,54 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(pmd);
 	get_page(page);
-	spin_unlock(&mm->page_table_lock);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1)
+	if (target_nid == -1) {
+		put_page(page);
 		goto clear_pmdnuma;
+	}
 
-	/*
-	 * Due to lacking code to migrate thp pages, we'll split
-	 * (which preserves the special PROT_NONE) and re-take the
-	 * fault on the normal pages.
-	 */
-	split_huge_page(page);
-	put_page(page);
-
-	return 0;
+	/* Acquire the page lock to serialise THP migrations */
+	spin_unlock(&mm->page_table_lock);
+	lock_page(page);
+	page_locked = true;
 
-clear_pmdnuma:
+	/* Confirm the PTE did not while locked */
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		unlock_page(page);
+		put_page(page);
 		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	/* Migrate the THP to the requested node */
+	migrated = migrate_misplaced_transhuge_page(mm, vma,
+				pmdp, pmd, addr,
+				page, target_nid);
+	if (!migrated) {
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(pmd, *pmdp))) {
+			unlock_page(page);
+			goto out_unlock;
+		}
+		goto clear_pmdnuma;
+	}
+
+	task_numa_fault(target_nid, HPAGE_PMD_NR);
+	return 0;
 
+clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	if (page_locked)
+		unlock_page(page);
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page) {
-		put_page(page);
-		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
-	}
 	return 0;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index a4fa284..7e60ac8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -212,15 +212,18 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 {
 	if (TestClearPageMlocked(page)) {
 		unsigned long flags;
+		int nr_pages = hpage_nr_pages(page);
 
 		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		SetPageMlocked(newpage);
-		__inc_zone_page_state(newpage, NR_MLOCK);
+		__mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
 		local_irq_restore(flags);
 	}
 }
 
+extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd39ba0..d97af96 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3288,15 +3288,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 				  struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
 	enum charge_type ctype;
 
 	*memcgp = NULL;
 
-	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return;
 
+	if (PageTransHuge(page))
+		nr_pages <<= compound_order(page);
+
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
@@ -3358,7 +3361,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
 }
 
 /* remove redundant charge if migration failed*/
diff --git a/mm/migrate.c b/mm/migrate.c
index 3033669..d7c5bdf 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -410,7 +410,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	if (PageHuge(page))
+	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
 		copy_highpage(newpage, page);
@@ -1488,25 +1488,10 @@ bool migrate_ratelimited(int node)
 	return true;
 }
 
-/*
- * Attempt to migrate a misplaced page to the specified destination
- * node. Caller is expected to have an elevated reference count on
- * the page that will be dropped by this function before returning.
- */
-int migrate_misplaced_page(struct page *page, int node)
+/* Returns true if the node is migrate rate-limited after the update */
+bool numamigrate_update_ratelimit(pg_data_t *pgdat)
 {
-	pg_data_t *pgdat = NODE_DATA(node);
-	int isolated = 0;
-	LIST_HEAD(migratepages);
-
-	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1) {
-		put_page(page);
-		goto out;
-	}
+	bool rate_limited = false;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -1519,23 +1504,25 @@ int migrate_misplaced_page(struct page *page, int node)
 		pgdat->balancenuma_migrate_next_window = jiffies +
 			msecs_to_jiffies(migrate_interval_millisecs);
 	}
-	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages) {
-		spin_unlock(&pgdat->balancenuma_migrate_lock);
-		put_page(page);
-		goto out;
-	}
-	pgdat->balancenuma_migrate_nr_pages++;
+	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages)
+		rate_limited = true;
+	else
+		pgdat->balancenuma_migrate_nr_pages++;
 	spin_unlock(&pgdat->balancenuma_migrate_lock);
+	
+	return rate_limited;
+}
 
+int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
+{
 	/* Avoid migrating to a node that is nearly full */
 	if (migrate_balanced_pgdat(pgdat, 1)) {
 		int page_lru;
 
 		if (isolate_lru_page(page)) {
 			put_page(page);
-			goto out;
+			return 0;
 		}
-		isolated = 1;
 
 		/*
 		 * Page is isolated which takes a reference count so now the
@@ -1546,27 +1533,172 @@ int migrate_misplaced_page(struct page *page, int node)
 
 		page_lru = page_is_file_cache(page);
 		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
-		list_add(&page->lru, &migratepages);
 	}
 
-	if (isolated) {
-		int nr_remaining;
-
-		nr_remaining = migrate_pages(&migratepages,
-				alloc_misplaced_dst_page,
-				node, false, MIGRATE_ASYNC,
-				MR_NUMA_MISPLACED);
-		if (nr_remaining) {
-			putback_lru_pages(&migratepages);
-			isolated = 0;
-		} else
-			count_vm_numa_event(NUMA_PAGE_MIGRATE);
+	return 1;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	int isolated = 0;
+	int nr_remaining;
+	LIST_HEAD(migratepages);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1) {
+		put_page(page);
+		goto out;
 	}
+
+	/*
+	 * Rate-limit the amount of data that is being migrated to a node.
+	 * Optimal placement is no good if the memory bus is saturated and
+	 * all the time is being spent migrating!
+	 */
+	if (numamigrate_update_ratelimit(pgdat)) {
+		put_page(page);
+		goto out;
+	}
+
+	isolated = numamigrate_isolate_page(pgdat, page);
+	if (!isolated)
+		goto out;
+
+	list_add(&page->lru, &migratepages);
+	nr_remaining = migrate_pages(&migratepages,
+			alloc_misplaced_dst_page,
+			node, false, MIGRATE_ASYNC,
+			MR_NUMA_MISPLACED);
+	if (nr_remaining) {
+		putback_lru_pages(&migratepages);
+		isolated = 0;
+	} else
+		count_vm_numa_event(NUMA_PAGE_MIGRATE);
 	BUG_ON(!list_empty(&migratepages));
 
 out:
 	return isolated;
 }
+
+int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+				struct vm_area_struct *vma,
+				pmd_t *pmd, pmd_t entry,
+				unsigned long address,
+				struct page *page, int node)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pg_data_t *pgdat = NODE_DATA(node);
+	int isolated = 0;
+	LIST_HEAD(migratepages);
+	struct page *new_page = NULL;
+	struct mem_cgroup *memcg = NULL;
+	int page_lru = page_is_file_cache(page);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1)
+		goto out_dropref;
+
+	/*
+	 * Rate-limit the amount of data that is being migrated to a node.
+	 * Optimal placement is no good if the memory bus is saturated and
+	 * all the time is being spent migrating!
+	 */
+	if (numamigrate_update_ratelimit(pgdat))
+		goto out_dropref;
+
+	new_page = alloc_pages_node(node,
+		(GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+	if (!new_page)
+		goto out_dropref;
+
+	isolated = numamigrate_isolate_page(pgdat, page);
+	if (!isolated)
+		goto out_keep_locked;
+	list_add(&page->lru, &migratepages);
+
+	/* Prepare a page as a migration target */
+	__set_page_locked(new_page);
+	SetPageSwapBacked(new_page);
+
+	/* anon mapping, we can simply copy page->mapping to the new page: */
+	new_page->mapping = page->mapping;
+	new_page->index = page->index;
+	migrate_page_copy(new_page, page);
+	WARN_ON(PageLRU(new_page));
+
+	/* Recheck the target PMD */
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+
+		/* Reverse changes made by migrate_page_copy() */
+		if (TestClearPageActive(new_page))
+			SetPageActive(page);
+		if (TestClearPageUnevictable(new_page))
+			SetPageUnevictable(page);
+		mlock_migrate_page(page, new_page);
+
+		unlock_page(new_page);
+		put_page(new_page);		/* Free it */
+
+		unlock_page(page);
+		putback_lru_page(page);
+		goto out;
+	}
+
+	/*
+	 * Traditional migration needs to prepare the memcg charge
+	 * transaction early to prevent the old page from being
+	 * uncharged when installing migration entries.  Here we can
+	 * save the potential rollback and start the charge transfer
+	 * only when migration is already known to end successfully.
+	 */
+	mem_cgroup_prepare_migration(page, new_page, &memcg);
+
+	entry = mk_pmd(new_page, vma->vm_page_prot);
+	entry = pmd_mknonnuma(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = pmd_mkhuge(entry);
+
+	page_add_new_anon_rmap(new_page, vma, haddr);
+
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+	page_remove_rmap(page);
+	/*
+	 * Finish the charge transaction under the page table lock to
+	 * prevent split_huge_page() from dividing up the charge
+	 * before it's fully transferred to the new page.
+	 */
+	mem_cgroup_end_migration(memcg, page, new_page, true);
+	spin_unlock(&mm->page_table_lock);
+
+	unlock_page(new_page);
+	unlock_page(page);
+	put_page(page);			/* Drop the rmap reference */
+	put_page(page);			/* Drop the LRU isolation reference */
+
+out:
+	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	return isolated;
+
+out_dropref:
+	put_page(page);
+out_keep_locked:
+	return 0;
+}
 #endif /* CONFIG_BALANCE_NUMA */
 
 #endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 38/40] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (36 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 37/40] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 39/40] mm: sched: numa: Control enabling and disabling of NUMA balancing Mel Gorman
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

The PTE scanning rate and fault rates are two of the biggest sources of
system CPU overhead with automatic NUMA placement.  Ideally a proper policy
would detect if a workload was properly placed, schedule and adjust the
PTE scanning rate accordingly. We do not track the necessary information
to do that but we at least know if we migrated or not.

This patch scans slower if a page was not migrated as the result of a
NUMA hinting fault up to sysctl_balance_numa_scan_period_max which is
now higher than the previous default. Once every minute it will reset
the scanner in case of phase changes.

This is hilariously crude and the times are arbitrary. Workloads will
converge slowly in comparison to what a proper policy should be able to
do. On the plus side, we will chew up less CPU for workloads that have no
need for automatic balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    5 +++--
 kernel/sched/core.c      |    1 +
 kernel/sched/fair.c      |   29 +++++++++++++++++++++--------
 kernel/sysctl.c          |    7 +++++++
 mm/huge_memory.c         |   11 +++++++++--
 mm/memory.c              |   12 ++++++++----
 7 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b478ff..62d18a9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -410,6 +410,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* numa_next_reset is when the PTE scanner period will be reset */
+	unsigned long numa_next_reset;
+
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2b06ea..1068afd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1562,9 +1562,9 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_BALANCE_NUMA
-extern void task_numa_fault(int node, int pages);
+extern void task_numa_fault(int node, int pages, bool migrated);
 #else
-static inline void task_numa_fault(int node, int pages)
+static inline void task_numa_fault(int node, int pages, bool migrated)
 {
 }
 #endif
@@ -2009,6 +2009,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_period_reset;
 extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 047e3c7..a59d869 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1537,6 +1537,7 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_BALANCE_NUMA
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c632448..c1be907 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -784,7 +784,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_balance_numa_scan_period_min = 100;
-unsigned int sysctl_balance_numa_scan_period_max = 100*16;
+unsigned int sysctl_balance_numa_scan_period_max = 100*50;
+unsigned int sysctl_balance_numa_scan_period_reset = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_balance_numa_scan_size = 256;
@@ -806,20 +807,19 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages)
+void task_numa_fault(int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
 	/*
-	 * Assume that as faults occur that pages are getting properly placed
-	 * and fewer NUMA hints are required. Note that this is a big
-	 * assumption, it assumes processes reach a steady steady with no
-	 * further phase changes.
+	 * If pages are properly placed (did not migrate) then scan slower.
+	 * This is reset periodically in case of phase changes
 	 */
-	p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
-				p->numa_scan_period + jiffies_to_msecs(2));
+        if (!migrated)
+		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
 }
@@ -858,6 +858,19 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * Reset the scan period if enough time has gone by. Objective is that
+	 * scanning will be reduced if pages are properly placed. As tasks
+	 * can enter different phases this needs to be re-examined. Lacking
+	 * proper tracking of reference behaviour, this blunt hammer is used.
+	 */
+	migrate = mm->numa_next_reset;
+	if (time_after(now, migrate)) {
+		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+		next_scan = now + msecs_to_jiffies(sysctl_balance_numa_scan_period_reset);
+		xchg(&mm->numa_next_reset, next_scan);
+	}
+
+	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5ee587d..c335f426 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "balance_numa_scan_period_reset",
+		.data		= &sysctl_balance_numa_scan_period_reset,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "balance_numa_scan_period_max_ms",
 		.data		= &sysctl_balance_numa_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e74cb93..b4e1431 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1027,6 +1027,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid;
 	bool migrated;
 	bool page_locked = false;
+	int current_nid = -1;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1035,9 +1036,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	get_page(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
+	current_nid = page_to_nid(page);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
+		current_nid = page_to_nid(page);
 		put_page(page);
 		goto clear_pmdnuma;
 	}
@@ -1060,7 +1063,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr,
 				page, target_nid);
-	if (!migrated) {
+	if (migrated)
+		current_nid = target_nid;
+	else {
 		spin_lock(&mm->page_table_lock);
 		if (unlikely(!pmd_same(pmd, *pmdp))) {
 			unlock_page(page);
@@ -1069,7 +1074,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto clear_pmdnuma;
 	}
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR);
+	task_numa_fault(current_nid, HPAGE_PMD_NR, migrated);
 	return 0;
 
 clear_pmdnuma:
@@ -1082,6 +1087,8 @@ clear_pmdnuma:
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (current_nid != -1)
+		task_numa_fault(current_nid, HPAGE_PMD_NR, migrated);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 0f0ce80..d0789cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3468,6 +3468,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int current_nid = -1;
 	int target_nid;
+	bool migrated = false;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3509,12 +3510,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	if (migrate_misplaced_page(page, target_nid))
+	migrated = migrate_misplaced_page(page, target_nid);
+	if (migrated)
 		current_nid = target_nid;
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1);
+		task_numa_fault(current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3553,6 +3555,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct page *page;
 		int curr_nid = local_nid;
 		int target_nid;
+		bool migrated;
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3589,9 +3592,10 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Migrate to the requested node */
 		pte_unmap_unlock(pte, ptl);
-		if (migrate_misplaced_page(page, target_nid))
+		migrated = migrate_misplaced_page(page, target_nid);
+		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1);
+		task_numa_fault(curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 39/40] mm: sched: numa: Control enabling and disabling of NUMA balancing
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (37 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 38/40] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-22 19:25 ` [PATCH 40/40] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Mel Gorman
  2012-11-26 14:58 ` [PATCH 00/41] Automatic NUMA Balancing V6 Mel Gorman
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

This patch adds Kconfig options and kernel parameters to allow the
enabling and disabling of automatic NUMA balancing. The existance
of such a switch was and is very important when debugging problems
related to transparent hugepages and we should have the same for
automatic NUMA placement.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/kernel-parameters.txt |    3 +++
 include/linux/sched.h               |    4 +++
 init/Kconfig                        |    8 ++++++
 kernel/sched/core.c                 |   48 ++++++++++++++++++++++++-----------
 kernel/sched/fair.c                 |    3 +++
 kernel/sched/features.h             |    6 +++--
 mm/mempolicy.c                      |   46 +++++++++++++++++++++++++++++++++
 7 files changed, 101 insertions(+), 17 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9776f06..d984acb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -403,6 +403,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	atkbd.softrepeat= [HW]
 			Use software keyboard repeat
 
+	balancenuma=	[KNL,X86] Enable or disable automatic NUMA balancing.
+			Allowed values are enable and disable
+
 	baycom_epp=	[HW,AX25]
 			Format: <io>,<mode>
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1068afd..2669bdd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1563,10 +1563,14 @@ struct task_struct {
 
 #ifdef CONFIG_BALANCE_NUMA
 extern void task_numa_fault(int node, int pages, bool migrated);
+extern void set_balancenuma_state(bool enabled);
 #else
 static inline void task_numa_fault(int node, int pages, bool migrated)
 {
 }
+static inline void set_balancenuma_state(bool enabled)
+{
+}
 #endif
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index 6897a05..4cccc00f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -720,6 +720,14 @@ config ARCH_USES_NUMA_PROT_NONE
 	depends on ARCH_WANTS_PROT_NUMA_PROT_NONE
 	depends on BALANCE_NUMA
 
+config BALANCE_NUMA_DEFAULT_ENABLED
+	bool "Automatically enable NUMA aware memory/task placement"
+	default y
+	depends on BALANCE_NUMA
+	help
+	  If set, autonumic NUMA balancing will be enabled if running on a NUMA
+	  machine.
+
 config BALANCE_NUMA
 	bool "Memory placement aware NUMA scheduler"
 	default n
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a59d869..4841f4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -192,23 +192,10 @@ static void sched_feat_disable(int i) { };
 static void sched_feat_enable(int i) { };
 #endif /* HAVE_JUMP_LABEL */
 
-static ssize_t
-sched_feat_write(struct file *filp, const char __user *ubuf,
-		size_t cnt, loff_t *ppos)
+static int sched_feat_set(char *cmp)
 {
-	char buf[64];
-	char *cmp;
-	int neg = 0;
 	int i;
-
-	if (cnt > 63)
-		cnt = 63;
-
-	if (copy_from_user(&buf, ubuf, cnt))
-		return -EFAULT;
-
-	buf[cnt] = 0;
-	cmp = strstrip(buf);
+	int neg = 0;
 
 	if (strncmp(cmp, "NO_", 3) == 0) {
 		neg = 1;
@@ -228,6 +215,27 @@ sched_feat_write(struct file *filp, const char __user *ubuf,
 		}
 	}
 
+	return i;
+}
+
+static ssize_t
+sched_feat_write(struct file *filp, const char __user *ubuf,
+		size_t cnt, loff_t *ppos)
+{
+	char buf[64];
+	char *cmp;
+	int i;
+
+	if (cnt > 63)
+		cnt = 63;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+	cmp = strstrip(buf);
+
+	i = sched_feat_set(cmp);
 	if (i == __SCHED_FEAT_NR)
 		return -EINVAL;
 
@@ -1549,6 +1557,16 @@ static void __sched_fork(struct task_struct *p)
 #endif /* CONFIG_BALANCE_NUMA */
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+void set_balancenuma_state(bool enabled)
+{
+	if (enabled)
+		sched_feat_set("NUMA");
+	else
+		sched_feat_set("NO_NUMA");
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 /*
  * fork()/clone()-time setup:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1be907..b4bc459 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -811,6 +811,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 
+	if (!sched_feat_numa(NUMA))
+		return;
+
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
 	/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7cfd289..d402368 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,10 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 
 /*
- * Apply the automatic NUMA scheduling policy
+ * Apply the automatic NUMA scheduling policy. Enabled automatically
+ * at runtime if running on a NUMA machine. Can be controlled via
+ * balancenuma=
  */
 #ifdef CONFIG_BALANCE_NUMA
-SCHED_FEAT(NUMA,	true)
+SCHED_FEAT(NUMA,	false)
 #endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fd20e28..56ad9bf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2521,6 +2521,50 @@ void mpol_free_shared_policy(struct shared_policy *p)
 	mutex_unlock(&p->mutex);
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static bool __initdata balancenuma_override;
+
+static void __init check_balancenuma_enable(void)
+{
+	bool balancenuma_default = false;
+
+	if (IS_ENABLED(CONFIG_BALANCE_NUMA_DEFAULT_ENABLED))
+		balancenuma_default = true;
+
+	if (nr_node_ids > 1 && !balancenuma_override) {
+		printk(KERN_INFO "Enabling automatic NUMA balancing. "
+			"Configure with balancenuma= or sysctl");
+		set_balancenuma_state(balancenuma_default);
+	}
+}
+
+static int __init setup_balancenuma(char *str)
+{
+	int ret = 0;
+	if (!str)
+		goto out;
+	balancenuma_override = true;
+
+	if (!strcmp(str, "enable")) {
+		set_balancenuma_state(true);
+		ret = 1;
+	} else if (!strcmp(str, "disable")) {
+		set_balancenuma_state(false);
+		ret = 1;
+	}
+out:
+	if (!ret)
+		printk(KERN_WARNING "Unable to parse balancenuma=\n");
+
+	return ret;
+}
+__setup("balancenuma=", setup_balancenuma);
+#else
+static inline void __init check_balancenuma_enable(void)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {
@@ -2571,6 +2615,8 @@ void __init numa_policy_init(void)
 
 	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
 		printk("numa_policy_init: interleaving failed\n");
+
+	check_balancenuma_enable();
 }
 
 /* Reset policy of current process to default */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 40/40] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (38 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 39/40] mm: sched: numa: Control enabling and disabling of NUMA balancing Mel Gorman
@ 2012-11-22 19:25 ` Mel Gorman
  2012-11-26 14:58 ` [PATCH 00/41] Automatic NUMA Balancing V6 Mel Gorman
  40 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-22 19:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML,
	Mel Gorman

Due to the fact that migrations are driven by the CPU a task is running
on there is no point tracking NUMA faults until one task runs on a new
node. This patch tracks the first node used by an address space. Until
it changes, PTE scanning is disabled and no NUMA hinting faults are
trapped. This should help workloads that are short-lived, do not care
about NUMA placement or have bound themselves to a single node.

This takes advantage of the logic in "mm: sched: numa: Implement slow
start for working set sampling" to delay when the checks are made. This
will take advantage of processes that set their CPU and node bindings
early in their lifetime. It will also potentially allow any initial load
balancing to take place.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h |   10 ++++++++++
 kernel/fork.c            |    3 +++
 kernel/sched/fair.c      |   18 ++++++++++++++++++
 kernel/sched/features.h  |    4 +++-
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 62d18a9..e4551c1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -418,10 +418,20 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
+
+	/*
+	 * The first node a task was scheduled on. If a task runs on
+	 * a different node than Make PTE Scan Go Now.
+	 */
+	int first_nid;
 #endif
 	struct uprobes_state uprobes_state;
 };
 
+/* first nid will either be a valid NID or one of these values */
+#define NUMA_PTE_SCAN_INIT	-1
+#define NUMA_PTE_SCAN_ACTIVE	-2
+
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 #ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..e39111a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -821,6 +821,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	mm->first_nid = NUMA_PTE_SCAN_INIT;
+#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4bc459..fd9c78c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -861,6 +861,24 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * We do not care about task placement until a task runs on a node
+	 * other than the first one used by the address space. This is
+	 * largely because migrations are driven by what CPU the task
+	 * is running on. If it's never scheduled on another node, it'll
+	 * not migrate so why bother trapping the fault.
+	 */
+	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
+		mm->first_nid = numa_node_id();
+	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
+		/* Are we running on a new node yet? */
+		if (numa_node_id() == mm->first_nid &&
+		    !sched_feat_numa(NUMA_FORCE))
+			return;
+
+		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
+	}
+
+	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d402368..c3c86fd 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -65,8 +65,10 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * balancenuma=
+ * balancenuma=. Allow PTE scanning to be forced on UMA machines
+ * for debugging the core machinery.
  */
 #ifdef CONFIG_BALANCE_NUMA
 SCHED_FEAT(NUMA,	false)
+SCHED_FEAT(NUMA_FORCE,	false)
 #endif
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags
  2012-11-22 19:25 ` [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
@ 2012-11-22 20:56   ` Alan Cox
  2012-11-23  9:09     ` Mel Gorman
  0 siblings, 1 reply; 53+ messages in thread
From: Alan Cox @ 2012-11-22 20:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Lee Schermerhorn, Alex Shi, Srikar Dronamraju, Aneesh Kumar,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Thu, 22 Nov 2012 19:25:15 +0000
Mel Gorman <mgorman@suse.de> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> Intel has an architectural guarantee that the TLB entry causing
> a page fault gets invalidated automatically. This means
> we should be able to drop the local TLB invalidation.

Can we get an AMD sign off on that ?

Alan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags
  2012-11-22 20:56   ` Alan Cox
@ 2012-11-23  9:09     ` Mel Gorman
  2012-11-23  9:53       ` Borislav Petkov
  0 siblings, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2012-11-23  9:09 UTC (permalink / raw)
  To: Alan Cox
  Cc: Peter Zijlstra, Andrea Arcangeli, Borislav Petkov, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Thu, Nov 22, 2012 at 08:56:37PM +0000, Alan Cox wrote:
> On Thu, 22 Nov 2012 19:25:15 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Intel has an architectural guarantee that the TLB entry causing
> > a page fault gets invalidated automatically. This means
> > we should be able to drop the local TLB invalidation.
> 
> Can we get an AMD sign off on that ?
> 

Hi Alan,

You sortof can[1]. Borislav Petkov answered that they do
https://lkml.org/lkml/2012/11/17/85 and quoted the manual at
https://lkml.org/lkml/2012/10/29/414 saying that this should be ok.

[1] There is no delicate way of putting it. I've no idea what the
    current status of current and former AMD kernel developers is.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags
  2012-11-23  9:09     ` Mel Gorman
@ 2012-11-23  9:53       ` Borislav Petkov
  0 siblings, 0 replies; 53+ messages in thread
From: Borislav Petkov @ 2012-11-23  9:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Alan Cox, Peter Zijlstra, Andrea Arcangeli, Ingo Molnar,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Fri, Nov 23, 2012 at 09:09:09AM +0000, Mel Gorman wrote:
> You sortof can[1]. Borislav Petkov answered that they do
> https://lkml.org/lkml/2012/11/17/85 and quoted the manual at
> https://lkml.org/lkml/2012/10/29/414 saying that this should be ok.
> 
> [1] There is no delicate way of putting it. I've no idea what the
>     current status of current and former AMD kernel developers is.

All those based in Dresden don't work for AMD anymore.

But regardless, I've already confirmed with AMD design that this is
actually architectural and we're zapping the TLB entry on a #PF on all
relevant CPUs.

I'd still like to have some sort of an assertion there just in case but,
as Linus pointed out, that won't be easy. I'd guess it's up to you -mm
guys to think up something sick that works under CONFIG_DEBUG_VM :).

HTH.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH] mm: numa: Add THP migration for the NUMA working set scanning fault case -fixes
  2012-11-22 19:25 ` [PATCH 37/40] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
@ 2012-11-23 10:43   ` Mel Gorman
  0 siblings, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-23 10:43 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

Hugh pointed out some issues that needed addressing in the THP native
migration patch

o transhuge isolations should be accounted as HPAGE_PMD_NR, not 1
o the migratepages list is doing nothing and is garbage leftover
  from an attempt to mesh transhuge migration properly with normal
  migration. Looking again now, I think it would trigger errors if list
  debugging was enabled and the THP migration failed. When I had a bunch
  of debugging options set earlier in development, list debugging was not
  one of them. This potentially could take a long time to hit but if you
  see bugs that look like LRU list corruption then this could be it.

Additionally

o Account for transhuage pages that are migrated so we know roughly
  how many MB/sec are being migrated for a given workload.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index d7c5bdf..b84fded 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1532,7 +1532,12 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 		put_page(page);
 
 		page_lru = page_is_file_cache(page);
-		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		if (!PageTransHuge(page))
+			inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		else
+			mod_zone_page_state(page_zone(page),
+					NR_ISOLATED_ANON + page_lru,
+					HPAGE_PMD_NR);
 	}
 
 	return 1;
@@ -1598,7 +1603,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
-	LIST_HEAD(migratepages);
 	struct page *new_page = NULL;
 	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
@@ -1626,7 +1630,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated)
 		goto out_keep_locked;
-	list_add(&page->lru, &migratepages);
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
@@ -1655,6 +1658,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 		unlock_page(page);
 		putback_lru_page(page);
+
+		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 		goto out;
 	}
 
@@ -1690,8 +1695,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	put_page(page);			/* Drop the rmap reference */
 	put_page(page);			/* Drop the LRU isolation reference */
 
+	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
+	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
+
 out:
-	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	mod_zone_page_state(page_zone(page),
+			NR_ISOLATED_ANON + page_lru,
+			-HPAGE_PMD_NR);
 	return isolated;
 
 out_dropref:

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 00/41] Automatic NUMA Balancing V6
  2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
                   ` (39 preceding siblings ...)
  2012-11-22 19:25 ` [PATCH 40/40] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Mel Gorman
@ 2012-11-26 14:58 ` Mel Gorman
  2012-11-28 13:49   ` [PATCH 00/45] Automatic NUMA Balancing V7 Mel Gorman
  40 siblings, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2012-11-26 14:58 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

Due to recent email floods, I am not resending all 41 patches back out as
the bulk of the changes are related to being bisect and build safe and
shuffling the THP migration patch to the end of the series. There is an
important fix from Hillf Danton in there which is arguably the most important
difference between V5 and V6. I'll send the full patchbomb if people prefer.
This is all based against 3.7-rc6

git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v6r15
git tag:  git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v6

This series can be treated as 5 major stages.

1. TLB optimisations that we're likely to want unconditionally.
2. Basic foundation and core mechanics, initial policy that does very little
3. Full PMD fault handling, rate limiting of migration, two-stage migration
   filter to mitigate poor migration decisions.  This will migrate pages
   on a PTE or PMD level using just the current referencing CPU as a
   placement hint
4. Scan rate adaption
5. Native THP migration

Very broadly speaking the TODOs that spring to mind are

1. Revisit MPOL_NOOP and MPOL_MF_LAZY
2. Other architecture support or at least validation that it could be made work. I'm
   half-hoping that the PPC64 people are watching because they tend to be interested
   in this type of thing.

Some advantages of the series are;

1. It handles regular PMDs which reduces overhead in case where pages within
   a PMD are on the same node
2. It rate limits migrations to avoid saturating the bus and backs off
   PTE scanning (in a fairly heavy manner) if the node is rate-limited
3. It keeps major optimisations like THP towards the end to be sure I am
   not accidentally depending on them
4. It has some vmstats which allow a user to make a rough guess as to how
   much overhead the balancing is introducing
5. It implements a basic policy that acts as a second performance baseline.
   The three baselines become vanilla kernel, basic placement policy,
   complex placement policy. This allows like-with-like comparisons with
   implementations.

Changelog since V5
  o Fix build errors related to config options, make bisect-safe
  o Account for transhuge migrations
  o Count HPAGE_PMD_NR pages when isolating transhuge
  o Account for local transphuge faults

Changelog since V4
  o Allow enabling/disable from command line
  o Delay PTE scanning until tasks are running on a new node
  o THP migration bits needed for memcg
  o Adapt the scanning rate depending on whether pages need to migrate
  o Drop all the scheduler policy stuff on top, it was broken

Changelog since V3
  o Use change_protection
  o Architecture-hook twiddling
  o Port of the THP migration patch.
  o Additional TLB optimisations
  o Fixes from Hillf Danton

Changelog since V2
  o Do not allocate from home node
  o Mostly remove pmd_numa handling for regular pmds
  o HOME policy will allocate from and migrate towards local node
  o Load balancer is more aggressive about moving tasks towards home node
  o Renames to sync up more with -tip version
  o Move pte handlers to generic code
  o Scanning rate starts at 100ms, system CPU usage expected to increase
  o Handle migration of PMD hinting faults
  o Rate limit migration on a per-node basis
  o Alter how the rate of PTE scanning is adapted
  o Rate limit setting of pte_numa if node is congested
  o Only flush local TLB is unmapping a pte_numa page
  o Only consider one CPU in cpu follow algorithm

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two (three depending on how you look at it) competing
approaches to implement support for automatically migrating pages to
optimise NUMA locality. Performance results are available but review
highlighted different problems in both.  They are not compatible with each
other even though some fundamental mechanics should have been the same.
This series addresses part of the integration and sharing problem by
implementing a foundation that either the policy for schednuma or autonuma
can be rebased on.

The initial policy it implements is a very basic greedy policy called
"Migrate On Reference Of pte_numa Node (MORON)".  I expect people to
build upon this revised policy and rename it to something more sensible
that reflects what it means. The ideal *worst-case* behaviour is that
it is comparable to current mainline but for some workloads this is an
improvement over mainline.

In terms of building on top of the foundation the ideal would be that
patches affect one of the following areas although obviously that will
not always be possible

1. The PTE update helper functions
2. The PTE scanning machinary driven from task_numa_tick
3. Task and process fault accounting and how that information is used
   to determine if a page is misplaced
4. Fault handling, migrating the page if misplaced, what information is
   provided to the placement policy
5. Scheduler and load balancing

Patches 1-5 are some TLB optimisations that mostly make sense on their own.
	They are likely to make it into the tree either way

Patches 6-7 are an mprotect optimisation

Patches 8-10 move some vmstat counters so that migrated pages get accounted
	for. In the past the primary user of migration was compaction but
	if pages are to migrate for NUMA optimisation then the counters
	need to be generally useful.

Patch 11 defines an arch-specific PTE bit called _PAGE_NUMA that is used
	to trigger faults later in the series. A placement policy is expected
	to use these faults to determine if a page should migrate.  On x86,
	the bit is the same as _PAGE_PROTNONE but other architectures
	may differ. Note that it is also possible to avoid using this bit
	and go with plain PROT_NONE but the resulting helpers are then
	heavier.

Patch 12-14 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
	friends, updated GUP and huge page splitting.

Patch 15 creates the fault handler for p[te|md]_numa PTEs and just clears
	them again.

Patch 16 adds a MPOL_LOCAL policy so applications can explicitly request the
	historical behaviour.

Patch 17 is premature but adds a MPOL_NOOP policy that can be used in
	conjunction with the LAZY flags introduced later in the series.

Patch 18 adds migrate_misplaced_page which is responsible for migrating
	a page to a new location.

Patch 19 migrates the page on fault if mpol_misplaced() says to do so.

Patch 20 updates the page fault handlers. Transparent huge pages are split.
	Pages pointed to by PTEs are migrated. Pages pointed to by PMDs
	are not properly handed until later in the series.

Patch 21 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
	On the next reference the memory should be migrated to the node that
	references the memory.

Patch 22 reimplements change_prot_numa in terms of change_protection. It could
	be collapsed with patch 21 but this might be easier to review.

Patch 23 notes that the MPOL_MF_LAZY and MPOL_NOOP flags have not been properly
	reviewed and there are no manual pages. They are removed for now and
	need to be revisited.

Patch 24 sets pte_numa within the context of the scheduler.

Patches 25-27 note that the marking of pte_numa has a number of disadvantages and
	instead incrementally updates a limited range of the address space
	each tick.

Patch 28 adds some vmstats that can be used to approximate the cost of the
	scheduling policy in a more fine-grained fashion than looking at
	the system CPU usage.

Patch 29 implements the MORON policy.

Patch 30 properly handles the migration of pages faulted when handling a pmd
	numa hinting fault. This could be improved as it's a bit tangled
	to follow. PMDs are only marked if the PTEs underneath are expected
	to point to pages on the same node.

Patches 31-33 rate-limit the number of pages being migrated and marked as pte_numa

Patch 34 slowly decreases the pte_numa update scanning rate

Patch 35-36 introduces last_nid and uses it to build a two-stage filter
	that delays when a page gets migrated to avoid a situation where
	a task running temporarily off its home node forces a migration.

Patch 37 adapts the scanning rate if pages do not have to be migrated

Patch 38 allows the enabling/disabling from command line

Patch 39 allows balancenuma to be disabled even if !SCHED_DEBUG

Patch 40 delays PTE scanning until a task is scheduled on a new node

Patch 41 implements native THP migration for NUMA hinting faults.

 Documentation/kernel-parameters.txt  |    3 +
 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    2 +
 arch/x86/include/asm/pgtable.h       |   17 +-
 arch/x86/include/asm/pgtable_types.h |   20 +++
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |  110 ++++++++++++
 include/linux/huge_mm.h              |   14 +-
 include/linux/hugetlb.h              |    8 +-
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   45 ++++-
 include/linux/mm.h                   |   39 +++++
 include/linux/mm_types.h             |   31 ++++
 include/linux/mmzone.h               |   13 ++
 include/linux/sched.h                |   27 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 ++++++
 include/uapi/linux/mempolicy.h       |   15 +-
 init/Kconfig                         |   41 +++++
 kernel/fork.c                        |    3 +
 kernel/sched/core.c                  |   71 ++++++--
 kernel/sched/fair.c                  |  227 ++++++++++++++++++++++++
 kernel/sched/features.h              |   11 ++
 kernel/sched/sched.h                 |   12 ++
 kernel/sysctl.c                      |   45 ++++-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |   94 +++++++++-
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |    7 +-
 mm/memcontrol.c                      |    7 +-
 mm/memory-failure.c                  |    3 +-
 mm/memory.c                          |  188 +++++++++++++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  283 +++++++++++++++++++++++++++---
 mm/migrate.c                         |  319 +++++++++++++++++++++++++++++++++-
 mm/mprotect.c                        |  124 ++++++++++---
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |    9 +-
 mm/vmstat.c                          |   16 +-
 40 files changed, 1821 insertions(+), 109 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 00/45] Automatic NUMA Balancing V7
  2012-11-26 14:58 ` [PATCH 00/41] Automatic NUMA Balancing V6 Mel Gorman
@ 2012-11-28 13:49   ` Mel Gorman
  2012-11-30 11:33     ` [PATCH 00/46] Automatic NUMA Balancing V8 Mel Gorman
  2012-12-07 10:45     ` [PATCH 00/45] Automatic NUMA Balancing V7 Srikar Dronamraju
  0 siblings, 2 replies; 53+ messages in thread
From: Mel Gorman @ 2012-11-28 13:49 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

Like V6, I'm only posting the git tree reference instead of sending out a
flood of emails as the differences are small. The v7 release is justified
by a page count reference bug identified and fixed by Hillf Danton in the
transhuge migration patch.

I'll send the full series if people would prefer that.

git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v7r6
git tag:  git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v7

Changelog since V6
  o Transfer last_nid information during transhuge migration		(dhillf)
  o Transfer last_nid information during splits				(dhillf)
  o Drop page reference if target node is full				(dhillf)
  o Account for transhuge allocation failure as migration failure	(mel)

Changelog since V5
  o Fix build errors related to config options, make bisect-safe
  o Account for transhuge migrations
  o Count HPAGE_PMD_NR pages when isolating transhuge
  o Account for local transphuge faults
  o Fix a memory leak on isolation failure

Changelog since V4
  o Allow enabling/disable from command line
  o Delay PTE scanning until tasks are running on a new node
  o THP migration bits needed for memcg
  o Adapt the scanning rate depending on whether pages need to migrate
  o Drop all the scheduler policy stuff on top, it was broken

Changelog since V3
  o Use change_protection
  o Architecture-hook twiddling
  o Port of the THP migration patch.
  o Additional TLB optimisations
  o Fixes from Hillf Danton

Changelog since V2
  o Do not allocate from home node
  o Mostly remove pmd_numa handling for regular pmds
  o HOME policy will allocate from and migrate towards local node
  o Load balancer is more aggressive about moving tasks towards home node
  o Renames to sync up more with -tip version
  o Move pte handlers to generic code
  o Scanning rate starts at 100ms, system CPU usage expected to increase
  o Handle migration of PMD hinting faults
  o Rate limit migration on a per-node basis
  o Alter how the rate of PTE scanning is adapted
  o Rate limit setting of pte_numa if node is congested
  o Only flush local TLB is unmapping a pte_numa page
  o Only consider one CPU in cpu follow algorithm

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two (three depending on how you look at it) competing
approaches to implement support for automatically migrating pages to
optimise NUMA locality. Performance results are available but review
highlighted different problems in both.  They are not compatible with each
other even though some fundamental mechanics should have been the same.
This series addresses part of the integration and sharing problem by
implementing a foundation that either the policy for schednuma or autonuma
can be rebased on.

The initial policy it implements is a very basic greedy policy called
"Migrate On Reference Of pte_numa Node (MORON)".  I expect people to
build upon this revised policy and rename it to something more sensible
that reflects what it means. The ideal *worst-case* behaviour is that
it is comparable to current mainline but for some workloads this is an
improvement over mainline.

This series can be treated as 5 major stages.

1. TLB optimisations that we're likely to want unconditionally.
2. Basic foundation and core mechanics, initial policy that does very little
3. Full PMD fault handling, rate limiting of migration, two-stage migration
   filter to mitigate poor migration decisions.  This will migrate pages
   on a PTE or PMD level using just the current referencing CPU as a
   placement hint
4. Scan rate adaption
5. Native THP migration

Very broadly speaking the TODOs that spring to mind are

1. Revisit MPOL_NOOP and MPOL_MF_LAZY
2. Other architecture support or at least validation that it could be made work. I'm
   half-hoping that the PPC64 people are watching because they tend to be interested
   in this type of thing.

Some advantages of the series are;

1. It handles regular PMDs which reduces overhead in case where pages within
   a PMD are on the same node
2. It rate limits migrations to avoid saturating the bus and backs off
   PTE scanning (in a fairly heavy manner) if the node is rate-limited
3. It keeps major optimisations like THP towards the end to be sure I am
   not accidentally depending on them
4. It has some vmstats which allow a user to make a rough guess as to how
   much overhead the balancing is introducing
5. It implements a basic policy that acts as a second performance baseline.
   The three baselines become vanilla kernel, basic placement policy,
   complex placement policy. This allows like-with-like comparisons with
   implementations.

In terms of building on top of the foundation the ideal would be that
patches affect one of the following areas although obviously that will
not always be possible

1. The PTE update helper functions
2. The PTE scanning machinary driven from task_numa_tick
3. Task and process fault accounting and how that information is used
   to determine if a page is misplaced
4. Fault handling, migrating the page if misplaced, what information is
   provided to the placement policy
5. Scheduler and load balancing

Patches in this series are as follows.

Patches 1-5 are some TLB optimisations that mostly make sense on their own.
	They are likely to make it into the tree either way

Patches 6-7 are an mprotect optimisation

Patches 8-10 move some vmstat counters so that migrated pages get accounted
	for. In the past the primary user of migration was compaction but
	if pages are to migrate for NUMA optimisation then the counters
	need to be generally useful.

Patch 11 defines an arch-specific PTE bit called _PAGE_NUMA that is used
	to trigger faults later in the series. A placement policy is expected
	to use these faults to determine if a page should migrate.  On x86,
	the bit is the same as _PAGE_PROTNONE but other architectures
	may differ. Note that it is also possible to avoid using this bit
	and go with plain PROT_NONE but the resulting helpers are then
	heavier.

Patch 12-14 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
	friends, updated GUP and huge page splitting.

Patch 15 creates the fault handler for p[te|md]_numa PTEs and just clears
	them again.

Patch 16 adds a MPOL_LOCAL policy so applications can explicitly request the
	historical behaviour.

Patch 17 is premature but adds a MPOL_NOOP policy that can be used in
	conjunction with the LAZY flags introduced later in the series.

Patch 18 adds migrate_misplaced_page which is responsible for migrating
	a page to a new location.

Patch 19-20 migrates the page on fault if mpol_misplaced() says to do so.

Patch 21 updates the page fault handlers. Transparent huge pages are split.
	Pages pointed to by PTEs are migrated. Pages pointed to by PMDs
	are not properly handed until later in the series.

Patch 22 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
	On the next reference the memory should be migrated to the node that
	references the memory.

Patch 23 reimplements change_prot_numa in terms of change_protection. It could
	be collapsed with patch 21 but this might be easier to review.

Patch 24 notes that the MPOL_MF_LAZY and MPOL_NOOP flags have not been properly
	reviewed and there are no manual pages. They are removed for now and
	need to be revisited.

Patch 25 sets pte_numa within the context of the scheduler.

Patches 26-28 note that the marking of pte_numa has a number of disadvantages and
	instead incrementally updates a limited range of the address space
	each tick.

Patch 29 adds some vmstats that can be used to approximate the cost of the
	scheduling policy in a more fine-grained fashion than looking at
	the system CPU usage.

Patch 30 implements the MORON policy.

Patch 31 properly handles the migration of pages faulted when handling a pmd
	numa hinting fault. This could be improved as it's a bit tangled
	to follow. PMDs are only marked if the PTEs underneath are expected
	to point to pages on the same node.

Patches 32-34 rate-limit the number of pages being migrated and marked as pte_numa

Patch 35 slowly decreases the pte_numa update scanning rate

Patch 36-39 introduces last_nid and uses it to build a two-stage filter
	that delays when a page gets migrated to avoid a situation where
	a task running temporarily off its home node forces a migration.

Patch 40 adapts the scanning rate if pages do not have to be migrated

Patch 41 allows the enabling/disabling from command line

Patch 42 allows balancenuma to be disabled even if !SCHED_DEBUG

Patch 43 delays PTE scanning until a task is scheduled on a new node

Patch 44 implements native THP migration for NUMA hinting faults.

Patch 45 accounts for transhuge allocation failures as migration failures.

 Documentation/kernel-parameters.txt  |    3 +
 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    2 +
 arch/x86/include/asm/pgtable.h       |   17 +-
 arch/x86/include/asm/pgtable_types.h |   20 ++
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |  110 +++++++++++
 include/linux/huge_mm.h              |   14 +-
 include/linux/hugetlb.h              |    8 +-
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   45 ++++-
 include/linux/mm.h                   |   39 ++++
 include/linux/mm_types.h             |   31 ++++
 include/linux/mmzone.h               |   13 ++
 include/linux/sched.h                |   27 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 ++++++
 include/uapi/linux/mempolicy.h       |   15 +-
 init/Kconfig                         |   41 +++++
 kernel/fork.c                        |    3 +
 kernel/sched/core.c                  |   71 ++++++--
 kernel/sched/fair.c                  |  227 +++++++++++++++++++++++
 kernel/sched/features.h              |   11 ++
 kernel/sched/sched.h                 |   12 ++
 kernel/sysctl.c                      |   45 ++++-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |   95 +++++++++-
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |    7 +-
 mm/memcontrol.c                      |    7 +-
 mm/memory-failure.c                  |    3 +-
 mm/memory.c                          |  188 ++++++++++++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  283 ++++++++++++++++++++++++++---
 mm/migrate.c                         |  333 +++++++++++++++++++++++++++++++++-
 mm/mprotect.c                        |  124 ++++++++++---
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |    9 +-
 mm/vmstat.c                          |   16 +-
 40 files changed, 1836 insertions(+), 109 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 00/46] Automatic NUMA Balancing V8
  2012-11-28 13:49   ` [PATCH 00/45] Automatic NUMA Balancing V7 Mel Gorman
@ 2012-11-30 11:33     ` Mel Gorman
  2012-11-30 11:41       ` Results for balancenuma v8, autonuma-v28fast and numacore-20121126 Mel Gorman
  2012-12-07 10:45     ` [PATCH 00/45] Automatic NUMA Balancing V7 Srikar Dronamraju
  1 sibling, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2012-11-30 11:33 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

Like V7, I'm only posting the git tree reference instead of sending out
a flood of emails as the difference is just one patch related to migrate
rate limiting. I've not actually identified it as making a difference in
the tests I ran but the patch makes sense.

git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v8r6
git tag:  git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v8

Changelog since V7
  o Account for transhuge migrations properly when migrate rate-limiting

Changelog since V6
  o Transfer last_nid information during transhuge migration		(dhillf)
  o Transfer last_nid information during splits				(dhillf)
  o Drop page reference if target node is full				(dhillf)
  o Account for transhuge allocation failure as migration failure	(mel)

Changelog since V5
  o Fix build errors related to config options, make bisect-safe
  o Account for transhuge migrations
  o Count HPAGE_PMD_NR pages when isolating transhuge
  o Account for local transphuge faults
  o Fix a memory leak on isolation failure

Changelog since V4
  o Allow enabling/disable from command line
  o Delay PTE scanning until tasks are running on a new node
  o THP migration bits needed for memcg
  o Adapt the scanning rate depending on whether pages need to migrate
  o Drop all the scheduler policy stuff on top, it was broken

Changelog since V3
  o Use change_protection
  o Architecture-hook twiddling
  o Port of the THP migration patch.
  o Additional TLB optimisations
  o Fixes from Hillf Danton

Changelog since V2
  o Do not allocate from home node
  o Mostly remove pmd_numa handling for regular pmds
  o HOME policy will allocate from and migrate towards local node
  o Load balancer is more aggressive about moving tasks towards home node
  o Renames to sync up more with -tip version
  o Move pte handlers to generic code
  o Scanning rate starts at 100ms, system CPU usage expected to increase
  o Handle migration of PMD hinting faults
  o Rate limit migration on a per-node basis
  o Alter how the rate of PTE scanning is adapted
  o Rate limit setting of pte_numa if node is congested
  o Only flush local TLB is unmapping a pte_numa page
  o Only consider one CPU in cpu follow algorithm

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two (three depending on how you look at it) competing
approaches to implement support for automatically migrating pages to
optimise NUMA locality. Performance results are available but review
highlighted different problems in both.  They are not compatible with each
other even though some fundamental mechanics should have been the same.
This series addresses part of the integration and sharing problem by
implementing a foundation that either the policy for schednuma or autonuma
can be rebased on.

The initial policy it implements is a very basic greedy policy called
"Migrate On Reference Of pte_numa Node (MORON)".  I expect people to
build upon this revised policy and rename it to something more sensible
that reflects what it means. The ideal *worst-case* behaviour is that
it is comparable to current mainline but for some workloads this is an
improvement over mainline.

This series can be treated as 5 major stages.

1. TLB optimisations that we're likely to want unconditionally.
2. Basic foundation and core mechanics, initial policy that does very little
3. Full PMD fault handling, rate limiting of migration, two-stage migration
   filter to mitigate poor migration decisions.  This will migrate pages
   on a PTE or PMD level using just the current referencing CPU as a
   placement hint
4. Scan rate adaption
5. Native THP migration

Very broadly speaking the TODOs that spring to mind are

1. Revisit MPOL_NOOP and MPOL_MF_LAZY
2. Other architecture support or at least validation that it could be made work. I'm
   half-hoping that the PPC64 people are watching because they tend to be interested
   in this type of thing.

Some advantages of the series are;

1. It handles regular PMDs which reduces overhead in case where pages within
   a PMD are on the same node
2. It rate limits migrations to avoid saturating the bus and backs off
   PTE scanning (in a fairly heavy manner) if the node is rate-limited
3. It keeps major optimisations like THP towards the end to be sure I am
   not accidentally depending on them
4. It has some vmstats which allow a user to make a rough guess as to how
   much overhead the balancing is introducing
5. It implements a basic policy that acts as a second performance baseline.
   The three baselines become vanilla kernel, basic placement policy,
   complex placement policy. This allows like-with-like comparisons with
   implementations.

In terms of building on top of the foundation the ideal would be that
patches affect one of the following areas although obviously that will
not always be possible

1. The PTE update helper functions
2. The PTE scanning machinary driven from task_numa_tick
3. Task and process fault accounting and how that information is used
   to determine if a page is misplaced
4. Fault handling, migrating the page if misplaced, what information is
   provided to the placement policy
5. Scheduler and load balancing

Patches in this series are as follows.

Patches 1-5 are some TLB optimisations that mostly make sense on their own.
	They are likely to make it into the tree either way

Patches 6-7 are an mprotect optimisation

Patches 8-10 move some vmstat counters so that migrated pages get accounted
	for. In the past the primary user of migration was compaction but
	if pages are to migrate for NUMA optimisation then the counters
	need to be generally useful.

Patch 11 defines an arch-specific PTE bit called _PAGE_NUMA that is used
	to trigger faults later in the series. A placement policy is expected
	to use these faults to determine if a page should migrate.  On x86,
	the bit is the same as _PAGE_PROTNONE but other architectures
	may differ. Note that it is also possible to avoid using this bit
	and go with plain PROT_NONE but the resulting helpers are then
	heavier.

Patch 12-14 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
	friends, updated GUP and huge page splitting.

Patch 15 creates the fault handler for p[te|md]_numa PTEs and just clears
	them again.

Patch 16 adds a MPOL_LOCAL policy so applications can explicitly request the
	historical behaviour.

Patch 17 is premature but adds a MPOL_NOOP policy that can be used in
	conjunction with the LAZY flags introduced later in the series.

Patch 18 adds migrate_misplaced_page which is responsible for migrating
	a page to a new location.

Patch 19-20 migrates the page on fault if mpol_misplaced() says to do so.

Patch 21 updates the page fault handlers. Transparent huge pages are split.
	Pages pointed to by PTEs are migrated. Pages pointed to by PMDs
	are not properly handed until later in the series.

Patch 22 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
	On the next reference the memory should be migrated to the node that
	references the memory.

Patch 23 reimplements change_prot_numa in terms of change_protection. It could
	be collapsed with patch 21 but this might be easier to review.

Patch 24 notes that the MPOL_MF_LAZY and MPOL_NOOP flags have not been properly
	reviewed and there are no manual pages. They are removed for now and
	need to be revisited.

Patch 25 sets pte_numa within the context of the scheduler.

Patches 26-28 note that the marking of pte_numa has a number of disadvantages and
	instead incrementally updates a limited range of the address space
	each tick.

Patch 29 adds some vmstats that can be used to approximate the cost of the
	scheduling policy in a more fine-grained fashion than looking at
	the system CPU usage.

Patch 30 implements the MORON policy.

Patch 31 properly handles the migration of pages faulted when handling a pmd
	numa hinting fault. This could be improved as it's a bit tangled
	to follow. PMDs are only marked if the PTEs underneath are expected
	to point to pages on the same node.

Patches 32-34 rate-limit the number of pages being migrated and marked as pte_numa

Patch 35 slowly decreases the pte_numa update scanning rate

Patch 36-39 introduces last_nid and uses it to build a two-stage filter
	that delays when a page gets migrated to avoid a situation where
	a task running temporarily off its home node forces a migration.

Patch 40 adapts the scanning rate if pages do not have to be migrated

Patch 41 allows the enabling/disabling from command line

Patch 42 allows balancenuma to be disabled even if !SCHED_DEBUG

Patch 43 delays PTE scanning until a task is scheduled on a new node

Patch 44 implements native THP migration for NUMA hinting faults.

Patch 45 accounts for transhuge allocation failures as migration failures.

Patch 46 accounts for transhuge pages when migrate rate limiting

 Documentation/kernel-parameters.txt  |    3 +
 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    2 +
 arch/x86/include/asm/pgtable.h       |   17 +-
 arch/x86/include/asm/pgtable_types.h |   20 ++
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |  110 +++++++++++
 include/linux/huge_mm.h              |   14 +-
 include/linux/hugetlb.h              |    8 +-
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   45 ++++-
 include/linux/mm.h                   |   39 ++++
 include/linux/mm_types.h             |   31 ++++
 include/linux/mmzone.h               |   13 ++
 include/linux/sched.h                |   27 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 ++++++
 include/uapi/linux/mempolicy.h       |   15 +-
 init/Kconfig                         |   41 +++++
 kernel/fork.c                        |    3 +
 kernel/sched/core.c                  |   71 ++++++--
 kernel/sched/fair.c                  |  227 +++++++++++++++++++++++
 kernel/sched/features.h              |   11 ++
 kernel/sched/sched.h                 |   12 ++
 kernel/sysctl.c                      |   45 ++++-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |   95 +++++++++-
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |    7 +-
 mm/memcontrol.c                      |    7 +-
 mm/memory-failure.c                  |    3 +-
 mm/memory.c                          |  188 ++++++++++++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  283 ++++++++++++++++++++++++++---
 mm/migrate.c                         |  333 +++++++++++++++++++++++++++++++++-
 mm/mprotect.c                        |  124 ++++++++++---
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |    9 +-
 mm/vmstat.c                          |   16 +-
 40 files changed, 1836 insertions(+), 109 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Results for balancenuma v8, autonuma-v28fast and numacore-20121126
  2012-11-30 11:33     ` [PATCH 00/46] Automatic NUMA Balancing V8 Mel Gorman
@ 2012-11-30 11:41       ` Mel Gorman
  2012-11-30 16:09         ` Rik van Riel
  0 siblings, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2012-11-30 11:41 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

This is an another insanely long mail. Short summary, based on the results
of what is in tip/master right now, I think if we're going to merge
anything for v3.8 it should be the "Automatic NUMA Balancing V8". It does
reasonably well for many of the workloads and AFAIK there is no reason why
numacore or autonuma could not be rebased on top with the view to merging
proper scheduling and placement policies in 3.9. That way we would have
a comparison between a do-nothing kernel, the most basic of migration
policies and something more complex with some sort of logical progression.

This time I added the NAS Parallel Benchmark running with MPI and OpenMP
to see how they fared. From the series "Automatic NUMA Balancing V8",
the kernels tested were

stats-v6r15	Patches 1-10. TLB optimisations, migration stats. This
		is based on the V6 release but the patches have not
		changed since.
balancenuma-v8r6 Patches 1-46. Full series

The other two kernels were

numacore-20121126 is a pull of tip/master on November 26rd, 2012. It ends
	up being a 3.7-rc6 based kernel

autonuma-v28fast This is a rebased version of Andrea's autonuma-v28fast
	branch with Hugh's THP migration patch on top. Hopefully Andrea
	and Hugh will not mind but I took the liberty of publishing the
	result as the mm-autonuma-v28fastr4-mels-rebase branch in
	git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git

I'm treating stats-v6r15 as the baseline as it has the same TLB optimisations
shared between balancenuma and numacore. This may not be fair to autonuma
depending on how it avoids flushing the TLB.

All of these tests were run unattended via MMTests. Any errors in the
methodology would be applied evenly to all kernels tested. There were
monitors running but *not* profiling. The heaviest monitor would read
numa_maps every 10 seconds and is only read one per address space and
reused by all threads. This will affect peaks because it means the monitors
contend on some of the same locks the PTE scanner does for example.

AUTONUMA BENCH
                                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7
                                    stats-v6r15     numacore-20121126    autonuma-v28fastr4       balancenuma-v8r6
User    NUMA01               66979.15 (  0.00%)    24590.05 ( 63.29%)    30815.06 ( 53.99%)    56701.65 ( 15.34%)
User    NUMA01_THEADLOCAL    61248.25 (  0.00%)    18607.40 ( 69.62%)    17124.49 ( 72.04%)    17344.99 ( 71.68%)
User    NUMA02                6645.34 (  0.00%)     2116.64 ( 68.15%)     2209.76 ( 66.75%)     2073.78 ( 68.79%)
User    NUMA02_SMT            2925.65 (  0.00%)      989.22 ( 66.19%)     1020.53 ( 65.12%)     1000.81 ( 65.79%)
System  NUMA01                  45.46 (  0.00%)     1038.13 (-2183.61%)      195.90 (-330.93%)      289.11 (-535.97%)
System  NUMA01_THEADLOCAL       46.15 (  0.00%)      556.78 (-1106.46%)       72.36 (-56.79%)      112.87 (-144.57%)
System  NUMA02                   1.66 (  0.00%)       25.38 (-1428.92%)        7.49 (-351.20%)        9.71 (-484.94%)
System  NUMA02_SMT               0.92 (  0.00%)       10.70 (-1063.04%)        2.41 (-161.96%)        3.40 (-269.57%)
Elapsed NUMA01                1513.72 (  0.00%)      571.78 ( 62.23%)      795.56 ( 47.44%)     1292.04 ( 14.64%)
Elapsed NUMA01_THEADLOCAL     1390.72 (  0.00%)      420.02 ( 69.80%)      380.84 ( 72.62%)      379.59 ( 72.71%)
Elapsed NUMA02                 167.65 (  0.00%)       50.52 ( 69.87%)       53.22 ( 68.26%)       49.17 ( 70.67%)
Elapsed NUMA02_SMT             164.38 (  0.00%)       48.26 ( 70.64%)       48.10 ( 70.74%)       46.91 ( 71.46%)
CPU     NUMA01                4427.00 (  0.00%)     4482.00 ( -1.24%)     3897.00 ( 11.97%)     4410.00 (  0.38%)
CPU     NUMA01_THEADLOCAL     4407.00 (  0.00%)     4562.00 ( -3.52%)     4515.00 ( -2.45%)     4599.00 ( -4.36%)
CPU     NUMA02                3964.00 (  0.00%)     4239.00 ( -6.94%)     4165.00 ( -5.07%)     4236.00 ( -6.86%)
CPU     NUMA02_SMT            1780.00 (  0.00%)     2071.00 (-16.35%)     2126.00 (-19.44%)     2140.00 (-20.22%)

numacore is the best at running the adverse numa01 workload. autonuma does
respectably but balancenuma does not cope with this case. It improves on the
baseline but it does not know how to interleave for this type of workload.

For the other workloads that are friendlier to NUMA, the three trees do
not differ by massive amounts.  There are not multiple runs because it
takes too long but there is a possibility the results are within the noise.

Where we differ is in system CPU usage. In all cases, numacore uses more
system CPU. It is likely it is compensating better for this overhead
with better placement. With this higher overhead it ends up with a tie
on everything except the adverse workload. Take NUMA01_THREADLOCAL as an
example -- numacore uses roughly 3-4 times more system CPU than autonuma
or balancenuma. autonumas cost could be hidden in kernel threads but that's
not true for balancenuma.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
         stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
User       137805.34    46310.68    51177.02    77128.10
System         94.81     1631.75      278.81      415.74
Elapsed      3245.05     1101.08     1287.83     1776.42

The overall elapsed time is differences in how well numa01 is handled. There
are large differences in the system CPU in the different trees. numacore
is using over twice the amount of CPU as either autonuma or balancenuma.


MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
                           stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
Page Ins                         42892       42804       42988       42616
Page Outs                        31156       12352       13980       19192
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                  16022       13747       19639       17857
THP collapse alloc                   9           4          51           3
THP splits                           2           1           7           6
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0    10303098
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0       10694
NUMA PTE updates                     0           0           0   147254249
NUMA hint faults                     0           0           0      688568
NUMA hint local faults               0           0           0      542906
NUMA pages migrated                  0           0           0    10303098
AutoNUMA cost                        0           0           0        4669

Not much to usefully interpret here other than noting we generally avoid
splitting THP. For balancenuma, note what the scan adaption does to the
number of PTE updates and the number of faults incurred. A policy may
not necessarily like this. It depends on its requirements but if it wants
higher PTE scan rates it will have to compensate for it.

Next is the specjbb. There are 4 separate configurations

multiple JVMs, THP
multiple JVMs, no THP
single JVM, THP
single JVM, no THP

SPECJBB: Multiple JVMs (one per node, 4 nodes), THP is enabled
                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7
                    stats-v6r15     numacore-20121126    autonuma-v28fastr4       balancenuma-v8r6
Mean   1      31600.00 (  0.00%)     27467.75 (-13.08%)     31006.75 ( -1.88%)     31360.25 ( -0.76%)
Mean   2      62937.75 (  0.00%)     55240.00 (-12.23%)     65086.25 (  3.41%)     61924.00 ( -1.61%)
Mean   3      91147.25 (  0.00%)     81735.50 (-10.33%)     95839.00 (  5.15%)     90739.00 ( -0.45%)
Mean   4     114616.50 (  0.00%)     94354.75 (-17.68%)    124129.50 (  8.30%)    116105.25 (  1.30%)
Mean   5     136264.25 (  0.00%)    107829.25 (-20.87%)    150632.00 ( 10.54%)    139659.25 (  2.49%)
Mean   6     152161.75 (  0.00%)    123039.75 (-19.14%)    175110.25 ( 15.08%)    157911.25 (  3.78%)
Mean   7     150385.25 (  0.00%)    137133.00 ( -8.81%)    180693.25 ( 20.15%)    160335.50 (  6.62%)
Mean   8     146897.75 (  0.00%)     94324.75 (-35.79%)    184689.00 ( 25.73%)    159786.50 (  8.77%)
Mean   9     141853.25 (  0.00%)    103640.75 (-26.94%)    183592.75 ( 29.42%)    153544.25 (  8.24%)
Mean   10    145524.00 (  0.00%)    113260.25 (-22.17%)    179482.75 ( 23.34%)    145893.50 (  0.25%)
Mean   11    129652.25 (  0.00%)     98646.75 (-23.91%)    174891.50 ( 34.89%)    138897.75 (  7.13%)
Mean   12    123313.25 (  0.00%)    124340.75 (  0.83%)    168959.25 ( 37.02%)    138027.00 ( 11.93%)
Mean   13    122442.75 (  0.00%)    107168.25 (-12.47%)    164761.50 ( 34.56%)    135222.50 ( 10.44%)
Mean   14    120407.50 (  0.00%)    107057.00 (-11.09%)    163350.50 ( 35.66%)    132712.25 ( 10.22%)
Mean   15    118236.50 (  0.00%)    106874.00 ( -9.61%)    160638.75 ( 35.86%)    129598.75 (  9.61%)
Mean   16    115439.00 (  0.00%)    128464.75 ( 11.28%)    158838.00 ( 37.59%)    122542.50 (  6.15%)
Mean   17    111400.25 (  0.00%)    127869.50 ( 14.78%)    157191.25 ( 41.10%)    129454.50 ( 16.21%)
Mean   18    114168.50 (  0.00%)    121763.00 (  6.65%)    154828.75 ( 35.61%)    125674.25 ( 10.08%)
Mean   19    112622.25 (  0.00%)    114235.50 (  1.43%)    154380.25 ( 37.08%)    122692.00 (  8.94%)
Mean   20    109717.75 (  0.00%)    109561.50 ( -0.14%)    153291.75 ( 39.71%)    122799.25 ( 11.92%)
Mean   21    106640.00 (  0.00%)    103904.75 ( -2.56%)    151053.75 ( 41.65%)    118169.50 ( 10.81%)
Mean   22    105173.00 (  0.00%)    107866.00 (  2.56%)    149248.75 ( 41.91%)    120062.00 ( 14.16%)
Mean   23    104009.50 (  0.00%)     84539.25 (-18.72%)    147848.25 ( 42.15%)    119518.25 ( 14.91%)
Mean   24    102713.75 (  0.00%)     85635.25 (-16.63%)    145843.25 ( 41.99%)    120339.75 ( 17.16%)
Stddev 1       1366.60 (  0.00%)      1135.04 ( 16.94%)      1619.94 (-18.54%)      1370.51 ( -0.29%)
Stddev 2        918.86 (  0.00%)      3552.45 (-286.61%)      1024.58 (-11.51%)       813.06 ( 11.51%)
Stddev 3       1066.85 (  0.00%)       881.39 ( 17.38%)      1176.32 (-10.26%)      1356.60 (-27.16%)
Stddev 4       1493.03 (  0.00%)      5298.20 (-254.86%)      1587.00 ( -6.29%)      1271.82 ( 14.82%)
Stddev 5        877.10 (  0.00%)      7526.59 (-758.13%)      1298.12 (-48.00%)      1030.81 (-17.53%)
Stddev 6       2351.71 (  0.00%)     16420.61 (-598.24%)      1122.37 ( 52.27%)      1276.07 ( 45.74%)
Stddev 7       1259.53 (  0.00%)     11596.65 (-820.71%)      1777.67 (-41.14%)      3225.46 (-156.08%)
Stddev 8       2912.35 (  0.00%)     18376.73 (-530.99%)      2428.53 ( 16.61%)      2997.79 ( -2.93%)
Stddev 9       6512.12 (  0.00%)      3668.11 ( 43.67%)      3311.86 ( 49.14%)      5116.28 ( 21.43%)
Stddev 10      6096.83 (  0.00%)      6969.09 (-14.31%)      6918.63 (-13.48%)      4623.63 ( 24.16%)
Stddev 11      9487.80 (  0.00%)      8337.58 ( 12.12%)     10122.20 ( -6.69%)      4651.18 ( 50.98%)
Stddev 12      8235.94 (  0.00%)     12325.53 (-49.66%)     13754.33 (-67.00%)      3002.66 ( 63.54%)
Stddev 13      8345.11 (  0.00%)     12512.09 (-49.93%)     15335.24 (-83.76%)      2206.88 ( 73.55%)
Stddev 14      8752.13 (  0.00%)      1689.34 ( 80.70%)     15529.14 (-77.43%)      6095.85 ( 30.35%)
Stddev 15      7611.56 (  0.00%)      3735.24 ( 50.93%)     16501.90 (-116.80%)      4713.94 ( 38.07%)
Stddev 16      8223.93 (  0.00%)      3621.59 ( 55.96%)     16426.27 (-99.74%)      5322.68 ( 35.28%)
Stddev 17      8829.49 (  0.00%)       100.89 ( 98.86%)     16633.79 (-88.39%)      3884.20 ( 56.01%)
Stddev 18      7053.69 (  0.00%)      1390.26 ( 80.29%)     18474.77 (-161.92%)      4296.24 ( 39.09%)
Stddev 19      6775.02 (  0.00%)      1335.05 ( 80.29%)     18046.60 (-166.37%)      3698.15 ( 45.41%)
Stddev 20      7481.59 (  0.00%)      4460.51 ( 40.38%)     17890.82 (-139.13%)      3406.39 ( 54.47%)
Stddev 21      8100.05 (  0.00%)      2934.02 ( 63.78%)     19041.29 (-135.08%)      2966.54 ( 63.38%)
Stddev 22      6507.61 (  0.00%)      3128.61 ( 51.92%)     17399.30 (-167.37%)      4242.58 ( 34.81%)
Stddev 23      6113.03 (  0.00%)      4226.82 ( 30.86%)     18573.42 (-203.83%)      5575.06 (  8.80%)
Stddev 24      5128.26 (  0.00%)      1695.29 ( 66.94%)     18824.94 (-267.08%)      4011.27 ( 21.78%)
TPut   1     126400.00 (  0.00%)    109871.00 (-13.08%)    124027.00 ( -1.88%)    125441.00 ( -0.76%)
TPut   2     251751.00 (  0.00%)    220960.00 (-12.23%)    260345.00 (  3.41%)    247696.00 ( -1.61%)
TPut   3     364589.00 (  0.00%)    326942.00 (-10.33%)    383356.00 (  5.15%)    362956.00 ( -0.45%)
TPut   4     458466.00 (  0.00%)    377419.00 (-17.68%)    496518.00 (  8.30%)    464421.00 (  1.30%)
TPut   5     545057.00 (  0.00%)    431317.00 (-20.87%)    602528.00 ( 10.54%)    558637.00 (  2.49%)
TPut   6     608647.00 (  0.00%)    492159.00 (-19.14%)    700441.00 ( 15.08%)    631645.00 (  3.78%)
TPut   7     601541.00 (  0.00%)    548532.00 ( -8.81%)    722773.00 ( 20.15%)    641342.00 (  6.62%)
TPut   8     587591.00 (  0.00%)    377299.00 (-35.79%)    738756.00 ( 25.73%)    639146.00 (  8.77%)
TPut   9     567413.00 (  0.00%)    414563.00 (-26.94%)    734371.00 ( 29.42%)    614177.00 (  8.24%)
TPut   10    582096.00 (  0.00%)    453041.00 (-22.17%)    717931.00 ( 23.34%)    583574.00 (  0.25%)
TPut   11    518609.00 (  0.00%)    394587.00 (-23.91%)    699566.00 ( 34.89%)    555591.00 (  7.13%)
TPut   12    493253.00 (  0.00%)    497363.00 (  0.83%)    675837.00 ( 37.02%)    552108.00 ( 11.93%)
TPut   13    489771.00 (  0.00%)    428673.00 (-12.47%)    659046.00 ( 34.56%)    540890.00 ( 10.44%)
TPut   14    481630.00 (  0.00%)    428228.00 (-11.09%)    653402.00 ( 35.66%)    530849.00 ( 10.22%)
TPut   15    472946.00 (  0.00%)    427496.00 ( -9.61%)    642555.00 ( 35.86%)    518395.00 (  9.61%)
TPut   16    461756.00 (  0.00%)    513859.00 ( 11.28%)    635352.00 ( 37.59%)    490170.00 (  6.15%)
TPut   17    445601.00 (  0.00%)    511478.00 ( 14.78%)    628765.00 ( 41.10%)    517818.00 ( 16.21%)
TPut   18    456674.00 (  0.00%)    487052.00 (  6.65%)    619315.00 ( 35.61%)    502697.00 ( 10.08%)
TPut   19    450489.00 (  0.00%)    456942.00 (  1.43%)    617521.00 ( 37.08%)    490768.00 (  8.94%)
TPut   20    438871.00 (  0.00%)    438246.00 ( -0.14%)    613167.00 ( 39.71%)    491197.00 ( 11.92%)
TPut   21    426560.00 (  0.00%)    415619.00 ( -2.56%)    604215.00 ( 41.65%)    472678.00 ( 10.81%)
TPut   22    420692.00 (  0.00%)    431464.00 (  2.56%)    596995.00 ( 41.91%)    480248.00 ( 14.16%)
TPut   23    416038.00 (  0.00%)    338157.00 (-18.72%)    591393.00 ( 42.15%)    478073.00 ( 14.91%)
TPut   24    410855.00 (  0.00%)    342541.00 (-16.63%)    583373.00 ( 41.99%)    481359.00 ( 17.16%)

numacore is not handling the multiple JVM case well with numerous regressions
for lower number of threads. It is a bit better around the expected peak
of 12 warehouses per JVM for this configuration. There are also large
variances between the different JVMs throughput but note again that this
improves as the number of warehouses increase.

autonuma generally does very well in terms of throughput but the variance
between JVMs is massive.

balancenuma does reasonably well and improves upon the baseline kernel. It
shows regressions for small warehouses which was not evident in V6 and so it
is known to vary a bit. However, as the number of warehouses increases, it
shows some performance improvement and the variances are not too bad. It's
far short of what autonuma achieved but it's respectable.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7
                                 stats-v6r15          numacore-20121126         autonuma-v28fastr4            balancenuma-v8r6
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        493253.00 (  0.00%)        497363.00 (  0.83%)        675837.00 ( 37.02%)        552108.00 ( 11.93%)
 Actual Warehouse             6.00 (  0.00%)             7.00 ( 16.67%)             8.00 ( 33.33%)             7.00 ( 16.67%)
 Actual Peak Bops        608647.00 (  0.00%)        548532.00 ( -9.88%)        738756.00 ( 21.38%)        641342.00 (  5.37%)
 SpecJBB Bops            451164.00 (  0.00%)        439778.00 ( -2.52%)        624688.00 ( 38.46%)        503634.00 ( 11.63%)
 SpecJBB Bops/JVM        112791.00 (  0.00%)        109945.00 ( -2.52%)        156172.00 ( 38.46%)        125909.00 ( 11.63%)

Note the peak numbers for numacore. The peak performance regresses 9.88%
from the baseline kernel. In a previous 3.7-rc6 comparison it showed an
improvement in the specjbb score of 0.52% at the peak. This is not a fair
comparison any more because of the large differences in kernels but it's
still the case that the specjbb score looks better than the actual peak
throughput because of how the specjbb score is calculated.

autonuma sees an 21.38% performance gain at its peak and a 38.46% gain in
its specjbb score.

balancenuma does reasonably well with a 5.37% gain at its peak and 11.63%
on its overall specjbb score. Not as good as autonuma, but respectable.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
         stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
User       177410.90   171382.97   177112.15   177078.17
System        175.57     5976.48      219.87      514.57
Elapsed      4035.05     4037.94     4037.14     4030.78

Note the system CPU usage. numacore is using 11 times more system CPU
than balancenuma is and 27 times more than autonuma (usual disclaimer
about threads).


MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
                           stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
Page Ins                         38092       37968       37632       66512
Page Outs                        50240       52836       48468       64196
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                  65717       49223       56929       67137
THP collapse alloc                 125          55         462         122
THP splits                         370         211         383         367
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0    51459156
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0       53414
NUMA PTE updates                     0           0           0   415931339
NUMA hint faults                     0           0           0     3089027
NUMA hint local faults               0           0           0      936873
NUMA pages migrated                  0           0           0    51459156
AutoNUMA cost                        0           0           0       19334

The main takeaways here is that there were THP allocations and all the
trees split THPs at very roughly the same rate overall. Migration stats
are not available for numacore or autonuma but the migration stats for
balancenuma show that it's migrating at a rate 49MB/sec on average. This
is far higher than I'd like and a proper policy on top should be able to
help get that down.

SPECJBB: Multiple JVMs (one per node, 4 nodes), THP is disabled

                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7
                    stats-v6r15     numacore-20121126    autonuma-v28fastr4       balancenuma-v8r6
Mean   1      25460.75 (  0.00%)     19041.25 (-25.21%)     25538.50 (  0.31%)     25889.25 (  1.68%)
Mean   2      53520.75 (  0.00%)     36285.25 (-32.20%)     56045.00 (  4.72%)     52424.00 ( -2.05%)
Mean   3      77555.00 (  0.00%)     53221.25 (-31.38%)     83147.25 (  7.21%)     76898.75 ( -0.85%)
Mean   4     100030.00 (  0.00%)     65234.00 (-34.79%)    108965.25 (  8.93%)     98110.75 ( -1.92%)
Mean   5     120309.25 (  0.00%)     76315.25 (-36.57%)    132176.00 (  9.86%)    119555.75 ( -0.63%)
Mean   6     136112.50 (  0.00%)     89173.00 (-34.49%)    150532.75 ( 10.59%)    136993.00 (  0.65%)
Mean   7     135358.75 (  0.00%)     93026.00 (-31.27%)    159185.00 ( 17.60%)    138854.25 (  2.58%)
Mean   8     134319.50 (  0.00%)     97704.50 (-27.26%)    162122.25 ( 20.70%)    138954.25 (  3.45%)
Mean   9     132189.75 (  0.00%)     97305.75 (-26.39%)    161477.25 ( 22.16%)    135756.75 (  2.70%)
Mean   10    128023.25 (  0.00%)     86914.50 (-32.11%)    159014.25 ( 24.21%)    130314.75 (  1.79%)
Mean   11    119226.75 (  0.00%)     95627.25 (-19.79%)    155241.50 ( 30.21%)    123851.00 (  3.88%)
Mean   12    111769.50 (  0.00%)     88829.00 (-20.52%)    150002.75 ( 34.21%)    115657.25 (  3.48%)
Mean   13    110908.25 (  0.00%)    105153.00 ( -5.19%)    146769.75 ( 32.33%)    113916.00 (  2.71%)
Mean   14    109063.25 (  0.00%)    103905.50 ( -4.73%)    144350.50 ( 32.35%)    116530.75 (  6.85%)
Mean   15    105400.50 (  0.00%)    102274.25 ( -2.97%)    141991.50 ( 34.72%)    116928.50 ( 10.94%)
Mean   16    106195.50 (  0.00%)    100147.00 ( -5.70%)    141436.25 ( 33.18%)    114429.25 (  7.75%)
Mean   17    102077.00 (  0.00%)     98444.50 ( -3.56%)    139735.25 ( 36.89%)    113637.00 ( 11.32%)
Mean   18    101157.00 (  0.00%)     96963.25 ( -4.15%)    137867.50 ( 36.29%)    113728.75 ( 12.43%)
Mean   19     99892.75 (  0.00%)     95881.00 ( -4.02%)    135465.25 ( 35.61%)    112367.50 ( 12.49%)
Mean   20    100012.50 (  0.00%)     93851.50 ( -6.16%)    134840.25 ( 34.82%)    112712.25 ( 12.70%)
Mean   21     97157.25 (  0.00%)     92788.25 ( -4.50%)    133454.25 ( 37.36%)    107491.50 ( 10.64%)
Mean   22     97807.25 (  0.00%)     90831.25 ( -7.13%)    130811.00 ( 33.74%)    108284.00 ( 10.71%)
Mean   23     94287.00 (  0.00%)     88404.50 ( -6.24%)    129693.00 ( 37.55%)    106024.25 ( 12.45%)
Mean   24     94142.00 (  0.00%)     86549.00 ( -8.07%)    127417.25 ( 35.35%)    103483.00 (  9.92%)
Stddev 1        873.15 (  0.00%)       819.01 (  6.20%)       805.93 (  7.70%)       982.04 (-12.47%)
Stddev 2        828.04 (  0.00%)       151.51 ( 81.70%)       641.04 ( 22.58%)       504.12 ( 39.12%)
Stddev 3        824.92 (  0.00%)      3708.80 (-349.60%)      1092.76 (-32.47%)      2024.69 (-145.44%)
Stddev 4        607.86 (  0.00%)      1768.43 (-190.93%)      1422.30 (-133.99%)      1298.14 (-113.56%)
Stddev 5        836.75 (  0.00%)      1048.83 (-25.34%)      1656.67 (-97.99%)      2600.99 (-210.84%)
Stddev 6        641.16 (  0.00%)      1010.82 (-57.66%)       990.71 (-54.52%)      1832.47 (-185.81%)
Stddev 7       4556.68 (  0.00%)      2374.23 ( 47.90%)      1395.66 ( 69.37%)      3149.28 ( 30.89%)
Stddev 8       3770.88 (  0.00%)      5926.66 (-57.17%)      1017.86 ( 73.01%)      3213.00 ( 14.79%)
Stddev 9       2396.64 (  0.00%)      2946.42 (-22.94%)      1131.78 ( 52.78%)      5125.85 (-113.88%)
Stddev 10      2535.66 (  0.00%)      2827.47 (-11.51%)      2330.35 (  8.10%)      2662.72 ( -5.01%)
Stddev 11      2858.16 (  0.00%)      4522.90 (-58.25%)      5970.58 (-108.90%)      3843.01 (-34.46%)
Stddev 12      4084.30 (  0.00%)      2782.83 ( 31.87%)      9008.52 (-120.56%)      1062.12 ( 74.00%)
Stddev 13      3079.56 (  0.00%)      1107.30 ( 64.04%)      9118.81 (-196.11%)      3075.82 (  0.12%)
Stddev 14      2886.35 (  0.00%)      1497.39 ( 48.12%)      9084.67 (-214.75%)      3209.97 (-11.21%)
Stddev 15      3302.30 (  0.00%)      1942.68 ( 41.17%)     10684.80 (-223.56%)      1094.48 ( 66.86%)
Stddev 16      3868.79 (  0.00%)      2024.71 ( 47.67%)     10202.01 (-163.70%)      1389.86 ( 64.08%)
Stddev 17      3318.20 (  0.00%)      1031.66 ( 68.91%)     10295.90 (-210.29%)      1334.94 ( 59.77%)
Stddev 18      3926.91 (  0.00%)       976.39 ( 75.14%)     11497.98 (-192.80%)       914.90 ( 76.70%)
Stddev 19      3169.02 (  0.00%)       668.74 ( 78.90%)     10951.67 (-245.59%)      2192.84 ( 30.80%)
Stddev 20      3343.84 (  0.00%)       727.51 ( 78.24%)     10974.75 (-228.21%)       991.99 ( 70.33%)
Stddev 21      3253.04 (  0.00%)      1212.03 ( 62.74%)     11682.29 (-259.12%)       802.70 ( 75.32%)
Stddev 22      3320.18 (  0.00%)      1017.95 ( 69.34%)     11224.85 (-238.08%)       536.20 ( 83.85%)
Stddev 23      3160.77 (  0.00%)      1544.09 ( 51.15%)     11611.88 (-267.37%)      1076.64 ( 65.94%)
Stddev 24      3079.01 (  0.00%)       739.34 ( 75.99%)     13124.55 (-326.26%)      1311.96 ( 57.39%)
TPut   1     101843.00 (  0.00%)     76165.00 (-25.21%)    102154.00 (  0.31%)    103557.00 (  1.68%)
TPut   2     214083.00 (  0.00%)    145141.00 (-32.20%)    224180.00 (  4.72%)    209696.00 ( -2.05%)
TPut   3     310220.00 (  0.00%)    212885.00 (-31.38%)    332589.00 (  7.21%)    307595.00 ( -0.85%)
TPut   4     400120.00 (  0.00%)    260936.00 (-34.79%)    435861.00 (  8.93%)    392443.00 ( -1.92%)
TPut   5     481237.00 (  0.00%)    305261.00 (-36.57%)    528704.00 (  9.86%)    478223.00 ( -0.63%)
TPut   6     544450.00 (  0.00%)    356692.00 (-34.49%)    602131.00 ( 10.59%)    547972.00 (  0.65%)
TPut   7     541435.00 (  0.00%)    372104.00 (-31.27%)    636740.00 ( 17.60%)    555417.00 (  2.58%)
TPut   8     537278.00 (  0.00%)    390818.00 (-27.26%)    648489.00 ( 20.70%)    555817.00 (  3.45%)
TPut   9     528759.00 (  0.00%)    389223.00 (-26.39%)    645909.00 ( 22.16%)    543027.00 (  2.70%)
TPut   10    512093.00 (  0.00%)    347658.00 (-32.11%)    636057.00 ( 24.21%)    521259.00 (  1.79%)
TPut   11    476907.00 (  0.00%)    382509.00 (-19.79%)    620966.00 ( 30.21%)    495404.00 (  3.88%)
TPut   12    447078.00 (  0.00%)    355316.00 (-20.52%)    600011.00 ( 34.21%)    462629.00 (  3.48%)
TPut   13    443633.00 (  0.00%)    420612.00 ( -5.19%)    587079.00 ( 32.33%)    455664.00 (  2.71%)
TPut   14    436253.00 (  0.00%)    415622.00 ( -4.73%)    577402.00 ( 32.35%)    466123.00 (  6.85%)
TPut   15    421602.00 (  0.00%)    409097.00 ( -2.97%)    567966.00 ( 34.72%)    467714.00 ( 10.94%)
TPut   16    424782.00 (  0.00%)    400588.00 ( -5.70%)    565745.00 ( 33.18%)    457717.00 (  7.75%)
TPut   17    408308.00 (  0.00%)    393778.00 ( -3.56%)    558941.00 ( 36.89%)    454548.00 ( 11.32%)
TPut   18    404628.00 (  0.00%)    387853.00 ( -4.15%)    551470.00 ( 36.29%)    454915.00 ( 12.43%)
TPut   19    399571.00 (  0.00%)    383524.00 ( -4.02%)    541861.00 ( 35.61%)    449470.00 ( 12.49%)
TPut   20    400050.00 (  0.00%)    375406.00 ( -6.16%)    539361.00 ( 34.82%)    450849.00 ( 12.70%)
TPut   21    388629.00 (  0.00%)    371153.00 ( -4.50%)    533817.00 ( 37.36%)    429966.00 ( 10.64%)
TPut   22    391229.00 (  0.00%)    363325.00 ( -7.13%)    523244.00 ( 33.74%)    433136.00 ( 10.71%)
TPut   23    377148.00 (  0.00%)    353618.00 ( -6.24%)    518772.00 ( 37.55%)    424097.00 ( 12.45%)
TPut   24    376568.00 (  0.00%)    346196.00 ( -8.07%)    509669.00 ( 35.35%)    413932.00 (  9.92%)

numacore regresses without THP on multiple JVM configurations, particularly
for lower number of warehouses. Note that once again it improves as the
number of warehouses increase. SpecJBB reports based on peaks so this will
be missed if only the peak figures are quoted in other benchmark reports.

autonuma again performs very well although it's variances between JVMs
is nuts.

Without THP, balancenuma shows small regressions for small numbers of
warehouses but recovers to show decent performance gains. Note that the
gains vary between warehouses because it's completely at the mercy of the
default scheduler decisions which are getting no hints about NUMA placement.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7
                                 stats-v6r15          numacore-20121126         autonuma-v28fastr4            balancenuma-v8r6
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        447078.00 (  0.00%)        355316.00 (-20.52%)        600011.00 ( 34.21%)        462629.00 (  3.48%)
 Actual Warehouse             6.00 (  0.00%)            13.00 (116.67%)             8.00 ( 33.33%)             8.00 ( 33.33%)
 Actual Peak Bops        544450.00 (  0.00%)        420612.00 (-22.75%)        648489.00 ( 19.11%)        555817.00 (  2.09%)
 SpecJBB Bops            409191.00 (  0.00%)        382775.00 ( -6.46%)        551949.00 ( 34.89%)        447750.00 (  9.42%)
 SpecJBB Bops/JVM        102298.00 (  0.00%)         95694.00 ( -6.46%)        137987.00 ( 34.89%)        111938.00 (  9.42%)

numacore regresses from the peak by 22.75% and the specjbb overall score is down 6.46%.

autonuma does well with a 19.11% gain on the peak and 34.89% overall.

balancenuma does reasonably well -- 2.09% gain at the peak and 9.42%
gain overall.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
         stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
User       177276.00   146602.11   176834.75   175649.50
System         91.09    27863.11      283.25     1455.39
Elapsed      4030.76     4042.32     4038.79     4038.06

numacores system CPU usage is extremely high.

autonumas is ok (kernel threads blah blah)

balancenumas is higher than I'd like. I want to describe is as "not crazy"
but it probably is to everybody else.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
                           stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
Page Ins                         37836       37744       38072       37192
Page Outs                        49440       51944       49024       51384
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                      2           1           1           3
THP collapse alloc                   2           0          20           0
THP splits                           0           0           0           0
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0    37212252
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0       38626
NUMA PTE updates                     0           0           0   290219318
NUMA hint faults                     0           0           0   267929465
NUMA hint local faults               0           0           0    69757534
NUMA pages migrated                  0           0           0    37212252
AutoNUMA cost                        0           0           0     1342385

First take-away is the lack of THP activity.

Here the stats balancenuma reports are useful because we're only dealing
with base pages. balancenuma migrates 36MB/second which is really high,
particularly when you bear in mind that with copying that's 72MB/sec of
data transferred. From earlier test results we know the scan rate adaption
helps keep this figure down and that average migration rates is something
we should keep an eye on.

>From here, we're onto the single JVM configuration. I suspect
this is tested much more commonly but note that it behaves very
differently to the multi JVM configuration as explained by Andrea
(http://choon.net/forum/read.php?21,1599976,page=4).

SPECJBB: Single JVM, THP is enabled
                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7
                  stats-v6r15     numacore-20121126    autonuma-v28fastr4       balancenuma-v8r6
TPut 1      25219.00 (  0.00%)     24994.00 ( -0.89%)     23003.00 ( -8.79%)     26876.00 (  6.57%)
TPut 2      56218.00 (  0.00%)     52603.00 ( -6.43%)     52412.00 ( -6.77%)     55372.00 ( -1.50%)
TPut 3      87560.00 (  0.00%)     78545.00 (-10.30%)     82769.00 ( -5.47%)     87351.00 ( -0.24%)
TPut 4     114877.00 (  0.00%)    110117.00 ( -4.14%)    109057.00 ( -5.07%)    116584.00 (  1.49%)
TPut 5     145249.00 (  0.00%)    126704.00 (-12.77%)    136402.00 ( -6.09%)    144194.00 ( -0.73%)
TPut 6     169591.00 (  0.00%)    147129.00 (-13.24%)    153711.00 ( -9.36%)    170627.00 (  0.61%)
TPut 7     194429.00 (  0.00%)    171652.00 (-11.71%)    185094.00 ( -4.80%)    197385.00 (  1.52%)
TPut 8     218492.00 (  0.00%)    167754.00 (-23.22%)    212731.00 ( -2.64%)    225145.00 (  3.04%)
TPut 9     242090.00 (  0.00%)    200709.00 (-17.09%)    233781.00 ( -3.43%)    250624.00 (  3.53%)
TPut 10    254513.00 (  0.00%)    236769.00 ( -6.97%)    256599.00 (  0.82%)    275834.00 (  8.38%)
TPut 11    283694.00 (  0.00%)    227999.00 (-19.63%)    281189.00 ( -0.88%)    300696.00 (  5.99%)
TPut 12    306679.00 (  0.00%)    263599.00 (-14.05%)    307239.00 (  0.18%)    325723.00 (  6.21%)
TPut 13    317050.00 (  0.00%)    281988.00 (-11.06%)    320474.00 (  1.08%)    346733.00 (  9.36%)
TPut 14    281122.00 (  0.00%)    306206.00 (  8.92%)    348007.00 ( 23.79%)    363974.00 ( 29.47%)
TPut 15    344584.00 (  0.00%)    327784.00 ( -4.88%)    370530.00 (  7.53%)    390804.00 ( 13.41%)
TPut 16    355251.00 (  0.00%)    325626.00 ( -8.34%)    388602.00 (  9.39%)    412690.00 ( 16.17%)
TPut 17    358785.00 (  0.00%)    372911.00 (  3.94%)    406725.00 ( 13.36%)    431710.00 ( 20.33%)
TPut 18    362037.00 (  0.00%)    358876.00 ( -0.87%)    423311.00 ( 16.92%)    447506.00 ( 23.61%)
TPut 19    366526.00 (  0.00%)    397926.00 (  8.57%)    434692.00 ( 18.60%)    454669.00 ( 24.05%)
TPut 20    365125.00 (  0.00%)    387871.00 (  6.23%)    441119.00 ( 20.81%)    475213.00 ( 30.15%)
TPut 21    367221.00 (  0.00%)    446595.00 ( 21.61%)    473582.00 ( 28.96%)    483085.00 ( 31.55%)
TPut 22    352732.00 (  0.00%)    436862.00 ( 23.85%)    479616.00 ( 35.97%)    494976.00 ( 40.33%)
TPut 23    358840.00 (  0.00%)    464554.00 ( 29.46%)    484157.00 ( 34.92%)    507236.00 ( 41.35%)
TPut 24    355426.00 (  0.00%)    474432.00 ( 33.48%)    477851.00 ( 34.44%)    503864.00 ( 41.76%)
TPut 25    354178.00 (  0.00%)    456845.00 ( 28.99%)    476411.00 ( 34.51%)    505628.00 ( 42.76%)
TPut 26    352844.00 (  0.00%)    477178.00 ( 35.24%)    474925.00 ( 34.60%)    496278.00 ( 40.65%)
TPut 27    351616.00 (  0.00%)    461061.00 ( 31.13%)    461218.00 ( 31.17%)    507777.00 ( 44.41%)
TPut 28    342442.00 (  0.00%)    458497.00 ( 33.89%)    442311.00 ( 29.16%)    495797.00 ( 44.78%)
TPut 29    330633.00 (  0.00%)    492795.00 ( 49.05%)    444804.00 ( 34.53%)    512545.00 ( 55.02%)
TPut 30    330202.00 (  0.00%)    503148.00 ( 52.38%)    428283.00 ( 29.70%)    494677.00 ( 49.81%)
TPut 31    318975.00 (  0.00%)    488421.00 ( 53.12%)    445121.00 ( 39.55%)    498506.00 ( 56.28%)
TPut 32    321422.00 (  0.00%)    469743.00 ( 46.15%)    437403.00 ( 36.08%)    490464.00 ( 52.59%)
TPut 33    322341.00 (  0.00%)    465564.00 ( 44.43%)    422936.00 ( 31.21%)    485365.00 ( 50.58%)
TPut 34    306767.00 (  0.00%)    462386.00 ( 50.73%)    407367.00 ( 32.79%)    467848.00 ( 52.51%)
TPut 35    304995.00 (  0.00%)    476963.00 ( 56.38%)    407555.00 ( 33.63%)    471954.00 ( 54.74%)
TPut 36    296795.00 (  0.00%)    455814.00 ( 53.58%)    403723.00 ( 36.03%)    467543.00 ( 57.53%)
TPut 37    295131.00 (  0.00%)    414467.00 ( 40.43%)    367104.00 ( 24.39%)    453145.00 ( 53.54%)
TPut 38    285609.00 (  0.00%)    418189.00 ( 46.42%)    357852.00 ( 25.29%)    436387.00 ( 52.79%)
TPut 39    288418.00 (  0.00%)    432818.00 ( 50.07%)    345127.00 ( 19.66%)    424866.00 ( 47.31%)
TPut 40    284779.00 (  0.00%)    416627.00 ( 46.30%)    330080.00 ( 15.91%)    429043.00 ( 50.66%)
TPut 41    275224.00 (  0.00%)    406106.00 ( 47.55%)    332766.00 ( 20.91%)    412042.00 ( 49.71%)
TPut 42    272301.00 (  0.00%)    387449.00 ( 42.29%)    330321.00 ( 21.31%)    409263.00 ( 50.30%)
TPut 43    261075.00 (  0.00%)    369755.00 ( 41.63%)    322081.00 ( 23.37%)    416906.00 ( 59.69%)
TPut 44    259570.00 (  0.00%)    383102.00 ( 47.59%)    310141.00 ( 19.48%)    401482.00 ( 54.67%)
TPut 45    268308.00 (  0.00%)    370866.00 ( 38.22%)    309946.00 ( 15.52%)    397084.00 ( 48.00%)
TPut 46    251641.00 (  0.00%)    371264.00 ( 47.54%)    308248.00 ( 22.50%)    367053.00 ( 45.86%)
TPut 47    248566.00 (  0.00%)    381703.00 ( 53.56%)    296089.00 ( 19.12%)    362150.00 ( 45.70%)
TPut 48    256403.00 (  0.00%)    392542.00 ( 53.10%)    302787.00 ( 18.09%)    368646.00 ( 43.78%)
TPut 49    252248.00 (  0.00%)    377276.00 ( 49.57%)    330756.00 ( 31.12%)    385558.00 ( 52.85%)
TPut 50    247856.00 (  0.00%)    351684.00 ( 41.89%)    344068.00 ( 38.82%)    373454.00 ( 50.67%)
TPut 51    251900.00 (  0.00%)    332813.00 ( 32.12%)    332706.00 ( 32.08%)    385786.00 ( 53.15%)
TPut 52    255247.00 (  0.00%)    373908.00 ( 46.49%)    338580.00 ( 32.65%)    357138.00 ( 39.92%)
TPut 53    254376.00 (  0.00%)    354872.00 ( 39.51%)    366606.00 ( 44.12%)    367391.00 ( 44.43%)
TPut 54    239804.00 (  0.00%)    375675.00 ( 56.66%)    347626.00 ( 44.96%)    387538.00 ( 61.61%)
TPut 55    243339.00 (  0.00%)    411901.00 ( 69.27%)    345700.00 ( 42.07%)    379513.00 ( 55.96%)
TPut 56    253604.00 (  0.00%)    379291.00 ( 49.56%)    366087.00 ( 44.35%)    367165.00 ( 44.78%)
TPut 57    238212.00 (  0.00%)    376023.00 ( 57.85%)    347698.00 ( 45.96%)    346641.00 ( 45.52%)
TPut 58    246397.00 (  0.00%)    399372.00 ( 62.08%)    372138.00 ( 51.03%)    377817.00 ( 53.34%)
TPut 59    244926.00 (  0.00%)    389607.00 ( 59.07%)    367619.00 ( 50.09%)    373928.00 ( 52.67%)
TPut 60    247249.00 (  0.00%)    382694.00 ( 54.78%)    339032.00 ( 37.12%)    377435.00 ( 52.65%)
TPut 61    249833.00 (  0.00%)    383316.00 ( 53.43%)    340934.00 ( 36.46%)    345885.00 ( 38.45%)
TPut 62    247309.00 (  0.00%)    390815.00 ( 58.03%)    345727.00 ( 39.80%)    359426.00 ( 45.33%)
TPut 63    246530.00 (  0.00%)    390800.00 ( 58.52%)    369327.00 ( 49.81%)    351243.00 ( 42.47%)
TPut 64    238954.00 (  0.00%)    404036.00 ( 69.09%)    359388.00 ( 50.40%)    354036.00 ( 48.16%)
TPut 65    245095.00 (  0.00%)    398807.00 ( 62.72%)    341462.00 ( 39.32%)    336288.00 ( 37.21%)
TPut 66    250698.00 (  0.00%)    387445.00 ( 54.55%)    352065.00 ( 40.43%)    374670.00 ( 49.45%)
TPut 67    235819.00 (  0.00%)    385050.00 ( 63.28%)    337617.00 ( 43.17%)    365777.00 ( 55.11%)
TPut 68    233949.00 (  0.00%)    372286.00 ( 59.13%)    365514.00 ( 56.24%)    344230.00 ( 47.14%)
TPut 69    229172.00 (  0.00%)    370092.00 ( 61.49%)    370106.00 ( 61.50%)    364038.00 ( 58.85%)
TPut 70    237174.00 (  0.00%)    375051.00 ( 58.13%)    366155.00 ( 54.38%)    351673.00 ( 48.28%)
TPut 71    235153.00 (  0.00%)    375629.00 ( 59.74%)    365557.00 ( 55.45%)    328308.00 ( 39.61%)
TPut 72    235747.00 (  0.00%)    356140.00 ( 51.07%)    378508.00 ( 60.56%)    334254.00 ( 41.79%)

numacore does not perform well here for low numbers of warehouses but rapidly
improves and by warehouse 18 is more or less level with the mainline kernel. After
that it improves quite dramatically. Note that specjbb reports on peak scores so
with THP enabled and a single JVM, numacore scores extremely well.

autonuma also regressed for lower number of warehouses in this run although
it is not clear why.  In 3.7-rc6, the same patch ashowed very small gains
flor lower number of warehouses. As with numacore it improves for larger
number of warehouses and starts improveing from warehouse 12 as opposed
to 18 for numacore.

balancenuma regressed a little initially but improves sooner and shows
respectable performance gains similar to numacore and autonuma for larger
numbers of warehouses.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7
                                 stats-v6r15          numacore-20121126         autonuma-v28fastr4            balancenuma-v8r6
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        256403.00 (  0.00%)        392542.00 ( 53.10%)        302787.00 ( 18.09%)        368646.00 ( 43.78%)
 Actual Warehouse            21.00 (  0.00%)            30.00 ( 42.86%)            23.00 (  9.52%)            29.00 ( 38.10%)
 Actual Peak Bops        367221.00 (  0.00%)        503148.00 ( 37.02%)        484157.00 ( 31.84%)        512545.00 ( 39.57%)
 SpecJBB Bops            124837.00 (  0.00%)        193615.00 ( 55.09%)        179465.00 ( 43.76%)        184854.00 ( 48.08%)
 SpecJBB Bops/JVM        124837.00 (  0.00%)        193615.00 ( 55.09%)        179465.00 ( 43.76%)        184854.00 ( 48.08%)

Here you can see that numacore scales to a higher number of warehouses
and sees a 37.02% performance gain at the peak and a 55.09% gain on the
specjbb score. The peaks are great but not the results for smaller number
of warehouses. As specjbb scores based on the peak, be mindful of this.

autonuma sees a 31.84% performance gain at the peak and a 43.76%
performance gain on the specjbb score.

balancenuma gets a 39.57% performance gain at the peak and a 48.08%
gain on the specjbb score.

For larger numbers of warehouses, all three trees do extremely well.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
         stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
User       317746.38   311465.45   316147.49   315667.42
System         99.42     3043.75      355.53      459.73
Elapsed      7433.93     7436.53     7435.53     7433.49

Same comments about the system CPU usage. numacores is extremely high and
us using 6 times more CPU than balancenuma is.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
                           stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
Page Ins                         37060       36916       37072       33400
Page Outs                        59220       63380       57804       54436
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                  53004       43971       51386       50126
THP collapse alloc                  67           1         192          58
THP splits                          82          39         107          77
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0    47488580
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0       49293
NUMA PTE updates                     0           0           0   359807386
NUMA hint faults                     0           0           0     2024295
NUMA hint local faults               0           0           0      693439
NUMA pages migrated                  0           0           0    47488580
AutoNUMA cost                        0           0           0       13542

THP is in use. balancenuma migrated more than I'd like at an average
of 24M/sec.


SPECJBB: Single JVM, THP is disabled

                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7
                  stats-v6r15     numacore-20121126    autonuma-v28fastr4       balancenuma-v8r6
TPut 1      19264.00 (  0.00%)     17423.00 ( -9.56%)     18895.00 ( -1.92%)     19925.00 (  3.43%)
TPut 2      45341.00 (  0.00%)     38727.00 (-14.59%)     46448.00 (  2.44%)     47567.00 (  4.91%)
TPut 3      69495.00 (  0.00%)     58775.00 (-15.43%)     69639.00 (  0.21%)     72462.00 (  4.27%)
TPut 4      93336.00 (  0.00%)     71864.00 (-23.01%)     95667.00 (  2.50%)     97095.00 (  4.03%)
TPut 5     113997.00 (  0.00%)     98727.00 (-13.40%)    123262.00 (  8.13%)    121667.00 (  6.73%)
TPut 6     135278.00 (  0.00%)    111789.00 (-17.36%)    143619.00 (  6.17%)    144664.00 (  6.94%)
TPut 7     158037.00 (  0.00%)    119202.00 (-24.57%)    168299.00 (  6.49%)    169072.00 (  6.98%)
TPut 8     180282.00 (  0.00%)    124026.00 (-31.20%)    189608.00 (  5.17%)    186262.00 (  3.32%)
TPut 9     203033.00 (  0.00%)    128233.00 (-36.84%)    211492.00 (  4.17%)    207573.00 (  2.24%)
TPut 10    221732.00 (  0.00%)    139290.00 (-37.18%)    230843.00 (  4.11%)    232814.00 (  5.00%)
TPut 11    242479.00 (  0.00%)    127751.00 (-47.31%)    255217.00 (  5.25%)    255212.00 (  5.25%)
TPut 12    257236.00 (  0.00%)    149851.00 (-41.75%)    272681.00 (  6.00%)    259541.00 (  0.90%)
TPut 13    281727.00 (  0.00%)    163583.00 (-41.94%)    287647.00 (  2.10%)    299305.00 (  6.24%)
TPut 14    303538.00 (  0.00%)    142471.00 (-53.06%)    312506.00 (  2.95%)    316094.00 (  4.14%)
TPut 15    322025.00 (  0.00%)    127744.00 (-60.33%)    312595.00 ( -2.93%)    279241.00 (-13.29%)
TPut 16    336713.00 (  0.00%)    123808.00 (-63.23%)    335452.00 ( -0.37%)    307668.00 ( -8.63%)
TPut 17    356063.00 (  0.00%)    111864.00 (-68.58%)    225754.00 (-36.60%)    355818.00 ( -0.07%)
TPut 18    371661.00 (  0.00%)    147370.00 (-60.35%)    360233.00 ( -3.07%)    372634.00 (  0.26%)
TPut 19    379312.00 (  0.00%)    123923.00 (-67.33%)    387282.00 (  2.10%)    361767.00 ( -4.63%)
TPut 20    401692.00 (  0.00%)    138242.00 (-65.59%)    404094.00 (  0.60%)    423420.00 (  5.41%)
TPut 21    414513.00 (  0.00%)    130297.00 (-68.57%)    407778.00 ( -1.62%)    391592.00 ( -5.53%)
TPut 22    428844.00 (  0.00%)    137265.00 (-67.99%)    417451.00 ( -2.66%)    405080.00 ( -5.54%)
TPut 23    438020.00 (  0.00%)    142830.00 (-67.39%)    429879.00 ( -1.86%)    408552.00 ( -6.73%)
TPut 24    448953.00 (  0.00%)    134555.00 (-70.03%)    438014.00 ( -2.44%)    437712.00 ( -2.50%)
TPut 25    435304.00 (  0.00%)    139353.00 (-67.99%)    421593.00 ( -3.15%)    434468.00 ( -0.19%)
TPut 26    440650.00 (  0.00%)    138950.00 (-68.47%)    431110.00 ( -2.16%)    470865.00 (  6.86%)
TPut 27    450883.00 (  0.00%)    122023.00 (-72.94%)    363860.00 (-19.30%)    454628.00 (  0.83%)
TPut 28    443898.00 (  0.00%)    147767.00 (-66.71%)    432948.00 ( -2.47%)    435056.00 ( -1.99%)
TPut 29    441452.00 (  0.00%)    146533.00 (-66.81%)    424264.00 ( -3.89%)    428605.00 ( -2.91%)
TPut 30    441326.00 (  0.00%)    151533.00 (-65.66%)    422050.00 ( -4.37%)    460991.00 (  4.46%)
TPut 31    439690.00 (  0.00%)    153500.00 (-65.09%)    414679.00 ( -5.69%)    434294.00 ( -1.23%)
TPut 32    429590.00 (  0.00%)    157455.00 (-63.35%)    419414.00 ( -2.37%)    428349.00 ( -0.29%)
TPut 33    417133.00 (  0.00%)    144792.00 (-65.29%)    416503.00 ( -0.15%)    417916.00 (  0.19%)
TPut 34    420403.00 (  0.00%)    145986.00 (-65.27%)    405824.00 ( -3.47%)    433001.00 (  3.00%)
TPut 35    416891.00 (  0.00%)    147549.00 (-64.61%)    403946.00 ( -3.11%)    442290.00 (  6.09%)
TPut 36    408666.00 (  0.00%)    148456.00 (-63.67%)    407079.00 ( -0.39%)    394163.00 ( -3.55%)
TPut 37    404101.00 (  0.00%)    155440.00 (-61.53%)    388615.00 ( -3.83%)    402274.00 ( -0.45%)
TPut 38    388909.00 (  0.00%)    160695.00 (-58.68%)    394499.00 (  1.44%)    427483.00 (  9.92%)
TPut 39    383162.00 (  0.00%)    152452.00 (-60.21%)    375101.00 ( -2.10%)    390608.00 (  1.94%)
TPut 40    370984.00 (  0.00%)    165686.00 (-55.34%)    374385.00 (  0.92%)    377252.00 (  1.69%)
TPut 41    370755.00 (  0.00%)    164312.00 (-55.68%)    370951.00 (  0.05%)    375261.00 (  1.22%)
TPut 42    356921.00 (  0.00%)    168220.00 (-52.87%)    365286.00 (  2.34%)    361267.00 (  1.22%)
TPut 43    346752.00 (  0.00%)    164975.00 (-52.42%)    348567.00 (  0.52%)    402065.00 ( 15.95%)
TPut 44    333574.00 (  0.00%)    155288.00 (-53.45%)    346565.00 (  3.89%)    359868.00 (  7.88%)
TPut 45    330858.00 (  0.00%)    158725.00 (-52.03%)    359029.00 (  8.51%)    355606.00 (  7.48%)
TPut 46    324668.00 (  0.00%)    163932.00 (-49.51%)    351591.00 (  8.29%)    375223.00 ( 15.57%)
TPut 47    317691.00 (  0.00%)    154329.00 (-51.42%)    353301.00 ( 11.21%)    355017.00 ( 11.75%)
TPut 48    323505.00 (  0.00%)    159024.00 (-50.84%)    344156.00 (  6.38%)    372821.00 ( 15.24%)
TPut 49    323870.00 (  0.00%)    142198.00 (-56.09%)    349592.00 (  7.94%)    370188.00 ( 14.30%)
TPut 50    332865.00 (  0.00%)    133112.00 (-60.01%)    355565.00 (  6.82%)    366131.00 (  9.99%)
TPut 51    325322.00 (  0.00%)    139628.00 (-57.08%)    355764.00 (  9.36%)    354747.00 (  9.04%)
TPut 52    326365.00 (  0.00%)    144885.00 (-55.61%)    364997.00 ( 11.84%)    358001.00 (  9.69%)
TPut 53    312548.00 (  0.00%)    167534.00 (-46.40%)    370090.00 ( 18.41%)    360848.00 ( 15.45%)
TPut 54    324755.00 (  0.00%)    170174.00 (-47.60%)    373291.00 ( 14.95%)    362261.00 ( 11.55%)
TPut 55    317938.00 (  0.00%)    177956.00 (-44.03%)    375091.00 ( 17.98%)    344495.00 (  8.35%)
TPut 56    326050.00 (  0.00%)    178906.00 (-45.13%)    375465.00 ( 15.16%)    369663.00 ( 13.38%)
TPut 57    302538.00 (  0.00%)    176488.00 (-41.66%)    372899.00 ( 23.26%)    366090.00 ( 21.01%)
TPut 58    314612.00 (  0.00%)    175755.00 (-44.14%)    385492.00 ( 22.53%)    354818.00 ( 12.78%)
TPut 59    312258.00 (  0.00%)    170366.00 (-45.44%)    383785.00 ( 22.91%)    373003.00 ( 19.45%)
TPut 60    317391.00 (  0.00%)    171247.00 (-46.05%)    379551.00 ( 19.58%)    365024.00 ( 15.01%)
TPut 61    289702.00 (  0.00%)    171227.00 (-40.90%)    373473.00 ( 28.92%)    368090.00 ( 27.06%)
TPut 62    314272.00 (  0.00%)    170611.00 (-45.71%)    369686.00 ( 17.63%)    367854.00 ( 17.05%)
TPut 63    318831.00 (  0.00%)    170379.00 (-46.56%)    367372.00 ( 15.22%)    372475.00 ( 16.83%)
TPut 64    304071.00 (  0.00%)    167930.00 (-44.77%)    368247.00 ( 21.11%)    370133.00 ( 21.73%)
TPut 65    294689.00 (  0.00%)    170535.00 (-42.13%)    361717.00 ( 22.75%)    363054.00 ( 23.20%)
TPut 66    309932.00 (  0.00%)    168917.00 (-45.50%)    356749.00 ( 15.11%)    351800.00 ( 13.51%)
TPut 67    309109.00 (  0.00%)    168709.00 (-45.42%)    366841.00 ( 18.68%)    366473.00 ( 18.56%)
TPut 68    307969.00 (  0.00%)    167717.00 (-45.54%)    345216.00 ( 12.09%)    372904.00 ( 21.08%)
TPut 69    315208.00 (  0.00%)    165794.00 (-47.40%)    367136.00 ( 16.47%)    354816.00 ( 12.57%)
TPut 70    310438.00 (  0.00%)    166529.00 (-46.36%)    364421.00 ( 17.39%)    362567.00 ( 16.79%)
TPut 71    304885.00 (  0.00%)    165862.00 (-45.60%)    357377.00 ( 17.22%)    355774.00 ( 16.69%)
TPut 72    304734.00 (  0.00%)    165487.00 (-45.69%)    331900.00 (  8.91%)    348366.00 ( 14.32%)

Without THP, numacore suffers really badly. In an earlier run against
3.7-rc6, autonuma and balancenuma also did not do great but autonuma did
quite well this time with the same patch so something significant may have
changed between 3.7-rc6 and 3.7-rc7.  balancenuma also did reasonably well
this time when it showed flat performance the last time. It has changed,
but mostly in how it treats THP which should not have affected this result.
Tip was based on 3.7-rc6 this time but maybe it'll benefit from the same
mystery change in 3.7-rc7 when it's tested.

So, while balancenuma did well here it's worth noting that if it continually
migrates then its scan rate does not drop and it incurs a higher system
CPU cost. It did not happen here but is worth bearing in mind.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7
                                 stats-v6r15          numacore-20121126         autonuma-v28fastr4            balancenuma-v8r6
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        323505.00 (  0.00%)        159024.00 (-50.84%)        344156.00 (  6.38%)        372821.00 ( 15.24%)
 Actual Warehouse            27.00 (  0.00%)            56.00 (107.41%)            24.00 (-11.11%)            26.00 ( -3.70%)
 Actual Peak Bops        450883.00 (  0.00%)        178906.00 (-60.32%)        438014.00 ( -2.85%)        470865.00 (  4.43%)
 SpecJBB Bops            160079.00 (  0.00%)         84224.00 (-47.39%)        186038.00 ( 16.22%)        185151.00 ( 15.66%)
 SpecJBB Bops/JVM        160079.00 (  0.00%)         84224.00 (-47.39%)        186038.00 ( 16.22%)        185151.00 ( 15.66%)

numacore regressed 60.32% at the peak and has a 47.39% loss on its specjbb
score.

autonuma regresses 2.85% at its peak but gained 16.22% on its overall
specjbb score.

balancenuma does gained 4.43 at its peak and a 15.66% on its overall score.


MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
         stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
User       317176.63   168175.82   308607.83   308503.96
System         60.85   119763.49     3974.78     1879.45
Elapsed      7434.09     7451.39     7437.49     7437.41

numacores system CPU usage is excessive.

autonumas is high here as well and that's even with the kernel threads.

balancenumas is also higher than I'd like but it's the best of the three
trees.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
                           stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
Page Ins                         62572       36844       37132       37100
Page Outs                        60448       62928       58464       59028
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                      3           3           3           3
THP collapse alloc                   0           0          12           0
THP splits                           0           0           0           0
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0    25255063
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0       26214
NUMA PTE updates                     0           0           0   206844060
NUMA hint faults                     0           0           0   201377412
NUMA hint local faults               0           0           0    51864509
NUMA pages migrated                  0           0           0    25255063
AutoNUMA cost                        0           0           0     1008814

THP is not in use. Migrations for balancenuma were at 13MB/sec which is better
than has been seen before but should still be lower.


Next I ran NPB (http://www.nas.nasa.gov/publications/npb.htm) as an
example of a workload of interest to HPC. I made little or no attempt to
be clever here. Defaults were used instead of trying to tune to achieve
peak performance. I used the Class C problem set size as Class D was being
pushed to swap on my machine. This means that the benchmark is not using that
much memory but it will be using a lot of the CPUs so it is still useful.

For MPI, it is mostly process based and running in local mode was using
large files in /tmp/ to communicate. So it's using shared memory but not
system V shmem.

OpenMP is thread based.

I analysed neither set of workloads closely. It was just a blind punt.

NAS MPI
                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7
                    stats-v6r15     numacore-20121126    autonuma-v28fastr4       balancenuma-v8r6
Time cg.C       59.92 (  0.00%)       56.59 (  5.56%)       58.66 (  2.10%)       53.58 ( 10.58%)
Time ep.C       18.07 (  0.00%)       18.96 ( -4.93%)       18.12 ( -0.28%)       18.86 ( -4.37%)
Time ft.C       51.57 (  0.00%)       53.67 ( -4.07%)       53.60 ( -3.94%)       51.81 ( -0.47%)
Time is.C        2.85 (  0.00%)        4.19 (-47.02%)        3.26 (-14.39%)        3.34 (-17.19%)
Time lu.C      160.07 (  0.00%)      142.26 ( 11.13%)      138.43 ( 13.52%)      139.71 ( 12.72%)
Time mg.C       24.46 (  0.00%)       23.57 (  3.64%)       24.71 ( -1.02%)       22.73 (  7.07%)

Everyone regressed on is.C and ep.C which are very short-lived. mg.C showed
gains and losses but again is very short-lived. Of what's left

cg.C	balancenuma best but not by that great a margin
ft.C	balancenuma "best" by a small margin and is close to mainline
lu.C	autonuma    best by a small margin
mg.C    balancenuma best by a small margin

The differences between the trees is not massive any may be within the noise.
The fact is that the tests are too short-lived to be really useful. It's a
pity that class D is not usable on this machine because it starts using swap.
I'll investigate if something can be done about that.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
         stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
User         8279.08     7415.87     7564.98     7427.82
System       2309.04     2608.66     2432.62     2306.59
Elapsed       366.62      350.35      349.25      341.20

numacore is a bit high on the system CPU usage side but not as excessive
as it can be.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
                           stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
Page Ins                         33256       36576       36448       36508
Page Outs                       732304      832596      745144      590296
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                   7532        7524        7526        7530
THP collapse alloc                  19           0         100          21
THP splits                           0           0           8           1
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0     1954996
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0        2029
NUMA PTE updates                     0           0           0   106542884
NUMA hint faults                     0           0           0     2634360
NUMA hint local faults               0           0           0     2385326
NUMA pages migrated                  0           0           0     1954996
AutoNUMA cost                        0           0           0       13954

THP was in use but otherwise it's hard to conclude anything useful. Each
workload is very different so we cannot draw reasonable conclusions from
the amount of data migrated.

NAS OMP

                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7
                    stats-v6r15     numacore-20121126    autonuma-v28fastr4       balancenuma-v8r6
Time bt.C      167.76 (  0.00%)      189.34 (-12.86%)      166.28 (  0.88%)      169.68 ( -1.14%)
Time cg.C       44.52 (  0.00%)       61.84 (-38.90%)       52.11 (-17.05%)       46.71 ( -4.92%)
Time ep.C       12.66 (  0.00%)       15.41 (-21.72%)       12.35 (  2.45%)       12.21 (  3.55%)
Time ft.C       32.55 (  0.00%)       37.77 (-16.04%)       35.21 ( -8.17%)       32.85 ( -0.92%)
Time is.C        1.69 (  0.00%)        2.28 (-34.91%)        1.95 (-15.38%)        1.68 (  0.59%)
Time lu.C       88.12 (  0.00%)      135.42 (-53.68%)      120.73 (-37.01%)       91.07 ( -3.35%)
Time mg.C       26.62 (  0.00%)       33.15 (-24.53%)       29.07 ( -9.20%)       28.08 ( -5.48%)
Time sp.C      783.74 (  0.00%)      450.35 ( 42.54%)      384.51 ( 50.94%)      413.22 ( 47.28%)
Time ua.C      201.91 (  0.00%)      173.32 ( 14.16%)      187.70 (  7.04%)      172.80 ( 14.42%)

Note that OpenMP runs more tests. At some time in the past, the equivalent
tests were not compiling for OpenMPI and the MMTests script does not even try
and run time. I'll recheck if this is still the case of if it can be fixed.

numacore and autonuma did really badly on lu.C, worth looking at what that
benchmark is doing. balancenuma looks like it did ok but am cautious about it
and would prefer it if was more than once.

Otherwise, numacore regressed a number of the remaining tests but
saw large gains for sp and ua.

autonuma fares much better but there are large regressions there too.

balancenuma did ok. Generally though, this series of benchmark has issued
a few challenges that will need to be answered.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
         stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
User        60286.11    46017.38    41803.90    42021.18
System         68.02     1430.31      118.75      166.79
Elapsed      1495.34     1236.03     1131.33     1103.99

numacores system CPU usage is comparatively very high again.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7
                           stats-v6r15numacore-20121126autonuma-v28fastr4balancenuma-v8r6
Page Ins                         37544       37288       37428       37404
Page Outs                        19240       17908       17244       17600
Swap Ins                             0           0           0           0
Swap Outs                            0           0           0           0
Direct pages scanned                 0           0           0           0
Kswapd pages scanned                 0           0           0           0
Kswapd pages reclaimed               0           0           0           0
Direct pages reclaimed               0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%
Page writes by reclaim               0           0           0           0
Page writes file                     0           0           0           0
Page writes anon                     0           0           0           0
Page reclaim immediate               0           0           0           0
Page rescued immediate               0           0           0           0
Slabs scanned                        0           0           0           0
Direct inode steals                  0           0           0           0
Kswapd inode steals                  0           0           0           0
Kswapd skipped wait                  0           0           0           0
THP fault alloc                  15700       15798       15495       15696
THP collapse alloc                  13           2          98           8
THP splits                           0           0           2           1
THP fault fallback                   0           0           0           0
THP collapse fail                    0           0           0           0
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success                 0           0           0     2814591
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                      0           0           0        2921
NUMA PTE updates                     0           0           0    49389870
NUMA hint faults                     0           0           0     1575920
NUMA hint local faults               0           0           0      961230
NUMA pages migrated                  0           0           0     2814591
AutoNUMA cost                        0           0           0        8278

THP is in use but as each workload is very different we cannot really draw
sensible conclusions from the other stats.

Finally, the following are just rudimentary tests to check some basics. I'm
not going into heavy details this time because the figures look very similar to
the previous report

kernbench	- numacore    -2.50%
		  autonuma    -0.49%
		  balancenuma -0.60%

aim9		- everyone ok
hackbench-pipes	- same as before. numacore, balancenuma ok. autonuma regressed heavily
hackbench-socket- same
pft		- same as before. numacore, balancenuma ok. autonuma high system CPU usage
		  similar with fault rates. numacore, balancenuma ok. autonuma regresses heavily

There you have it. Some good results, some great, some bad results, some
disastrous. Of course this is for only one machine and other machines
might report differently.

numacore does very well with THP enabled on a single JVM for specjbb
and does very well for an adverse workload in autonumabench. However,
in other benchmarks it can regress heavily and it's system CPU usage can
be excessive. I'm still of the opinion that it should be rebased on top
of balancenuma and evaulated against it.

autonuma does very well in a number of configurations but there are too
many people unhappy with how it integrates with the core kernel. It would
also be nice if the placement policies part could be rebased on top of
balancenuma where it could get a fair like-like comparison with numacore.

balancenuma did pretty well overall. It generally was an improvement on
the baseline kernel but there are cases where it could really benefit
from a placement policy on top that could place the memory and quickly
reduce the PTE scan rates and number of migrations. I think it's the best
starting point we have available right now.

Comments?

-- 
Mel Gorman
SUSE Labs.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Results for balancenuma v8, autonuma-v28fast and numacore-20121126
  2012-11-30 11:41       ` Results for balancenuma v8, autonuma-v28fast and numacore-20121126 Mel Gorman
@ 2012-11-30 16:09         ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-11-30 16:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	Lee Schermerhorn, Alex Shi, Srikar Dronamraju, Aneesh Kumar,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

On 11/30/2012 06:41 AM, Mel Gorman wrote:
> This is an another insanely long mail. Short summary, based on the results
> of what is in tip/master right now, I think if we're going to merge
> anything for v3.8 it should be the "Automatic NUMA Balancing V8". It does
> reasonably well for many of the workloads and AFAIK there is no reason why
> numacore or autonuma could not be rebased on top with the view to merging
> proper scheduling and placement policies in 3.9.

Given how minimalistic balancenuma is, and how there does not seem
to be anything significant in the way of performance regressions
with balancenuma, I have no objections to Linus merging all of
balancenuma for 3.8.

That could significantly reduce the amount of NUMA code we need
to "fight over" for the 3.9 kernel :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/45] Automatic NUMA Balancing V7
  2012-11-28 13:49   ` [PATCH 00/45] Automatic NUMA Balancing V7 Mel Gorman
  2012-11-30 11:33     ` [PATCH 00/46] Automatic NUMA Balancing V8 Mel Gorman
@ 2012-12-07 10:45     ` Srikar Dronamraju
  2012-12-10  9:07       ` Mel Gorman
  1 sibling, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2012-12-07 10:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML


Got a chance to run autonuma-benchmark on a 8 node, 64 core machine. 
the results are as below. (for each kernel I ran 5 iterations of
autonuma-benchmark)

KernelVersion: 3.7.0-rc3-mainline_v37rc7()
                        Testcase:      Min      Max      Avg
                          numa01:  1562.65  1621.02  1595.92
                numa01_HARD_BIND:   916.84  1114.15  1031.68
             numa01_INVERSE_BIND:  2841.51  7864.06  4145.55
             numa01_THREAD_ALLOC:  1014.28  1722.63  1233.83
   numa01_THREAD_ALLOC_HARD_BIND:   595.15   683.74   645.45
numa01_THREAD_ALLOC_INVERSE_BIND:  1987.64  3324.27  2431.64
                          numa02:   126.07   147.53   132.66
                numa02_HARD_BIND:    25.82    26.54    26.16
             numa02_INVERSE_BIND:   339.12   352.30   344.61
                      numa02_SMT:   137.85   369.20   202.97
            numa02_SMT_HARD_BIND:    27.11   151.72    84.33
         numa02_SMT_INVERSE_BIND:   287.53  1510.83   535.73

KernelVersion: 3.6.0-autonuma+()
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  2102.39  2283.92  2211.19  -27.83%
                numa01_HARD_BIND:   929.84  1155.89  1054.46   -2.16%
             numa01_INVERSE_BIND:  2959.99  4309.97  3366.57   23.14%
             numa01_THREAD_ALLOC:   354.59   453.28   381.67  223.27%
   numa01_THREAD_ALLOC_HARD_BIND:   580.08  1041.88   749.49  -13.88%
numa01_THREAD_ALLOC_INVERSE_BIND:  1805.52  2186.07  1990.85   22.14%
                          numa02:    50.06    62.44    58.25  127.74%
                numa02_HARD_BIND:    25.85    26.26    26.03    0.50%
             numa02_INVERSE_BIND:   335.19   378.02   345.20   -0.17%
                      numa02_SMT:    56.73    71.73    63.67  218.78%
            numa02_SMT_HARD_BIND:    35.70    75.05    50.52   66.92%
         numa02_SMT_INVERSE_BIND:   292.38   302.87   297.85   79.87%

KernelVersion: 3.7.0-rc6-mel_auto_balance+ (mm-balancenuma-v7r6)
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  1606.26  1815.21  1703.47   -6.31%
                numa01_HARD_BIND:   952.50  1186.68  1072.18   -3.78%
             numa01_INVERSE_BIND:  2851.68  5238.50  3417.63   21.30%
             numa01_THREAD_ALLOC:  1013.36  2675.91  1681.84  -26.64%
   numa01_THREAD_ALLOC_HARD_BIND:   660.48  1310.79  1007.33  -35.92%
numa01_THREAD_ALLOC_INVERSE_BIND:  1858.45  2567.01  2053.79   18.40%
                          numa02:   127.00   387.29   181.77  -27.02%
                numa02_HARD_BIND:    25.58    26.30    26.07    0.35%
             numa02_INVERSE_BIND:   342.17   448.23   367.59   -6.25%
                      numa02_SMT:   150.28   739.28   313.60  -35.28%
            numa02_SMT_HARD_BIND:    27.46   234.01   109.82  -23.21%
         numa02_SMT_INVERSE_BIND:   289.47   500.87   339.96   57.59%

KernelVersion: 3.7.0-rc5-tip_master+ (Nov 23rd tip) 
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  1294.35  1760.17  1555.51    2.60%
                numa01_HARD_BIND:   769.32  2588.15  1429.87  -27.85%
             numa01_INVERSE_BIND:  3003.87  4041.55  3335.73   24.28%
             numa01_THREAD_ALLOC:   308.77   341.92   321.26  284.06%
   numa01_THREAD_ALLOC_HARD_BIND:   484.54   547.84   516.80   24.89%
numa01_THREAD_ALLOC_INVERSE_BIND:  1873.33  2026.21  1978.36   22.91%
                          numa02:    34.73    38.61    36.62  262.26%
                numa02_HARD_BIND:    29.08    31.07    29.66  -11.80%
             numa02_INVERSE_BIND:    30.72    34.16    31.60  990.54%
                      numa02_SMT:    36.05    43.49    40.35  403.02%
            numa02_SMT_HARD_BIND:    43.26   100.50    67.12   25.64%
         numa02_SMT_INVERSE_BIND:    44.33   114.72    75.12  613.17%


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/45] Automatic NUMA Balancing V7
  2012-12-07 10:45     ` [PATCH 00/45] Automatic NUMA Balancing V7 Srikar Dronamraju
@ 2012-12-10  9:07       ` Mel Gorman
  2012-12-10  9:42         ` Srikar Dronamraju
  0 siblings, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2012-12-10  9:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Fri, Dec 07, 2012 at 04:15:39PM +0530, Srikar Dronamraju wrote:
> 
> Got a chance to run autonuma-benchmark on a 8 node, 64 core machine. 
> the results are as below. (for each kernel I ran 5 iterations of
> autonuma-benchmark)
> 

Thanks, a test of v10 would also be appreciated. The differences between
V7 and V10 are small but do include a change in how migrate rate-limiting
is handled. It is unlikely it'll make a difference to this test but I'd
like to rule it out.

> KernelVersion: 3.7.0-rc3-mainline_v37rc7()

What kernel is this? The name begins with 3.7-rc3 but then says
v37rc7. v37rc7 of what? I thought it might be v3.7-rc7 but it already said
it's 3.7-rc3 so I'm confused. Would it be possible to base the tests on
a similar baseline kernel such as 3.7.0-rc7 or 3.7.0-rc8? The
balancenuma patches should apply and the autonuma patches can be taken
from the mm-autonuma-v28fastr4-mels-rebase branch in
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git

Either way, the figures look bad. I'm trying to find a similar machine
but initially at least I have not had much luck. Can you post the .config
you used for balancenuma in case I can reproduce the problem on a 4-node
machine please? Are all the nodes the same size?

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/45] Automatic NUMA Balancing V7
  2012-12-10  9:07       ` Mel Gorman
@ 2012-12-10  9:42         ` Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2012-12-10  9:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

> > 
> > Got a chance to run autonuma-benchmark on a 8 node, 64 core machine. 
> > the results are as below. (for each kernel I ran 5 iterations of
> > autonuma-benchmark)
> > 
> 
> Thanks, a test of v10 would also be appreciated. The differences between
> V7 and V10 are small but do include a change in how migrate rate-limiting
> is handled. It is unlikely it'll make a difference to this test but I'd
> like to rule it out.
> 


Yes, have queued it for testing. Will report on completion.


> > KernelVersion: 3.7.0-rc3-mainline_v37rc7()

Please read it as 3.7-rc3 

> 
> What kernel is this? The name begins with 3.7-rc3 but then says
> v37rc7. v37rc7 of what? I thought it might be v3.7-rc7 but it already said
> it's 3.7-rc3 so I'm confused. Would it be possible to base the tests on
> a similar baseline kernel such as 3.7.0-rc7 or 3.7.0-rc8? The



> balancenuma patches should apply and the autonuma patches can be taken
> from the mm-autonuma-v28fastr4-mels-rebase branch in
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git
> 

Yes, for the next set of reports I have based autonuma branch on this
branch.

> Either way, the figures look bad. I'm trying to find a similar machine
> but initially at least I have not had much luck. Can you post the .config
> you used for balancenuma in case I can reproduce the problem on a 4-node
> machine please? Are all the nodes the same size?
> 

No all nodes are not of same size
There are 6 32 GB nodes and 2 64 GB nodes.

Will post the balancenuma config along with results.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2012-12-10 10:13 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-22 19:25 [PATCH 00/40] Automatic NUMA Balancing V5 Mel Gorman
2012-11-22 19:25 ` [PATCH 01/40] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
2012-11-22 19:25 ` [PATCH 02/40] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
2012-11-22 20:56   ` Alan Cox
2012-11-23  9:09     ` Mel Gorman
2012-11-23  9:53       ` Borislav Petkov
2012-11-22 19:25 ` [PATCH 03/40] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
2012-11-22 19:25 ` [PATCH 04/40] x86/mm: Introduce pte_accessible() Mel Gorman
2012-11-22 19:25 ` [PATCH 05/40] mm: Only flush the TLB when clearing an accessible pte Mel Gorman
2012-11-22 19:25 ` [PATCH 06/40] mm: Count the number of pages affected in change_protection() Mel Gorman
2012-11-22 19:25 ` [PATCH 07/40] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Mel Gorman
2012-11-22 19:25 ` [PATCH 08/40] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
2012-11-22 19:25 ` [PATCH 09/40] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
2012-11-22 19:25 ` [PATCH 10/40] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
2012-11-22 19:25 ` [PATCH 11/40] mm: numa: define _PAGE_NUMA Mel Gorman
2012-11-22 19:25 ` [PATCH 12/40] mm: numa: pte_numa() and pmd_numa() Mel Gorman
2012-11-22 19:25 ` [PATCH 13/40] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
2012-11-22 19:25 ` [PATCH 14/40] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
2012-11-22 19:25 ` [PATCH 15/40] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
2012-11-22 19:25 ` [PATCH 16/40] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
2012-11-22 19:25 ` [PATCH 17/40] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
2012-11-22 19:25 ` [PATCH 18/40] mm: mempolicy: Check for misplaced page Mel Gorman
2012-11-22 19:25 ` [PATCH 19/40] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
2012-11-22 19:25 ` [PATCH 20/40] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
2012-11-22 19:25 ` [PATCH 21/40] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
2012-11-22 19:25 ` [PATCH 22/40] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() Mel Gorman
2012-11-22 19:25 ` [PATCH 23/40] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
2012-11-22 19:25 ` [PATCH 24/40] mm: numa: Add fault driven placement and migration Mel Gorman
2012-11-22 19:25 ` [PATCH 25/40] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
2012-11-22 19:25 ` [PATCH 26/40] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
2012-11-22 19:25 ` [PATCH 27/40] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
2012-11-22 19:25 ` [PATCH 28/40] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
2012-11-22 19:25 ` [PATCH 29/40] mm: numa: Migrate on reference policy Mel Gorman
2012-11-22 19:25 ` [PATCH 30/40] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
2012-11-22 19:25 ` [PATCH 31/40] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
2012-11-22 19:25 ` [PATCH 32/40] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
2012-11-22 19:25 ` [PATCH 33/40] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
2012-11-22 19:25 ` [PATCH 34/40] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
2012-11-22 19:25 ` [PATCH 35/40] mm: numa: Introduce last_nid to the page frame Mel Gorman
2012-11-22 19:25 ` [PATCH 36/40] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
2012-11-22 19:25 ` [PATCH 37/40] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
2012-11-23 10:43   ` [PATCH] mm: numa: Add THP migration for the NUMA working set scanning fault case -fixes Mel Gorman
2012-11-22 19:25 ` [PATCH 38/40] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate Mel Gorman
2012-11-22 19:25 ` [PATCH 39/40] mm: sched: numa: Control enabling and disabling of NUMA balancing Mel Gorman
2012-11-22 19:25 ` [PATCH 40/40] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Mel Gorman
2012-11-26 14:58 ` [PATCH 00/41] Automatic NUMA Balancing V6 Mel Gorman
2012-11-28 13:49   ` [PATCH 00/45] Automatic NUMA Balancing V7 Mel Gorman
2012-11-30 11:33     ` [PATCH 00/46] Automatic NUMA Balancing V8 Mel Gorman
2012-11-30 11:41       ` Results for balancenuma v8, autonuma-v28fast and numacore-20121126 Mel Gorman
2012-11-30 16:09         ` Rik van Riel
2012-12-07 10:45     ` [PATCH 00/45] Automatic NUMA Balancing V7 Srikar Dronamraju
2012-12-10  9:07       ` Mel Gorman
2012-12-10  9:42         ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).