linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/49] Automatic NUMA Balancing v10
@ 2012-12-07 10:23 Mel Gorman
  2012-12-07 10:23 ` [PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
                   ` (51 more replies)
  0 siblings, 52 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

This is a full release of all the patches so apologies for the flood.  V9 was
just a MIPS build fix and did not justify a full release. V10 includes Ingo's
scalability patches because even though they increase system CPU usage,
they also helped in a number of test cases. It would be worthwhile trying
to reduce the system CPU usage by looking closer at how rwsem works and
dealing with the contended case a bit better. Otherwise the rate of change
in the last few weeks has been tiny as the preliminary objectives had been
met and I did not want to invalidate any testing other people had conducted.

git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v10r3
git tag:  git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v10

Based on the performance results I have, I still think this tree is what
should be merged for 3.8 with numacore or autonuma rebased on top of it for
3.9. The numacore results are based on an oldish tree.  I'm travelling this
week and tip/master crashes and I haven't had the chance to debug it. Worse,
v18 of numacore had a last-minute patch that effectively disabled it. I
reported it (https://lkml.org/lkml/2012/12/4/393) but got no feedback.

Changelog since V9
  o Migration scalability						(mingo)

Changelog since V8
  o Fix build error on MIPS						(rientjes)

Changelog since V7
  o Account for transhuge migrations properly when migrate rate-limiting

Changelog since V6
  o Transfer last_nid information during transhuge migration		(dhillf)
  o Transfer last_nid information during splits				(dhillf)
  o Drop page reference if target node is full				(dhillf)
  o Account for transhuge allocation failure as migration failure	(mel)

Changelog since V5
  o Fix build errors related to config options, make bisect-safe
  o Account for transhuge migrations
  o Count HPAGE_PMD_NR pages when isolating transhuge
  o Account for local transphuge faults
  o Fix a memory leak on isolation failure

Changelog since V4
  o Allow enabling/disable from command line
  o Delay PTE scanning until tasks are running on a new node
  o THP migration bits needed for memcg
  o Adapt the scanning rate depending on whether pages need to migrate
  o Drop all the scheduler policy stuff on top, it was broken

Changelog since V3
  o Use change_protection
  o Architecture-hook twiddling
  o Port of the THP migration patch.
  o Additional TLB optimisations
  o Fixes from Hillf Danton

Changelog since V2
  o Do not allocate from home node
  o Mostly remove pmd_numa handling for regular pmds
  o HOME policy will allocate from and migrate towards local node
  o Load balancer is more aggressive about moving tasks towards home node
  o Renames to sync up more with -tip version
  o Move pte handlers to generic code
  o Scanning rate starts at 100ms, system CPU usage expected to increase
  o Handle migration of PMD hinting faults
  o Rate limit migration on a per-node basis
  o Alter how the rate of PTE scanning is adapted
  o Rate limit setting of pte_numa if node is congested
  o Only flush local TLB is unmapping a pte_numa page
  o Only consider one CPU in cpu follow algorithm

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two (three depending on how you look at it) competing
approaches to implement support for automatically migrating pages to
optimise NUMA locality. Performance results are available but review
highlighted different problems in both.  They are not compatible with each
other even though some fundamental mechanics should have been the same.
This series addresses part of the integration and sharing problem by
implementing a foundation that either the policy for schednuma or autonuma
can be rebased on.

The initial policy it implements is a very basic greedy policy called
"Migrate On Reference Of pte_numa Node (MORON)".  I expect people to
build upon this revised policy and rename it to something more sensible
that reflects what it means. The ideal *worst-case* behaviour is that
it is comparable to current mainline but for some workloads this is an
improvement over mainline.

This series can be treated as 5 major stages.

1. TLB optimisations that we're likely to want unconditionally.
2. Basic foundation and core mechanics, initial policy that does very little
3. Full PMD fault handling, rate limiting of migration, two-stage migration
   filter to mitigate poor migration decisions.  This will migrate pages
   on a PTE or PMD level using just the current referencing CPU as a
   placement hint
4. Scan rate adaption
5. Native THP migration

Very broadly speaking the TODOs that spring to mind are

1. Revisit MPOL_NOOP and MPOL_MF_LAZY
2. Other architecture support or at least validation that it could be made work. I'm
   half-hoping that the PPC64 people are watching because they tend to be interested
   in this type of thing.

Some advantages of the series are;

1. It rate limits migrations to avoid saturating the bus and backs off
   PTE scanning (in a fairly heavy manner) if the node is rate-limited
2. It keeps major optimisations like THP towards the end to be sure I am
   not accidentally depending on them
3. It implements a basic policy that acts as a second performance baseline.
   The three baselines become vanilla kernel, basic placement policy,
   complex placement policy. This allows like-with-like comparisons with
   implementations.

The comparisons are a bit shorter this time.

Kernels are

stats-v8r6		TLB flush optimisations and stats from this series
numacore-20121130	Tip/master on that date (roughly v17)
numacore-20121202	Tip/master on that date (roughly v18)
autonuma-v28fastr4	Autonuma v28fast rebased and with THP patch on top
balancenuma-v9r2	balancenuma-v9
balancenuma-v10		balancenuma-v10

v9 and v10 only differ by the migration scalability patches. Current
tip/master is crashing during boot and has been crashing for the last few
days which is why it's not included. As I'm remote I have not had the
chance to debug it but it has been reported already. It does mean that
the numacore comparison is old and not based on the unified tree but
right now there is not much I can do about that.

This is less detailed than earlier reports because many of the conclusions
are the same as before.

AUTONUMA BENCH
                                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                                     stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
User    NUMA01               65230.85 (  0.00%)    24835.22 ( 61.93%)    69344.37 ( -6.31%)    30410.22 ( 53.38%)    52436.65 ( 19.61%)    59949.95 (  8.10%)
User    NUMA01_THEADLOCAL    60794.67 (  0.00%)    17856.17 ( 70.63%)    53416.06 ( 12.14%)    17185.34 ( 71.73%)    17829.96 ( 70.67%)    17501.83 ( 71.21%)
User    NUMA02                7031.50 (  0.00%)     2084.38 ( 70.36%)     6726.17 (  4.34%)     2238.73 ( 68.16%)     2079.48 ( 70.43%)     2094.68 ( 70.21%)
User    NUMA02_SMT            2916.19 (  0.00%)     1009.28 ( 65.39%)     3207.30 ( -9.98%)     1037.07 ( 64.44%)      997.57 ( 65.79%)     1010.15 ( 65.36%)
System  NUMA01                  39.66 (  0.00%)      926.55 (-2236.23%)      333.49 (-740.87%)      236.83 (-497.15%)      275.09 (-593.62%)      265.02 (-568.23%)
System  NUMA01_THEADLOCAL       42.33 (  0.00%)      513.99 (-1114.25%)       40.59 (  4.11%)       70.90 (-67.49%)      110.82 (-161.80%)      130.30 (-207.82%)
System  NUMA02                   1.25 (  0.00%)       18.57 (-1385.60%)        1.04 ( 16.80%)        6.39 (-411.20%)        6.42 (-413.60%)        9.17 (-633.60%)
System  NUMA02_SMT              16.66 (  0.00%)       12.32 ( 26.05%)        0.95 ( 94.30%)        3.17 ( 80.97%)        3.58 ( 78.51%)        6.21 ( 62.73%)
Elapsed NUMA01                1511.76 (  0.00%)      575.93 ( 61.90%)     1644.63 ( -8.79%)      701.62 ( 53.59%)     1185.53 ( 21.58%)     1352.74 ( 10.52%)
Elapsed NUMA01_THEADLOCAL     1387.17 (  0.00%)      398.55 ( 71.27%)     1260.92 (  9.10%)      378.47 ( 72.72%)      397.37 ( 71.35%)      387.93 ( 72.03%)
Elapsed NUMA02                 176.81 (  0.00%)       51.14 ( 71.08%)      180.80 ( -2.26%)       53.45 ( 69.77%)       49.51 ( 72.00%)       49.77 ( 71.85%)
Elapsed NUMA02_SMT             163.96 (  0.00%)       48.92 ( 70.16%)      166.96 ( -1.83%)       48.17 ( 70.62%)       47.71 ( 70.90%)       48.63 ( 70.34%)
CPU     NUMA01                4317.00 (  0.00%)     4473.00 ( -3.61%)     4236.00 (  1.88%)     4368.00 ( -1.18%)     4446.00 ( -2.99%)     4451.00 ( -3.10%)
CPU     NUMA01_THEADLOCAL     4385.00 (  0.00%)     4609.00 ( -5.11%)     4239.00 (  3.33%)     4559.00 ( -3.97%)     4514.00 ( -2.94%)     4545.00 ( -3.65%)
CPU     NUMA02                3977.00 (  0.00%)     4111.00 ( -3.37%)     3720.00 (  6.46%)     4200.00 ( -5.61%)     4212.00 ( -5.91%)     4226.00 ( -6.26%)
CPU     NUMA02_SMT            1788.00 (  0.00%)     2087.00 (-16.72%)     1921.00 ( -7.44%)     2159.00 (-20.75%)     2098.00 (-17.34%)     2089.00 (-16.83%)

numacore-20121130 did reasonably well although its system CPU usage is extremely high.

numacore-20121202 is very poor and roughly comparable to mainline. This
is likely because numacore is effectively disabled in this release. The
reasons it is likely disabled have already been reported and current
tip/master looks like it would suffer the same problem if it booted.

balancenuma does reasonably well. It's not great at numa01 which is an adverse
workload as it does not know how to interleave which is what's needed in this
case. It does very well for the other test cases.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
User       135980.38    45792.55   132701.13    50878.50    73350.91    80563.56
System        100.53     1472.19      376.74      317.89      396.58      411.40
Elapsed      3248.36     1084.63     3262.62     1191.85     1689.70     1847.35

numacore-20121130 has very high system CPU usaage.

balancenumas is higher than I'd like but it's acceptable.

Specjbb Multiple JVMs, 4 Nodes
                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                     stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
Mean   1      31311.75 (  0.00%)     27938.00 (-10.77%)     29681.25 ( -5.21%)     31474.25 (  0.52%)     31112.00 ( -0.64%)     31281.50 ( -0.10%)
Mean   2      62972.75 (  0.00%)     51899.00 (-17.58%)     60403.00 ( -4.08%)     66654.00 (  5.85%)     62937.50 ( -0.06%)     62483.50 ( -0.78%)
Mean   3      91292.00 (  0.00%)     80908.00 (-11.37%)     86570.25 ( -5.17%)     97177.50 (  6.45%)     90665.50 ( -0.69%)     90667.00 ( -0.68%)
Mean   4     115768.75 (  0.00%)     99497.25 (-14.06%)    105982.25 ( -8.45%)    125596.00 (  8.49%)    116812.50 (  0.90%)    116193.50 (  0.37%)
Mean   5     137248.50 (  0.00%)     92837.75 (-32.36%)    115640.50 (-15.74%)    152795.25 ( 11.33%)    139037.75 (  1.30%)    139055.50 (  1.32%)
Mean   6     155528.50 (  0.00%)    105554.50 (-32.13%)    124614.75 (-19.88%)    177455.25 ( 14.10%)    155769.25 (  0.15%)    159129.50 (  2.32%)
Mean   7     156747.50 (  0.00%)    122582.25 (-21.80%)    133205.00 (-15.02%)    184578.75 ( 17.76%)    157103.25 (  0.23%)    163234.00 (  4.14%)
Mean   8     152069.50 (  0.00%)    122439.00 (-19.48%)    132939.25 (-12.58%)    186619.25 ( 22.72%)    157631.00 (  3.66%)    163077.75 (  7.24%)
Mean   9     146609.75 (  0.00%)    112410.00 (-23.33%)    123667.25 (-15.65%)    186165.00 ( 26.98%)    152561.00 (  4.06%)    159656.00 (  8.90%)
Mean   10    142819.00 (  0.00%)    111456.00 (-21.96%)    117609.00 (-17.65%)    182569.75 ( 27.83%)    145320.00 (  1.75%)    153414.25 (  7.42%)
Mean   11    128292.25 (  0.00%)     98027.00 (-23.59%)    112410.25 (-12.38%)    176104.75 ( 37.27%)    138599.50 (  8.03%)    147194.25 ( 14.73%)
Mean   12    128769.75 (  0.00%)    129469.50 (  0.54%)    106629.50 (-17.19%)    169003.00 ( 31.24%)    131994.75 (  2.50%)    140049.75 (  8.76%)
Mean   13    126488.50 (  0.00%)    110133.75 (-12.93%)    106878.25 (-15.50%)    162725.75 ( 28.65%)    130005.25 (  2.78%)    139109.75 (  9.98%)
Mean   14    123400.00 (  0.00%)    117929.75 ( -4.43%)    105558.25 (-14.46%)    163781.25 ( 32.72%)    126340.75 (  2.38%)    137883.00 ( 11.74%)
Mean   15    122139.50 (  0.00%)    122404.25 (  0.22%)    102829.25 (-15.81%)    160800.25 ( 31.65%)    128612.75 (  5.30%)    136624.00 ( 11.86%)
Mean   16    116413.50 (  0.00%)    124573.50 (  7.01%)    100475.75 (-13.69%)    160882.75 ( 38.20%)    117793.75 (  1.19%)    134005.75 ( 15.11%)
Mean   17    117263.25 (  0.00%)    121937.25 (  3.99%)     97237.75 (-17.08%)    159069.75 ( 35.65%)    121991.75 (  4.03%)    133444.50 ( 13.80%)
Mean   18    117277.00 (  0.00%)    116633.75 ( -0.55%)     96547.00 (-17.68%)    158694.75 ( 35.32%)    119089.75 (  1.55%)    129650.75 ( 10.55%)
Mean   19    113231.00 (  0.00%)    111035.75 ( -1.94%)     97683.00 (-13.73%)    155563.25 ( 37.39%)    119699.75 (  5.71%)    123403.25 (  8.98%)
Mean   20    113628.75 (  0.00%)    113451.25 ( -0.16%)     96311.75 (-15.24%)    154779.75 ( 36.22%)    118400.75 (  4.20%)    126041.25 ( 10.92%)
Mean   21    110982.50 (  0.00%)    107660.50 ( -2.99%)     93732.50 (-15.54%)    151147.25 ( 36.19%)    115663.25 (  4.22%)    121906.50 (  9.84%)
Mean   22    107660.25 (  0.00%)    104771.50 ( -2.68%)     91888.75 (-14.65%)    151180.50 ( 40.42%)    111038.00 (  3.14%)    125519.00 ( 16.59%)
Mean   23    105320.50 (  0.00%)     88275.25 (-16.18%)     91594.75 (-13.03%)    147032.00 ( 39.60%)    112817.50 (  7.12%)    124148.25 ( 17.88%)
Mean   24    110900.50 (  0.00%)     85169.00 (-23.20%)     87782.75 (-20.85%)    147407.00 ( 32.92%)    109556.50 ( -1.21%)    122544.00 ( 10.50%)
Stddev 1        720.83 (  0.00%)       982.31 (-36.28%)      1738.11 (-141.13%)       942.80 (-30.79%)      1170.23 (-62.35%)       539.84 ( 25.11%)
Stddev 2        466.00 (  0.00%)      1770.75 (-279.99%)       437.94 (  6.02%)      1327.32 (-184.83%)      1368.51 (-193.67%)      2103.32 (-351.35%)
Stddev 3        509.61 (  0.00%)      4849.62 (-851.63%)      1892.19 (-271.30%)      1803.72 (-253.94%)      1088.04 (-113.50%)       410.73 ( 19.40%)
Stddev 4       1750.10 (  0.00%)     10708.16 (-511.86%)      5762.55 (-229.27%)      2010.11 (-14.86%)      1456.90 ( 16.75%)      1370.22 ( 21.71%)
Stddev 5        700.05 (  0.00%)     16497.79 (-2256.66%)      4658.04 (-565.39%)      2354.70 (-236.36%)       759.38 ( -8.48%)      1869.54 (-167.06%)
Stddev 6       2259.33 (  0.00%)     24221.98 (-972.09%)      6618.94 (-192.96%)      1516.32 ( 32.89%)      1032.39 ( 54.31%)      1720.87 ( 23.83%)
Stddev 7       3390.99 (  0.00%)      4721.80 (-39.25%)      7337.14 (-116.37%)      2398.34 ( 29.27%)      2487.08 ( 26.66%)      4327.85 (-27.63%)
Stddev 8       7533.18 (  0.00%)      8609.90 (-14.29%)      9431.33 (-25.20%)      2895.55 ( 61.56%)      3902.53 ( 48.20%)      2536.68 ( 66.33%)
Stddev 9       9223.98 (  0.00%)     10731.70 (-16.35%)     10681.30 (-15.80%)      4726.23 ( 48.76%)      5673.20 ( 38.50%)      3377.59 ( 63.38%)
Stddev 10      4578.09 (  0.00%)     11136.27 (-143.25%)     12513.13 (-173.33%)      6705.48 (-46.47%)      5516.47 (-20.50%)      7227.58 (-57.87%)
Stddev 11      8201.30 (  0.00%)      3580.27 ( 56.35%)     18390.50 (-124.24%)     10915.90 (-33.10%)      4757.42 ( 41.99%)      4056.02 ( 50.54%)
Stddev 12      5713.70 (  0.00%)     13923.12 (-143.68%)     15228.05 (-166.52%)     16555.64 (-189.75%)      4573.05 ( 19.96%)      3678.89 ( 35.61%)
Stddev 13      5878.95 (  0.00%)     10471.09 (-78.11%)     14014.88 (-138.39%)     18628.01 (-216.86%)      1680.65 ( 71.41%)      3947.39 ( 32.86%)
Stddev 14      4783.95 (  0.00%)      4051.35 ( 15.31%)     13764.72 (-187.73%)     18324.63 (-283.04%)      2637.82 ( 44.86%)      4806.09 ( -0.46%)
Stddev 15      6281.48 (  0.00%)      3357.07 ( 46.56%)     11925.69 (-89.85%)     17654.58 (-181.06%)      2003.38 ( 68.11%)      3005.22 ( 52.16%)
Stddev 16      6948.12 (  0.00%)      3763.32 ( 45.84%)     13658.66 (-96.58%)     18280.52 (-163.10%)      3526.10 ( 49.25%)      3309.24 ( 52.37%)
Stddev 17      5603.77 (  0.00%)      1452.04 ( 74.09%)     12618.33 (-125.18%)     18230.53 (-225.33%)      1712.95 ( 69.43%)      3516.09 ( 37.25%)
Stddev 18      6200.90 (  0.00%)      1870.12 ( 69.84%)     11261.01 (-81.60%)     18486.73 (-198.13%)       751.36 ( 87.88%)      2412.60 ( 61.09%)
Stddev 19      6726.31 (  0.00%)      1045.21 ( 84.46%)     10748.09 (-59.79%)     18465.25 (-174.52%)      1750.49 ( 73.98%)      4482.82 ( 33.35%)
Stddev 20      5713.58 (  0.00%)      2066.90 ( 63.82%)     12195.08 (-113.44%)     19947.77 (-249.13%)      1892.91 ( 66.87%)      2612.62 ( 54.27%)
Stddev 21      4566.92 (  0.00%)      2460.40 ( 46.13%)     14089.14 (-208.50%)     21189.08 (-363.97%)      3639.75 ( 20.30%)      1963.17 ( 57.01%)
Stddev 22      6168.05 (  0.00%)      2770.81 ( 55.08%)     10037.19 (-62.73%)     20033.82 (-224.80%)      3682.20 ( 40.30%)      1159.17 ( 81.21%)
Stddev 23      6295.45 (  0.00%)      1337.32 ( 78.76%)     13290.13 (-111.11%)     22610.91 (-259.16%)      2013.53 ( 68.02%)      3842.61 ( 38.96%)
Stddev 24      3108.17 (  0.00%)      1381.20 ( 55.56%)     12637.15 (-306.58%)     21243.56 (-583.47%)      4044.16 (-30.11%)      2673.39 ( 13.99%)
TPut   1     125247.00 (  0.00%)    111752.00 (-10.77%)    118725.00 ( -5.21%)    125897.00 (  0.52%)    124448.00 ( -0.64%)    125126.00 ( -0.10%)
TPut   2     251891.00 (  0.00%)    207596.00 (-17.58%)    241612.00 ( -4.08%)    266616.00 (  5.85%)    251750.00 ( -0.06%)    249934.00 ( -0.78%)
TPut   3     365168.00 (  0.00%)    323632.00 (-11.37%)    346281.00 ( -5.17%)    388710.00 (  6.45%)    362662.00 ( -0.69%)    362668.00 ( -0.68%)
TPut   4     463075.00 (  0.00%)    397989.00 (-14.06%)    423929.00 ( -8.45%)    502384.00 (  8.49%)    467250.00 (  0.90%)    464774.00 (  0.37%)
TPut   5     548994.00 (  0.00%)    371351.00 (-32.36%)    462562.00 (-15.74%)    611181.00 ( 11.33%)    556151.00 (  1.30%)    556222.00 (  1.32%)
TPut   6     622114.00 (  0.00%)    422218.00 (-32.13%)    498459.00 (-19.88%)    709821.00 ( 14.10%)    623077.00 (  0.15%)    636518.00 (  2.32%)
TPut   7     626990.00 (  0.00%)    490329.00 (-21.80%)    532820.00 (-15.02%)    738315.00 ( 17.76%)    628413.00 (  0.23%)    652936.00 (  4.14%)
TPut   8     608278.00 (  0.00%)    489756.00 (-19.48%)    531757.00 (-12.58%)    746477.00 ( 22.72%)    630524.00 (  3.66%)    652311.00 (  7.24%)
TPut   9     586439.00 (  0.00%)    449640.00 (-23.33%)    494669.00 (-15.65%)    744660.00 ( 26.98%)    610244.00 (  4.06%)    638624.00 (  8.90%)
TPut   10    571276.00 (  0.00%)    445824.00 (-21.96%)    470436.00 (-17.65%)    730279.00 ( 27.83%)    581280.00 (  1.75%)    613657.00 (  7.42%)
TPut   11    513169.00 (  0.00%)    392108.00 (-23.59%)    449641.00 (-12.38%)    704419.00 ( 37.27%)    554398.00 (  8.03%)    588777.00 ( 14.73%)
TPut   12    515079.00 (  0.00%)    517878.00 (  0.54%)    426518.00 (-17.19%)    676012.00 ( 31.24%)    527979.00 (  2.50%)    560199.00 (  8.76%)
TPut   13    505954.00 (  0.00%)    440535.00 (-12.93%)    427513.00 (-15.50%)    650903.00 ( 28.65%)    520021.00 (  2.78%)    556439.00 (  9.98%)
TPut   14    493600.00 (  0.00%)    471719.00 ( -4.43%)    422233.00 (-14.46%)    655125.00 ( 32.72%)    505363.00 (  2.38%)    551532.00 ( 11.74%)
TPut   15    488558.00 (  0.00%)    489617.00 (  0.22%)    411317.00 (-15.81%)    643201.00 ( 31.65%)    514451.00 (  5.30%)    546496.00 ( 11.86%)
TPut   16    465654.00 (  0.00%)    498294.00 (  7.01%)    401903.00 (-13.69%)    643531.00 ( 38.20%)    471175.00 (  1.19%)    536023.00 ( 15.11%)
TPut   17    469053.00 (  0.00%)    487749.00 (  3.99%)    388951.00 (-17.08%)    636279.00 ( 35.65%)    487967.00 (  4.03%)    533778.00 ( 13.80%)
TPut   18    469108.00 (  0.00%)    466535.00 ( -0.55%)    386188.00 (-17.68%)    634779.00 ( 35.32%)    476359.00 (  1.55%)    518603.00 ( 10.55%)
TPut   19    452924.00 (  0.00%)    444143.00 ( -1.94%)    390732.00 (-13.73%)    622253.00 ( 37.39%)    478799.00 (  5.71%)    493613.00 (  8.98%)
TPut   20    454515.00 (  0.00%)    453805.00 ( -0.16%)    385247.00 (-15.24%)    619119.00 ( 36.22%)    473603.00 (  4.20%)    504165.00 ( 10.92%)
TPut   21    443930.00 (  0.00%)    430642.00 ( -2.99%)    374930.00 (-15.54%)    604589.00 ( 36.19%)    462653.00 (  4.22%)    487626.00 (  9.84%)
TPut   22    430641.00 (  0.00%)    419086.00 ( -2.68%)    367555.00 (-14.65%)    604722.00 ( 40.42%)    444152.00 (  3.14%)    502076.00 ( 16.59%)
TPut   23    421282.00 (  0.00%)    353101.00 (-16.18%)    366379.00 (-13.03%)    588128.00 ( 39.60%)    451270.00 (  7.12%)    496593.00 ( 17.88%)
TPut   24    443602.00 (  0.00%)    340676.00 (-23.20%)    351131.00 (-20.85%)    589628.00 ( 32.92%)    438226.00 ( -1.21%)    490176.00 ( 10.50%)

numacore is regressing heavily in this case. It's particularly weird for
numacore-20121202 as numacore should be effectively disabled. It's adding
overhead somewhere but not doing anything useful with it.

balancenuma gets about 1/3 of the performance gain of autonuma and the
migration scalabilty patches help quite a lot.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)
 Expctd Peak Bops               515079.00 (  0.00%)               517878.00 (  0.54%)               426518.00 (-17.19%)               676012.00 ( 31.24%)               527979.00 (  2.50%)               560199.00 (  8.76%)
 Actual Warehouse                    7.00 (  0.00%)                   12.00 ( 71.43%)                    7.00 (  0.00%)                    8.00 ( 14.29%)                    8.00 ( 14.29%)                    7.00 (  0.00%)
 Actual Peak Bops               626990.00 (  0.00%)               517878.00 (-17.40%)               532820.00 (-15.02%)               746477.00 ( 19.06%)               630524.00 (  0.56%)               652936.00 (  4.14%)
 SpecJBB Bops                   465685.00 (  0.00%)               447214.00 ( -3.97%)               392353.00 (-15.75%)               628328.00 ( 34.93%)               480925.00 (  3.27%)               521332.00 ( 11.95%)
 SpecJBB Bops/JVM               116421.00 (  0.00%)               111804.00 ( -3.97%)                98088.00 (-15.75%)               157082.00 ( 34.93%)               120231.00 (  3.27%)               130333.00 ( 11.95%)

numacore is regressing at the peak and in its overall specjbb score.

balancenuma again is getting some solid performance gains -- not as much as
autonuma but the objective was to be better than mainline, not necessarily
be the best overall. numacore or autonuma can be rebased on top of balancenuma.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
User       177835.94   171938.81   177810.87   177457.20   177445.71   177513.08
System        166.79     5814.00      168.00      207.74      527.49      503.25
Elapsed      4037.12     4038.74     4030.32     4037.22     4035.76     4037.74

numacores system CPU usage is very high. It's not high in 20121202 because it's mostly disabled.

As before, balancenumas is higher than I'd like and the migraiton patches do not hurt.

SpecJBB Multiple JVMs, THP disabled
                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                     stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
Mean   1      26036.50 (  0.00%)     19595.00 (-24.74%)     24601.50 ( -5.51%)     24738.25 ( -4.99%)     25595.00 ( -1.70%)     25610.50 ( -1.64%)
Mean   2      53629.75 (  0.00%)     38481.50 (-28.25%)     52351.25 ( -2.38%)     55646.75 (  3.76%)     53045.25 ( -1.09%)     53383.00 ( -0.46%)
Mean   3      77385.00 (  0.00%)     53685.50 (-30.63%)     75993.00 ( -1.80%)     82714.75 (  6.89%)     76596.00 ( -1.02%)     76502.75 ( -1.14%)
Mean   4     100097.75 (  0.00%)     68253.50 (-31.81%)     92149.50 ( -7.94%)    107883.25 (  7.78%)     98618.00 ( -1.48%)     99786.50 ( -0.31%)
Mean   5     119012.75 (  0.00%)     74164.50 (-37.68%)    112056.00 ( -5.85%)    130260.25 (  9.45%)    119354.50 (  0.29%)    121741.75 (  2.29%)
Mean   6     137419.25 (  0.00%)     86158.50 (-37.30%)    133604.50 ( -2.78%)    154244.50 ( 12.24%)    136901.75 ( -0.38%)    136990.50 ( -0.31%)
Mean   7     138018.25 (  0.00%)     96059.25 (-30.40%)    136477.50 ( -1.12%)    159501.00 ( 15.57%)    138265.50 (  0.18%)    139398.75 (  1.00%)
Mean   8     136774.00 (  0.00%)     97003.50 (-29.08%)    137033.75 (  0.19%)    162868.00 ( 19.08%)    138554.50 (  1.30%)    137340.75 (  0.41%)
Mean   9     127966.50 (  0.00%)     95261.00 (-25.56%)    135496.00 (  5.88%)    163008.00 ( 27.38%)    137954.00 (  7.80%)    134200.50 (  4.87%)
Mean   10    124628.75 (  0.00%)     96202.25 (-22.81%)    128704.25 (  3.27%)    159696.50 ( 28.14%)    131322.25 (  5.37%)    126927.50 (  1.84%)
Mean   11    117269.00 (  0.00%)     95924.25 (-18.20%)    119718.50 (  2.09%)    154701.50 ( 31.92%)    125032.75 (  6.62%)    122925.00 (  4.82%)
Mean   12    111962.25 (  0.00%)     94247.25 (-15.82%)    115400.75 (  3.07%)    150936.50 ( 34.81%)    118119.50 (  5.50%)    119931.75 (  7.12%)
Mean   13    111595.50 (  0.00%)    106538.50 ( -4.53%)    110988.50 ( -0.54%)    147193.25 ( 31.90%)    116398.75 (  4.30%)    117349.75 (  5.16%)
Mean   14    110881.00 (  0.00%)    103549.00 ( -6.61%)    111549.00 (  0.60%)    144584.00 ( 30.40%)    114934.50 (  3.66%)    115838.25 (  4.47%)
Mean   15    109337.50 (  0.00%)    101729.00 ( -6.96%)    108927.25 ( -0.38%)    143333.00 ( 31.09%)    115523.75 (  5.66%)    115151.25 (  5.32%)
Mean   16    107031.75 (  0.00%)    101983.75 ( -4.72%)    106160.75 ( -0.81%)    141907.75 ( 32.58%)    113666.00 (  6.20%)    113673.50 (  6.21%)
Mean   17    105491.25 (  0.00%)    100205.75 ( -5.01%)    104268.75 ( -1.16%)    140691.00 ( 33.37%)    112751.50 (  6.88%)    113221.25 (  7.33%)
Mean   18    101102.75 (  0.00%)     96635.50 ( -4.42%)    104045.75 (  2.91%)    137784.25 ( 36.28%)    112582.50 ( 11.35%)    111533.50 ( 10.32%)
Mean   19    103907.25 (  0.00%)     94578.25 ( -8.98%)    102897.50 ( -0.97%)    135719.25 ( 30.62%)    110152.25 (  6.01%)    113959.25 (  9.67%)
Mean   20    100496.00 (  0.00%)     92683.75 ( -7.77%)     98143.50 ( -2.34%)    135264.25 ( 34.60%)    108861.50 (  8.32%)    113746.00 ( 13.18%)
Mean   21     99570.00 (  0.00%)     92955.75 ( -6.64%)     97375.00 ( -2.20%)    133891.00 ( 34.47%)    110094.00 ( 10.57%)    109462.50 (  9.94%)
Mean   22     98611.75 (  0.00%)     89781.75 ( -8.95%)     98287.00 ( -0.33%)    132399.75 ( 34.26%)    109322.75 ( 10.86%)    110502.75 ( 12.06%)
Mean   23     98173.00 (  0.00%)     88846.00 ( -9.50%)     98131.00 ( -0.04%)    130726.00 ( 33.16%)    106046.25 (  8.02%)    107304.25 (  9.30%)
Mean   24     92074.75 (  0.00%)     88581.00 ( -3.79%)     96459.75 (  4.76%)    127552.25 ( 38.53%)    102362.00 ( 11.17%)    107119.25 ( 16.34%)
Stddev 1        735.13 (  0.00%)       538.24 ( 26.78%)       973.28 (-32.40%)       121.08 ( 83.53%)       906.62 (-23.33%)       788.06 ( -7.20%)
Stddev 2        406.26 (  0.00%)      3458.87 (-751.39%)      1082.66 (-166.49%)       477.32 (-17.49%)      1322.57 (-225.55%)       468.57 (-15.34%)
Stddev 3        644.20 (  0.00%)      1360.89 (-111.25%)      1334.10 (-107.09%)       922.47 (-43.20%)       609.27 (  5.42%)       599.26 (  6.98%)
Stddev 4        743.93 (  0.00%)      2149.34 (-188.92%)      2267.12 (-204.75%)      1385.42 (-86.23%)      1119.02 (-50.42%)       801.13 ( -7.69%)
Stddev 5        898.53 (  0.00%)      2521.01 (-180.57%)      1948.30 (-116.83%)       763.24 ( 15.06%)       942.52 ( -4.90%)      1718.19 (-91.22%)
Stddev 6       1126.61 (  0.00%)      3818.22 (-238.91%)       917.32 ( 18.58%)      1527.03 (-35.54%)      2445.69 (-117.08%)      1754.32 (-55.72%)
Stddev 7       2907.61 (  0.00%)      4419.29 (-51.99%)      2486.28 ( 14.49%)      1536.66 ( 47.15%)      4881.65 (-67.89%)      4863.83 (-67.28%)
Stddev 8       3200.64 (  0.00%)       382.01 ( 88.06%)      5978.31 (-86.78%)      1228.09 ( 61.63%)      5459.06 (-70.56%)      5583.95 (-74.46%)
Stddev 9       2907.92 (  0.00%)      1813.39 ( 37.64%)      4583.53 (-57.62%)      1502.61 ( 48.33%)      2501.16 ( 13.99%)      2525.02 ( 13.17%)
Stddev 10      5093.23 (  0.00%)      1313.58 ( 74.21%)      8194.93 (-60.90%)      2763.19 ( 45.75%)      2973.78 ( 41.61%)      2005.95 ( 60.62%)
Stddev 11      4982.41 (  0.00%)      1163.02 ( 76.66%)      1899.45 ( 61.88%)      4776.28 (  4.14%)      6068.34 (-21.80%)      4256.77 ( 14.56%)
Stddev 12      3051.38 (  0.00%)      2117.59 ( 30.60%)      2404.89 ( 21.19%)      9252.59 (-203.23%)      3885.96 (-27.35%)      2580.44 ( 15.43%)
Stddev 13      2918.03 (  0.00%)      2252.11 ( 22.82%)      3889.75 (-33.30%)      9384.83 (-221.62%)      1833.07 ( 37.18%)      2523.28 ( 13.53%)
Stddev 14      3178.97 (  0.00%)      2337.49 ( 26.47%)      3612.00 (-13.62%)      9353.03 (-194.22%)      1072.60 ( 66.26%)      1140.55 ( 64.12%)
Stddev 15      2438.31 (  0.00%)      1707.72 ( 29.96%)      2925.87 (-20.00%)     10494.03 (-330.38%)      2295.76 (  5.85%)      1213.75 ( 50.22%)
Stddev 16      2682.25 (  0.00%)       840.47 ( 68.67%)      3118.36 (-16.26%)     10343.25 (-285.62%)      2416.09 (  9.92%)      1697.27 ( 36.72%)
Stddev 17      2807.66 (  0.00%)      1546.16 ( 44.93%)      3750.42 (-33.58%)     11446.15 (-307.68%)      2484.08 ( 11.52%)       563.50 ( 79.93%)
Stddev 18      3049.27 (  0.00%)       934.11 ( 69.37%)      3382.16 (-10.92%)     11779.80 (-286.31%)      1472.27 ( 51.72%)      1533.68 ( 49.70%)
Stddev 19      2782.65 (  0.00%)       735.28 ( 73.58%)      2853.22 ( -2.54%)     11416.35 (-310.27%)       514.78 ( 81.50%)      1283.38 ( 53.88%)
Stddev 20      2379.12 (  0.00%)       956.25 ( 59.81%)      2876.85 (-20.92%)     10511.63 (-341.83%)      1641.25 ( 31.01%)      1758.22 ( 26.10%)
Stddev 21      2975.22 (  0.00%)       438.31 ( 85.27%)      2627.61 ( 11.68%)     11292.91 (-279.57%)      1087.60 ( 63.44%)       434.51 ( 85.40%)
Stddev 22      2260.61 (  0.00%)       718.23 ( 68.23%)      2706.69 (-19.73%)     11993.84 (-430.56%)       909.16 ( 59.78%)       322.32 ( 85.74%)
Stddev 23      2900.85 (  0.00%)       275.47 ( 90.50%)      2348.16 ( 19.05%)     12234.80 (-321.77%)       701.39 ( 75.82%)      1444.19 ( 50.21%)
Stddev 24      2578.98 (  0.00%)       481.68 ( 81.32%)      3346.30 (-29.75%)     12769.61 (-395.14%)       732.56 ( 71.60%)      1777.60 ( 31.07%)
TPut   1     104146.00 (  0.00%)     78380.00 (-24.74%)     98406.00 ( -5.51%)     98953.00 ( -4.99%)    102380.00 ( -1.70%)    102442.00 ( -1.64%)
TPut   2     214519.00 (  0.00%)    153926.00 (-28.25%)    209405.00 ( -2.38%)    222587.00 (  3.76%)    212181.00 ( -1.09%)    213532.00 ( -0.46%)
TPut   3     309540.00 (  0.00%)    214742.00 (-30.63%)    303972.00 ( -1.80%)    330859.00 (  6.89%)    306384.00 ( -1.02%)    306011.00 ( -1.14%)
TPut   4     400391.00 (  0.00%)    273014.00 (-31.81%)    368598.00 ( -7.94%)    431533.00 (  7.78%)    394472.00 ( -1.48%)    399146.00 ( -0.31%)
TPut   5     476051.00 (  0.00%)    296658.00 (-37.68%)    448224.00 ( -5.85%)    521041.00 (  9.45%)    477418.00 (  0.29%)    486967.00 (  2.29%)
TPut   6     549677.00 (  0.00%)    344634.00 (-37.30%)    534418.00 ( -2.78%)    616978.00 ( 12.24%)    547607.00 ( -0.38%)    547962.00 ( -0.31%)
TPut   7     552073.00 (  0.00%)    384237.00 (-30.40%)    545910.00 ( -1.12%)    638004.00 ( 15.57%)    553062.00 (  0.18%)    557595.00 (  1.00%)
TPut   8     547096.00 (  0.00%)    388014.00 (-29.08%)    548135.00 (  0.19%)    651472.00 ( 19.08%)    554218.00 (  1.30%)    549363.00 (  0.41%)
TPut   9     511866.00 (  0.00%)    381044.00 (-25.56%)    541984.00 (  5.88%)    652032.00 ( 27.38%)    551816.00 (  7.80%)    536802.00 (  4.87%)
TPut   10    498515.00 (  0.00%)    384809.00 (-22.81%)    514817.00 (  3.27%)    638786.00 ( 28.14%)    525289.00 (  5.37%)    507710.00 (  1.84%)
TPut   11    469076.00 (  0.00%)    383697.00 (-18.20%)    478874.00 (  2.09%)    618806.00 ( 31.92%)    500131.00 (  6.62%)    491700.00 (  4.82%)
TPut   12    447849.00 (  0.00%)    376989.00 (-15.82%)    461603.00 (  3.07%)    603746.00 ( 34.81%)    472478.00 (  5.50%)    479727.00 (  7.12%)
TPut   13    446382.00 (  0.00%)    426154.00 ( -4.53%)    443954.00 ( -0.54%)    588773.00 ( 31.90%)    465595.00 (  4.30%)    469399.00 (  5.16%)
TPut   14    443524.00 (  0.00%)    414196.00 ( -6.61%)    446196.00 (  0.60%)    578336.00 ( 30.40%)    459738.00 (  3.66%)    463353.00 (  4.47%)
TPut   15    437350.00 (  0.00%)    406916.00 ( -6.96%)    435709.00 ( -0.38%)    573332.00 ( 31.09%)    462095.00 (  5.66%)    460605.00 (  5.32%)
TPut   16    428127.00 (  0.00%)    407935.00 ( -4.72%)    424643.00 ( -0.81%)    567631.00 ( 32.58%)    454664.00 (  6.20%)    454694.00 (  6.21%)
TPut   17    421965.00 (  0.00%)    400823.00 ( -5.01%)    417075.00 ( -1.16%)    562764.00 ( 33.37%)    451006.00 (  6.88%)    452885.00 (  7.33%)
TPut   18    404411.00 (  0.00%)    386542.00 ( -4.42%)    416183.00 (  2.91%)    551137.00 ( 36.28%)    450330.00 ( 11.35%)    446134.00 ( 10.32%)
TPut   19    415629.00 (  0.00%)    378313.00 ( -8.98%)    411590.00 ( -0.97%)    542877.00 ( 30.62%)    440609.00 (  6.01%)    455837.00 (  9.67%)
TPut   20    401984.00 (  0.00%)    370735.00 ( -7.77%)    392574.00 ( -2.34%)    541057.00 ( 34.60%)    435446.00 (  8.32%)    454984.00 ( 13.18%)
TPut   21    398280.00 (  0.00%)    371823.00 ( -6.64%)    389500.00 ( -2.20%)    535564.00 ( 34.47%)    440376.00 ( 10.57%)    437850.00 (  9.94%)
TPut   22    394447.00 (  0.00%)    359127.00 ( -8.95%)    393148.00 ( -0.33%)    529599.00 ( 34.26%)    437291.00 ( 10.86%)    442011.00 ( 12.06%)
TPut   23    392692.00 (  0.00%)    355384.00 ( -9.50%)    392524.00 ( -0.04%)    522904.00 ( 33.16%)    424185.00 (  8.02%)    429217.00 (  9.30%)
TPut   24    368299.00 (  0.00%)    354324.00 ( -3.79%)    385839.00 (  4.76%)    510209.00 ( 38.53%)    409448.00 ( 11.17%)    428477.00 ( 16.34%)

As before numacore is regressing, autonuma does best and balancenuma does
all right with the migration patches helping a little.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)
 Expctd Peak Bops               447849.00 (  0.00%)               376989.00 (-15.82%)               461603.00 (  3.07%)               603746.00 ( 34.81%)               472478.00 (  5.50%)               479727.00 (  7.12%)
 Actual Warehouse                    7.00 (  0.00%)                   13.00 ( 85.71%)                    8.00 ( 14.29%)                    9.00 ( 28.57%)                    8.00 ( 14.29%)                    7.00 (  0.00%)
 Actual Peak Bops               552073.00 (  0.00%)               426154.00 (-22.81%)               548135.00 ( -0.71%)               652032.00 ( 18.11%)               554218.00 (  0.39%)               557595.00 (  1.00%)
 SpecJBB Bops                   415458.00 (  0.00%)               385328.00 ( -7.25%)               416195.00 (  0.18%)               554456.00 ( 33.46%)               446405.00 (  7.45%)               451937.00 (  8.78%)
 SpecJBB Bops/JVM               103865.00 (  0.00%)                96332.00 ( -7.25%)               104049.00 (  0.18%)               138614.00 ( 33.46%)               111601.00 (  7.45%)               112984.00 (  8.78%)

Same conclusions.

numacore regresses, autonuma is best, balancenuma does all right with the migration scalability patches helping a little.

SpecJBB, Single JVM, THP is enabled
                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                   stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
TPut 1      25550.00 (  0.00%)     25491.00 ( -0.23%)     26438.00 (  3.48%)     24233.00 ( -5.15%)     24913.00 ( -2.49%)     26480.00 (  3.64%)
TPut 2      55943.00 (  0.00%)     51630.00 ( -7.71%)     57004.00 (  1.90%)     55312.00 ( -1.13%)     55042.00 ( -1.61%)     56920.00 (  1.75%)
TPut 3      87707.00 (  0.00%)     74497.00 (-15.06%)     88852.00 (  1.31%)     88569.00 (  0.98%)     86135.00 ( -1.79%)     88608.00 (  1.03%)
TPut 4     117911.00 (  0.00%)     98435.00 (-16.52%)    104955.00 (-10.99%)    118561.00 (  0.55%)    117486.00 ( -0.36%)    117953.00 (  0.04%)
TPut 5     143285.00 (  0.00%)    133964.00 ( -6.51%)    126238.00 (-11.90%)    145703.00 (  1.69%)    142821.00 ( -0.32%)    144926.00 (  1.15%)
TPut 6     171208.00 (  0.00%)    152795.00 (-10.75%)    160028.00 ( -6.53%)    171006.00 ( -0.12%)    170635.00 ( -0.33%)    169394.00 ( -1.06%)
TPut 7     195635.00 (  0.00%)    162517.00 (-16.93%)    172973.00 (-11.58%)    198699.00 (  1.57%)    196108.00 (  0.24%)    196491.00 (  0.44%)
TPut 8     222655.00 (  0.00%)    168679.00 (-24.24%)    179260.00 (-19.49%)    224903.00 (  1.01%)    223494.00 (  0.38%)    225978.00 (  1.49%)
TPut 9     244787.00 (  0.00%)    193394.00 (-20.99%)    238823.00 ( -2.44%)    248313.00 (  1.44%)    251858.00 (  2.89%)    251569.00 (  2.77%)
TPut 10    271565.00 (  0.00%)    237987.00 (-12.36%)    247724.00 ( -8.78%)    272148.00 (  0.21%)    275869.00 (  1.58%)    279049.00 (  2.76%)
TPut 11    298270.00 (  0.00%)    207908.00 (-30.30%)    277513.00 ( -6.96%)    303749.00 (  1.84%)    301763.00 (  1.17%)    301399.00 (  1.05%)
TPut 12    320867.00 (  0.00%)    257937.00 (-19.61%)    281723.00 (-12.20%)    327808.00 (  2.16%)    329681.00 (  2.75%)    330506.00 (  3.00%)
TPut 13    343514.00 (  0.00%)    248474.00 (-27.67%)    301710.00 (-12.17%)    349080.00 (  1.62%)    340606.00 ( -0.85%)    350817.00 (  2.13%)
TPut 14    365321.00 (  0.00%)    298876.00 (-18.19%)    314066.00 (-14.03%)    370026.00 (  1.29%)    379939.00 (  4.00%)    361752.00 ( -0.98%)
TPut 15    377071.00 (  0.00%)    296562.00 (-21.35%)    334810.00 (-11.21%)    329847.00 (-12.52%)    395421.00 (  4.87%)    396091.00 (  5.04%)
TPut 16    404979.00 (  0.00%)    287964.00 (-28.89%)    347142.00 (-14.28%)    411066.00 (  1.50%)    420551.00 (  3.85%)    411673.00 (  1.65%)
TPut 17    420593.00 (  0.00%)    342590.00 (-18.55%)    352738.00 (-16.13%)    428242.00 (  1.82%)    437461.00 (  4.01%)    428270.00 (  1.83%)
TPut 18    440178.00 (  0.00%)    377508.00 (-14.24%)    344421.00 (-21.75%)    440392.00 (  0.05%)    455014.00 (  3.37%)    447671.00 (  1.70%)
TPut 19    448876.00 (  0.00%)    397727.00 (-11.39%)    367002.00 (-18.24%)    462036.00 (  2.93%)    479223.00 (  6.76%)    461881.00 (  2.90%)
TPut 20    460513.00 (  0.00%)    411831.00 (-10.57%)    370870.00 (-19.47%)    476437.00 (  3.46%)    493176.00 (  7.09%)    474824.00 (  3.11%)
TPut 21    474161.00 (  0.00%)    442153.00 ( -6.75%)    374835.00 (-20.95%)    487513.00 (  2.82%)    505246.00 (  6.56%)    468938.00 ( -1.10%)
TPut 22    474493.00 (  0.00%)    429921.00 ( -9.39%)    371022.00 (-21.81%)    487920.00 (  2.83%)    527360.00 ( 11.14%)    475208.00 (  0.15%)
TPut 23    489559.00 (  0.00%)    460354.00 ( -5.97%)    377444.00 (-22.90%)    508298.00 (  3.83%)    534820.00 (  9.25%)    490743.00 (  0.24%)
TPut 24    495378.00 (  0.00%)    486826.00 ( -1.73%)    376551.00 (-23.99%)    514403.00 (  3.84%)    545294.00 ( 10.08%)    493974.00 ( -0.28%)
TPut 25    491795.00 (  0.00%)    520474.00 (  5.83%)    370872.00 (-24.59%)    507373.00 (  3.17%)    543526.00 ( 10.52%)    489850.00 ( -0.40%)
TPut 26    490038.00 (  0.00%)    465587.00 ( -4.99%)    370093.00 (-24.48%)    376322.00 (-23.21%)    545175.00 ( 11.25%)    491352.00 (  0.27%)
TPut 27    491233.00 (  0.00%)    469764.00 ( -4.37%)    371915.00 (-24.29%)    366225.00 (-25.45%)    536927.00 (  9.30%)    489611.00 ( -0.33%)
TPut 28    489058.00 (  0.00%)    489561.00 (  0.10%)    364465.00 (-25.48%)    414027.00 (-15.34%)    543127.00 ( 11.06%)    473835.00 ( -3.11%)
TPut 29    471539.00 (  0.00%)    492496.00 (  4.44%)    353470.00 (-25.04%)    400529.00 (-15.06%)    541615.00 ( 14.86%)    486009.00 (  3.07%)
TPut 30    480343.00 (  0.00%)    488349.00 (  1.67%)    355023.00 (-26.09%)    405612.00 (-15.56%)    542904.00 ( 13.02%)    478384.00 ( -0.41%)
TPut 31    478109.00 (  0.00%)    460043.00 ( -3.78%)    352440.00 (-26.28%)    401471.00 (-16.03%)    529079.00 ( 10.66%)    466457.00 ( -2.44%)
TPut 32    475736.00 (  0.00%)    472007.00 ( -0.78%)    341509.00 (-28.21%)    401075.00 (-15.69%)    532423.00 ( 11.92%)    467866.00 ( -1.65%)
TPut 33    470758.00 (  0.00%)    474348.00 (  0.76%)    337127.00 (-28.39%)    399592.00 (-15.12%)    518811.00 ( 10.21%)    464764.00 ( -1.27%)
TPut 34    467304.00 (  0.00%)    475878.00 (  1.83%)    332477.00 (-28.85%)    394589.00 (-15.56%)    518334.00 ( 10.92%)    446719.00 ( -4.41%)
TPut 35    466391.00 (  0.00%)    487411.00 (  4.51%)    335639.00 (-28.03%)    382799.00 (-17.92%)    513591.00 ( 10.12%)    447071.00 ( -4.14%)
TPut 36    452722.00 (  0.00%)    478050.00 (  5.59%)    316889.00 (-30.00%)    381120.00 (-15.82%)    503801.00 ( 11.28%)    452243.00 ( -0.11%)
TPut 37    447878.00 (  0.00%)    478467.00 (  6.83%)    326939.00 (-27.00%)    382803.00 (-14.53%)    494555.00 ( 10.42%)    442751.00 ( -1.14%)
TPut 38    447907.00 (  0.00%)    455542.00 (  1.70%)    315719.00 (-29.51%)    341693.00 (-23.71%)    482758.00 (  7.78%)    444023.00 ( -0.87%)
TPut 39    428322.00 (  0.00%)    367921.00 (-14.10%)    310519.00 (-27.50%)    404210.00 ( -5.63%)    464550.00 (  8.46%)    440482.00 (  2.84%)
TPut 40    429157.00 (  0.00%)    394277.00 ( -8.13%)    302742.00 (-29.46%)    378554.00 (-11.79%)    467767.00 (  9.00%)    411807.00 ( -4.04%)
TPut 41    424339.00 (  0.00%)    415413.00 ( -2.10%)    304680.00 (-28.20%)    399220.00 ( -5.92%)    457669.00 (  7.85%)    428273.00 (  0.93%)
TPut 42    397440.00 (  0.00%)    421027.00 (  5.93%)    298298.00 (-24.95%)    372161.00 ( -6.36%)    458156.00 ( 15.28%)    422535.00 (  6.31%)
TPut 43    405391.00 (  0.00%)    433900.00 (  7.03%)    286294.00 (-29.38%)    383936.00 ( -5.29%)    438929.00 (  8.27%)    410196.00 (  1.19%)
TPut 44    400692.00 (  0.00%)    427504.00 (  6.69%)    282819.00 (-29.42%)    374757.00 ( -6.47%)    423538.00 (  5.70%)    399471.00 ( -0.30%)
TPut 45    399623.00 (  0.00%)    372622.00 ( -6.76%)    273593.00 (-31.54%)    379797.00 ( -4.96%)    407255.00 (  1.91%)    374068.00 ( -6.39%)
TPut 46    391920.00 (  0.00%)    351205.00 (-10.39%)    277380.00 (-29.23%)    368042.00 ( -6.09%)    411353.00 (  4.96%)    384363.00 ( -1.93%)
TPut 47    378199.00 (  0.00%)    358150.00 ( -5.30%)    273560.00 (-27.67%)    368744.00 ( -2.50%)    408739.00 (  8.08%)    385670.00 (  1.98%)
TPut 48    379346.00 (  0.00%)    387287.00 (  2.09%)    274168.00 (-27.73%)    373581.00 ( -1.52%)    423791.00 ( 11.72%)    380665.00 (  0.35%)
TPut 49    373614.00 (  0.00%)    395793.00 (  5.94%)    270794.00 (-27.52%)    372621.00 ( -0.27%)    423024.00 ( 13.22%)    377985.00 (  1.17%)
TPut 50    372494.00 (  0.00%)    366488.00 ( -1.61%)    271465.00 (-27.12%)    388778.00 (  4.37%)    410647.00 ( 10.24%)    378831.00 (  1.70%)
TPut 51    382195.00 (  0.00%)    381771.00 ( -0.11%)    272796.00 (-28.62%)    387687.00 (  1.44%)    423249.00 ( 10.74%)    402233.00 (  5.24%)
TPut 52    369118.00 (  0.00%)    429441.00 ( 16.34%)    272019.00 (-26.31%)    390226.00 (  5.72%)    410023.00 ( 11.08%)    396558.00 (  7.43%)
TPut 53    366453.00 (  0.00%)    445744.00 ( 21.64%)    267952.00 (-26.88%)    399257.00 (  8.95%)    405937.00 ( 10.77%)    383916.00 (  4.77%)
TPut 54    366571.00 (  0.00%)    375762.00 (  2.51%)    268229.00 (-26.83%)    395098.00 (  7.78%)    402220.00 (  9.72%)    395417.00 (  7.87%)
TPut 55    367580.00 (  0.00%)    336113.00 ( -8.56%)    267474.00 (-27.23%)    400550.00 (  8.97%)    420978.00 ( 14.53%)    398098.00 (  8.30%)
TPut 56    367056.00 (  0.00%)    375635.00 (  2.34%)    263577.00 (-28.19%)    385743.00 (  5.09%)    412685.00 ( 12.43%)    384029.00 (  4.62%)
TPut 57    359163.00 (  0.00%)    354001.00 ( -1.44%)    261130.00 (-27.29%)    389827.00 (  8.54%)    394688.00 (  9.89%)    381032.00 (  6.09%)
TPut 58    360552.00 (  0.00%)    353312.00 ( -2.01%)    261140.00 (-27.57%)    394099.00 (  9.30%)    388655.00 (  7.79%)    378132.00 (  4.88%)
TPut 59    354967.00 (  0.00%)    368534.00 (  3.82%)    262418.00 (-26.07%)    390746.00 ( 10.08%)    399086.00 ( 12.43%)    387101.00 (  9.05%)
TPut 60    362976.00 (  0.00%)    388472.00 (  7.02%)    267468.00 (-26.31%)    383073.00 (  5.54%)    399713.00 ( 10.12%)    390635.00 (  7.62%)
TPut 61    368072.00 (  0.00%)    399476.00 (  8.53%)    265659.00 (-27.82%)    380807.00 (  3.46%)    372060.00 (  1.08%)    383187.00 (  4.11%)
TPut 62    356938.00 (  0.00%)    385648.00 (  8.04%)    253107.00 (-29.09%)    387736.00 (  8.63%)    377183.00 (  5.67%)    378484.00 (  6.04%)
TPut 63    357491.00 (  0.00%)    404325.00 ( 13.10%)    259404.00 (-27.44%)    396672.00 ( 10.96%)    384221.00 (  7.48%)    378907.00 (  5.99%)
TPut 64    357322.00 (  0.00%)    389552.00 (  9.02%)    260333.00 (-27.14%)    386826.00 (  8.26%)    378601.00 (  5.96%)    369852.00 (  3.51%)
TPut 65    341262.00 (  0.00%)    394964.00 ( 15.74%)    258149.00 (-24.35%)    380271.00 ( 11.43%)    382896.00 ( 12.20%)    382897.00 ( 12.20%)
TPut 66    357807.00 (  0.00%)    384846.00 (  7.56%)    259279.00 (-27.54%)    362723.00 (  1.37%)    361530.00 (  1.04%)    380023.00 (  6.21%)
TPut 67    345092.00 (  0.00%)    376842.00 (  9.20%)    259350.00 (-24.85%)    364193.00 (  5.54%)    374449.00 (  8.51%)    373877.00 (  8.34%)
TPut 68    350334.00 (  0.00%)    358330.00 (  2.28%)    259332.00 (-25.98%)    359368.00 (  2.58%)    384920.00 (  9.87%)    381888.00 (  9.01%)
TPut 69    348372.00 (  0.00%)    356188.00 (  2.24%)    263076.00 (-24.48%)    364449.00 (  4.61%)    395611.00 ( 13.56%)    375892.00 (  7.90%)
TPut 70    335077.00 (  0.00%)    359313.00 (  7.23%)    259983.00 (-22.41%)    356418.00 (  6.37%)    375448.00 ( 12.05%)    372358.00 ( 11.13%)
TPut 71    341197.00 (  0.00%)    364168.00 (  6.73%)    254622.00 (-25.37%)    343847.00 (  0.78%)    376113.00 ( 10.23%)    384292.00 ( 12.63%)
TPut 72    345032.00 (  0.00%)    356934.00 (  3.45%)    261060.00 (-24.34%)    345007.00 ( -0.01%)    375313.00 (  8.78%)    381504.00 ( 10.57%)

numacore-20121130 (v17) did well but numacore-20121202 (v18) does not as it's effectively disabled.

autonuma does ok here and balancenuma does quite well. As before, the migration scalability patches help in some cases.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               379346.00 (  0.00%)               387287.00 (  2.09%)               274168.00 (-27.73%)               373581.00 ( -1.52%)               423791.00 ( 11.72%)               380665.00 (  0.35%)
 Actual Warehouse                   24.00 (  0.00%)                   25.00 (  4.17%)                   23.00 ( -4.17%)                   24.00 (  0.00%)                   24.00 (  0.00%)                   24.00 (  0.00%)
 Actual Peak Bops               495378.00 (  0.00%)               520474.00 (  5.07%)               377444.00 (-23.81%)               514403.00 (  3.84%)               545294.00 ( 10.08%)               493974.00 ( -0.28%)
 SpecJBB Bops                   183389.00 (  0.00%)               193652.00 (  5.60%)               134571.00 (-26.62%)               193461.00 (  5.49%)               201083.00 (  9.65%)               195465.00 (  6.58%)
 SpecJBB Bops/JVM               183389.00 (  0.00%)               193652.00 (  5.60%)               134571.00 (-26.62%)               193461.00 (  5.49%)               201083.00 (  9.65%)               195465.00 (  6.58%)

While the migration patches appear to help in some cases note that overall it is not a win in this case. The peak scores are hurt as is the
specjbb score.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
User       316340.52   311420.23   317791.75   314589.64   316061.23   315584.37
System        102.08     3067.27      102.89      352.70      428.76      450.71
Elapsed      7433.22     7436.63     7434.49     7434.74     7432.60     7435.03

Same comments about System CPU time. numacore v17 is very high, v18 is
low because it's disabled, balancenuma is higher than I'd like.

SpecJBB, Single JVM, THP disabled
                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                   stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
TPut 1      19861.00 (  0.00%)     18255.00 ( -8.09%)     21307.00 (  7.28%)     19636.00 ( -1.13%)     19838.00 ( -0.12%)     20650.00 (  3.97%)
TPut 2      47613.00 (  0.00%)     37136.00 (-22.00%)     47861.00 (  0.52%)     47153.00 ( -0.97%)     47481.00 ( -0.28%)     48199.00 (  1.23%)
TPut 3      72438.00 (  0.00%)     55692.00 (-23.12%)     72271.00 ( -0.23%)     69394.00 ( -4.20%)     72029.00 ( -0.56%)     72932.00 (  0.68%)
TPut 4      98455.00 (  0.00%)     81301.00 (-17.42%)     91079.00 ( -7.49%)     98577.00 (  0.12%)     98437.00 ( -0.02%)     99748.00 (  1.31%)
TPut 5     120831.00 (  0.00%)     89067.00 (-26.29%)    118381.00 ( -2.03%)    120805.00 ( -0.02%)    117218.00 ( -2.99%)    121254.00 (  0.35%)
TPut 6     140013.00 (  0.00%)    108349.00 (-22.62%)    141994.00 (  1.41%)    125079.00 (-10.67%)    139878.00 ( -0.10%)    145360.00 (  3.82%)
TPut 7     163553.00 (  0.00%)    116192.00 (-28.96%)    133084.00 (-18.63%)    164368.00 (  0.50%)    167133.00 (  2.19%)    169539.00 (  3.66%)
TPut 8     190148.00 (  0.00%)    125955.00 (-33.76%)    177239.00 ( -6.79%)    188906.00 ( -0.65%)    183058.00 ( -3.73%)    188936.00 ( -0.64%)
TPut 9     211343.00 (  0.00%)    144068.00 (-31.83%)    180903.00 (-14.40%)    206645.00 ( -2.22%)    205699.00 ( -2.67%)    217322.00 (  2.83%)
TPut 10    233190.00 (  0.00%)    148098.00 (-36.49%)    215595.00 ( -7.55%)    234533.00 (  0.58%)    233632.00 (  0.19%)    227292.00 ( -2.53%)
TPut 11    253333.00 (  0.00%)    146043.00 (-42.35%)    224514.00 (-11.38%)    254167.00 (  0.33%)    251938.00 ( -0.55%)    259924.00 (  2.60%)
TPut 12    270661.00 (  0.00%)    131739.00 (-51.33%)    245812.00 ( -9.18%)    271490.00 (  0.31%)    271393.00 (  0.27%)    272536.00 (  0.69%)
TPut 13    299807.00 (  0.00%)    169396.00 (-43.50%)    253075.00 (-15.59%)    299758.00 ( -0.02%)    270594.00 ( -9.74%)    299110.00 ( -0.23%)
TPut 14    319243.00 (  0.00%)    150705.00 (-52.79%)    256078.00 (-19.79%)    318481.00 ( -0.24%)    318566.00 ( -0.21%)    325133.00 (  1.84%)
TPut 15    339054.00 (  0.00%)    116872.00 (-65.53%)    268646.00 (-20.77%)    331534.00 ( -2.22%)    344672.00 (  1.66%)    318119.00 ( -6.17%)
TPut 16    354315.00 (  0.00%)    124346.00 (-64.91%)    291148.00 (-17.83%)    352600.00 ( -0.48%)    316761.00 (-10.60%)    364648.00 (  2.92%)
TPut 17    371306.00 (  0.00%)    118493.00 (-68.09%)    299399.00 (-19.37%)    368260.00 ( -0.82%)    328888.00 (-11.42%)    371088.00 ( -0.06%)
TPut 18    386361.00 (  0.00%)    138571.00 (-64.13%)    303185.00 (-21.53%)    374358.00 ( -3.11%)    356148.00 ( -7.82%)    399913.00 (  3.51%)
TPut 19    401827.00 (  0.00%)    118855.00 (-70.42%)    320630.00 (-20.21%)    399476.00 ( -0.59%)    393918.00 ( -1.97%)    405771.00 (  0.98%)
TPut 20    411130.00 (  0.00%)    144024.00 (-64.97%)    315391.00 (-23.29%)    407799.00 ( -0.81%)    377706.00 ( -8.13%)    406038.00 ( -1.24%)
TPut 21    425352.00 (  0.00%)    154264.00 (-63.73%)    326734.00 (-23.19%)    429226.00 (  0.91%)    431677.00 (  1.49%)    431583.00 (  1.46%)
TPut 22    438150.00 (  0.00%)    153892.00 (-64.88%)    329531.00 (-24.79%)    385827.00 (-11.94%)    440379.00 (  0.51%)    438861.00 (  0.16%)
TPut 23    438425.00 (  0.00%)    146506.00 (-66.58%)    336454.00 (-23.26%)    433963.00 ( -1.02%)    361427.00 (-17.56%)    445293.00 (  1.57%)
TPut 24    461598.00 (  0.00%)    138869.00 (-69.92%)    330113.00 (-28.48%)    439691.00 ( -4.75%)    471567.00 (  2.16%)    488259.00 (  5.78%)
TPut 25    459475.00 (  0.00%)    141698.00 (-69.16%)    333545.00 (-27.41%)    431373.00 ( -6.12%)    487921.00 (  6.19%)    447353.00 ( -2.64%)
TPut 26    452651.00 (  0.00%)    142844.00 (-68.44%)    325634.00 (-28.06%)    447517.00 ( -1.13%)    425336.00 ( -6.03%)    469793.00 (  3.79%)
TPut 27    450436.00 (  0.00%)    140870.00 (-68.73%)    324881.00 (-27.87%)    430805.00 ( -4.36%)    456114.00 (  1.26%)    461172.00 (  2.38%)
TPut 28    459770.00 (  0.00%)    143078.00 (-68.88%)    312547.00 (-32.02%)    432260.00 ( -5.98%)    478317.00 (  4.03%)    452144.00 ( -1.66%)
TPut 29    450347.00 (  0.00%)    142076.00 (-68.45%)    318785.00 (-29.21%)    440423.00 ( -2.20%)    388175.00 (-13.81%)    473273.00 (  5.09%)
TPut 30    449252.00 (  0.00%)    146900.00 (-67.30%)    310301.00 (-30.93%)    435082.00 ( -3.15%)    440795.00 ( -1.88%)    435189.00 ( -3.13%)
TPut 31    446802.00 (  0.00%)    148008.00 (-66.87%)    304119.00 (-31.93%)    418684.00 ( -6.29%)    417343.00 ( -6.59%)    437562.00 ( -2.07%)
TPut 32    439701.00 (  0.00%)    149591.00 (-65.98%)    297625.00 (-32.31%)    421866.00 ( -4.06%)    438719.00 ( -0.22%)    469763.00 (  6.84%)
TPut 33    434477.00 (  0.00%)    142801.00 (-67.13%)    293405.00 (-32.47%)    420631.00 ( -3.19%)    454673.00 (  4.65%)    451224.00 (  3.85%)
TPut 34    423014.00 (  0.00%)    152308.00 (-63.99%)    288639.00 (-31.77%)    415202.00 ( -1.85%)    415194.00 ( -1.85%)    446735.00 (  5.61%)
TPut 35    429012.00 (  0.00%)    154116.00 (-64.08%)    283797.00 (-33.85%)    402395.00 ( -6.20%)    425151.00 ( -0.90%)    434230.00 (  1.22%)
TPut 36    421097.00 (  0.00%)    157571.00 (-62.58%)    276038.00 (-34.45%)    404770.00 ( -3.88%)    430480.00 (  2.23%)    425324.00 (  1.00%)
TPut 37    414815.00 (  0.00%)    150771.00 (-63.65%)    272498.00 (-34.31%)    388842.00 ( -6.26%)    393351.00 ( -5.17%)    405824.00 ( -2.17%)
TPut 38    412361.00 (  0.00%)    157070.00 (-61.91%)    270972.00 (-34.29%)    398947.00 ( -3.25%)    401555.00 ( -2.62%)    432074.00 (  4.78%)
TPut 39    402234.00 (  0.00%)    161487.00 (-59.85%)    258636.00 (-35.70%)    382645.00 ( -4.87%)    423106.00 (  5.19%)    401091.00 ( -0.28%)
TPut 40    380278.00 (  0.00%)    165947.00 (-56.36%)    256492.00 (-32.55%)    394039.00 (  3.62%)    405371.00 (  6.60%)    410739.00 (  8.01%)
TPut 41    393204.00 (  0.00%)    160540.00 (-59.17%)    254896.00 (-35.17%)    385605.00 ( -1.93%)    403383.00 (  2.59%)    372466.00 ( -5.27%)
TPut 42    380622.00 (  0.00%)    151946.00 (-60.08%)    248167.00 (-34.80%)    374843.00 ( -1.52%)    380797.00 (  0.05%)    396227.00 (  4.10%)
TPut 43    371566.00 (  0.00%)    162369.00 (-56.30%)    238268.00 (-35.87%)    347951.00 ( -6.36%)    386765.00 (  4.09%)    345633.00 ( -6.98%)
TPut 44    365538.00 (  0.00%)    161127.00 (-55.92%)    239926.00 (-34.36%)    355070.00 ( -2.86%)    344701.00 ( -5.70%)    391276.00 (  7.04%)
TPut 45    359305.00 (  0.00%)    159062.00 (-55.73%)    237676.00 (-33.85%)    350973.00 ( -2.32%)    370666.00 (  3.16%)    331191.00 ( -7.82%)
TPut 46    343160.00 (  0.00%)    163889.00 (-52.24%)    231272.00 (-32.61%)    347960.00 (  1.40%)    380147.00 ( 10.78%)    323176.00 ( -5.82%)
TPut 47    346983.00 (  0.00%)    168666.00 (-51.39%)    228060.00 (-34.27%)    313612.00 ( -9.62%)    362189.00 (  4.38%)    343154.00 ( -1.10%)
TPut 48    338143.00 (  0.00%)    153448.00 (-54.62%)    224598.00 (-33.58%)    341809.00 (  1.08%)    365342.00 (  8.04%)    354348.00 (  4.79%)
TPut 49    333941.00 (  0.00%)    142784.00 (-57.24%)    224568.00 (-32.75%)    336174.00 (  0.67%)    371700.00 ( 11.31%)    353148.00 (  5.75%)
TPut 50    334001.00 (  0.00%)    135713.00 (-59.37%)    221381.00 (-33.72%)    322489.00 ( -3.45%)    367963.00 ( 10.17%)    355823.00 (  6.53%)
TPut 51    338310.00 (  0.00%)    133402.00 (-60.57%)    219870.00 (-35.01%)    354805.00 (  4.88%)    372592.00 ( 10.13%)    351194.00 (  3.81%)
TPut 52    322897.00 (  0.00%)    150293.00 (-53.45%)    217427.00 (-32.66%)    353169.00 (  9.38%)    363024.00 ( 12.43%)    344846.00 (  6.80%)
TPut 53    329801.00 (  0.00%)    160792.00 (-51.25%)    224019.00 (-32.07%)    353588.00 (  7.21%)    365359.00 ( 10.78%)    355499.00 (  7.79%)
TPut 54    336610.00 (  0.00%)    164696.00 (-51.07%)    214752.00 (-36.20%)    361189.00 (  7.30%)    377851.00 ( 12.25%)    363987.00 (  8.13%)
TPut 55    325920.00 (  0.00%)    172380.00 (-47.11%)    219529.00 (-32.64%)    365678.00 ( 12.20%)    375735.00 ( 15.28%)    363697.00 ( 11.59%)
TPut 56    318997.00 (  0.00%)    176071.00 (-44.80%)    218120.00 (-31.62%)    367048.00 ( 15.06%)    380588.00 ( 19.31%)    362614.00 ( 13.67%)
TPut 57    321776.00 (  0.00%)    174531.00 (-45.76%)    214685.00 (-33.28%)    341874.00 (  6.25%)    378996.00 ( 17.78%)    360366.00 ( 11.99%)
TPut 58    308532.00 (  0.00%)    174202.00 (-43.54%)    208226.00 (-32.51%)    348156.00 ( 12.84%)    361623.00 ( 17.21%)    369693.00 ( 19.82%)
TPut 59    318974.00 (  0.00%)    175343.00 (-45.03%)    214260.00 (-32.83%)    358252.00 ( 12.31%)    360457.00 ( 13.01%)    364556.00 ( 14.29%)
TPut 60    325465.00 (  0.00%)    173694.00 (-46.63%)    213290.00 (-34.47%)    360808.00 ( 10.86%)    362745.00 ( 11.45%)    354232.00 (  8.84%)
TPut 61    319151.00 (  0.00%)    172320.00 (-46.01%)    206197.00 (-35.39%)    350597.00 (  9.85%)    371277.00 ( 16.33%)    352478.00 ( 10.44%)
TPut 62    320837.00 (  0.00%)    172312.00 (-46.29%)    211186.00 (-34.18%)    359062.00 ( 11.91%)    361009.00 ( 12.52%)    352930.00 ( 10.00%)
TPut 63    318198.00 (  0.00%)    172297.00 (-45.85%)    215174.00 (-32.38%)    356137.00 ( 11.92%)    347637.00 (  9.25%)    335322.00 (  5.38%)
TPut 64    321438.00 (  0.00%)    171894.00 (-46.52%)    212493.00 (-33.89%)    347376.00 (  8.07%)    346756.00 (  7.88%)    351410.00 (  9.32%)
TPut 65    314482.00 (  0.00%)    169147.00 (-46.21%)    204809.00 (-34.87%)    351726.00 ( 11.84%)    357429.00 ( 13.66%)    351236.00 ( 11.69%)
TPut 66    316802.00 (  0.00%)    170234.00 (-46.26%)    199708.00 (-36.96%)    344548.00 (  8.76%)    362143.00 ( 14.31%)    347058.00 (  9.55%)
TPut 67    312139.00 (  0.00%)    168180.00 (-46.12%)    208644.00 (-33.16%)    329030.00 (  5.41%)    353305.00 ( 13.19%)    345903.00 ( 10.82%)
TPut 68    323918.00 (  0.00%)    168392.00 (-48.01%)    206120.00 (-36.37%)    319985.00 ( -1.21%)    344250.00 (  6.28%)    345703.00 (  6.73%)
TPut 69    307506.00 (  0.00%)    167082.00 (-45.67%)    204703.00 (-33.43%)    340673.00 ( 10.79%)    339346.00 ( 10.35%)    336071.00 (  9.29%)
TPut 70    306799.00 (  0.00%)    165764.00 (-45.97%)    201529.00 (-34.31%)    331678.00 (  8.11%)    349583.00 ( 13.95%)    341944.00 ( 11.46%)
TPut 71    304232.00 (  0.00%)    165289.00 (-45.67%)    203291.00 (-33.18%)    319824.00 (  5.13%)    335238.00 ( 10.19%)    343396.00 ( 12.87%)
TPut 72    301619.00 (  0.00%)    163909.00 (-45.66%)    203306.00 (-32.60%)    326875.00 (  8.37%)    345999.00 ( 14.71%)    343949.00 ( 14.03%)

numacore does really badly.

autonuma is ok and balancenuma is ok. Scalability patches do not help as much.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               338143.00 (  0.00%)               153448.00 (-54.62%)               224598.00 (-33.58%)               341809.00 (  1.08%)               365342.00 (  8.04%)               354348.00 (  4.79%)
 Actual Warehouse                   24.00 (  0.00%)                   56.00 (133.33%)                   23.00 ( -4.17%)                   26.00 (  8.33%)                   25.00 (  4.17%)                   24.00 (  0.00%)
 Actual Peak Bops               461598.00 (  0.00%)               176071.00 (-61.86%)               336454.00 (-27.11%)               447517.00 ( -3.05%)               487921.00 (  5.70%)               488259.00 (  5.78%)
 SpecJBB Bops                   163683.00 (  0.00%)                83963.00 (-48.70%)               108406.00 (-33.77%)               176379.00 (  7.76%)               184040.00 ( 12.44%)               179621.00 (  9.74%)
 SpecJBB Bops/JVM               163683.00 (  0.00%)                83963.00 (-48.70%)               108406.00 (-33.77%)               176379.00 (  7.76%)               184040.00 ( 12.44%)               179621.00 (  9.74%)

balancenuma is doing reasonably well. It's interesting to note that
the scalabilty patches make little difference to the peak but there
is a difference in the specjbb score. It's veryliekly this is just
variance. balancenuma depends on "luck" on what decisions the scheduler
makes.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
User       316751.91   167098.56   318360.63   307598.67   309109.47   313644.48
System         60.28   122511.08       59.60     4411.81     1820.70     2654.77
Elapsed      7434.08     7451.36     7436.99     7437.52     7438.28     7438.19

numacores CPU usage is off the charts except in v18 where it's more or less disabled.

balancenuma has very high CPU usage too unfortunately.

Overall, balancenuma is not the known best kernel but it wans't meant
to be. It was meant to be better than mainline, establish a performance
baseline and be something that either numacore or autonuma can be rebased
upon. I think it achieves that and is the best choice for 3.8.

 Documentation/kernel-parameters.txt  |    3 +
 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    2 +
 arch/x86/include/asm/pgtable.h       |   17 +-
 arch/x86/include/asm/pgtable_types.h |   20 ++
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |  110 +++++++++++
 include/linux/huge_mm.h              |   16 +-
 include/linux/hugetlb.h              |    8 +-
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   47 ++++-
 include/linux/mm.h                   |   39 ++++
 include/linux/mm_types.h             |   31 ++++
 include/linux/mmzone.h               |   13 ++
 include/linux/rmap.h                 |   33 ++--
 include/linux/sched.h                |   27 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 +++++
 include/uapi/linux/mempolicy.h       |   15 +-
 init/Kconfig                         |   41 +++++
 kernel/fork.c                        |    3 +
 kernel/sched/core.c                  |   71 +++++--
 kernel/sched/fair.c                  |  227 +++++++++++++++++++++++
 kernel/sched/features.h              |   11 ++
 kernel/sched/sched.h                 |   12 ++
 kernel/sysctl.c                      |   45 ++++-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |  105 ++++++++++-
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |    7 +-
 mm/ksm.c                             |    6 +-
 mm/memcontrol.c                      |    7 +-
 mm/memory-failure.c                  |    7 +-
 mm/memory.c                          |  196 +++++++++++++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  283 +++++++++++++++++++++++++---
 mm/migrate.c                         |  337 +++++++++++++++++++++++++++++++++-
 mm/mmap.c                            |   10 +-
 mm/mprotect.c                        |  135 +++++++++++---
 mm/mremap.c                          |    2 +-
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |    9 +-
 mm/rmap.c                            |   66 +++----
 mm/vmstat.c                          |   16 +-
 45 files changed, 1932 insertions(+), 171 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags()
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 02/49] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
                   ` (50 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags() is only ever invoked to set access
flags or add write permission on a PTE.  The write bit is only ever set
together with the dirty bit.

Because we only ever upgrade a PTE, it is safe to skip flushing entries on
remote TLBs. The worst that can happen is a spurious page fault on other
CPUs, which would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally is
(much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
 int ptep_set_access_flags(struct vm_area_struct *vma,
 			  unsigned long address, pte_t *ptep,
 			  pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		flush_tlb_page(vma, address);
+		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 02/49] x86: mm: drop TLB flush from ptep_set_access_flags
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
  2012-12-07 10:23 ` [PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 03/49] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
                   ` (49 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this.  However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/mm/pgtable.c |    1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		__flush_tlb_one(address);
 	}
 
 	return changed;
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 03/49] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
  2012-12-07 10:23 ` [PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
  2012-12-07 10:23 ` [PATCH 02/49] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 04/49] x86/mm: Introduce pte_accessible() Mel Gorman
                   ` (48 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write 
+ * permission. Furthermore, we know it always gets set to a "more
  * permissive" setting, which allows most architectures to optimize
  * this. We return whether the PTE actually changed, which in turn
  * instructs the caller to do things like update__mmu_cache.  This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(*ptep, entry);
 	if (changed) {
 		set_pte_at(vma->vm_mm, address, ptep, entry);
-		flush_tlb_page(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 	return changed;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 04/49] x86/mm: Introduce pte_accessible()
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (2 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 03/49] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 05/49] mm: Only flush the TLB when clearing an accessible pte Mel Gorman
                   ` (47 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page.  This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Fill in this method for x86 and provide a safe (but slower) method
on other architectures.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fixed-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |    6 ++++++
 include/asm-generic/pgtable.h  |    4 ++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..5fe03aa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -407,6 +407,12 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+	return pte_flags(a) & _PAGE_PRESENT;
+}
+
 static inline int pte_hidden(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_HIDDEN;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..48fc1dc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef pte_accessible
+# define pte_accessible(pte)		((void)(pte),1)
+#endif
+
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 05/49] mm: Only flush the TLB when clearing an accessible pte
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (3 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 04/49] x86/mm: Introduce pte_accessible() Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 06/49] mm: Count the number of pages affected in change_protection() Mel Gorman
                   ` (46 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d8397da..0c8323f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pte_t pte;
 	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	flush_tlb_page(vma, address);
+	if (pte_accessible(pte))
+		flush_tlb_page(vma, address);
 	return pte;
 }
 #endif
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 06/49] mm: Count the number of pages affected in change_protection()
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (4 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 05/49] mm: Only flush the TLB when clearing an accessible pte Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 07/49] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Mel Gorman
                   ` (45 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

This will be used for three kinds of purposes:

 - to optimize mprotect()

 - to speed up working set scanning for working set areas that
   have not been touched

 - to more accurately scan per real working set

No change in functionality from this patch.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/hugetlb.h |    8 +++++--
 include/linux/mm.h      |    3 +++
 mm/hugetlb.c            |   10 ++++++--
 mm/mprotect.c           |   58 +++++++++++++++++++++++++++++++++++------------
 4 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2251648..06e691b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 				pud_t *pud, int write);
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
 #else /* !CONFIG_HUGETLB_PAGE */
@@ -132,7 +132,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
 
-#define hugetlb_change_protection(vma, address, end, newprot)
+static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+		unsigned long address, unsigned long end, pgprot_t newprot)
+{
+	return 0;
+}
 
 static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..1856c62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1078,6 +1078,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long old_len, unsigned long new_len,
 			       unsigned long flags, unsigned long new_addr);
+extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end, pgprot_t newprot,
+			      int dirty_accountable);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..712895e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3014,7 +3014,7 @@ same_page:
 	return i ? i : -EFAULT;
 }
 
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -3022,6 +3022,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
+	unsigned long pages = 0;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
@@ -3032,12 +3033,15 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
-		if (huge_pmd_unshare(mm, &address, ptep))
+		if (huge_pmd_unshare(mm, &address, ptep)) {
+			pages++;
 			continue;
+		}
 		if (!huge_pte_none(huge_ptep_get(ptep))) {
 			pte = huge_ptep_get_and_clear(mm, address, ptep);
 			pte = pte_mkhuge(pte_modify(pte, newprot));
 			set_huge_pte_at(mm, address, ptep, pte);
+			pages++;
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -3049,6 +3053,8 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+
+	return pages << h->order;
 }
 
 int hugetlb_reserve_pages(struct inode *inode,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..1e265be 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,12 +35,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 }
 #endif
 
-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
+	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -60,6 +61,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
+			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
@@ -72,18 +74,22 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
 			}
+			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
+
+	return pages;
 }
 
-static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pmd_t *pmd;
 	unsigned long next;
+	unsigned long pages = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -91,35 +97,42 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot))
+			else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+				pages += HPAGE_PMD_NR;
 				continue;
+			}
 			/* fall through */
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+		pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
 				 dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
+
+	return pages;
 }
 
-static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
 	pud_t *pud;
 	unsigned long next;
+	unsigned long pages = 0;
 
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(vma, pud, addr, next, newprot,
+		pages += change_pmd_range(vma, pud, addr, next, newprot,
 				 dirty_accountable);
 	} while (pud++, addr = next, addr != end);
+
+	return pages;
 }
 
-static void change_protection(struct vm_area_struct *vma,
+static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -127,6 +140,7 @@ static void change_protection(struct vm_area_struct *vma,
 	pgd_t *pgd;
 	unsigned long next;
 	unsigned long start = addr;
+	unsigned long pages = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
@@ -135,10 +149,30 @@ static void change_protection(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(vma, pgd, addr, next, newprot,
+		pages += change_pud_range(vma, pgd, addr, next, newprot,
 				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
+
 	flush_tlb_range(vma, start, end);
+
+	return pages;
+}
+
+unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, pgprot_t newprot,
+		       int dirty_accountable)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long pages;
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	if (is_vm_hugetlb_page(vma))
+		pages = hugetlb_change_protection(vma, start, end, newprot);
+	else
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+
+	return pages;
 }
 
 int
@@ -213,12 +247,8 @@ success:
 		dirty_accountable = 1;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
-	else
-		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	perf_event_mmap(vma);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 07/49] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (5 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 06/49] mm: Count the number of pages affected in change_protection() Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 08/49] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
                   ` (44 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Reuse the NUMA code's 'modified page protections' count that
change_protection() computes and skip the TLB flush if there's
no changes to a range that sys_mprotect() modifies.

Given that mprotect() already optimizes the same-flags case
I expected this optimization to dominantly trigger on
CONFIG_NUMA_BALANCING=y kernels - but even with that feature
disabled it triggers rather often.

There's two reasons for that:

1)

While sys_mprotect() already optimizes the same-flag case:

        if (newflags == oldflags) {
                *pprev = vma;
                return 0;
        }

and this test works in many cases, but it is too sharp in some
others, where it differentiates between protection values that the
underlying PTE format makes no distinction about, such as
PROT_EXEC == PROT_READ on x86.

2)

Even where the pte format over vma flag changes necessiates a
modification of the pagetables, there might be no pagetables
yet to modify: they might not be instantiated yet.

During a regular desktop bootup this optimization hits a couple
of hundred times. During a Java test I measured thousands of
hits.

So this optimization improves sys_mprotect() in general, not just
CONFIG_NUMA_BALANCING=y kernels.

[ We could further increase the efficiency of this optimization if
  change_pte_range() and change_huge_pmd() was a bit smarter about
  recognizing exact-same-value protection masks - when the hardware
  can do that safely. This would probably further speed up mprotect(). ]

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mprotect.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e265be..7c3628a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -153,7 +153,9 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 
-	flush_tlb_range(vma, start, end);
+	/* Only flush the TLB if we actually modified any entries: */
+	if (pages)
+		flush_tlb_range(vma, start, end);
 
 	return pages;
 }
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 08/49] mm: compaction: Move migration fail/success stats to migrate.c
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (6 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 07/49] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 09/49] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
                   ` (43 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    4 +++-
 mm/compaction.c               |    4 ----
 mm/migrate.c                  |    6 ++++++
 mm/vmstat.c                   |    7 ++++---
 4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..8aa7cb9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,8 +38,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_MIGRATION
+		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
+#endif
 #ifdef CONFIG_COMPACTION
-		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..00ad883 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -994,10 +994,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
-		count_vm_event(COMPACTBLOCKS);
-		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
-		if (nr_remaining)
-			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
 		trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
 						nr_remaining);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..04687f6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -962,6 +962,7 @@ int migrate_pages(struct list_head *from,
 {
 	int retry = 1;
 	int nr_failed = 0;
+	int nr_succeeded = 0;
 	int pass = 0;
 	struct page *page;
 	struct page *page2;
@@ -988,6 +989,7 @@ int migrate_pages(struct list_head *from,
 				retry++;
 				break;
 			case 0:
+				nr_succeeded++;
 				break;
 			default:
 				/* Permanent failure */
@@ -998,6 +1000,10 @@ int migrate_pages(struct list_head *from,
 	}
 	rc = 0;
 out:
+	if (nr_succeeded)
+		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+	if (nr_failed)
+		count_vm_events(PGMIGRATE_FAIL, nr_failed);
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..89a7fd6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,10 +774,11 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_MIGRATION
+	"pgmigrate_success",
+	"pgmigrate_fail",
+#endif
 #ifdef CONFIG_COMPACTION
-	"compact_blocks_moved",
-	"compact_pages_moved",
-	"compact_pagemigrate_failed",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 09/49] mm: migrate: Add a tracepoint for migrate_pages
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (7 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 08/49] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 10/49] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
                   ` (42 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h        |   13 ++++++++--
 include/trace/events/migrate.h |   51 ++++++++++++++++++++++++++++++++++++++++
 mm/compaction.c                |    3 ++-
 mm/memory-failure.c            |    3 ++-
 mm/memory_hotplug.c            |    3 ++-
 mm/mempolicy.c                 |    6 +++--
 mm/migrate.c                   |   10 ++++++--
 mm/page_alloc.c                |    3 ++-
 8 files changed, 82 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/migrate.h

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..9d1c159 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,15 @@
 
 typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
+enum migrate_reason {
+	MR_COMPACTION,
+	MR_MEMORY_FAILURE,
+	MR_MEMORY_HOTPLUG,
+	MR_SYSCALL,		/* also applies to cpusets */
+	MR_MEMPOLICY_MBIND,
+	MR_CMA
+};
+
 #ifdef CONFIG_MIGRATION
 
 extern void putback_lru_pages(struct list_head *l);
@@ -14,7 +23,7 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			enum migrate_mode mode);
+			enum migrate_mode mode, int reason);
 extern int migrate_huge_page(struct page *, new_page_t x,
 			unsigned long private, bool offlining,
 			enum migrate_mode mode);
@@ -35,7 +44,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		enum migrate_mode mode) { return -ENOSYS; }
+		enum migrate_mode mode, int reason) { return -ENOSYS; }
 static inline int migrate_huge_page(struct page *page, new_page_t x,
 		unsigned long private, bool offlining,
 		enum migrate_mode mode) { return -ENOSYS; }
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
new file mode 100644
index 0000000..ec2a6cc
--- /dev/null
+++ b/include/trace/events/migrate.h
@@ -0,0 +1,51 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM migrate
+
+#if !defined(_TRACE_MIGRATE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MIGRATE_H
+
+#define MIGRATE_MODE						\
+	{MIGRATE_ASYNC,		"MIGRATE_ASYNC"},		\
+	{MIGRATE_SYNC_LIGHT,	"MIGRATE_SYNC_LIGHT"},		\
+	{MIGRATE_SYNC,		"MIGRATE_SYNC"}		
+
+#define MIGRATE_REASON						\
+	{MR_COMPACTION,		"compaction"},			\
+	{MR_MEMORY_FAILURE,	"memory_failure"},		\
+	{MR_MEMORY_HOTPLUG,	"memory_hotplug"},		\
+	{MR_SYSCALL,		"syscall_or_cpuset"},		\
+	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
+	{MR_CMA,		"cma"}
+
+TRACE_EVENT(mm_migrate_pages,
+
+	TP_PROTO(unsigned long succeeded, unsigned long failed,
+		 enum migrate_mode mode, int reason),
+
+	TP_ARGS(succeeded, failed, mode, reason),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,		succeeded)
+		__field(	unsigned long,		failed)
+		__field(	enum migrate_mode,	mode)
+		__field(	int,			reason)
+	),
+
+	TP_fast_assign(
+		__entry->succeeded	= succeeded;
+		__entry->failed		= failed;
+		__entry->mode		= mode;
+		__entry->reason		= reason;
+	),
+
+	TP_printk("nr_succeeded=%lu nr_failed=%lu mode=%s reason=%s",
+		__entry->succeeded,
+		__entry->failed,
+		__print_symbolic(__entry->mode, MIGRATE_MODE),
+		__print_symbolic(__entry->reason, MIGRATE_REASON))
+);
+
+#endif /* _TRACE_MIGRATE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/compaction.c b/mm/compaction.c
index 00ad883..2c077a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -990,7 +990,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
+				cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
+				MR_COMPACTION);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..ddb68a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1558,7 +1558,8 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_MEMORY_FAILURE);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..e598bd1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -812,7 +812,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		 * migrate_pages returns # of failed pages.
 		 */
 		ret = migrate_pages(&source, alloc_migrate_target, 0,
-							true, MIGRATE_SYNC);
+							true, MIGRATE_SYNC,
+							MR_MEMORY_HOTPLUG);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..66e90ec 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -961,7 +961,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-							false, MIGRATE_SYNC);
+							false, MIGRATE_SYNC,
+							MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1202,7 +1203,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
-						false, MIGRATE_SYNC);
+						false, MIGRATE_SYNC,
+						MR_MEMPOLICY_MBIND);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 04687f6..27be9c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -38,6 +38,9 @@
 
 #include <asm/tlbflush.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/migrate.h>
+
 #include "internal.h"
 
 /*
@@ -958,7 +961,7 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		enum migrate_mode mode)
+		enum migrate_mode mode, int reason)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -1004,6 +1007,8 @@ out:
 		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
 	if (nr_failed)
 		count_vm_events(PGMIGRATE_FAIL, nr_failed);
+	trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
@@ -1145,7 +1150,8 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, MIGRATE_SYNC);
+				(unsigned long)pm, 0, MIGRATE_SYNC,
+				MR_SYSCALL);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..5953dc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5707,7 +5707,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		ret = migrate_pages(&cc->migratepages,
 				    alloc_migrate_target,
-				    0, false, MIGRATE_SYNC);
+				    0, false, MIGRATE_SYNC,
+				    MR_CMA);
 	}
 
 	putback_lru_pages(&cc->migratepages);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 10/49] mm: compaction: Add scanned and isolated counters for compaction
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (8 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 09/49] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 11/49] mm: numa: define _PAGE_NUMA Mel Gorman
                   ` (41 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u:	a word	sizeof(void *)

Ca  = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure   = Ca * 2
Ci  = Cost page isolation    = (Ca + Wi)
	where Wi is a constant that should reflect the approximate
	cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost free    scanning = Ca

Overall cost =	(Csm * compact_migrate_scanned) +
	      	(Csf * compact_free_scanned)    +
	      	(Ci  * compact_isolated)	+
		(Cmc * pgmigrate_success)	+
		(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/compaction.c               |    8 ++++++++
 mm/vmstat.c                   |    3 +++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8aa7cb9..a1f750b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -42,6 +42,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
 #ifdef CONFIG_COMPACTION
+		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+		COMPACTISOLATED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c077a7..aee7443 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -356,6 +356,10 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	if (blockpfn == end_pfn)
 		update_pageblock_skip(cc, valid_page, total_isolated, false);
 
+	count_vm_events(COMPACTFREE_SCANNED, nr_scanned);
+	if (total_isolated)
+		count_vm_events(COMPACTISOLATED, total_isolated);
+
 	return total_isolated;
 }
 
@@ -646,6 +650,10 @@ next_pageblock:
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
+	count_vm_events(COMPACTMIGRATE_SCANNED, nr_scanned);
+	if (nr_isolated)
+		count_vm_events(COMPACTISOLATED, nr_isolated);
+
 	return low_pfn;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89a7fd6..3a067fa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -779,6 +779,9 @@ const char * const vmstat_text[] = {
 	"pgmigrate_fail",
 #endif
 #ifdef CONFIG_COMPACTION
+	"compact_migrate_scanned",
+	"compact_free_scanned",
+	"compact_isolated",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 11/49] mm: numa: define _PAGE_NUMA
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (9 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 10/49] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 12/49] mm: numa: pte_numa() and pmd_numa() Mel Gorman
                   ` (40 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
faults to identify the per NUMA node working set of the thread at
runtime.

Arming the NUMA hinting page fault mechanism works similarly to
setting up a mprotect(PROT_NONE) virtual range: the present bit is
cleared at the same time that _PAGE_NUMA is set, so when the fault
triggers we can identify it as a NUMA hinting page fault.

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
could also use a different bitflag, it's up to the architecture to
decide).

It would be confusing to call the "NUMA hinting page faults" as
"do_prot_none faults". They're different events and _PAGE_NUMA doesn't
alter the semantics of mprotect(PROT_NONE) in any way.

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths executed
by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
step into each other toes.

Because we want to be able to set this bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, this bitflag must never be set when the pte and
pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
be used by the swap entry format.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..3c32db8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * _PAGE_NUMA indicates that this page will trigger a numa hinting
+ * minor page fault to gather numa placement statistics (see
+ * pte_numa()). The bit picked (8) is within the range between
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ *
+ * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
+ * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
+ * couldn't reach, like handle_mm_fault() (see access_error in
+ * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
+ * handle_mm_fault() to be invoked).
+ */
+#define _PAGE_NUMA	_PAGE_PROTNONE
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 12/49] mm: numa: pte_numa() and pmd_numa()
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (10 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 11/49] mm: numa: define _PAGE_NUMA Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 13/49] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
                   ` (39 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

The expectation is that a NUMA hinting page fault is used as part
of a placement policy that decides if a page should remain on the
current node or migrated to a different node.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/pgtable.h |   11 ++++-
 include/asm-generic/pgtable.h  |  106 ++++++++++++++++++++++++++++++++++++++++
 init/Kconfig                   |   33 +++++++++++++
 3 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5fe03aa..9cd7b72 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA);
 }
 
 #define pte_accessible pte_accessible
@@ -426,7 +427,8 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+				 _PAGE_NUMA);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -485,6 +487,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_BALANCE_NUMA
+	/* pmd_numa check */
+	if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 48fc1dc..7ab6e63 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -558,6 +558,112 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+#ifndef pte_numa
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+#ifndef pmd_numa
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+#ifndef pte_mknonnuma
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+
+#ifndef pmd_mknonnuma
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+
+#ifndef pte_mknuma
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+#endif
+
+#ifndef pmd_mknuma
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+#endif
+#else
+extern int pte_numa(pte_t pte);
+extern int pmd_numa(pmd_t pmd);
+extern pte_t pte_mknonnuma(pte_t pte);
+extern pmd_t pmd_mknonnuma(pmd_t pmd);
+extern pte_t pte_mknuma(pte_t pte);
+extern pmd_t pmd_mknuma(pmd_t pmd);
+#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
+#else
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+	return pte;
+}
+
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	return pte;
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	return pmd;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..6897a05 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,39 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# For architectures that want to enable the support for NUMA-affine scheduler
+# balancing logic:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+	bool
+
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANT_NUMA_VARIABLE_LOCALITY
+	bool
+
+#
+# For architectures that are willing to define _PAGE_NUMA as _PAGE_PROTNONE
+config ARCH_WANTS_PROT_NUMA_PROT_NONE
+	bool
+
+config ARCH_USES_NUMA_PROT_NONE
+	bool
+	default y
+	depends on ARCH_WANTS_PROT_NUMA_PROT_NONE
+	depends on BALANCE_NUMA
+
+config BALANCE_NUMA
+	bool "Memory placement aware NUMA scheduler"
+	default n
+	depends on ARCH_SUPPORTS_NUMA_BALANCING
+	depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
+	depends on SMP && NUMA && MIGRATION
+	help
+	  This option adds support for automatic NUMA aware memory/task placement.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 13/49] mm: numa: Support NUMA hinting page faults from gup/gup_fast
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (11 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 12/49] mm: numa: pte_numa() and pmd_numa() Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 14/49] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
                   ` (38 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1856c62..fa16152 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1572,6 +1572,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
 #define FOLL_MLOCK	0x40	/* mark page as mlocked */
 #define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+#define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..73834e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1517,6 +1517,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+		goto no_page_table;
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
 			split_huge_page_pmd(mm, pmd);
@@ -1546,6 +1548,8 @@ split_fallthrough:
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto no_page;
+	if ((flags & FOLL_NUMA) && pte_numa(pte))
+		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 
@@ -1697,6 +1701,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
 	vm_flags &= (gup_flags & FOLL_FORCE) ?
 			(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+	/*
+	 * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+	 * would be called on PROT_NONE ranges. We must never invoke
+	 * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+	 * page faults would unprotect the PROT_NONE ranges if
+	 * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+	 * bitflag. So to avoid that, don't set FOLL_NUMA if
+	 * FOLL_FORCE is set.
+	 */
+	if (!(gup_flags & FOLL_FORCE))
+		gup_flags |= FOLL_NUMA;
+
 	i = 0;
 
 	do {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 14/49] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (12 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 13/49] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 15/49] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
                   ` (37 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

When we split a transparent hugepage, transfer the NUMA type from the
pmd to the pte if needed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..3aaf242 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1363,6 +1363,8 @@ static int __split_huge_page_map(struct page *page,
 				BUG_ON(page_mapcount(page) != 1);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
+			if (pmd_numa(*pmd))
+				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 15/49] mm: numa: Create basic numa page hinting infrastructure
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (13 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 14/49] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 16/49] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
                   ` (36 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: This patch started as "mm/mpol: Create special PROT_NONE
	infrastructure" and preserves the basic idea but steals *very*
	heavily from "autonuma: numa hinting page faults entry points" for
	the actual fault handlers without the migration parts.	The end
	result is barely recognisable as either patch so all Signed-off
	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
	this version, I will re-add the signed-offs-by to reflect the history.

In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.

The meaning of PAGE_NUMA depends on the architecture but on x86 it is
effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
NUMA faults for the reason that the page fault code checks the permission on
the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
before it ever calls handle_mm_fault.

[dhillf@gmail.com: Fix typo]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/huge_mm.h |   10 +++++
 mm/huge_memory.c        |   22 ++++++++++
 mm/memory.c             |  112 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..a1d26a9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,10 @@ static inline struct page *compound_trans_head(struct page *page)
 	}
 	return page;
 }
+
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				  pmd_t pmd, pmd_t *pmdp);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +199,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 {
 	return 0;
 }
+
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3aaf242..f1b2d63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1017,6 +1017,28 @@ out:
 	return page;
 }
 
+/* NUMA hinting page fault entry point for trans huge pmds */
+int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+				pmd_t pmd, pmd_t *pmdp)
+{
+	struct page *page;
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
+	page = pmd_page(pmd);
+	pmd = pmd_mknonnuma(pmd);
+	set_pmd_at(mm, haddr, pmdp, pmd);
+	VM_BUG_ON(pmd_numa(*pmdp));
+	update_mmu_cache_pmd(vma, addr, pmdp);
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return 0;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 73834e7..4d005a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3448,6 +3448,103 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+	struct page *page;
+	spinlock_t *ptl;
+
+	/*
+	* The "pte" at this point cannot be used safely without
+	* validation through pte_unmap_same(). It's of NUMA type but
+	* the pfn may be screwed if the read is non atomic.
+	*
+	* ptep_modify_prot_start is not called as this is clearing
+	* the _PAGE_NUMA bit and it is not really expected that there
+	* would be concurrent hardware modifications to the PTE.
+	*/
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte)))
+		goto out_unlock;
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	update_mmu_cache(vma, addr, ptep);
+
+	page = vm_normal_page(vma, addr, pte);
+	if (!page) {
+		pte_unmap_unlock(ptep, ptl);
+		return 0;
+	}
+
+out_unlock:
+	pte_unmap_unlock(ptep, ptl);
+	return 0;
+}
+
+/* NUMA hinting page fault entry point for regular pmds */
+#ifdef CONFIG_BALANCE_NUMA
+static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte, *orig_pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return 0;
+
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page *page;
+		if (!pte_present(pteval))
+			continue;
+		if (!pte_numa(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknonnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+	}
+	pte_unmap_unlock(orig_pte, ptl);
+
+	return 0;
+}
+#else
+static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long addr, pmd_t *pmdp)
+{
+	BUG();
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3486,6 +3583,9 @@ int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(entry))
+		return do_numa_page(mm, vma, address, entry, pte, pmd);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3554,9 +3654,11 @@ retry:
 
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_numa(*pmd))
+				return do_huge_pmd_numa_page(mm, address,
+							     orig_pmd, pmd);
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3568,10 +3670,14 @@ retry:
 					goto retry;
 				return ret;
 			}
+
 			return 0;
 		}
 	}
 
+	if (pmd_numa(*pmd))
+		return do_pmd_numa_page(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 16/49] mm: mempolicy: Make MPOL_LOCAL a real policy
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (14 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 15/49] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 17/49] mm: mempolicy: Add MPOL_NOOP Mel Gorman
                   ` (35 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.

Requested-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |    9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..3e835c9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_LOCAL,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66e90ec..54bd3e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
 		}
+	} else if (mode == MPOL_LOCAL) {
+		if (!nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2399,7 +2403,6 @@ void numa_default_policy(void)
  * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
  * Used only for mpol_parse_str() and mpol_to_str()
  */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2452,12 +2455,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 	if (flags)
 		*flags++ = '\0';	/* terminate mode string */
 
-	for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+	for (mode = 0; mode < MPOL_MAX; mode++) {
 		if (!strcmp(str, policy_modes[mode])) {
 			break;
 		}
 	}
-	if (mode > MPOL_LOCAL)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 17/49] mm: mempolicy: Add MPOL_NOOP
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (15 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 16/49] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 18/49] mm: mempolicy: Check for misplaced page Mel Gorman
                   ` (34 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3e835c9..d23dca8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 54bd3e5..c21e914 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1147,7 +1147,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2409,7 +2409,8 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
-	[MPOL_LOCAL]      = "local"
+	[MPOL_LOCAL]      = "local",
+	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2460,7 +2461,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX)
+	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 18/49] mm: mempolicy: Check for misplaced page
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (16 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 17/49] mm: mempolicy: Add MPOL_NOOP Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 19/49] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
                   ` (33 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.

A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy.  So, I just mimic the alloc_page_vma() node computation
logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
  simplified code now that we don't have to bother
  with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h      |    8 +++++
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   76 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 #else
 
 struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 	return 0;
 }
 
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
+
 #endif /* CONFIG_NUMA */
 #endif
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index d23dca8..472de8a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
+#define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c21e914..df1466d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2181,6 +2181,82 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *	-1	- not misplaced, page is in the right node
+ *	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	unsigned long pgoff;
+	int polnid = -1;
+	int ret = -1;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(pol->flags & MPOL_F_MOF))
+		goto out;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+	if (curnid != polnid)
+		ret = polnid;
+out:
+	mpol_cond_put(pol);
+
+	return ret;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 19/49] mm: migrate: Introduce migrate_misplaced_page()
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (17 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 18/49] mm: mempolicy: Check for misplaced page Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 20/49] mm: migrate: Drop the misplaced pages reference count if the target node is full Mel Gorman
                   ` (32 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Note: This was originally based on Peter's patch "mm/migrate: Introduce
	migrate_misplaced_page()" but borrows extremely heavily from Andrea's
	"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
	collection". The end result is barely recognisable so signed-offs
	had to be dropped. If original authors are ok with it, I'll
	re-add the signed-off-bys.

Add migrate_misplaced_page() which deals with migrating pages from
faults.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Based-on-work-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/migrate.h |   11 +++++
 mm/migrate.c            |  108 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9d1c159..2923135 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -13,6 +13,7 @@ enum migrate_reason {
 	MR_MEMORY_HOTPLUG,
 	MR_SYSCALL,		/* also applies to cpusets */
 	MR_MEMPOLICY_MBIND,
+	MR_NUMA_MISPLACED,
 	MR_CMA
 };
 
@@ -73,4 +74,14 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #define fail_migrate_page NULL
 
 #endif /* CONFIG_MIGRATION */
+
+#ifdef CONFIG_BALANCE_NUMA
+extern int migrate_misplaced_page(struct page *page, int node);
+#else
+static inline int migrate_misplaced_page(struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 27be9c9..a2c4567 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -282,7 +282,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode)
 {
-	int expected_count;
+	int expected_count = 0;
 	void **pslot;
 
 	if (!mapping) {
@@ -1415,4 +1415,108 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
  	}
  	return err;
 }
-#endif
+
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which crude
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_exact_node(nid,
+					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+					  __GFP_NOMEMALLOC | __GFP_NORETRY |
+					  __GFP_NOWARN) &
+					 ~GFP_IOFS, 0);
+	return newpage;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	int isolated = 0;
+	LIST_HEAD(migratepages);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1) {
+		put_page(page);
+		goto out;
+	}
+
+	/* Avoid migrating to a node that is nearly full */
+	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+		int page_lru;
+
+		if (isolate_lru_page(page)) {
+			put_page(page);
+			goto out;
+		}
+		isolated = 1;
+
+		/*
+		 * Page is isolated which takes a reference count so now the
+		 * callers reference can be safely dropped without the page
+		 * disappearing underneath us during migration
+		 */
+		put_page(page);
+
+		page_lru = page_is_file_cache(page);
+		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		list_add(&page->lru, &migratepages);
+	}
+
+	if (isolated) {
+		int nr_remaining;
+
+		nr_remaining = migrate_pages(&migratepages,
+				alloc_misplaced_dst_page,
+				node, false, MIGRATE_ASYNC,
+				MR_NUMA_MISPLACED);
+		if (nr_remaining) {
+			putback_lru_pages(&migratepages);
+			isolated = 0;
+		}
+	}
+	BUG_ON(!list_empty(&migratepages));
+out:
+	return isolated;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
+#endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 20/49] mm: migrate: Drop the misplaced pages reference count if the target node is full
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (18 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 19/49] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 21/49] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
                   ` (31 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

If we have to avoid migrating to a node that is nearly full, put page
and return zero.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a2c4567..49878d7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1489,18 +1489,21 @@ int migrate_misplaced_page(struct page *page, int node)
 		}
 		isolated = 1;
 
-		/*
-		 * Page is isolated which takes a reference count so now the
-		 * callers reference can be safely dropped without the page
-		 * disappearing underneath us during migration
-		 */
-		put_page(page);
-
 		page_lru = page_is_file_cache(page);
 		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
 		list_add(&page->lru, &migratepages);
 	}
 
+	/*
+	 * Page is either isolated or there is not enough space on the target
+	 * node. If isolated, then it has taken a reference count and the
+	 * callers reference can be safely dropped without the page
+	 * disappearing underneath us during migration. Otherwise the page is
+	 * not to be migrated but the callers reference should still be
+	 * dropped so it does not leak.
+	 */
+	put_page(page);
+
 	if (isolated) {
 		int nr_remaining;
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 21/49] mm: mempolicy: Use _PAGE_NUMA to migrate pages
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (19 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 20/49] mm: migrate: Drop the misplaced pages reference count if the target node is full Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
                   ` (30 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
	sufficiently different that the signed-off-bys were dropped

Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
pieces into an effective migrate on fault scheme.

Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
page-migration performance.

Based-on-work-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |    9 +++++----
 mm/huge_memory.c        |   31 ++++++++++++++++++++++++++++---
 mm/memory.c             |   32 +++++++++++++++++++++++++++-----
 3 files changed, 60 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a1d26a9..dabb510 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,8 +160,8 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
-extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				  pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
@@ -200,9 +200,10 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	return 0;
 }
 
-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-					pmd_t pmd, pmd_t *pmdp)
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+					unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	return 0;
 }
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f1b2d63..df1af09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1018,17 +1019,39 @@ out:
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
-				pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
-	struct page *page;
+	struct page *page = NULL;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int target_nid;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1)
+		goto clear_pmdnuma;
+
+	/*
+	 * Due to lacking code to migrate thp pages, we'll split
+	 * (which preserves the special PROT_NONE) and re-take the
+	 * fault on the normal pages.
+	 */
+	split_huge_page(page);
+	put_page(page);
+	return 0;
+
+clear_pmdnuma:
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(pmd, *pmdp)))
+		goto out_unlock;
+
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
@@ -1036,6 +1059,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (page)
+		put_page(page);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4d005a3..1757ad8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3451,8 +3452,9 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
-	struct page *page;
+	struct page *page = NULL;
 	spinlock_t *ptl;
+	int current_nid, target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3465,8 +3467,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	*/
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
-	if (unlikely(!pte_same(*ptep, pte)))
-		goto out_unlock;
+	if (unlikely(!pte_same(*ptep, pte))) {
+		pte_unmap_unlock(ptep, ptl);
+		goto out;
+	}
+
 	pte = pte_mknonnuma(pte);
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
@@ -3477,8 +3482,25 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-out_unlock:
+	get_page(page);
+	current_nid = page_to_nid(page);
+	target_nid = mpol_misplaced(page, vma, addr);
 	pte_unmap_unlock(ptep, ptl);
+	if (target_nid == -1) {
+		/*
+		 * Account for the fault against the current node if it not
+		 * being replaced regardless of where the page is located.
+		 */
+		current_nid = numa_node_id();
+		put_page(page);
+		goto out;
+	}
+
+	/* Migrate to the requested node */
+	if (migrate_misplaced_page(page, target_nid))
+		current_nid = target_nid;
+
+out:
 	return 0;
 }
 
@@ -3655,7 +3677,7 @@ retry:
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
 			if (pmd_numa(*pmd))
-				return do_huge_pmd_numa_page(mm, address,
+				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
 			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (20 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 21/49] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2013-01-05  5:18   ` Simon Jeons
  2012-12-07 10:23 ` [PATCH 23/49] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() Mel Gorman
                   ` (29 subsequent siblings)
  51 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

NOTE: Once again there is a lot of patch stealing and the end result
	is sufficiently different that I had to drop the signed-offs.
	Will re-add if the original authors are ok with that.

This patch adds another mbind() flag to request "lazy migration".  The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm.h             |    5 ++
 include/uapi/linux/mempolicy.h |   13 ++-
 mm/mempolicy.c                 |  185 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 185 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa16152..471185e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1551,6 +1551,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 }
 #endif
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+void change_prot_numa(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
+#endif
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
+			 MPOL_MF_MOVE     | 	\
+			 MPOL_MF_MOVE_ALL |	\
+			 MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index df1466d..51d3ebd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
 #include <linux/syscalls.h>
 #include <linux/ctype.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -565,6 +566,145 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault.
+ */
+static int
+change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		int page_nid;
+		ret = HPAGE_PMD_NR;
+
+		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page = pmd_page(*pmd);
+
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		page_nid = page_to_nid(page);
+
+		if (pmd_numa(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto out;
+		}
+
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		ret += HPAGE_PMD_NR;
+		/* defer TLB flush to lower the overhead */
+		spin_unlock(&mm->page_table_lock);
+		goto out;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		goto out;
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+/* Assumes mmap_sem is held */
+void
+change_prot_numa(struct vm_area_struct *vma,
+			unsigned long address, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int progress = 0;
+
+	while (address < end) {
+		VM_BUG_ON(address < vma->vm_start ||
+			  address + PAGE_SIZE > vma->vm_end);
+
+		progress += change_prot_numa_range(mm, vma, address);
+		address = (address + PMD_SIZE) & PMD_MASK;
+	}
+
+	/*
+	 * Flush the TLB for the mm to start the NUMA hinting
+	 * page faults after we finish scanning this vma part
+	 * if there were any PTE updates
+	 */
+	if (progress) {
+		mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
+		flush_tlb_range(vma, address, end);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
+	}
+}
+#else
+static unsigned long change_prot_numa(struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	return 0;
+}
+#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
+
 /*
  * Check if all pages in a range are on a set of nodes.
  * If pagelist != NULL then isolate pages from the LRU and
@@ -583,22 +723,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		return ERR_PTR(-EFAULT);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+		unsigned long endvma = vma->vm_end;
+
+		if (endvma > end)
+			endvma = end;
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
 			if (!vma->vm_next && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
-		if (!is_vm_hugetlb_page(vma) &&
-		    ((flags & MPOL_MF_STRICT) ||
+
+		if (is_vm_hugetlb_page(vma))
+			goto next;
+
+		if (flags & MPOL_MF_LAZY) {
+			change_prot_numa(vma, start, endvma);
+			goto next;
+		}
+
+		if ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
-			unsigned long endvma = vma->vm_end;
+		      vma_migratable(vma))) {
 
-			if (endvma > end)
-				endvma = end;
-			if (vma->vm_start > start)
-				start = vma->vm_start;
 			err = check_pgd_range(vma, start, endvma, nodes,
 						flags, private);
 			if (err) {
@@ -606,6 +756,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 				break;
 			}
 		}
+next:
 		prev = vma;
 	}
 	return first;
@@ -1138,8 +1289,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1162,6 +1312,9 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	if (flags & MPOL_MF_LAZY)
+		new->flags |= MPOL_F_MOF;
+
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -1198,13 +1351,15 @@ static long do_mbind(unsigned long start, unsigned long len,
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
+			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
 						false, MIGRATE_SYNC,
@@ -1213,7 +1368,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 23/49] mm: mempolicy: Implement change_prot_numa() in terms of change_protection()
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (21 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 24/49] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
                   ` (28 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

This patch converts change_prot_numa() to use change_protection(). As
pte_numa and friends check the PTE bits directly it is necessary for
change_protection() to use pmd_mknuma(). Hence the required
modifications to change_protection() are a little clumsy but the
end result is that most of the numa page table helpers are just one or
two instructions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |    3 +-
 include/linux/mm.h      |    4 +-
 mm/huge_memory.c        |   14 ++++-
 mm/mempolicy.c          |  137 +++++------------------------------------------
 mm/mprotect.c           |   72 +++++++++++++++++++------
 5 files changed, 85 insertions(+), 145 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index dabb510..027ad04 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -27,7 +27,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 			 unsigned long new_addr, unsigned long old_end,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, pgprot_t newprot);
+			unsigned long addr, pgprot_t newprot,
+			int prot_numa);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 471185e..d04c2f0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1080,7 +1080,7 @@ extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long flags, unsigned long new_addr);
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
-			      int dirty_accountable);
+			      int dirty_accountable, int prot_numa);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
@@ -1552,7 +1552,7 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
 #endif
 
 #ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
-void change_prot_numa(struct vm_area_struct *vma,
+unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 #endif
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index df1af09..68e0412 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1146,7 +1146,7 @@ out:
 }
 
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, pgprot_t newprot)
+		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
@@ -1154,7 +1154,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
-		entry = pmd_modify(entry, newprot);
+		if (!prot_numa)
+			entry = pmd_modify(entry, newprot);
+		else {
+			struct page *page = pmd_page(*pmd);
+
+			/* only check non-shared pages */
+			if (page_mapcount(page) == 1 &&
+			    !pmd_numa(*pmd)) {
+				entry = pmd_mknuma(entry);
+			}
+		}
 		set_pmd_at(mm, addr, pmd, entry);
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		ret = 1;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 51d3ebd..75d4600 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -568,134 +568,23 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
 
 #ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
 /*
- * Here we search for not shared page mappings (mapcount == 1) and we
- * set up the pmd/pte_numa on those mappings so the very next access
- * will fire a NUMA hinting page fault.
+ * This is used to mark a range of virtual addresses to be inaccessible.
+ * These are later cleared by a NUMA hinting fault. Depending on these
+ * faults, pages may be migrated for better NUMA placement.
+ *
+ * This is assuming that NUMA faults are handled using PROT_NONE. If
+ * an architecture makes a different choice, it will need further
+ * changes to the core.
  */
-static int
-change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address)
-{
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte, *_pte;
-	struct page *page;
-	unsigned long _address, end;
-	spinlock_t *ptl;
-	int ret = 0;
-
-	VM_BUG_ON(address & ~PAGE_MASK);
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		goto out;
-
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		goto out;
-
-	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
-		goto out;
-
-	if (pmd_trans_huge_lock(pmd, vma) == 1) {
-		int page_nid;
-		ret = HPAGE_PMD_NR;
-
-		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-		if (pmd_numa(*pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
-		page = pmd_page(*pmd);
-
-		/* only check non-shared pages */
-		if (page_mapcount(page) != 1) {
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
-		page_nid = page_to_nid(page);
-
-		if (pmd_numa(*pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
-		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
-		ret += HPAGE_PMD_NR;
-		/* defer TLB flush to lower the overhead */
-		spin_unlock(&mm->page_table_lock);
-		goto out;
-	}
-
-	if (pmd_trans_unstable(pmd))
-		goto out;
-	VM_BUG_ON(!pmd_present(*pmd));
-
-	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	for (_address = address, _pte = pte; _address < end;
-	     _pte++, _address += PAGE_SIZE) {
-		pte_t pteval = *_pte;
-		if (!pte_present(pteval))
-			continue;
-		if (pte_numa(pteval))
-			continue;
-		page = vm_normal_page(vma, _address, pteval);
-		if (unlikely(!page))
-			continue;
-		/* only check non-shared pages */
-		if (page_mapcount(page) != 1)
-			continue;
-
-		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
-
-		/* defer TLB flush to lower the overhead */
-		ret++;
-	}
-	pte_unmap_unlock(pte, ptl);
-
-	if (ret && !pmd_numa(*pmd)) {
-		spin_lock(&mm->page_table_lock);
-		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
-		spin_unlock(&mm->page_table_lock);
-		/* defer TLB flush to lower the overhead */
-	}
-
-out:
-	return ret;
-}
-
-/* Assumes mmap_sem is held */
-void
-change_prot_numa(struct vm_area_struct *vma,
-			unsigned long address, unsigned long end)
+unsigned long change_prot_numa(struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	int progress = 0;
-
-	while (address < end) {
-		VM_BUG_ON(address < vma->vm_start ||
-			  address + PAGE_SIZE > vma->vm_end);
+	int nr_updated;
+	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
 
-		progress += change_prot_numa_range(mm, vma, address);
-		address = (address + PMD_SIZE) & PMD_MASK;
-	}
+	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
 
-	/*
-	 * Flush the TLB for the mm to start the NUMA hinting
-	 * page faults after we finish scanning this vma part
-	 * if there were any PTE updates
-	 */
-	if (progress) {
-		mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
-		flush_tlb_range(vma, address, end);
-		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
-	}
+	return nr_updated;
 }
 #else
 static unsigned long change_prot_numa(struct vm_area_struct *vma,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7c3628a..8abf7c6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,10 +35,11 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 }
 #endif
 
-static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
@@ -49,19 +50,39 @@ static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
+			bool updated = false;
 
 			ptent = ptep_modify_prot_start(mm, addr, pte);
-			ptent = pte_modify(ptent, newprot);
+			if (!prot_numa) {
+				ptent = pte_modify(ptent, newprot);
+				updated = true;
+			} else {
+				struct page *page;
+
+				page = vm_normal_page(vma, addr, oldpte);
+				if (page) {
+					/* only check non-shared pages */
+					if (!pte_numa(oldpte) &&
+					    page_mapcount(page) == 1) {
+						ptent = pte_mknuma(ptent);
+						updated = true;
+					}
+				}
+			}
 
 			/*
 			 * Avoid taking write faults for pages we know to be
 			 * dirty.
 			 */
-			if (dirty_accountable && pte_dirty(ptent))
+			if (dirty_accountable && pte_dirty(ptent)) {
 				ptent = pte_mkwrite(ptent);
+				updated = true;
+			}
+
+			if (updated)
+				pages++;
 
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
-			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
@@ -83,9 +104,25 @@ static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	return pages;
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
+		pmd_t *pmd)
+{
+	spin_lock(&mm->page_table_lock);
+	set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
+	spin_unlock(&mm->page_table_lock);
+}
+#else
+static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
+		pmd_t *pmd)
+{
+	BUG();
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -97,7 +134,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+			else if (change_huge_pmd(vma, pmd, addr, newprot, prot_numa)) {
 				pages += HPAGE_PMD_NR;
 				continue;
 			}
@@ -105,8 +142,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
-				 dirty_accountable);
+		pages += change_pte_range(vma, pmd, addr, next, newprot,
+				 dirty_accountable, prot_numa);
+
+		if (prot_numa)
+			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;
@@ -114,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -126,7 +166,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable);
+				 dirty_accountable, prot_numa);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -134,7 +174,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable)
+		int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -150,7 +190,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable);
+				 dirty_accountable, prot_numa);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -162,7 +202,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 
 unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, pgprot_t newprot,
-		       int dirty_accountable)
+		       int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long pages;
@@ -171,7 +211,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	if (is_vm_hugetlb_page(vma))
 		pages = hugetlb_change_protection(vma, start, end, newprot);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
 	return pages;
@@ -249,7 +289,7 @@ success:
 		dirty_accountable = 1;
 	}
 
-	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable, 0);
 
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 24/49] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (22 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 23/49] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 25/49] mm: numa: Add fault driven placement and migration Mel Gorman
                   ` (27 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
explicitly request lazy migration is a good idea but the actual
API has not been well reviewed and once released we have to support it.
For now this patch prevents an application using the services. This
will need to be revisited.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/uapi/linux/mempolicy.h |    4 +---
 mm/mempolicy.c                 |    9 ++++-----
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 6a1baae..16fb4e6 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,7 +21,6 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
-	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
@@ -57,8 +56,7 @@ enum mpol_rebind_step {
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
-			 MPOL_MF_MOVE_ALL |	\
-			 MPOL_MF_LAZY)
+			 MPOL_MF_MOVE_ALL)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 75d4600..a7a62fe 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -252,7 +252,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
+	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
 		return NULL;
@@ -1186,7 +1186,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
+	if (mode == MPOL_DEFAULT)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -1241,7 +1241,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
 	err = PTR_ERR(vma);	/* maybe ... */
-	if (!IS_ERR(vma) && mode != MPOL_NOOP)
+	if (!IS_ERR(vma))
 		err = mbind_range(mm, start, end, new);
 
 	if (!err) {
@@ -2530,7 +2530,6 @@ static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
-	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2581,7 +2580,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 25/49] mm: numa: Add fault driven placement and migration
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (23 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 24/49] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2013-01-04 11:56   ` Simon Jeons
  2012-12-07 10:23 ` [PATCH 26/49] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
                   ` (26 subsequent siblings)
  51 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

NOTE: This patch is based on "sched, numa, mm: Add fault driven
	placement and migration policy" but as it throws away all the policy
	to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on.  In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/sh/mm/Kconfig       |    1 +
 arch/x86/Kconfig         |    2 +
 include/linux/mm_types.h |   11 ++++
 include/linux/sched.h    |   20 ++++++++
 kernel/sched/core.c      |   13 +++++
 kernel/sched/fair.c      |  125 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h  |    7 +++
 kernel/sched/sched.h     |    6 +++
 kernel/sysctl.c          |   24 ++++++++-
 mm/huge_memory.c         |    5 +-
 mm/memory.c              |   14 +++++-
 11 files changed, 224 insertions(+), 4 deletions(-)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..0f7c852 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
 config NUMA
 	bool "Non Uniform Memory Access (NUMA) Support"
 	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+	select ARCH_WANT_NUMA_VARIABLE_LOCALITY
 	default n
 	help
 	  Some SH systems have many various memories scattered around
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..1137028 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,8 @@ config X86
 	def_bool y
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_NUMA_BALANCING
+	select ARCH_WANTS_PROT_NUMA_PROT_NONE
 	select HAVE_IDE
 	select HAVE_OPROFILE
 	select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..d82accb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -398,6 +398,17 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	/*
+	 * numa_next_scan is the next time when the PTEs will me marked
+	 * pte_numa to gather statistics and migrate pages to new nodes
+	 * if necessary
+	 */
+	unsigned long numa_next_scan;
+
+	/* numa_scan_seq prevents two threads setting pte_numa */
+	int numa_scan_seq;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..ac71181 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1479,6 +1479,14 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	int numa_scan_seq;
+	int numa_migrate_seq;
+	unsigned int numa_scan_period;
+	u64 node_stamp;			/* migration stamp  */
+	struct callback_head numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
+
 	struct rcu_head rcu;
 
 	/*
@@ -1553,6 +1561,14 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#ifdef CONFIG_BALANCE_NUMA
+extern void task_numa_fault(int node, int pages);
+#else
+static inline void task_numa_fault(int node, int pages)
+{
+}
+#endif
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -1990,6 +2006,10 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_period_min;
+extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_settle_count;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..81fa185 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1533,6 +1533,19 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_scan_seq = 0;
+	}
+
+	p->node_stamp = 0ULL;
+	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..b6d3ed7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,8 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
 
 #include <trace/events/sched.h>
 
@@ -776,6 +778,126 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_balance_numa_scan_period_min = 5000;
+unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+
+static void task_numa_placement(struct task_struct *p)
+{
+	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+
+	if (p->numa_scan_seq == seq)
+		return;
+	p->numa_scan_seq = seq;
+
+	/* FIXME: Scheduling placement policy hints go here */
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int pages)
+{
+	struct task_struct *p = current;
+
+	/* FIXME: Allocate task-specific structure for placement policy here */
+
+	task_numa_placement(p);
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+	unsigned long migrate, next_scan, now = jiffies;
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work; /* protect against double add */
+	/*
+	 * Who cares about NUMA placement when they're dying.
+	 *
+	 * NOTE: make sure not to dereference p->mm before this check,
+	 * exit_task_work() happens _after_ exit_mm() so we could be called
+	 * without p->mm even though we still had it when we enqueued this
+	 * work.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/*
+	 * Enforce maximal scan/migration frequency..
+	 */
+	migrate = mm->numa_next_scan;
+	if (time_before(now, migrate))
+		return;
+
+	if (p->numa_scan_period == 0)
+		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+
+	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
+	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+		return;
+
+	ACCESS_ONCE(mm->numa_scan_seq)++;
+	{
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (!vma_migratable(vma))
+				continue;
+			change_prot_numa(vma, vma->vm_start, vma->vm_end);
+		}
+		up_read(&mm->mmap_sem);
+	}
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_work;
+	u64 period, now;
+
+	/*
+	 * We don't care about NUMA placement if we don't have memory.
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+		return;
+
+	/*
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the selection from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * NUMA placement.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+	if (now - curr->node_stamp > period) {
+		curr->node_stamp = now;
+
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+			task_work_add(curr, work, true);
+		}
+	}
+}
+#else
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -4954,6 +5076,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	if (sched_feat_numa(NUMA))
+		task_tick_numa(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..7cfd289 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,3 +61,10 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+
+/*
+ * Apply the automatic NUMA scheduling policy
+ */
+#ifdef CONFIG_BALANCE_NUMA
+SCHED_FEAT(NUMA,	true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a7db09..9a43241 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -648,6 +648,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
+#ifdef CONFIG_BALANCE_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..1359f51 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000;		/* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
+#ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_COMPACTION
 static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &min_wakeup_granularity_ns,
 		.extra2		= &max_wakeup_granularity_ns,
 	},
+#ifdef CONFIG_SMP
 	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
@@ -347,7 +350,24 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_BALANCE_NUMA
+	{
+		.procname	= "balance_numa_scan_period_min_ms",
+		.data		= &sysctl_balance_numa_scan_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "balance_numa_scan_period_max_ms",
+		.data		= &sysctl_balance_numa_scan_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_BALANCE_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 68e0412..b3d4c4b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1045,6 +1045,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	split_huge_page(page);
 	put_page(page);
+
 	return 0;
 
 clear_pmdnuma:
@@ -1059,8 +1060,10 @@ clear_pmdnuma:
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page)
+	if (page) {
 		put_page(page);
+		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
+	}
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 1757ad8..1d6f85a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3454,7 +3454,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid, target_nid;
+	int current_nid = -1;
+	int target_nid;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3501,6 +3502,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		current_nid = target_nid;
 
 out:
+	task_numa_fault(current_nid, 1);
 	return 0;
 }
 
@@ -3537,6 +3539,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
+		int curr_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3554,6 +3557,15 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
+		/* only check non-shared pages */
+		if (unlikely(page_mapcount(page) != 1))
+			continue;
+		pte_unmap_unlock(pte, ptl);
+
+		curr_nid = page_to_nid(page);
+		task_numa_fault(curr_nid, 1);
+
+		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 26/49] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (24 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 25/49] mm: numa: Add fault driven placement and migration Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 27/49] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
                   ` (25 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds.  The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    1 +
 kernel/sched/fair.c      |   65 ++++++++++++++++++++++++++++++++++++----------
 kernel/sysctl.c          |    7 +++++
 4 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d82accb..b40f4ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -406,6 +406,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* Restart point for scanning and setting pte_numa */
+	unsigned long numa_scan_offset;
+
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac71181..abb1c70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6d3ed7..66d8bd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_BALANCE_NUMA
 /*
- * numa task sample period in ms: 5s
+ * numa task sample period in ms
  */
-unsigned int sysctl_balance_numa_scan_period_min = 5000;
-unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+unsigned int sysctl_balance_numa_scan_period_min = 100;
+unsigned int sysctl_balance_numa_scan_period_max = 100*16;
+
+/* Portion of address space to scan in MB */
+unsigned int sysctl_balance_numa_scan_size = 256;
 
 static void task_numa_placement(struct task_struct *p)
 {
@@ -808,6 +811,12 @@ void task_numa_fault(int node, int pages)
 	task_numa_placement(p);
 }
 
+static void reset_ptenuma_scan(struct task_struct *p)
+{
+	ACCESS_ONCE(p->mm->numa_scan_seq)++;
+	p->mm->numa_scan_offset = 0;
+}
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -817,6 +826,9 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -846,18 +858,45 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
+	offset = mm->numa_scan_offset;
+	length = sysctl_balance_numa_scan_size;
+	length <<= 20;
 
-		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_prot_numa(vma, vma->vm_start, vma->vm_end);
-		}
-		up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		reset_ptenuma_scan(p);
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		/* Skip small VMAs. They are not likely to be of relevance */
+		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_numa(vma, offset, end);
+
+		offset = end;
 	}
+
+	/*
+	 * It is possible to reach the end of the VMA list but the last few VMAs are
+	 * not guaranteed to the vma_migratable. If they are not, we would find the
+	 * !migratable VMA on the next scan but not reset the scanner to the start
+	 * so check it now.
+	 */
+	if (vma)
+		mm->numa_scan_offset = offset;
+	else
+		reset_ptenuma_scan(p);
+	up_read(&mm->mmap_sem);
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1359f51..d191203 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "balance_numa_scan_size_mb",
+		.data		= &sysctl_balance_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif /* CONFIG_BALANCE_NUMA */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 27/49] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (25 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 26/49] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 28/49] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
                   ` (24 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 66d8bd2..773ef97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -827,8 +827,8 @@ void task_numa_work(struct callback_head *work)
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
-	unsigned long offset, end;
-	long length;
+	unsigned long start, end;
+	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -858,18 +858,20 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	offset = mm->numa_scan_offset;
-	length = sysctl_balance_numa_scan_size;
-	length <<= 20;
+	start = mm->numa_scan_offset;
+	pages = sysctl_balance_numa_scan_size;
+	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	if (!pages)
+		return;
 
 	down_read(&mm->mmap_sem);
-	vma = find_vma(mm, offset);
+	vma = find_vma(mm, start);
 	if (!vma) {
 		reset_ptenuma_scan(p);
-		offset = 0;
+		start = 0;
 		vma = mm->mmap;
 	}
-	for (; vma && length > 0; vma = vma->vm_next) {
+	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma))
 			continue;
 
@@ -877,15 +879,19 @@ void task_numa_work(struct callback_head *work)
 		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
 			continue;
 
-		offset = max(offset, vma->vm_start);
-		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
-		length -= end - offset;
-
-		change_prot_numa(vma, offset, end);
+		do {
+			start = max(start, vma->vm_start);
+			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+			end = min(end, vma->vm_end);
+			pages -= change_prot_numa(vma, start, end);
 
-		offset = end;
+			start = end;
+			if (pages <= 0)
+				goto out;
+		} while (end != vma->vm_end);
 	}
 
+out:
 	/*
 	 * It is possible to reach the end of the VMA list but the last few VMAs are
 	 * not guaranteed to the vma_migratable. If they are not, we would find the
@@ -893,7 +899,7 @@ void task_numa_work(struct callback_head *work)
 	 * so check it now.
 	 */
 	if (vma)
-		mm->numa_scan_offset = offset;
+		mm->numa_scan_offset = start;
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 28/49] mm: sched: numa: Implement slow start for working set sampling
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (26 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 27/49] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
                   ` (23 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/balance_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 +-
 kernel/sched/fair.c   |    5 +++++
 kernel/sysctl.c       |    7 +++++++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index abb1c70..a2b06ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2006,6 +2006,7 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
 extern unsigned int sysctl_balance_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81fa185..047e3c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1543,7 +1543,7 @@ static void __sched_fork(struct task_struct *p)
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
-	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+	p->numa_scan_period = sysctl_balance_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_BALANCE_NUMA */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 773ef97..2e65f44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -788,6 +788,9 @@ unsigned int sysctl_balance_numa_scan_period_max = 100*16;
 /* Portion of address space to scan in MB */
 unsigned int sysctl_balance_numa_scan_size = 256;
 
+/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
+unsigned int sysctl_balance_numa_scan_delay = 1000;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -929,6 +932,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
 	if (now - curr->node_stamp > period) {
+		if (!curr->node_stamp)
+			curr->numa_scan_period = sysctl_balance_numa_scan_period_min;
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d191203..5ee587d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_BALANCE_NUMA
 	{
+		.procname	= "balance_numa_scan_delay_ms",
+		.data		= &sysctl_balance_numa_scan_delay,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "balance_numa_scan_period_min_ms",
 		.data		= &sysctl_balance_numa_scan_period_min,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (27 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 28/49] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2013-01-04 11:42   ` Simon Jeons
  2012-12-07 10:23 ` [PATCH 30/49] mm: numa: Migrate on reference policy Mel Gorman
                   ` (22 subsequent siblings)
  51 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

It is tricky to quantify the basic cost of automatic NUMA placement in a
meaningful manner. This patch adds some vmstats that can be used as part
of a basic costing model.

u    = basic unit = sizeof(void *)
Ca   = cost of struct page access = sizeof(struct page) / u
Cpte = Cost PTE access = Ca
Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
	where Cpte is incurred twice for a read and a write and Wlock
	is a constant representing the cost of taking or releasing a
	lock
Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
Ci = Cost of page isolation = Ca + Wi
	where Wi is a constant that should reflect the approximate cost
	of the locking operation
Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
	would imply that remote accesses are 20% more expensive

Balancing cost = Cpte * numa_pte_updates +
		Cnumahint * numa_hint_faults +
		Ci * numa_pages_migrated +
		Cpagecopy * numa_pages_migrated

Note that numa_pages_migrated is used as a measure of how many pages
were isolated even though it would miss pages that failed to migrate. A
vmstat counter could have been added for it but the isolation cost is
pretty marginal in comparison to the overall cost so it seemed overkill.

The ideal way to measure automatic placement benefit would be to count
the number of remote accesses versus local accesses and do something like

	benefit = (remote_accesses_before - remove_access_after) * Wnuma

but the information is not readily available. As a workload converges, the
expection would be that the number of remote numa hints would reduce to 0.

	convergence = numa_hint_faults_local / numa_hint_faults
		where this is measured for the last N number of
		numa hints recorded. When the workload is fully
		converged the value is 1.

This can measure if the placement policy is converging and how fast it is
doing it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/vm_event_item.h |    6 ++++++
 include/linux/vmstat.h        |    8 ++++++++
 mm/huge_memory.c              |    5 +++++
 mm/memory.c                   |   12 ++++++++++++
 mm/mempolicy.c                |    2 ++
 mm/migrate.c                  |    3 ++-
 mm/vmstat.c                   |    6 ++++++
 7 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a1f750b..dded0af 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,6 +38,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_BALANCE_NUMA
+		NUMA_PTE_UPDATES,
+		NUMA_HINT_FAULTS,
+		NUMA_HINT_FAULTS_LOCAL,
+		NUMA_PAGE_MIGRATE,
+#endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 #endif
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 92a86b2..dffccfa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -80,6 +80,14 @@ static inline void vm_events_fold_cpu(int cpu)
 
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 
+#ifdef CONFIG_BALANCE_NUMA
+#define count_vm_numa_event(x)     count_vm_event(x)
+#define count_vm_numa_events(x, y) count_vm_events(x, y)
+#else
+#define count_vm_numa_event(x) do {} while (0)
+#define count_vm_numa_events(x, y) do {} while (0)
+#endif /* CONFIG_BALANCE_NUMA */
+
 #define __count_zone_vm_events(item, zone, delta) \
 		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
 		zone_idx(zone), delta)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b3d4c4b..66e73cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1025,6 +1025,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
+	int current_nid = -1;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1033,6 +1034,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
+	current_nid = page_to_nid(page);
+	count_vm_numa_event(NUMA_HINT_FAULTS);
+	if (current_nid == numa_node_id())
+		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1)
diff --git a/mm/memory.c b/mm/memory.c
index 1d6f85a..47f5dd1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3477,6 +3477,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
+	count_vm_numa_event(NUMA_HINT_FAULTS);
 	page = vm_normal_page(vma, addr, pte);
 	if (!page) {
 		pte_unmap_unlock(ptep, ptl);
@@ -3485,6 +3486,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	get_page(page);
 	current_nid = page_to_nid(page);
+	if (current_nid == numa_node_id())
+		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 	target_nid = mpol_misplaced(page, vma, addr);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
@@ -3517,6 +3520,9 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int local_nid = numa_node_id();
+	unsigned long nr_faults = 0;
+	unsigned long nr_faults_local = 0;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3565,10 +3571,16 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		curr_nid = page_to_nid(page);
 		task_numa_fault(curr_nid, 1);
 
+		nr_faults++;
+		if (curr_nid == local_nid)
+			nr_faults_local++;
+
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
+	count_vm_numa_events(NUMA_HINT_FAULTS, nr_faults);
+	count_vm_numa_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
 	return 0;
 }
 #else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a7a62fe..516491f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -583,6 +583,8 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
 
 	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	if (nr_updated)
+		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
 	return nr_updated;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 49878d7..4f55694 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1514,7 +1514,8 @@ int migrate_misplaced_page(struct page *page, int node)
 		if (nr_remaining) {
 			putback_lru_pages(&migratepages);
 			isolated = 0;
-		}
+		} else
+			count_vm_numa_event(NUMA_PAGE_MIGRATE);
 	}
 	BUG_ON(!list_empty(&migratepages));
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a067fa..cfa386da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,6 +774,12 @@ const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_BALANCE_NUMA
+	"numa_pte_updates",
+	"numa_hint_faults",
+	"numa_hint_faults_local",
+	"numa_pages_migrated",
+#endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
 	"pgmigrate_fail",
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 30/49] mm: numa: Migrate on reference policy
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (28 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 31/49] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
                   ` (21 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

This is the simplest possible policy that still does something of note.
When a pte_numa is faulted, it is moved immediately. Any replacement
policy must at least do better than this and in all likelihood this
policy regresses normal workloads.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 16fb4e6..0d11c3d 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -67,6 +67,7 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_MORON	(1 << 4) /* Migrate On pte_numa Reference On Node */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 516491f..4c1c8d8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,6 +118,26 @@ static struct mempolicy default_policy = {
 	.flags = MPOL_F_LOCAL,
 };
 
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+	struct mempolicy *pol = p->mempolicy;
+	int node;
+
+	if (!pol) {
+		node = numa_node_id();
+		if (node != -1)
+			pol = &preferred_node_policy[node];
+
+		/* preferred_node_policy is not initialised early in boot */
+		if (!pol->mode)
+			pol = NULL;
+	}
+
+	return pol;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -1598,7 +1618,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol = get_task_policy(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -2021,7 +2041,7 @@ retry_cpuset:
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
@@ -2295,6 +2315,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	default:
 		BUG();
 	}
+
+	/* Migrate the page towards the node whose CPU is referencing it */
+	if (pol->flags & MPOL_F_MORON)
+		polnid = numa_node_id();
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
@@ -2483,6 +2508,15 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	for_each_node(nid) {
+		preferred_node_policy[nid] = (struct mempolicy) {
+			.refcnt = ATOMIC_INIT(1),
+			.mode = MPOL_PREFERRED,
+			.flags = MPOL_F_MOF | MPOL_F_MORON,
+			.v = { .preferred_node = nid, },
+		};
+	}
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 31/49] mm: numa: Migrate pages handled during a pmd_numa hinting fault
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (29 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 30/49] mm: numa: Migrate on reference policy Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 32/49] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
                   ` (20 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

To say that the PMD handling code was incorrectly transferred from autonuma
is an understatement. The intention was to handle a PMDs worth of pages
in the same fault and effectively batch the taking of the PTL and page
migration. The copied version instead has the impact of clearing a number
of pte_numa PTE entries and whether any page migration takes place depends
on racing. This just happens to work in some cases.

This patch handles pte_numa faults in batch when a pmd_numa fault is
handled. The pages are migrated if they are currently misplaced.
Essentially this is making an assumption that NUMA locality is
on a PMD boundary but that could be addressed by only setting
pmd_numa if all the pages within that PMD are on the same node
if necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/memory.c   |   51 ++++++++++++++++++++++++++++++++++-----------------
 mm/mprotect.c |   25 ++++++++++++++++++++-----
 2 files changed, 54 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 47f5dd1..6a1e534 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3449,6 +3449,18 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+				unsigned long addr, int current_nid)
+{
+	get_page(page);
+
+	count_vm_numa_event(NUMA_HINT_FAULTS);
+	if (current_nid == numa_node_id())
+		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+
+	return mpol_misplaced(page, vma, addr);
+}
+
 int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
 {
@@ -3477,18 +3489,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
-	count_vm_numa_event(NUMA_HINT_FAULTS);
 	page = vm_normal_page(vma, addr, pte);
 	if (!page) {
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
 
-	get_page(page);
 	current_nid = page_to_nid(page);
-	if (current_nid == numa_node_id())
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-	target_nid = mpol_misplaced(page, vma, addr);
+	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
 		/*
@@ -3505,7 +3513,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		current_nid = target_nid;
 
 out:
-	task_numa_fault(current_nid, 1);
+	if (current_nid != -1)
+		task_numa_fault(current_nid, 1);
 	return 0;
 }
 
@@ -3521,8 +3530,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	bool numa = false;
 	int local_nid = numa_node_id();
-	unsigned long nr_faults = 0;
-	unsigned long nr_faults_local = 0;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3545,7 +3552,8 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid;
+		int curr_nid = local_nid;
+		int target_nid;
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3566,21 +3574,30 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* only check non-shared pages */
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
-		pte_unmap_unlock(pte, ptl);
 
-		curr_nid = page_to_nid(page);
-		task_numa_fault(curr_nid, 1);
+		/*
+		 * Note that the NUMA fault is later accounted to either
+		 * the node that is currently running or where the page is
+		 * migrated to.
+		 */
+		curr_nid = local_nid;
+		target_nid = numa_migrate_prep(page, vma, addr,
+					       page_to_nid(page));
+		if (target_nid == -1) {
+			put_page(page);
+			continue;
+		}
 
-		nr_faults++;
-		if (curr_nid == local_nid)
-			nr_faults_local++;
+		/* Migrate to the requested node */
+		pte_unmap_unlock(pte, ptl);
+		if (migrate_misplaced_page(page, target_nid))
+			curr_nid = target_nid;
+		task_numa_fault(curr_nid, 1);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
 	pte_unmap_unlock(orig_pte, ptl);
 
-	count_vm_numa_events(NUMA_HINT_FAULTS, nr_faults);
-	count_vm_numa_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
 	return 0;
 }
 #else
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8abf7c6..629dba1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,12 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
+	bool all_same_node = true;
+	int last_nid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -61,6 +63,12 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
+					int this_nid = page_to_nid(page);
+					if (last_nid == -1)
+						last_nid = this_nid;
+					if (last_nid != this_nid)
+						all_same_node = false;
+
 					/* only check non-shared pages */
 					if (!pte_numa(oldpte) &&
 					    page_mapcount(page) == 1) {
@@ -81,7 +89,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			if (updated)
 				pages++;
-
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -101,6 +108,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
+	*ret_all_same_node = all_same_node;
 	return pages;
 }
 
@@ -127,6 +135,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
+	bool all_same_node;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -143,9 +152,15 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
-
-		if (prot_numa)
+				 dirty_accountable, prot_numa, &all_same_node);
+
+		/*
+		 * If we are changing protections for NUMA hinting faults then
+		 * set pmd_numa if the examined pages were all on the same
+		 * node. This allows a regular PMD to be handled as one fault
+		 * and effectively batches the taking of the PTL
+		 */
+		if (prot_numa && all_same_node)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 32/49] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (30 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 31/49] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 33/49] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
                   ` (19 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/page_alloc.c        |    5 +++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..1ed16e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -717,6 +717,19 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_BALANCE_NUMA
+	/*
+	 * Lock serializing the per destination node AutoNUMA memory
+	 * migration rate limiting data.
+	 */
+	spinlock_t balancenuma_migrate_lock;
+
+	/* Rate limiting time interval */
+	unsigned long balancenuma_migrate_next_window;
+
+	/* Number of pages migrated during the rate limiting time interval */
+	unsigned long balancenuma_migrate_nr_pages;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5953dc2..df58654 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4449,6 +4449,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int ret;
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_BALANCE_NUMA
+	spin_lock_init(&pgdat->balancenuma_migrate_lock);
+	pgdat->balancenuma_migrate_nr_pages = 0;
+	pgdat->balancenuma_migrate_next_window = jiffies;
+#endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat_page_cgroup_init(pgdat);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 33/49] mm: numa: Rate limit the amount of memory that is migrated between nodes
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (31 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 32/49] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 34/49] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
                   ` (18 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

NOTE: This is very heavily based on similar logic in autonuma. It should
	be signed off by Andrea but because there was no standalone
	patch and it's sufficiently different from what he did that
	the signed-off is omitted. Will be added back if requested.

If a large number of pages are misplaced then the memory bus can be
saturated just migrating pages between nodes. This patch rate-limits
the amount of memory that can be migrating between nodes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 4f55694..b2e6d4c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1461,12 +1461,21 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 }
 
 /*
+ * page migration rate limiting control.
+ * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
+ * window of time. Default here says do not migrate more than 1280M per second.
+ */
+static unsigned int migrate_interval_millisecs __read_mostly = 100;
+static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+
+/*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
 int migrate_misplaced_page(struct page *page, int node)
 {
+	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	LIST_HEAD(migratepages);
 
@@ -1479,8 +1488,27 @@ int migrate_misplaced_page(struct page *page, int node)
 		goto out;
 	}
 
+	/*
+	 * Rate-limit the amount of data that is being migrated to a node.
+	 * Optimal placement is no good if the memory bus is saturated and
+	 * all the time is being spent migrating!
+	 */
+	spin_lock(&pgdat->balancenuma_migrate_lock);
+	if (time_after(jiffies, pgdat->balancenuma_migrate_next_window)) {
+		pgdat->balancenuma_migrate_nr_pages = 0;
+		pgdat->balancenuma_migrate_next_window = jiffies +
+			msecs_to_jiffies(migrate_interval_millisecs);
+	}
+	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages) {
+		spin_unlock(&pgdat->balancenuma_migrate_lock);
+		put_page(page);
+		goto out;
+	}
+	pgdat->balancenuma_migrate_nr_pages++;
+	spin_unlock(&pgdat->balancenuma_migrate_lock);
+
 	/* Avoid migrating to a node that is nearly full */
-	if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+	if (migrate_balanced_pgdat(pgdat, 1)) {
 		int page_lru;
 
 		if (isolate_lru_page(page)) {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 34/49] mm: numa: Rate limit setting of pte_numa if node is saturated
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (32 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 33/49] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 35/49] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
                   ` (17 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

If there are a large number of NUMA hinting faults and all of them
are resulting in migrations it may indicate that memory is just
bouncing uselessly around. NUMA balancing cost is likely exceeding
any benefit from locality. Rate limit the PTE updates if the node
is migration rate-limited. As noted in the comments, this distorts
the NUMA faulting statistics.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |    6 ++++++
 kernel/sched/fair.c     |    9 +++++++++
 mm/migrate.c            |   20 ++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 2923135..6229177 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -77,11 +77,17 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 #ifdef CONFIG_BALANCE_NUMA
 extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page, int node);
+extern bool migrate_ratelimited(int node);
 #else
 static inline int migrate_misplaced_page(struct page *page, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline bool migrate_ratelimited(int node)
+{
+	return false;
+}
 #endif /* CONFIG_BALANCE_NUMA */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e65f44..357057c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -27,6 +27,7 @@
 #include <linux/profile.h>
 #include <linux/interrupt.h>
 #include <linux/mempolicy.h>
+#include <linux/migrate.h>
 #include <linux/task_work.h>
 
 #include <trace/events/sched.h>
@@ -861,6 +862,14 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
+	/*
+	 * Do not set pte_numa if the current running node is rate-limited.
+	 * This loses statistics on the fault but if we are unwilling to
+	 * migrate to this node, it is less likely we can do useful work
+	 */
+	if (migrate_ratelimited(numa_node_id()))
+		return;
+
 	start = mm->numa_scan_offset;
 	pages = sysctl_balance_numa_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
diff --git a/mm/migrate.c b/mm/migrate.c
index b2e6d4c..2c8310c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1464,10 +1464,30 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  * page migration rate limiting control.
  * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
  * window of time. Default here says do not migrate more than 1280M per second.
+ * If a node is rate-limited then PTE NUMA updates are also rate-limited. However
+ * as it is faults that reset the window, pte updates will happen unconditionally
+ * if there has not been a fault since @pteupdate_interval_millisecs after the
+ * throttle window closed.
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
+static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
 static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
 
+/* Returns true if NUMA migration is currently rate limited */
+bool migrate_ratelimited(int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+
+	if (time_after(jiffies, pgdat->balancenuma_migrate_next_window +
+				msecs_to_jiffies(pteupdate_interval_millisecs)))
+		return false;
+
+	if (pgdat->balancenuma_migrate_nr_pages < ratelimit_pages)
+		return false;
+
+	return true;
+}
+
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 35/49] sched: numa: Slowly increase the scanning period as NUMA faults are handled
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (33 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 34/49] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 36/49] mm: numa: Introduce last_nid to the page frame Mel Gorman
                   ` (16 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Currently the rate of scanning for an address space is controlled
by the individual tasks. The next scan is simply determined by
2*p->numa_scan_period.

The 2*p->numa_scan_period is arbitrary and never changes. At this point
there is still no proper policy that decides if a task or process is
properly placed. It just scans and assumes the next NUMA fault will
place it properly. As it is assumed that pages will get properly placed
over time, increase the scan window each time a fault is incurred. This
is a big assumption as noted in the comments.

It should be noted that changing to p->numa_scan_period will increase
system CPU usage because now the scanning rate has effectively doubled.
If that is a problem then the min_rate should be made 200ms instead of
restoring the 2* logic.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 357057c..3c632448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -812,6 +812,15 @@ void task_numa_fault(int node, int pages)
 
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
+	/*
+	 * Assume that as faults occur that pages are getting properly placed
+	 * and fewer NUMA hints are required. Note that this is a big
+	 * assumption, it assumes processes reach a steady steady with no
+	 * further phase changes.
+	 */
+	p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+				p->numa_scan_period + jiffies_to_msecs(2));
+
 	task_numa_placement(p);
 }
 
@@ -858,7 +867,7 @@ void task_numa_work(struct callback_head *work)
 	if (p->numa_scan_period == 0)
 		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
 
-	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
+	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 36/49] mm: numa: Introduce last_nid to the page frame
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (34 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 35/49] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 37/49] mm: numa: split_huge_page: Transfer last_nid on tail page Mel Gorman
                   ` (15 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h       |   30 ++++++++++++++++++++++++++++++
 include/linux/mm_types.h |    4 ++++
 mm/page_alloc.c          |    2 ++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d04c2f0..a0834e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -693,6 +693,36 @@ static inline int page_to_nid(const struct page *page)
 }
 #endif
 
+#ifdef CONFIG_BALANCE_NUMA
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return xchg(&page->_last_nid, nid);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page->_last_nid;
+}
+static inline void reset_page_last_nid(struct page *page)
+{
+	page->_last_nid = -1;
+}
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return page_to_nid(page);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page_to_nid(page);
+}
+
+static inline void reset_page_last_nid(struct page *page)
+{
+}
+#endif
+
 static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b40f4ef..6b478ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -175,6 +175,10 @@ struct page {
 	 */
 	void *shadow;
 #endif
+
+#ifdef CONFIG_BALANCE_NUMA
+	int _last_nid;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df58654..fd6a073 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -608,6 +608,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	reset_page_last_nid(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3826,6 +3827,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
+		reset_page_last_nid(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 37/49] mm: numa: split_huge_page: Transfer last_nid on tail page
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (35 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 36/49] mm: numa: Introduce last_nid to the page frame Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 38/49] mm: numa: migrate: Set last_nid on newly allocated page Mel Gorman
                   ` (14 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Hillf Danton <dhillf@gmail.com>

Pass last_nid from head page to tail page.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 66e73cc..4c6efa8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1361,6 +1361,7 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
+		page_xchg_last_nid(page_tail, page_last_nid(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 38/49] mm: numa: migrate: Set last_nid on newly allocated page
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (36 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 37/49] mm: numa: split_huge_page: Transfer last_nid on tail page Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 39/49] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
                   ` (13 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Hillf Danton <dhillf@gmail.com>

Pass last_nid from misplaced page to newly allocated migration target page.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2c8310c..6bc9745 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1457,6 +1457,9 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOMEMALLOC | __GFP_NORETRY |
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
+	if (newpage)
+		page_xchg_last_nid(newpage, page_last_nid(page));
+
 	return newpage;
 }
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 39/49] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (37 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 38/49] mm: numa: migrate: Set last_nid on newly allocated page Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 40/49] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate Mel Gorman
                   ` (12 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: This two-stage filter was taken directly from the sched/numa patch
	"sched, numa, mm: Add the scanning page fault machinery" but is
	only a partial extraction. As the end result is not necessarily
	recognisable, the signed-offs-by had to be removed. Will be added
	back if requested.

While it is desirable that all threads in a process run on its home
node, this is not always possible or necessary. There may be more
threads than exist within the node or the node might over-subscribed
with unrelated processes.

This can cause a situation whereby a page gets migrated off its home
node because the threads clearing pte_numa were running off-node. This
patch uses page->last_nid to build a two-stage filter before pages get
migrated to avoid problems with short or unlikely task<->node
relationships.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mempolicy.c |   30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4c1c8d8..fd20e28 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2317,9 +2317,37 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	}
 
 	/* Migrate the page towards the node whose CPU is referencing it */
-	if (pol->flags & MPOL_F_MORON)
+	if (pol->flags & MPOL_F_MORON) {
+		int last_nid;
+
 		polnid = numa_node_id();
 
+		/*
+		 * Multi-stage node selection is used in conjunction
+		 * with a periodic migration fault to build a temporal
+		 * task<->page relation. By using a two-stage filter we
+		 * remove short/unlikely relations.
+		 *
+		 * Using P(p) ~ n_p / n_t as per frequentist
+		 * probability, we can equate a task's usage of a
+		 * particular page (n_p) per total usage of this
+		 * page (n_t) (in a given time-span) to a probability.
+		 *
+		 * Our periodic faults will sample this probability and
+		 * getting the same result twice in a row, given these
+		 * samples are fully independent, is then given by
+		 * P(n)^2, provided our sample period is sufficiently
+		 * short compared to the usage pattern.
+		 *
+		 * This quadric squishes small probabilities, making
+		 * it less likely we act on an unlikely task<->page
+		 * relation.
+		 */
+		last_nid = page_xchg_last_nid(page, polnid);
+		if (last_nid != polnid)
+			goto out;
+	}
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 40/49] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (38 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 39/49] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 41/49] mm: sched: numa: Control enabling and disabling of NUMA balancing Mel Gorman
                   ` (11 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

The PTE scanning rate and fault rates are two of the biggest sources of
system CPU overhead with automatic NUMA placement.  Ideally a proper policy
would detect if a workload was properly placed, schedule and adjust the
PTE scanning rate accordingly. We do not track the necessary information
to do that but we at least know if we migrated or not.

This patch scans slower if a page was not migrated as the result of a
NUMA hinting fault up to sysctl_balance_numa_scan_period_max which is
now higher than the previous default. Once every minute it will reset
the scanner in case of phase changes.

This is hilariously crude and the numbers are arbitrary. Workloads will
converge quite slowly in comparison to what a proper policy should be able
to do. On the plus side, we will chew up less CPU for workloads that have
no need for automatic balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    5 +++--
 kernel/sched/core.c      |    1 +
 kernel/sched/fair.c      |   29 +++++++++++++++++++++--------
 kernel/sysctl.c          |    7 +++++++
 mm/huge_memory.c         |    2 +-
 mm/memory.c              |   12 ++++++++----
 7 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b478ff..62d18a9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -410,6 +410,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* numa_next_reset is when the PTE scanner period will be reset */
+	unsigned long numa_next_reset;
+
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2b06ea..1068afd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1562,9 +1562,9 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_BALANCE_NUMA
-extern void task_numa_fault(int node, int pages);
+extern void task_numa_fault(int node, int pages, bool migrated);
 #else
-static inline void task_numa_fault(int node, int pages)
+static inline void task_numa_fault(int node, int pages, bool migrated)
 {
 }
 #endif
@@ -2009,6 +2009,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_balance_numa_scan_delay;
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_period_reset;
 extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 047e3c7..a59d869 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1537,6 +1537,7 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_BALANCE_NUMA
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_next_reset = jiffies;
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c632448..c1be907 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -784,7 +784,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * numa task sample period in ms
  */
 unsigned int sysctl_balance_numa_scan_period_min = 100;
-unsigned int sysctl_balance_numa_scan_period_max = 100*16;
+unsigned int sysctl_balance_numa_scan_period_max = 100*50;
+unsigned int sysctl_balance_numa_scan_period_reset = 100*600;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_balance_numa_scan_size = 256;
@@ -806,20 +807,19 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages)
+void task_numa_fault(int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
 	/*
-	 * Assume that as faults occur that pages are getting properly placed
-	 * and fewer NUMA hints are required. Note that this is a big
-	 * assumption, it assumes processes reach a steady steady with no
-	 * further phase changes.
+	 * If pages are properly placed (did not migrate) then scan slower.
+	 * This is reset periodically in case of phase changes
 	 */
-	p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
-				p->numa_scan_period + jiffies_to_msecs(2));
+        if (!migrated)
+		p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
 }
@@ -858,6 +858,19 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * Reset the scan period if enough time has gone by. Objective is that
+	 * scanning will be reduced if pages are properly placed. As tasks
+	 * can enter different phases this needs to be re-examined. Lacking
+	 * proper tracking of reference behaviour, this blunt hammer is used.
+	 */
+	migrate = mm->numa_next_reset;
+	if (time_after(now, migrate)) {
+		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+		next_scan = now + msecs_to_jiffies(sysctl_balance_numa_scan_period_reset);
+		xchg(&mm->numa_next_reset, next_scan);
+	}
+
+	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
 	migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5ee587d..c335f426 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "balance_numa_scan_period_reset",
+		.data		= &sysctl_balance_numa_scan_period_reset,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "balance_numa_scan_period_max_ms",
 		.data		= &sysctl_balance_numa_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4c6efa8..1327a03 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1067,7 +1067,7 @@ out_unlock:
 	spin_unlock(&mm->page_table_lock);
 	if (page) {
 		put_page(page);
-		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
+		task_numa_fault(numa_node_id(), HPAGE_PMD_NR, false);
 	}
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 6a1e534..30e1335 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3468,6 +3468,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int current_nid = -1;
 	int target_nid;
+	bool migrated = false;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3509,12 +3510,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	if (migrate_misplaced_page(page, target_nid))
+	migrated = migrate_misplaced_page(page, target_nid);
+	if (migrated)
 		current_nid = target_nid;
 
 out:
 	if (current_nid != -1)
-		task_numa_fault(current_nid, 1);
+		task_numa_fault(current_nid, 1, migrated);
 	return 0;
 }
 
@@ -3554,6 +3556,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct page *page;
 		int curr_nid = local_nid;
 		int target_nid;
+		bool migrated;
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3590,9 +3593,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* Migrate to the requested node */
 		pte_unmap_unlock(pte, ptl);
-		if (migrate_misplaced_page(page, target_nid))
+		migrated = migrate_misplaced_page(page, target_nid);
+		if (migrated)
 			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1);
+		task_numa_fault(curr_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 41/49] mm: sched: numa: Control enabling and disabling of NUMA balancing
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (39 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 40/49] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 42/49] mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG Mel Gorman
                   ` (10 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

This patch adds Kconfig options and kernel parameters to allow the
enabling and disabling of automatic NUMA balancing. The existance
of such a switch was and is very important when debugging problems
related to transparent hugepages and we should have the same for
automatic NUMA placement.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/kernel-parameters.txt |    3 +++
 include/linux/sched.h               |    4 +++
 init/Kconfig                        |    8 ++++++
 kernel/sched/core.c                 |   48 ++++++++++++++++++++++++-----------
 kernel/sched/fair.c                 |    3 +++
 kernel/sched/features.h             |    6 +++--
 mm/mempolicy.c                      |   46 +++++++++++++++++++++++++++++++++
 7 files changed, 101 insertions(+), 17 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 9776f06..d984acb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -403,6 +403,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	atkbd.softrepeat= [HW]
 			Use software keyboard repeat
 
+	balancenuma=	[KNL,X86] Enable or disable automatic NUMA balancing.
+			Allowed values are enable and disable
+
 	baycom_epp=	[HW,AX25]
 			Format: <io>,<mode>
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1068afd..2669bdd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1563,10 +1563,14 @@ struct task_struct {
 
 #ifdef CONFIG_BALANCE_NUMA
 extern void task_numa_fault(int node, int pages, bool migrated);
+extern void set_balancenuma_state(bool enabled);
 #else
 static inline void task_numa_fault(int node, int pages, bool migrated)
 {
 }
+static inline void set_balancenuma_state(bool enabled)
+{
+}
 #endif
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index 6897a05..4cccc00f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -720,6 +720,14 @@ config ARCH_USES_NUMA_PROT_NONE
 	depends on ARCH_WANTS_PROT_NUMA_PROT_NONE
 	depends on BALANCE_NUMA
 
+config BALANCE_NUMA_DEFAULT_ENABLED
+	bool "Automatically enable NUMA aware memory/task placement"
+	default y
+	depends on BALANCE_NUMA
+	help
+	  If set, autonumic NUMA balancing will be enabled if running on a NUMA
+	  machine.
+
 config BALANCE_NUMA
 	bool "Memory placement aware NUMA scheduler"
 	default n
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a59d869..4841f4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -192,23 +192,10 @@ static void sched_feat_disable(int i) { };
 static void sched_feat_enable(int i) { };
 #endif /* HAVE_JUMP_LABEL */
 
-static ssize_t
-sched_feat_write(struct file *filp, const char __user *ubuf,
-		size_t cnt, loff_t *ppos)
+static int sched_feat_set(char *cmp)
 {
-	char buf[64];
-	char *cmp;
-	int neg = 0;
 	int i;
-
-	if (cnt > 63)
-		cnt = 63;
-
-	if (copy_from_user(&buf, ubuf, cnt))
-		return -EFAULT;
-
-	buf[cnt] = 0;
-	cmp = strstrip(buf);
+	int neg = 0;
 
 	if (strncmp(cmp, "NO_", 3) == 0) {
 		neg = 1;
@@ -228,6 +215,27 @@ sched_feat_write(struct file *filp, const char __user *ubuf,
 		}
 	}
 
+	return i;
+}
+
+static ssize_t
+sched_feat_write(struct file *filp, const char __user *ubuf,
+		size_t cnt, loff_t *ppos)
+{
+	char buf[64];
+	char *cmp;
+	int i;
+
+	if (cnt > 63)
+		cnt = 63;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+	cmp = strstrip(buf);
+
+	i = sched_feat_set(cmp);
 	if (i == __SCHED_FEAT_NR)
 		return -EINVAL;
 
@@ -1549,6 +1557,16 @@ static void __sched_fork(struct task_struct *p)
 #endif /* CONFIG_BALANCE_NUMA */
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+void set_balancenuma_state(bool enabled)
+{
+	if (enabled)
+		sched_feat_set("NUMA");
+	else
+		sched_feat_set("NO_NUMA");
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 /*
  * fork()/clone()-time setup:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1be907..b4bc459 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -811,6 +811,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 
+	if (!sched_feat_numa(NUMA))
+		return;
+
 	/* FIXME: Allocate task-specific structure for placement policy here */
 
 	/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7cfd289..d402368 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,10 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 
 /*
- * Apply the automatic NUMA scheduling policy
+ * Apply the automatic NUMA scheduling policy. Enabled automatically
+ * at runtime if running on a NUMA machine. Can be controlled via
+ * balancenuma=
  */
 #ifdef CONFIG_BALANCE_NUMA
-SCHED_FEAT(NUMA,	true)
+SCHED_FEAT(NUMA,	false)
 #endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fd20e28..56ad9bf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2521,6 +2521,50 @@ void mpol_free_shared_policy(struct shared_policy *p)
 	mutex_unlock(&p->mutex);
 }
 
+#ifdef CONFIG_BALANCE_NUMA
+static bool __initdata balancenuma_override;
+
+static void __init check_balancenuma_enable(void)
+{
+	bool balancenuma_default = false;
+
+	if (IS_ENABLED(CONFIG_BALANCE_NUMA_DEFAULT_ENABLED))
+		balancenuma_default = true;
+
+	if (nr_node_ids > 1 && !balancenuma_override) {
+		printk(KERN_INFO "Enabling automatic NUMA balancing. "
+			"Configure with balancenuma= or sysctl");
+		set_balancenuma_state(balancenuma_default);
+	}
+}
+
+static int __init setup_balancenuma(char *str)
+{
+	int ret = 0;
+	if (!str)
+		goto out;
+	balancenuma_override = true;
+
+	if (!strcmp(str, "enable")) {
+		set_balancenuma_state(true);
+		ret = 1;
+	} else if (!strcmp(str, "disable")) {
+		set_balancenuma_state(false);
+		ret = 1;
+	}
+out:
+	if (!ret)
+		printk(KERN_WARNING "Unable to parse balancenuma=\n");
+
+	return ret;
+}
+__setup("balancenuma=", setup_balancenuma);
+#else
+static inline void __init check_balancenuma_enable(void)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {
@@ -2571,6 +2615,8 @@ void __init numa_policy_init(void)
 
 	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
 		printk("numa_policy_init: interleaving failed\n");
+
+	check_balancenuma_enable();
 }
 
 /* Reset policy of current process to default */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 42/49] mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (40 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 41/49] mm: sched: numa: Control enabling and disabling of NUMA balancing Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 43/49] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Mel Gorman
                   ` (9 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
depends on scheduling debug being enabled but it's perfectly legimate to
disable automatic NUMA balancing even without this option. This should
take care of it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  |    9 +++++++++
 kernel/sched/sched.h |    8 +++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4841f4f..161079c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1558,6 +1558,7 @@ static void __sched_fork(struct task_struct *p)
 }
 
 #ifdef CONFIG_BALANCE_NUMA
+#ifdef CONFIG_SCHED_DEBUG
 void set_balancenuma_state(bool enabled)
 {
 	if (enabled)
@@ -1565,6 +1566,14 @@ void set_balancenuma_state(bool enabled)
 	else
 		sched_feat_set("NO_NUMA");
 }
+#else
+__read_mostly bool balancenuma_enabled;
+
+void set_balancenuma_state(bool enabled)
+{
+	balancenuma_enabled = enabled;
+}
+#endif /* CONFIG_SCHED_DEBUG */
 #endif /* CONFIG_BALANCE_NUMA */
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a43241..03dce73 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -650,9 +650,15 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 
 #ifdef CONFIG_BALANCE_NUMA
 #define sched_feat_numa(x) sched_feat(x)
+#ifdef CONFIG_SCHED_DEBUG
+#define balancenuma_enabled sched_feat_numa(NUMA)
+#else
+extern bool balancenuma_enabled;
+#endif /* CONFIG_SCHED_DEBUG */
 #else
 #define sched_feat_numa(x) (0)
-#endif
+#define balancenuma_enabled (0)
+#endif /* CONFIG_BALANCE_NUMA */
 
 static inline u64 global_rt_period(void)
 {
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 43/49] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (41 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 42/49] mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
                   ` (8 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Due to the fact that migrations are driven by the CPU a task is running
on there is no point tracking NUMA faults until one task runs on a new
node. This patch tracks the first node used by an address space. Until
it changes, PTE scanning is disabled and no NUMA hinting faults are
trapped. This should help workloads that are short-lived, do not care
about NUMA placement or have bound themselves to a single node.

This takes advantage of the logic in "mm: sched: numa: Implement slow
start for working set sampling" to delay when the checks are made. This
will take advantage of processes that set their CPU and node bindings
early in their lifetime. It will also potentially allow any initial load
balancing to take place.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h |   10 ++++++++++
 kernel/fork.c            |    3 +++
 kernel/sched/fair.c      |   18 ++++++++++++++++++
 kernel/sched/features.h  |    4 +++-
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 62d18a9..e4551c1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -418,10 +418,20 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
+
+	/*
+	 * The first node a task was scheduled on. If a task runs on
+	 * a different node than Make PTE Scan Go Now.
+	 */
+	int first_nid;
 #endif
 	struct uprobes_state uprobes_state;
 };
 
+/* first nid will either be a valid NID or one of these values */
+#define NUMA_PTE_SCAN_INIT	-1
+#define NUMA_PTE_SCAN_ACTIVE	-2
+
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 #ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..e39111a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -821,6 +821,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
+#ifdef CONFIG_BALANCE_NUMA
+	mm->first_nid = NUMA_PTE_SCAN_INIT;
+#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4bc459..fd9c78c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -861,6 +861,24 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * We do not care about task placement until a task runs on a node
+	 * other than the first one used by the address space. This is
+	 * largely because migrations are driven by what CPU the task
+	 * is running on. If it's never scheduled on another node, it'll
+	 * not migrate so why bother trapping the fault.
+	 */
+	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
+		mm->first_nid = numa_node_id();
+	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
+		/* Are we running on a new node yet? */
+		if (numa_node_id() == mm->first_nid &&
+		    !sched_feat_numa(NUMA_FORCE))
+			return;
+
+		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
+	}
+
+	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d402368..c3c86fd 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -65,8 +65,10 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * balancenuma=
+ * balancenuma=. Allow PTE scanning to be forced on UMA machines
+ * for debugging the core machinery.
  */
 #ifdef CONFIG_BALANCE_NUMA
 SCHED_FEAT(NUMA,	false)
+SCHED_FEAT(NUMA_FORCE,	false)
 #endif
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case.
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (42 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 43/49] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
       [not found]   ` <20130105084229.GA3208@hacker.(null)>
  2012-12-07 10:23 ` [PATCH 45/49] mm: numa: Add THP migration for the NUMA working set scanning fault case build fix Mel Gorman
                   ` (7 subsequent siblings)
  51 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Note: This is very heavily based on a patch from Peter Zijlstra with
	fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner.  That patch
	put a lot of migration logic into mm/huge_memory.c where it does
	not belong. This version puts tries to share some of the migration
	logic with migrate_misplaced_page.  However, it should be noted
	that now migrate.c is doing more with the pagetable manipulation
	than is preferred. The end result is barely recognisable so as
	before, the signed-offs had to be removed but will be re-added if
	the original authors are ok with it.

Add THP migration for the NUMA working set scanning fault case.

It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.

[dhillf@gmail.com: Fix memory leak on isolation failure]
[dhillf@gmail.com: Fix transfer of last_nid information]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |   15 +++
 mm/huge_memory.c        |   59 ++++++++----
 mm/internal.h           |    7 +-
 mm/memcontrol.c         |    7 +-
 mm/migrate.c            |  231 ++++++++++++++++++++++++++++++++++++++---------
 5 files changed, 255 insertions(+), 64 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 6229177..ed5a6c5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -79,6 +79,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 extern int migrate_misplaced_page(struct page *page, int node);
 extern int migrate_misplaced_page(struct page *page, int node);
 extern bool migrate_ratelimited(int node);
+extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, pmd_t entry,
+			unsigned long address,
+			struct page *page, int node);
+
 #else
 static inline int migrate_misplaced_page(struct page *page, int node)
 {
@@ -88,6 +94,15 @@ static inline bool migrate_ratelimited(int node)
 {
 	return false;
 }
+
+static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, pmd_t entry,
+			unsigned long address,
+			struct page *page, int node)
+{
+	return -EAGAIN;
+}
 #endif /* CONFIG_BALANCE_NUMA */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1327a03..61b66f8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,7 +600,7 @@ out:
 }
 __setup("transparent_hugepage=", setup_transparent_hugepage);
 
-static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
 		pmd = pmd_mkwrite(pmd);
@@ -1022,10 +1022,12 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
-	struct page *page = NULL;
+	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
 	int current_nid = -1;
+	bool migrated;
+	bool page_locked = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1033,42 +1035,61 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(pmd);
 	get_page(page);
-	spin_unlock(&mm->page_table_lock);
 	current_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1)
+	if (target_nid == -1) {
+		put_page(page);
 		goto clear_pmdnuma;
+	}
 
-	/*
-	 * Due to lacking code to migrate thp pages, we'll split
-	 * (which preserves the special PROT_NONE) and re-take the
-	 * fault on the normal pages.
-	 */
-	split_huge_page(page);
-	put_page(page);
-
-	return 0;
+	/* Acquire the page lock to serialise THP migrations */
+	spin_unlock(&mm->page_table_lock);
+	lock_page(page);
+	page_locked = true;
 
-clear_pmdnuma:
+	/* Confirm the PTE did not while locked */
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		unlock_page(page);
+		put_page(page);
 		goto out_unlock;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	/* Migrate the THP to the requested node */
+	migrated = migrate_misplaced_transhuge_page(mm, vma,
+				pmdp, pmd, addr,
+				page, target_nid);
+	if (migrated)
+		current_nid = target_nid;
+	else {
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(pmd, *pmdp))) {
+			unlock_page(page);
+			goto out_unlock;
+		}
+		goto clear_pmdnuma;
+	}
+
+	task_numa_fault(current_nid, HPAGE_PMD_NR, migrated);
+	return 0;
 
+clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	if (page_locked)
+		unlock_page(page);
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page) {
-		put_page(page);
-		task_numa_fault(numa_node_id(), HPAGE_PMD_NR, false);
-	}
+	if (current_nid != -1)
+		task_numa_fault(current_nid, HPAGE_PMD_NR, migrated);
 	return 0;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index a4fa284..7e60ac8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -212,15 +212,18 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 {
 	if (TestClearPageMlocked(page)) {
 		unsigned long flags;
+		int nr_pages = hpage_nr_pages(page);
 
 		local_irq_save(flags);
-		__dec_zone_page_state(page, NR_MLOCK);
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		SetPageMlocked(newpage);
-		__inc_zone_page_state(newpage, NR_MLOCK);
+		__mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
 		local_irq_restore(flags);
 	}
 }
 
+extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd39ba0..d97af96 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3288,15 +3288,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 				  struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
 	enum charge_type ctype;
 
 	*memcgp = NULL;
 
-	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return;
 
+	if (PageTransHuge(page))
+		nr_pages <<= compound_order(page);
+
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
@@ -3358,7 +3361,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
 }
 
 /* remove redundant charge if migration failed*/
diff --git a/mm/migrate.c b/mm/migrate.c
index 6bc9745..4b1b239 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -410,7 +410,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	if (PageHuge(page))
+	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
 		copy_highpage(newpage, page);
@@ -1491,25 +1491,10 @@ bool migrate_ratelimited(int node)
 	return true;
 }
 
-/*
- * Attempt to migrate a misplaced page to the specified destination
- * node. Caller is expected to have an elevated reference count on
- * the page that will be dropped by this function before returning.
- */
-int migrate_misplaced_page(struct page *page, int node)
+/* Returns true if the node is migrate rate-limited after the update */
+bool numamigrate_update_ratelimit(pg_data_t *pgdat)
 {
-	pg_data_t *pgdat = NODE_DATA(node);
-	int isolated = 0;
-	LIST_HEAD(migratepages);
-
-	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1) {
-		put_page(page);
-		goto out;
-	}
+	bool rate_limited = false;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -1522,13 +1507,18 @@ int migrate_misplaced_page(struct page *page, int node)
 		pgdat->balancenuma_migrate_next_window = jiffies +
 			msecs_to_jiffies(migrate_interval_millisecs);
 	}
-	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages) {
-		spin_unlock(&pgdat->balancenuma_migrate_lock);
-		put_page(page);
-		goto out;
-	}
-	pgdat->balancenuma_migrate_nr_pages++;
+	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages)
+		rate_limited = true;
+	else
+		pgdat->balancenuma_migrate_nr_pages++;
 	spin_unlock(&pgdat->balancenuma_migrate_lock);
+	
+	return rate_limited;
+}
+
+int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
+{
+	int ret = 0;
 
 	/* Avoid migrating to a node that is nearly full */
 	if (migrate_balanced_pgdat(pgdat, 1)) {
@@ -1536,13 +1526,18 @@ int migrate_misplaced_page(struct page *page, int node)
 
 		if (isolate_lru_page(page)) {
 			put_page(page);
-			goto out;
+			return 0;
 		}
-		isolated = 1;
 
+		/* Page is isolated */
+		ret = 1;
 		page_lru = page_is_file_cache(page);
-		inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
-		list_add(&page->lru, &migratepages);
+		if (!PageTransHuge(page))
+			inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+		else
+			mod_zone_page_state(page_zone(page),
+					NR_ISOLATED_ANON + page_lru,
+					HPAGE_PMD_NR);
 	}
 
 	/*
@@ -1555,23 +1550,177 @@ int migrate_misplaced_page(struct page *page, int node)
 	 */
 	put_page(page);
 
-	if (isolated) {
-		int nr_remaining;
-
-		nr_remaining = migrate_pages(&migratepages,
-				alloc_misplaced_dst_page,
-				node, false, MIGRATE_ASYNC,
-				MR_NUMA_MISPLACED);
-		if (nr_remaining) {
-			putback_lru_pages(&migratepages);
-			isolated = 0;
-		} else
-			count_vm_numa_event(NUMA_PAGE_MIGRATE);
+	return ret;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	int isolated = 0;
+	int nr_remaining;
+	LIST_HEAD(migratepages);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1) {
+		put_page(page);
+		goto out;
 	}
+
+	/*
+	 * Rate-limit the amount of data that is being migrated to a node.
+	 * Optimal placement is no good if the memory bus is saturated and
+	 * all the time is being spent migrating!
+	 */
+	if (numamigrate_update_ratelimit(pgdat)) {
+		put_page(page);
+		goto out;
+	}
+
+	isolated = numamigrate_isolate_page(pgdat, page);
+	if (!isolated)
+		goto out;
+
+	list_add(&page->lru, &migratepages);
+	nr_remaining = migrate_pages(&migratepages,
+			alloc_misplaced_dst_page,
+			node, false, MIGRATE_ASYNC,
+			MR_NUMA_MISPLACED);
+	if (nr_remaining) {
+		putback_lru_pages(&migratepages);
+		isolated = 0;
+	} else
+		count_vm_numa_event(NUMA_PAGE_MIGRATE);
 	BUG_ON(!list_empty(&migratepages));
 out:
 	return isolated;
 }
+
+int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+				struct vm_area_struct *vma,
+				pmd_t *pmd, pmd_t entry,
+				unsigned long address,
+				struct page *page, int node)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pg_data_t *pgdat = NODE_DATA(node);
+	int isolated = 0;
+	struct page *new_page = NULL;
+	struct mem_cgroup *memcg = NULL;
+	int page_lru = page_is_file_cache(page);
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 * TODO: Handle false sharing detection instead of this hammer
+	 */
+	if (page_mapcount(page) != 1)
+		goto out_dropref;
+
+	/*
+	 * Rate-limit the amount of data that is being migrated to a node.
+	 * Optimal placement is no good if the memory bus is saturated and
+	 * all the time is being spent migrating!
+	 */
+	if (numamigrate_update_ratelimit(pgdat))
+		goto out_dropref;
+
+	new_page = alloc_pages_node(node,
+		(GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+	if (!new_page)
+		goto out_dropref;
+	page_xchg_last_nid(new_page, page_last_nid(page));
+
+	isolated = numamigrate_isolate_page(pgdat, page);
+	if (!isolated) {
+		put_page(new_page);
+		goto out_keep_locked;
+	}
+
+	/* Prepare a page as a migration target */
+	__set_page_locked(new_page);
+	SetPageSwapBacked(new_page);
+
+	/* anon mapping, we can simply copy page->mapping to the new page: */
+	new_page->mapping = page->mapping;
+	new_page->index = page->index;
+	migrate_page_copy(new_page, page);
+	WARN_ON(PageLRU(new_page));
+
+	/* Recheck the target PMD */
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+
+		/* Reverse changes made by migrate_page_copy() */
+		if (TestClearPageActive(new_page))
+			SetPageActive(page);
+		if (TestClearPageUnevictable(new_page))
+			SetPageUnevictable(page);
+		mlock_migrate_page(page, new_page);
+
+		unlock_page(new_page);
+		put_page(new_page);		/* Free it */
+
+		unlock_page(page);
+		putback_lru_page(page);
+
+		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
+		goto out;
+	}
+
+	/*
+	 * Traditional migration needs to prepare the memcg charge
+	 * transaction early to prevent the old page from being
+	 * uncharged when installing migration entries.  Here we can
+	 * save the potential rollback and start the charge transfer
+	 * only when migration is already known to end successfully.
+	 */
+	mem_cgroup_prepare_migration(page, new_page, &memcg);
+
+	entry = mk_pmd(new_page, vma->vm_page_prot);
+	entry = pmd_mknonnuma(entry);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = pmd_mkhuge(entry);
+
+	page_add_new_anon_rmap(new_page, vma, haddr);
+
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+	page_remove_rmap(page);
+	/*
+	 * Finish the charge transaction under the page table lock to
+	 * prevent split_huge_page() from dividing up the charge
+	 * before it's fully transferred to the new page.
+	 */
+	mem_cgroup_end_migration(memcg, page, new_page, true);
+	spin_unlock(&mm->page_table_lock);
+
+	unlock_page(new_page);
+	unlock_page(page);
+	put_page(page);			/* Drop the rmap reference */
+	put_page(page);			/* Drop the LRU isolation reference */
+
+	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
+	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
+
+out:
+	mod_zone_page_state(page_zone(page),
+			NR_ISOLATED_ANON + page_lru,
+			-HPAGE_PMD_NR);
+	return isolated;
+
+out_dropref:
+	put_page(page);
+out_keep_locked:
+	return 0;
+}
 #endif /* CONFIG_BALANCE_NUMA */
 
 #endif /* CONFIG_NUMA */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 45/49] mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (43 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 46/49] mm: numa: Account for failed allocations and isolations as migration failures Mel Gorman
                   ` (6 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Commit "Add THP migration for the NUMA working set scanning fault case"
breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to
explode without CONFIG_TRANSPARENT_HUGEPAGE:

mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed

CONFIG_NUMA_BALANCING allows compilation without enabling transparent
hugepages, so define the dummy function for such a configuration and only
define migrate_misplaced_transhuge_page_put() when transparent hugepages
are enabled.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |   16 +++++++++-------
 mm/migrate.c            |    2 ++
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ed5a6c5..6c15d4a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -79,12 +79,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 extern int migrate_misplaced_page(struct page *page, int node);
 extern int migrate_misplaced_page(struct page *page, int node);
 extern bool migrate_ratelimited(int node);
-extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, pmd_t entry,
-			unsigned long address,
-			struct page *page, int node);
-
 #else
 static inline int migrate_misplaced_page(struct page *page, int node)
 {
@@ -94,7 +88,15 @@ static inline bool migrate_ratelimited(int node)
 {
 	return false;
 }
+#endif /* CONFIG_BALANCE_NUMA */
 
+#if defined(CONFIG_BALANCE_NUMA) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, pmd_t entry,
+			unsigned long address,
+			struct page *page, int node);
+#else
 static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, pmd_t entry,
@@ -103,6 +105,6 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 {
 	return -EAGAIN;
 }
-#endif /* CONFIG_BALANCE_NUMA */
+#endif /* CONFIG_BALANCE_NUMA && CONFIG_TRANSPARENT_HUGEPAGE*/
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4b1b239..b6fe2d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1602,7 +1602,9 @@ int migrate_misplaced_page(struct page *page, int node)
 out:
 	return isolated;
 }
+#endif /* CONFIG_BALANCE_NUMA */
 
+#if defined(CONFIG_BALANCE_NUMA) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 				struct vm_area_struct *vma,
 				pmd_t *pmd, pmd_t entry,
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 46/49] mm: numa: Account for failed allocations and isolations as migration failures
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (44 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 45/49] mm: numa: Add THP migration for the NUMA working set scanning fault case build fix Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 47/49] mm: migrate: Account a transhuge page properly when rate limiting Mel Gorman
                   ` (5 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

Subject says it all. Allocation failures and a failure to isolate should
be accounted as a migration failure. This is partially another
difference between base page and transhuge page migration. A base page
migration makes multiple attempts for these conditions before it would
be accounted for as a failure.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index b6fe2d2..eb155c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1635,12 +1635,15 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	new_page = alloc_pages_node(node,
 		(GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
-	if (!new_page)
+	if (!new_page) {
+		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 		goto out_dropref;
+	}
 	page_xchg_last_nid(new_page, page_last_nid(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
+		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 		put_page(new_page);
 		goto out_keep_locked;
 	}
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 47/49] mm: migrate: Account a transhuge page properly when rate limiting
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (45 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 46/49] mm: numa: Account for failed allocations and isolations as migration failures Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 48/49] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Mel Gorman
                   ` (4 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

If there is excessive migration due to NUMA balancing it gets rate
limited. It does this by counting the number of pages it has migrated
recently but counts a transhuge page as 1 page. Account for it properly.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index eb155c9..6b6567f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1492,7 +1492,7 @@ bool migrate_ratelimited(int node)
 }
 
 /* Returns true if the node is migrate rate-limited after the update */
-bool numamigrate_update_ratelimit(pg_data_t *pgdat)
+bool numamigrate_update_ratelimit(pg_data_t *pgdat, unsigned long nr_pages)
 {
 	bool rate_limited = false;
 
@@ -1510,7 +1510,7 @@ bool numamigrate_update_ratelimit(pg_data_t *pgdat)
 	if (pgdat->balancenuma_migrate_nr_pages > ratelimit_pages)
 		rate_limited = true;
 	else
-		pgdat->balancenuma_migrate_nr_pages++;
+		pgdat->balancenuma_migrate_nr_pages += nr_pages;
 	spin_unlock(&pgdat->balancenuma_migrate_lock);
 	
 	return rate_limited;
@@ -1579,7 +1579,7 @@ int migrate_misplaced_page(struct page *page, int node)
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
 	 */
-	if (numamigrate_update_ratelimit(pgdat)) {
+	if (numamigrate_update_ratelimit(pgdat, 1)) {
 		put_page(page);
 		goto out;
 	}
@@ -1630,7 +1630,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
 	 */
-	if (numamigrate_update_ratelimit(pgdat))
+	if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 48/49] mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (46 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 47/49] mm: migrate: Account a transhuge page properly when rate limiting Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 10:23 ` [PATCH 49/49] mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Mel Gorman
                   ` (3 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Convert the struct anon_vma::mutex to an rwsem, which will help
in solving a page-migration scalability problem. (Addressed in
a separate patch.)

The conversion is simple and straightforward: in every case
where we mutex_lock()ed we'll now down_write().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/rmap.h |   16 ++++++++--------
 mm/huge_memory.c     |    4 ++--
 mm/mmap.c            |    8 ++++----
 mm/rmap.c            |   22 +++++++++++-----------
 4 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bfe1f47..f3f41d2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -25,8 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	struct anon_vma *root;	/* Root of this anon_vma tree */
-	struct mutex mutex;	/* Serialize access to vma list */
+	struct anon_vma *root;		/* Root of this anon_vma tree */
+	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
 	/*
 	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
@@ -64,7 +64,7 @@ struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
 	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
-	struct rb_node rb;			/* locked by anon_vma->mutex */
+	struct rb_node rb;			/* locked by anon_vma->rwsem */
 	unsigned long rb_subtree_last;
 #ifdef CONFIG_DEBUG_VM_RB
 	unsigned long cached_vma_start, cached_vma_last;
@@ -108,24 +108,24 @@ static inline void vma_lock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		mutex_lock(&anon_vma->root->mutex);
+		down_write(&anon_vma->root->rwsem);
 }
 
 static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		mutex_unlock(&anon_vma->root->mutex);
+		up_write(&anon_vma->root->rwsem);
 }
 
 static inline void anon_vma_lock(struct anon_vma *anon_vma)
 {
-	mutex_lock(&anon_vma->root->mutex);
+	down_write(&anon_vma->root->rwsem);
 }
 
 static inline void anon_vma_unlock(struct anon_vma *anon_vma)
 {
-	mutex_unlock(&anon_vma->root->mutex);
+	up_write(&anon_vma->root->rwsem);
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 61b66f8..f0c4928 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,7 +1291,7 @@ static int __split_huge_page_splitting(struct page *page,
 		 * We can't temporarily set the pmd to null in order
 		 * to split it, the pmd must remain marked huge at all
 		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->mutex to
+		 * and it won't wait on the anon_vma->root->rwsem to
 		 * serialize against split_huge_page*.
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
@@ -1494,7 +1494,7 @@ static int __split_huge_page_map(struct page *page,
 	return ret;
 }
 
-/* must be called with anon_vma->root->mutex hold */
+/* must be called with anon_vma->root->rwsem held */
 static void __split_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index 9a796c4..8840863 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2561,15 +2561,15 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);
+		down_write(&anon_vma->root->rwsem);
 		/*
 		 * We can safely modify head.next after taking the
-		 * anon_vma->root->mutex. If some other vma in this mm shares
+		 * anon_vma->root->rwsem. If some other vma in this mm shares
 		 * the same anon_vma we won't take it again.
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us thanks to the
-		 * anon_vma->root->mutex.
+		 * anon_vma->root->rwsem.
 		 */
 		if (__test_and_set_bit(0, (unsigned long *)
 				       &anon_vma->root->rb_root.rb_node))
@@ -2671,7 +2671,7 @@ static void vm_unlock_anon_vma(struct anon_vma *anon_vma)
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us until we release the
-		 * anon_vma->root->mutex.
+		 * anon_vma->root->rwsem.
 		 */
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->root->rb_root.rb_node))
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ee1ef0..6e3ee3b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
- *         anon_vma->mutex
+ *         anon_vma->rwsem
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
@@ -37,7 +37,7 @@
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within bdi.wb->list_lock in __sync_single_inode)
  *
- * anon_vma->mutex,mapping->i_mutex      (memory_failure, collect_procs_anon)
+ * anon_vma->rwsem,mapping->i_mutex      (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
  */
@@ -103,7 +103,7 @@ static inline void anon_vma_free(struct anon_vma *anon_vma)
 	 * LOCK should suffice since the actual taking of the lock must
 	 * happen _before_ what follows.
 	 */
-	if (mutex_is_locked(&anon_vma->root->mutex)) {
+	if (rwsem_is_locked(&anon_vma->root->rwsem)) {
 		anon_vma_lock(anon_vma);
 		anon_vma_unlock(anon_vma);
 	}
@@ -219,9 +219,9 @@ static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct
 	struct anon_vma *new_root = anon_vma->root;
 	if (new_root != root) {
 		if (WARN_ON_ONCE(root))
-			mutex_unlock(&root->mutex);
+			up_write(&root->rwsem);
 		root = new_root;
-		mutex_lock(&root->mutex);
+		down_write(&root->rwsem);
 	}
 	return root;
 }
@@ -229,7 +229,7 @@ static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct
 static inline void unlock_anon_vma_root(struct anon_vma *root)
 {
 	if (root)
-		mutex_unlock(&root->mutex);
+		up_write(&root->rwsem);
 }
 
 /*
@@ -349,7 +349,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	/*
 	 * Iterate the list once more, it now only contains empty and unlinked
 	 * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
-	 * needing to acquire the anon_vma->root->mutex.
+	 * needing to write-acquire the anon_vma->root->rwsem.
 	 */
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
@@ -365,7 +365,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	mutex_init(&anon_vma->mutex);
+	init_rwsem(&anon_vma->rwsem);
 	atomic_set(&anon_vma->refcount, 0);
 	anon_vma->rb_root = RB_ROOT;
 }
@@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (mutex_trylock(&root_anon_vma->mutex)) {
+	if (down_write_trylock(&root_anon_vma->rwsem)) {
 		/*
 		 * If the page is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
 		 * not go away, see anon_vma_free().
 		 */
 		if (!page_mapped(page)) {
-			mutex_unlock(&root_anon_vma->mutex);
+			up_write(&root_anon_vma->rwsem);
 			anon_vma = NULL;
 		}
 		goto out;
@@ -1299,7 +1299,7 @@ out_mlock:
 	/*
 	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
 	 * unstable result and race. Plus, We can't wait here because
-	 * we now hold anon_vma->mutex or mapping->i_mmap_mutex.
+	 * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
 	 * if trylock failed, the page remain in evictable lru and later
 	 * vmscan could retry to move the page to unevictable lru if the
 	 * page is actually mlocked.
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 49/49] mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (47 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 48/49] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Mel Gorman
@ 2012-12-07 10:23 ` Mel Gorman
  2012-12-07 11:01 ` [PATCH 00/49] Automatic NUMA Balancing v10 Ingo Molnar
                   ` (2 subsequent siblings)
  51 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-07 10:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Paul Turner, Hillf Danton, David Rientjes, Lee Schermerhorn,
	Alex Shi, Srikar Dronamraju, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.

Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:

    96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                  |
                  --- perf_trace_sched_switch
                      __schedule
                      schedule
                      schedule_preempt_disabled
                      __mutex_lock_common.isra.6
                      __mutex_lock_slowpath
                      mutex_lock
                     |
                     |--50.61%-- rmap_walk
                     |          move_to_new_page
                     |          migrate_pages
                     |          migrate_misplaced_page
                     |          __do_numa_page.isra.69
                     |          handle_pte_fault
                     |          handle_mm_fault
                     |          __do_page_fault
                     |          do_page_fault
                     |          page_fault
                     |          __memset_sse2
                     |          |
                     |           --100.00%-- worker_thread
                     |                     |
                     |                      --100.00%-- start_thread
                     |
                      --49.39%-- page_lock_anon_vma
                                try_to_unmap_anon
                                try_to_unmap
                                migrate_pages
                                migrate_misplaced_page
                                __do_numa_page.isra.69
                                handle_pte_fault
                                handle_mm_fault
                                __do_page_fault
                                do_page_fault
                                page_fault
                                __memset_sse2
                                |
                                 --100.00%-- worker_thread
                                           start_thread

With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.

Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/huge_mm.h |    2 +-
 include/linux/rmap.h    |   17 ++++++++++++++---
 mm/huge_memory.c        |    6 +++---
 mm/ksm.c                |    6 +++---
 mm/memory-failure.c     |    4 ++--
 mm/migrate.c            |    2 +-
 mm/mmap.c               |    2 +-
 mm/mremap.c             |    2 +-
 mm/rmap.c               |   48 +++++++++++++++++++++++------------------------
 9 files changed, 50 insertions(+), 39 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 027ad04..0d1208c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -102,7 +102,7 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock(__anon_vma);				\
+		anon_vma_lock_write(__anon_vma);			\
 		anon_vma_unlock(__anon_vma);				\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index f3f41d2..c20635c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,7 +118,7 @@ static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 		up_write(&anon_vma->root->rwsem);
 }
 
-static inline void anon_vma_lock(struct anon_vma *anon_vma)
+static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
 {
 	down_write(&anon_vma->root->rwsem);
 }
@@ -128,6 +128,17 @@ static inline void anon_vma_unlock(struct anon_vma *anon_vma)
 	up_write(&anon_vma->root->rwsem);
 }
 
+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
+{
+	down_read(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
+{
+	up_read(&anon_vma->root->rwsem);
+}
+
+
 /*
  * anon_vma helper functions.
  */
@@ -220,8 +231,8 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page);
-void page_unlock_anon_vma(struct anon_vma *anon_vma);
+struct anon_vma *page_lock_anon_vma_read(struct page *page);
+void page_unlock_anon_vma_read(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f0c4928..409b2f3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1548,7 +1548,7 @@ int split_huge_page(struct page *page)
 	int ret = 1;
 
 	BUG_ON(!PageAnon(page));
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		goto out;
 	ret = 0;
@@ -1561,7 +1561,7 @@ int split_huge_page(struct page *page)
 
 	BUG_ON(PageCompound(page));
 out_unlock:
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 out:
 	return ret;
 }
@@ -2073,7 +2073,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
 		goto out;
 
-	anon_vma_lock(vma->anon_vma);
+	anon_vma_lock_write(vma->anon_vma);
 
 	pte = pte_offset_map(pmd, address);
 	ptl = pte_lockptr(mm, pmd);
diff --git a/mm/ksm.c b/mm/ksm.c
index ae539f0..7fa37de 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1634,7 +1634,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
 			vma = vmac->vma;
@@ -1688,7 +1688,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
 			vma = vmac->vma;
@@ -1741,7 +1741,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
 			vma = vmac->vma;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ddb68a1..f2cd830 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -402,7 +402,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	struct anon_vma *av;
 	pgoff_t pgoff;
 
-	av = page_lock_anon_vma(page);
+	av = page_lock_anon_vma_read(page);
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
@@ -423,7 +423,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 		}
 	}
 	read_unlock(&tasklist_lock);
-	page_unlock_anon_vma(av);
+	page_unlock_anon_vma_read(av);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index 6b6567f..da2001b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -754,7 +754,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 */
 	if (PageAnon(page)) {
 		/*
-		 * Only page_lock_anon_vma() understands the subtleties of
+		 * Only page_lock_anon_vma_read() understands the subtleties of
 		 * getting a hold on an anon_vma from outside one of its mms.
 		 */
 		anon_vma = page_get_anon_vma(page);
diff --git a/mm/mmap.c b/mm/mmap.c
index 8840863..68a16b4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -602,7 +602,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (anon_vma) {
 		VM_BUG_ON(adjust_next && next->anon_vma &&
 			  anon_vma != next->anon_vma);
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_pre_update_vma(vma);
 		if (adjust_next)
 			anon_vma_interval_tree_pre_update_vma(next);
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b61c2d..3dabd17 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -104,7 +104,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		}
 		if (vma->anon_vma) {
 			anon_vma = vma->anon_vma;
-			anon_vma_lock(anon_vma);
+			anon_vma_lock_write(anon_vma);
 		}
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 6e3ee3b..b0f612d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -87,24 +87,24 @@ static inline void anon_vma_free(struct anon_vma *anon_vma)
 	VM_BUG_ON(atomic_read(&anon_vma->refcount));
 
 	/*
-	 * Synchronize against page_lock_anon_vma() such that
+	 * Synchronize against page_lock_anon_vma_read() such that
 	 * we can safely hold the lock without the anon_vma getting
 	 * freed.
 	 *
 	 * Relies on the full mb implied by the atomic_dec_and_test() from
 	 * put_anon_vma() against the acquire barrier implied by
-	 * mutex_trylock() from page_lock_anon_vma(). This orders:
+	 * down_read_trylock() from page_lock_anon_vma_read(). This orders:
 	 *
-	 * page_lock_anon_vma()		VS	put_anon_vma()
-	 *   mutex_trylock()			  atomic_dec_and_test()
+	 * page_lock_anon_vma_read()	VS	put_anon_vma()
+	 *   down_read_trylock()		  atomic_dec_and_test()
 	 *   LOCK				  MB
-	 *   atomic_read()			  mutex_is_locked()
+	 *   atomic_read()			  rwsem_is_locked()
 	 *
 	 * LOCK should suffice since the actual taking of the lock must
 	 * happen _before_ what follows.
 	 */
 	if (rwsem_is_locked(&anon_vma->root->rwsem)) {
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_unlock(anon_vma);
 	}
 
@@ -146,7 +146,7 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
  * allocate a new one.
  *
  * Anon-vma allocations are very subtle, because we may have
- * optimistically looked up an anon_vma in page_lock_anon_vma()
+ * optimistically looked up an anon_vma in page_lock_anon_vma_read()
  * and that may actually touch the spinlock even in the newly
  * allocated vma (it depends on RCU to make sure that the
  * anon_vma isn't actually destroyed).
@@ -181,7 +181,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			allocated = anon_vma;
 		}
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
@@ -306,7 +306,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	get_anon_vma(anon_vma->root);
 	/* Mark this anon_vma as the one where our new (COWed) pages go. */
 	vma->anon_vma = anon_vma;
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_write(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
 	anon_vma_unlock(anon_vma);
 
@@ -442,7 +442,7 @@ out:
  * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
  * reference like with page_get_anon_vma() and then block on the mutex.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma_read(struct page *page)
 {
 	struct anon_vma *anon_vma = NULL;
 	struct anon_vma *root_anon_vma;
@@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (down_write_trylock(&root_anon_vma->rwsem)) {
+	if (down_read_trylock(&root_anon_vma->rwsem)) {
 		/*
 		 * If the page is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
 		 * not go away, see anon_vma_free().
 		 */
 		if (!page_mapped(page)) {
-			up_write(&root_anon_vma->rwsem);
+			up_read(&root_anon_vma->rwsem);
 			anon_vma = NULL;
 		}
 		goto out;
@@ -484,15 +484,15 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	/* we pinned the anon_vma, its safe to sleep */
 	rcu_read_unlock();
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_read(anon_vma);
 
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock
 		 * and bail -- can't simply use put_anon_vma() because
-		 * we'll deadlock on the anon_vma_lock() recursion.
+		 * we'll deadlock on the anon_vma_lock_write() recursion.
 		 */
-		anon_vma_unlock(anon_vma);
+		anon_vma_unlock_read(anon_vma);
 		__put_anon_vma(anon_vma);
 		anon_vma = NULL;
 	}
@@ -504,9 +504,9 @@ out:
 	return anon_vma;
 }
 
-void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 {
-	anon_vma_unlock(anon_vma);
+	anon_vma_unlock_read(anon_vma);
 }
 
 /*
@@ -732,7 +732,7 @@ static int page_referenced_anon(struct page *page,
 	struct anon_vma_chain *avc;
 	int referenced = 0;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		return referenced;
 
@@ -754,7 +754,7 @@ static int page_referenced_anon(struct page *page,
 			break;
 	}
 
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 	return referenced;
 }
 
@@ -1474,7 +1474,7 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		return ret;
 
@@ -1501,7 +1501,7 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 			break;
 	}
 
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 	return ret;
 }
 
@@ -1696,7 +1696,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	int ret = SWAP_AGAIN;
 
 	/*
-	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
+	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma_read()
 	 * because that depends on page_mapped(); but not all its usages
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
@@ -1704,7 +1704,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_read(anon_vma);
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1712,7 +1712,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	anon_vma_unlock(anon_vma);
+	anon_vma_unlock_read(anon_vma);
 	return ret;
 }
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (48 preceding siblings ...)
  2012-12-07 10:23 ` [PATCH 49/49] mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Mel Gorman
@ 2012-12-07 11:01 ` Ingo Molnar
  2012-12-09 20:36   ` Mel Gorman
  2012-12-10 16:42 ` Srikar Dronamraju
  2012-12-13 13:21 ` Srikar Dronamraju
  51 siblings, 1 reply; 80+ messages in thread
From: Ingo Molnar @ 2012-12-07 11:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> This is a full release of all the patches so apologies for the 
> flood. [...]

I have yet to process all your mails, but assuming I address all 
your review feedback and the latest unified tree in tip:master 
shows no regression in your testing, would you be willing to 
start using it for ongoing work?

It would make it much easier for me to pick up your 
enhancements, fixes, etc.

> Changelog since V9
>   o Migration scalability                                             (mingo)

To *really* see migration scalability bottlenecks you need to 
remove the migration-bandwidth throttling kludge from your tree 
(or configure it up very high if you want to do it simple).

Some (certainly not all) of the performance regressions you 
reported were certainly due to numa/core code hitting the 
migration codepaths as aggressively as the workload demanded - 
and hitting scalability bottlenecks.

The right approach is to hit scalability bottlenecks and fix 
them.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-07 11:01 ` [PATCH 00/49] Automatic NUMA Balancing v10 Ingo Molnar
@ 2012-12-09 20:36   ` Mel Gorman
  2012-12-09 21:17     ` Kirill A. Shutemov
                       ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-09 20:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Fri, Dec 07, 2012 at 12:01:13PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > This is a full release of all the patches so apologies for the 
> > flood. [...]
> 
> I have yet to process all your mails, but assuming I address all 
> your review feedback and the latest unified tree in tip:master 
> shows no regression in your testing, would you be willing to 
> start using it for ongoing work?
> 

Ingo,

If you had read the second paragraph of the mail you just responded to or
the results at the end then you would have seen that I had problems with
the performance. You would also know that tip/master testing for the last
week was failing due to a boot problem (issue was in mainline not tip and
has been already fixed) and would have known that since the -v18 release
that numacore was effectively disabled on my test machine.

Clearly you are not reading the bug reports you are receiving and you're not
seeing the small bit of review feedback or answering the review questions
you have received either. Why would I be more forthcoming when I feel that
it'll simply be ignored?  You simply assume that each batch of patches
you place on top must be fixing all known regressions and ignoring any
evidence to the contrary.

If you had read my mail from last Tuesday you would even know which patch
was causing the problem that effectively disabled numacore although not
why. The comment about p->numa_faults was completely off the mark (long
journey, was tired, assumed numa_faults was a counter and not a pointer
which was careless).  If you had called me on it then I would have spotted
the actual problem sooner. The problem was indeed with the nr_cpus_allowed
== num_online_cpus()s check which I had pointed out was a suspicious check
although for different reasons. As it turns out, a printk() bodge showed
that nr_cpus_allowed == 80 set in sched_init_smp() while num_online_cpus()
== 48. This effectively disabling numacore. If you had responded to the
bug report, this would likely have been found last Wednesday.

As for my ongoing work, I have not actually changed much in the last
two weeks or so -- build fixes and your scalability patches. As I've
said multiple times, my primary objective was to build something minimal
that did something better than mainline although not necessarily as good
as the kernel potentially if either numacore or autonuma were rebased
on top. I left the tree so that other testing might validate it was
correct and avoid changing the tree too much prior to the merge window.
I deliberately avoided working on anything that would directly collide
with what numacore was trying to achieve.

> It would make it much easier for me to pick up your 
> enhancements, fixes, etc.
> 
> > Changelog since V9
> >   o Migration scalability                                             (mingo)
> 
> To *really* see migration scalability bottlenecks you need to 
> remove the migration-bandwidth throttling kludge from your tree 
> (or configure it up very high if you want to do it simple).
> 

Why is it a kludge? I already explained what the rational behind the rate
limiting was. It's not about scalability, it's about mitigating worse-case
behaviour and the amount of time the kernel spends moving data around which
a deliberately adverse workload can trigger. It is unacceptable if during a
phase change that a process would stall potentially for milliseconds (seconds
if the node is large enough I guess) while the data is being migrated. Here
is it again -- http://www.spinics.net/lists/linux-mm/msg47440.html . You
either ignored the mail or simply could not be bothered explaining why
you thought this was the incorrect decision or why the concerns about an
adverse workload were unimportant.

I have a vague suspicion actually that when you are modelling the task->data
relationship that you make an implicit assumption that moving data has
zero or near-zero cost. In such a model it would always make sense to move
quickly and immediately but in practice the cost of moving can exceed the
performance benefit of accessing local data and lead to regressions. It
becomes more pronounced if the nodes are not fully connected.

> Some (certainly not all) of the performance regressions you 
> reported were certainly due to numa/core code hitting the 
> migration codepaths as aggressively as the workload demanded - 
> and hitting scalability bottlenecks.
> 

How are you so certain? How do you not know it's because your code is
migrating excessively for no good reason because the algorithm has a flaw
in it? Or that the cost of excessive migration is not being offset by
local data accesses? The critical point to note is that if it really was
only scalability problems then autonuma would suffer the same problems
and would be impossible to autonumas performance to exceed numacores.
This isn't the case making it unlikely the scalability is your only problem.

Either way, last night I applied a patch on top of latest tip/master to
remove the nr_cpus_allowed check so that numacore would be enabled again
and tested that. In some places it has indeed much improved. In others
it is still regressing badly and in two case, it's corrupting memory --
specjbb when THP is enabled crashes when running for single or multiple
JVMs. It is likely that a zero page is being inserted due to a race with
migration and causes the JVM to throw a null pointer exception. Here is
the comparison on the rough off-chance you actually read it this time.

stats-v8r6		Same collection of TLB flush fixes and stats
numacore-20121130	Roughly numacore v17
numafix-20121209	numacore as of December 9th with the nr_cpus_allowed check removed.
			Note that this is a 3.7-rc8 based test because that's what tip/master
			is.
autonuma-v28fastr4	Autonuma v28fast with THP patch on top
balancenuma-v9r2	Balance numa v9
balancenuma-v10r3	V9 + the migration scalability patches

AutoNUMA Benchmark
==================

                                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc8             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                                     stats-v8r6     numacore-20121130      numafix-20121209    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
User    NUMA01               65230.85 (  0.00%)    24835.22 ( 61.93%)    21882.80 ( 66.45%)    30410.22 ( 53.38%)    52436.65 ( 19.61%)    59949.95 (  8.10%)
User    NUMA01_THEADLOCAL    60794.67 (  0.00%)    17856.17 ( 70.63%)    18367.20 ( 69.79%)    17185.34 ( 71.73%)    17829.96 ( 70.67%)    17501.83 ( 71.21%)
User    NUMA02                7031.50 (  0.00%)     2084.38 ( 70.36%)     2391.47 ( 65.99%)     2238.73 ( 68.16%)     2079.48 ( 70.43%)     2094.68 ( 70.21%)
User    NUMA02_SMT            2916.19 (  0.00%)     1009.28 ( 65.39%)     1046.49 ( 64.11%)     1037.07 ( 64.44%)      997.57 ( 65.79%)     1010.15 ( 65.36%)
System  NUMA01                  39.66 (  0.00%)      926.55 (-2236.23%)      134.00 (-237.87%)      236.83 (-497.15%)      275.09 (-593.62%)      265.02 (-568.23%)
System  NUMA01_THEADLOCAL       42.33 (  0.00%)      513.99 (-1114.25%)      201.65 (-376.38%)       70.90 (-67.49%)      110.82 (-161.80%)      130.30 (-207.82%)
System  NUMA02                   1.25 (  0.00%)       18.57 (-1385.60%)       13.00 (-940.00%)        6.39 (-411.20%)        6.42 (-413.60%)        9.17 (-633.60%)
System  NUMA02_SMT              16.66 (  0.00%)       12.32 ( 26.05%)        7.26 ( 56.42%)        3.17 ( 80.97%)        3.58 ( 78.51%)        6.21 ( 62.73%)
Elapsed NUMA01                1511.76 (  0.00%)      575.93 ( 61.90%)      475.26 ( 68.56%)      701.62 ( 53.59%)     1185.53 ( 21.58%)     1352.74 ( 10.52%)
Elapsed NUMA01_THEADLOCAL     1387.17 (  0.00%)      398.55 ( 71.27%)      405.25 ( 70.79%)      378.47 ( 72.72%)      397.37 ( 71.35%)      387.93 ( 72.03%)
Elapsed NUMA02                 176.81 (  0.00%)       51.14 ( 71.08%)       62.08 ( 64.89%)       53.45 ( 69.77%)       49.51 ( 72.00%)       49.77 ( 71.85%)
Elapsed NUMA02_SMT             163.96 (  0.00%)       48.92 ( 70.16%)       54.45 ( 66.79%)       48.17 ( 70.62%)       47.71 ( 70.90%)       48.63 ( 70.34%)
CPU     NUMA01                4317.00 (  0.00%)     4473.00 ( -3.61%)     4632.00 ( -7.30%)     4368.00 ( -1.18%)     4446.00 ( -2.99%)     4451.00 ( -3.10%)
CPU     NUMA01_THEADLOCAL     4385.00 (  0.00%)     4609.00 ( -5.11%)     4582.00 ( -4.49%)     4559.00 ( -3.97%)     4514.00 ( -2.94%)     4545.00 ( -3.65%)
CPU     NUMA02                3977.00 (  0.00%)     4111.00 ( -3.37%)     3873.00 (  2.62%)     4200.00 ( -5.61%)     4212.00 ( -5.91%)     4226.00 ( -6.26%)
CPU     NUMA02_SMT            1788.00 (  0.00%)     2087.00 (-16.72%)     1935.00 ( -8.22%)     2159.00 (-20.75%)     2098.00 (-17.34%)     2089.00 (-16.83%)

Latest numacore has improved on the numa01 case quite a bit. However, this is
the adverse workload.  For the workloads that actually do something sensible,
autonuma and balancenuma are both beating numacore by a good margin.

numacores system CPU usage continues to be excessive -- over triple
balancenumas in the numa01 case. Over quadruple in the numa01_threadlocal
case. Double in numa02 and over double in the numa02_smt case.

Duration and vmstats showed nothing interesting so I excluded them this time.

SpecJBB, Multiple JVMs, THP is enabled
======================================

There is no latest numacore figures available because the JVM in two
separate tests crashed with this report

Input Properties:
  per_jvm_warehouse_rampup = 3.0
  per_jvm_warehouse_rampdown = 20.0
  jvm_instances = 4
  deterministic_random_seed = false
  ramp_up_seconds = 30
  measurement_seconds = 240
  starting_number_warehouses = 1
  increment_number_warehouses = 1
  ending_number_warehouses = 24
  expected_peak_warehouse = 12
Waiting on instance 1 pid 4028 to finish.
Accepted client /127.0.0.1:59130
Accepted client /127.0.0.1:58393
Accepted client /127.0.0.1:53374
Accepted client /127.0.0.1:40128
java.lang.NullPointerException: error
/root/git-private/autonuma-test/shellpacks/shellpack-bench-specjbb: line 203:  4028 Aborted                 java $USE_HUGEPAGE $SPECJBB_MAXHEAP spec.jbb.JBBmain -propfile SPECjbb.props -id $INSTANCE > $SHELLPACK_TEMP/jvm-instance-$INSTANCE.log
Waiting on instance 1 pid 4029 to finish.
Exception in thread "main" java.lang.NullPointerException
	at spec.jbb.Company.displayResultTotals(Unknown Source)
	at spec.jbb.JBBmain.DoARun(Unknown Source)
	at spec.jbb.JBBmain.runWarehouse(Unknown Source)
	at spec.jbb.JBBmain.doIt(Unknown Source)
	at spec.jbb.JBBmain.main(Unknown Source)
Exception in thread "main" java.lang.NullPointerException
	at spec.jbb.Company.displayResultTotals(Unknown Source)
	at spec.jbb.JBBmain.DoARun(Unknown Source)
	at spec.jbb.JBBmain.runWarehouse(Unknown Source)
	at spec.jbb.JBBmain.doIt(Unknown Source)
	at spec.jbb.JBBmain.main(Unknown Source)
Exception in thread "main" java.lang.NullPointerException
	at spec.jbb.Company.displayResultTotals(Unknown Source)
	at spec.jbb.JBBmain.DoARun(Unknown Source)
	at spec.jbb.JBBmain.runWarehouse(Unknown Source)
	at spec.jbb.JBBmain.doIt(Unknown Source)
	at spec.jbb.JBBmain.main(Unknown Source)
Read from remote host compass: Connection reset by peer

Here are the results for the kernels that succeeded

                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                     stats-v8r6     numacore-20121130    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
Mean   1      31311.75 (  0.00%)     27938.00 (-10.77%)     31474.25 (  0.52%)     31112.00 ( -0.64%)     31281.50 ( -0.10%)
Mean   2      62972.75 (  0.00%)     51899.00 (-17.58%)     66654.00 (  5.85%)     62937.50 ( -0.06%)     62483.50 ( -0.78%)
Mean   3      91292.00 (  0.00%)     80908.00 (-11.37%)     97177.50 (  6.45%)     90665.50 ( -0.69%)     90667.00 ( -0.68%)
Mean   4     115768.75 (  0.00%)     99497.25 (-14.06%)    125596.00 (  8.49%)    116812.50 (  0.90%)    116193.50 (  0.37%)
Mean   5     137248.50 (  0.00%)     92837.75 (-32.36%)    152795.25 ( 11.33%)    139037.75 (  1.30%)    139055.50 (  1.32%)
Mean   6     155528.50 (  0.00%)    105554.50 (-32.13%)    177455.25 ( 14.10%)    155769.25 (  0.15%)    159129.50 (  2.32%)
Mean   7     156747.50 (  0.00%)    122582.25 (-21.80%)    184578.75 ( 17.76%)    157103.25 (  0.23%)    163234.00 (  4.14%)
Mean   8     152069.50 (  0.00%)    122439.00 (-19.48%)    186619.25 ( 22.72%)    157631.00 (  3.66%)    163077.75 (  7.24%)
Mean   9     146609.75 (  0.00%)    112410.00 (-23.33%)    186165.00 ( 26.98%)    152561.00 (  4.06%)    159656.00 (  8.90%)
Mean   10    142819.00 (  0.00%)    111456.00 (-21.96%)    182569.75 ( 27.83%)    145320.00 (  1.75%)    153414.25 (  7.42%)
Mean   11    128292.25 (  0.00%)     98027.00 (-23.59%)    176104.75 ( 37.27%)    138599.50 (  8.03%)    147194.25 ( 14.73%)
Mean   12    128769.75 (  0.00%)    129469.50 (  0.54%)    169003.00 ( 31.24%)    131994.75 (  2.50%)    140049.75 (  8.76%)
Mean   13    126488.50 (  0.00%)    110133.75 (-12.93%)    162725.75 ( 28.65%)    130005.25 (  2.78%)    139109.75 (  9.98%)
Mean   14    123400.00 (  0.00%)    117929.75 ( -4.43%)    163781.25 ( 32.72%)    126340.75 (  2.38%)    137883.00 ( 11.74%)
Mean   15    122139.50 (  0.00%)    122404.25 (  0.22%)    160800.25 ( 31.65%)    128612.75 (  5.30%)    136624.00 ( 11.86%)
Mean   16    116413.50 (  0.00%)    124573.50 (  7.01%)    160882.75 ( 38.20%)    117793.75 (  1.19%)    134005.75 ( 15.11%)
Mean   17    117263.25 (  0.00%)    121937.25 (  3.99%)    159069.75 ( 35.65%)    121991.75 (  4.03%)    133444.50 ( 13.80%)
Mean   18    117277.00 (  0.00%)    116633.75 ( -0.55%)    158694.75 ( 35.32%)    119089.75 (  1.55%)    129650.75 ( 10.55%)
Mean   19    113231.00 (  0.00%)    111035.75 ( -1.94%)    155563.25 ( 37.39%)    119699.75 (  5.71%)    123403.25 (  8.98%)
Mean   20    113628.75 (  0.00%)    113451.25 ( -0.16%)    154779.75 ( 36.22%)    118400.75 (  4.20%)    126041.25 ( 10.92%)
Mean   21    110982.50 (  0.00%)    107660.50 ( -2.99%)    151147.25 ( 36.19%)    115663.25 (  4.22%)    121906.50 (  9.84%)
Mean   22    107660.25 (  0.00%)    104771.50 ( -2.68%)    151180.50 ( 40.42%)    111038.00 (  3.14%)    125519.00 ( 16.59%)
Mean   23    105320.50 (  0.00%)     88275.25 (-16.18%)    147032.00 ( 39.60%)    112817.50 (  7.12%)    124148.25 ( 17.88%)
Mean   24    110900.50 (  0.00%)     85169.00 (-23.20%)    147407.00 ( 32.92%)    109556.50 ( -1.21%)    122544.00 ( 10.50%)
Stddev 1        720.83 (  0.00%)       982.31 (-36.28%)       942.80 (-30.79%)      1170.23 (-62.35%)       539.84 ( 25.11%)
Stddev 2        466.00 (  0.00%)      1770.75 (-279.99%)      1327.32 (-184.83%)      1368.51 (-193.67%)      2103.32 (-351.35%)
Stddev 3        509.61 (  0.00%)      4849.62 (-851.63%)      1803.72 (-253.94%)      1088.04 (-113.50%)       410.73 ( 19.40%)
Stddev 4       1750.10 (  0.00%)     10708.16 (-511.86%)      2010.11 (-14.86%)      1456.90 ( 16.75%)      1370.22 ( 21.71%)
Stddev 5        700.05 (  0.00%)     16497.79 (-2256.66%)      2354.70 (-236.36%)       759.38 ( -8.48%)      1869.54 (-167.06%)
Stddev 6       2259.33 (  0.00%)     24221.98 (-972.09%)      1516.32 ( 32.89%)      1032.39 ( 54.31%)      1720.87 ( 23.83%)
Stddev 7       3390.99 (  0.00%)      4721.80 (-39.25%)      2398.34 ( 29.27%)      2487.08 ( 26.66%)      4327.85 (-27.63%)
Stddev 8       7533.18 (  0.00%)      8609.90 (-14.29%)      2895.55 ( 61.56%)      3902.53 ( 48.20%)      2536.68 ( 66.33%)
Stddev 9       9223.98 (  0.00%)     10731.70 (-16.35%)      4726.23 ( 48.76%)      5673.20 ( 38.50%)      3377.59 ( 63.38%)
Stddev 10      4578.09 (  0.00%)     11136.27 (-143.25%)      6705.48 (-46.47%)      5516.47 (-20.50%)      7227.58 (-57.87%)
Stddev 11      8201.30 (  0.00%)      3580.27 ( 56.35%)     10915.90 (-33.10%)      4757.42 ( 41.99%)      4056.02 ( 50.54%)
Stddev 12      5713.70 (  0.00%)     13923.12 (-143.68%)     16555.64 (-189.75%)      4573.05 ( 19.96%)      3678.89 ( 35.61%)
Stddev 13      5878.95 (  0.00%)     10471.09 (-78.11%)     18628.01 (-216.86%)      1680.65 ( 71.41%)      3947.39 ( 32.86%)
Stddev 14      4783.95 (  0.00%)      4051.35 ( 15.31%)     18324.63 (-283.04%)      2637.82 ( 44.86%)      4806.09 ( -0.46%)
Stddev 15      6281.48 (  0.00%)      3357.07 ( 46.56%)     17654.58 (-181.06%)      2003.38 ( 68.11%)      3005.22 ( 52.16%)
Stddev 16      6948.12 (  0.00%)      3763.32 ( 45.84%)     18280.52 (-163.10%)      3526.10 ( 49.25%)      3309.24 ( 52.37%)
Stddev 17      5603.77 (  0.00%)      1452.04 ( 74.09%)     18230.53 (-225.33%)      1712.95 ( 69.43%)      3516.09 ( 37.25%)
Stddev 18      6200.90 (  0.00%)      1870.12 ( 69.84%)     18486.73 (-198.13%)       751.36 ( 87.88%)      2412.60 ( 61.09%)
Stddev 19      6726.31 (  0.00%)      1045.21 ( 84.46%)     18465.25 (-174.52%)      1750.49 ( 73.98%)      4482.82 ( 33.35%)
Stddev 20      5713.58 (  0.00%)      2066.90 ( 63.82%)     19947.77 (-249.13%)      1892.91 ( 66.87%)      2612.62 ( 54.27%)
Stddev 21      4566.92 (  0.00%)      2460.40 ( 46.13%)     21189.08 (-363.97%)      3639.75 ( 20.30%)      1963.17 ( 57.01%)
Stddev 22      6168.05 (  0.00%)      2770.81 ( 55.08%)     20033.82 (-224.80%)      3682.20 ( 40.30%)      1159.17 ( 81.21%)
Stddev 23      6295.45 (  0.00%)      1337.32 ( 78.76%)     22610.91 (-259.16%)      2013.53 ( 68.02%)      3842.61 ( 38.96%)
Stddev 24      3108.17 (  0.00%)      1381.20 ( 55.56%)     21243.56 (-583.47%)      4044.16 (-30.11%)      2673.39 ( 13.99%)
TPut   1     125247.00 (  0.00%)    111752.00 (-10.77%)    125897.00 (  0.52%)    124448.00 ( -0.64%)    125126.00 ( -0.10%)
TPut   2     251891.00 (  0.00%)    207596.00 (-17.58%)    266616.00 (  5.85%)    251750.00 ( -0.06%)    249934.00 ( -0.78%)
TPut   3     365168.00 (  0.00%)    323632.00 (-11.37%)    388710.00 (  6.45%)    362662.00 ( -0.69%)    362668.00 ( -0.68%)
TPut   4     463075.00 (  0.00%)    397989.00 (-14.06%)    502384.00 (  8.49%)    467250.00 (  0.90%)    464774.00 (  0.37%)
TPut   5     548994.00 (  0.00%)    371351.00 (-32.36%)    611181.00 ( 11.33%)    556151.00 (  1.30%)    556222.00 (  1.32%)
TPut   6     622114.00 (  0.00%)    422218.00 (-32.13%)    709821.00 ( 14.10%)    623077.00 (  0.15%)    636518.00 (  2.32%)
TPut   7     626990.00 (  0.00%)    490329.00 (-21.80%)    738315.00 ( 17.76%)    628413.00 (  0.23%)    652936.00 (  4.14%)
TPut   8     608278.00 (  0.00%)    489756.00 (-19.48%)    746477.00 ( 22.72%)    630524.00 (  3.66%)    652311.00 (  7.24%)
TPut   9     586439.00 (  0.00%)    449640.00 (-23.33%)    744660.00 ( 26.98%)    610244.00 (  4.06%)    638624.00 (  8.90%)
TPut   10    571276.00 (  0.00%)    445824.00 (-21.96%)    730279.00 ( 27.83%)    581280.00 (  1.75%)    613657.00 (  7.42%)
TPut   11    513169.00 (  0.00%)    392108.00 (-23.59%)    704419.00 ( 37.27%)    554398.00 (  8.03%)    588777.00 ( 14.73%)
TPut   12    515079.00 (  0.00%)    517878.00 (  0.54%)    676012.00 ( 31.24%)    527979.00 (  2.50%)    560199.00 (  8.76%)
TPut   13    505954.00 (  0.00%)    440535.00 (-12.93%)    650903.00 ( 28.65%)    520021.00 (  2.78%)    556439.00 (  9.98%)
TPut   14    493600.00 (  0.00%)    471719.00 ( -4.43%)    655125.00 ( 32.72%)    505363.00 (  2.38%)    551532.00 ( 11.74%)
TPut   15    488558.00 (  0.00%)    489617.00 (  0.22%)    643201.00 ( 31.65%)    514451.00 (  5.30%)    546496.00 ( 11.86%)
TPut   16    465654.00 (  0.00%)    498294.00 (  7.01%)    643531.00 ( 38.20%)    471175.00 (  1.19%)    536023.00 ( 15.11%)
TPut   17    469053.00 (  0.00%)    487749.00 (  3.99%)    636279.00 ( 35.65%)    487967.00 (  4.03%)    533778.00 ( 13.80%)
TPut   18    469108.00 (  0.00%)    466535.00 ( -0.55%)    634779.00 ( 35.32%)    476359.00 (  1.55%)    518603.00 ( 10.55%)
TPut   19    452924.00 (  0.00%)    444143.00 ( -1.94%)    622253.00 ( 37.39%)    478799.00 (  5.71%)    493613.00 (  8.98%)
TPut   20    454515.00 (  0.00%)    453805.00 ( -0.16%)    619119.00 ( 36.22%)    473603.00 (  4.20%)    504165.00 ( 10.92%)
TPut   21    443930.00 (  0.00%)    430642.00 ( -2.99%)    604589.00 ( 36.19%)    462653.00 (  4.22%)    487626.00 (  9.84%)
TPut   22    430641.00 (  0.00%)    419086.00 ( -2.68%)    604722.00 ( 40.42%)    444152.00 (  3.14%)    502076.00 ( 16.59%)
TPut   23    421282.00 (  0.00%)    353101.00 (-16.18%)    588128.00 ( 39.60%)    451270.00 (  7.12%)    496593.00 ( 17.88%)
TPut   24    443602.00 (  0.00%)    340676.00 (-23.20%)    589628.00 ( 32.92%)    438226.00 ( -1.21%)    490176.00 ( 10.50%)

numacore v17 regressed but we knew that already.

autonuma does the best overall

balancenuma does all right and the scalability patches help quite a bit.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        515079.00 (  0.00%)        517878.00 (  0.54%)        676012.00 ( 31.24%)        527979.00 (  2.50%)        560199.00 (  8.76%)
 Actual Warehouse             7.00 (  0.00%)            12.00 ( 71.43%)             8.00 ( 14.29%)             8.00 ( 14.29%)             7.00 (  0.00%)
 Actual Peak Bops        626990.00 (  0.00%)        517878.00 (-17.40%)        746477.00 ( 19.06%)        630524.00 (  0.56%)        652936.00 (  4.14%)
 SpecJBB Bops            465685.00 (  0.00%)        447214.00 ( -3.97%)        628328.00 ( 34.93%)        480925.00 (  3.27%)        521332.00 ( 11.95%)
 SpecJBB Bops/JVM        116421.00 (  0.00%)        111804.00 ( -3.97%)        157082.00 ( 34.93%)        120231.00 (  3.27%)        130333.00 ( 11.95%)

numacore is pretty old here so ignore the regression.

autonuma is the best but balancenuma sees some of the performance gain.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
Page Ins                         37116       36404       36740       35664       34832
Page Outs                        30340       33624       29428       29656       30320
Swap Ins                             0           0           0           0           0
Swap Outs                            0           0           0           0           0
Direct pages scanned                 0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0
Page writes file                     0           0           0           0           0
Page writes anon                     0           0           0           0           0
Page reclaim immediate               0           0           0           0           0
Page rescued immediate               0           0           0           0           0
Slabs scanned                        0           0           0           0           0
Direct inode steals                  0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                  63322       49889       52514       65794       66963
THP collapse alloc                 130          53         463         128         121
THP splits                         355         192         376         371         362
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success                 0           0           0    51424061    50195011
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                      0           0           0       53378       52102
NUMA PTE updates                     0           0           0   411047238   404964644
NUMA hint faults                     0           0           0     3077302     3075026
NUMA hint local faults               0           0           0      958617      870171
NUMA pages migrated                  0           0           0    51424061    50195011
AutoNUMA cost                        0           0           0       19240       19163

All it shows really is that THP is enabled and that balancenuma is migrating
more than I'd like -- 48MB/sec on average throughout the test.

SpecJBB, Multiple JVMs, THP is disabled
=======================================
                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc8             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                     stats-v8r6     numacore-20121130      numafix-20121209    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
Mean   1      26036.50 (  0.00%)     19595.00 (-24.74%)     23791.25 ( -8.62%)     24738.25 ( -4.99%)     25595.00 ( -1.70%)     25610.50 ( -1.64%)
Mean   2      53629.75 (  0.00%)     38481.50 (-28.25%)     46966.75 (-12.42%)     55646.75 (  3.76%)     53045.25 ( -1.09%)     53383.00 ( -0.46%)
Mean   3      77385.00 (  0.00%)     53685.50 (-30.63%)     66913.25 (-13.53%)     82714.75 (  6.89%)     76596.00 ( -1.02%)     76502.75 ( -1.14%)
Mean   4     100097.75 (  0.00%)     68253.50 (-31.81%)     72186.50 (-27.88%)    107883.25 (  7.78%)     98618.00 ( -1.48%)     99786.50 ( -0.31%)
Mean   5     119012.75 (  0.00%)     74164.50 (-37.68%)     72126.50 (-39.40%)    130260.25 (  9.45%)    119354.50 (  0.29%)    121741.75 (  2.29%)
Mean   6     137419.25 (  0.00%)     86158.50 (-37.30%)     52123.00 (-62.07%)    154244.50 ( 12.24%)    136901.75 ( -0.38%)    136990.50 ( -0.31%)
Mean   7     138018.25 (  0.00%)     96059.25 (-30.40%)     55582.50 (-59.73%)    159501.00 ( 15.57%)    138265.50 (  0.18%)    139398.75 (  1.00%)
Mean   8     136774.00 (  0.00%)     97003.50 (-29.08%)     30208.25 (-77.91%)    162868.00 ( 19.08%)    138554.50 (  1.30%)    137340.75 (  0.41%)
Mean   9     127966.50 (  0.00%)     95261.00 (-25.56%)    125900.50 ( -1.61%)    163008.00 ( 27.38%)    137954.00 (  7.80%)    134200.50 (  4.87%)
Mean   10    124628.75 (  0.00%)     96202.25 (-22.81%)     73809.00 (-40.78%)    159696.50 ( 28.14%)    131322.25 (  5.37%)    126927.50 (  1.84%)
Mean   11    117269.00 (  0.00%)     95924.25 (-18.20%)    127804.25 (  8.98%)    154701.50 ( 31.92%)    125032.75 (  6.62%)    122925.00 (  4.82%)
Mean   12    111962.25 (  0.00%)     94247.25 (-15.82%)    146580.25 ( 30.92%)    150936.50 ( 34.81%)    118119.50 (  5.50%)    119931.75 (  7.12%)
Mean   13    111595.50 (  0.00%)    106538.50 ( -4.53%)    134462.75 ( 20.49%)    147193.25 ( 31.90%)    116398.75 (  4.30%)    117349.75 (  5.16%)
Mean   14    110881.00 (  0.00%)    103549.00 ( -6.61%)    137573.25 ( 24.07%)    144584.00 ( 30.40%)    114934.50 (  3.66%)    115838.25 (  4.47%)
Mean   15    109337.50 (  0.00%)    101729.00 ( -6.96%)    139722.50 ( 27.79%)    143333.00 ( 31.09%)    115523.75 (  5.66%)    115151.25 (  5.32%)
Mean   16    107031.75 (  0.00%)    101983.75 ( -4.72%)    121221.75 ( 13.26%)    141907.75 ( 32.58%)    113666.00 (  6.20%)    113673.50 (  6.21%)
Mean   17    105491.25 (  0.00%)    100205.75 ( -5.01%)    129429.75 ( 22.69%)    140691.00 ( 33.37%)    112751.50 (  6.88%)    113221.25 (  7.33%)
Mean   18    101102.75 (  0.00%)     96635.50 ( -4.42%)    115086.50 ( 13.83%)    137784.25 ( 36.28%)    112582.50 ( 11.35%)    111533.50 ( 10.32%)
Mean   19    103907.25 (  0.00%)     94578.25 ( -8.98%)    126392.75 ( 21.64%)    135719.25 ( 30.62%)    110152.25 (  6.01%)    113959.25 (  9.67%)
Mean   20    100496.00 (  0.00%)     92683.75 ( -7.77%)    123318.75 ( 22.71%)    135264.25 ( 34.60%)    108861.50 (  8.32%)    113746.00 ( 13.18%)
Mean   21     99570.00 (  0.00%)     92955.75 ( -6.64%)    111293.00 ( 11.77%)    133891.00 ( 34.47%)    110094.00 ( 10.57%)    109462.50 (  9.94%)
Mean   22     98611.75 (  0.00%)     89781.75 ( -8.95%)    118218.50 ( 19.88%)    132399.75 ( 34.26%)    109322.75 ( 10.86%)    110502.75 ( 12.06%)
Mean   23     98173.00 (  0.00%)     88846.00 ( -9.50%)    118210.00 ( 20.41%)    130726.00 ( 33.16%)    106046.25 (  8.02%)    107304.25 (  9.30%)
Mean   24     92074.75 (  0.00%)     88581.00 ( -3.79%)    111965.00 ( 21.60%)    127552.25 ( 38.53%)    102362.00 ( 11.17%)    107119.25 ( 16.34%)
Stddev 1        735.13 (  0.00%)       538.24 ( 26.78%)       854.37 (-16.22%)       121.08 ( 83.53%)       906.62 (-23.33%)       788.06 ( -7.20%)
Stddev 2        406.26 (  0.00%)      3458.87 (-751.39%)      4220.03 (-938.75%)       477.32 (-17.49%)      1322.57 (-225.55%)       468.57 (-15.34%)
Stddev 3        644.20 (  0.00%)      1360.89 (-111.25%)      2573.27 (-299.45%)       922.47 (-43.20%)       609.27 (  5.42%)       599.26 (  6.98%)
Stddev 4        743.93 (  0.00%)      2149.34 (-188.92%)     14533.01 (-1853.53%)      1385.42 (-86.23%)      1119.02 (-50.42%)       801.13 ( -7.69%)
Stddev 5        898.53 (  0.00%)      2521.01 (-180.57%)     15303.97 (-1603.23%)       763.24 ( 15.06%)       942.52 ( -4.90%)      1718.19 (-91.22%)
Stddev 6       1126.61 (  0.00%)      3818.22 (-238.91%)     23616.59 (-1996.26%)      1527.03 (-35.54%)      2445.69 (-117.08%)      1754.32 (-55.72%)
Stddev 7       2907.61 (  0.00%)      4419.29 (-51.99%)     29664.97 (-920.25%)      1536.66 ( 47.15%)      4881.65 (-67.89%)      4863.83 (-67.28%)
Stddev 8       3200.64 (  0.00%)       382.01 ( 88.06%)     10743.99 (-235.68%)      1228.09 ( 61.63%)      5459.06 (-70.56%)      5583.95 (-74.46%)
Stddev 9       2907.92 (  0.00%)      1813.39 ( 37.64%)     11763.90 (-304.55%)      1502.61 ( 48.33%)      2501.16 ( 13.99%)      2525.02 ( 13.17%)
Stddev 10      5093.23 (  0.00%)      1313.58 ( 74.21%)     34926.95 (-585.75%)      2763.19 ( 45.75%)      2973.78 ( 41.61%)      2005.95 ( 60.62%)
Stddev 11      4982.41 (  0.00%)      1163.02 ( 76.66%)     13792.07 (-176.81%)      4776.28 (  4.14%)      6068.34 (-21.80%)      4256.77 ( 14.56%)
Stddev 12      3051.38 (  0.00%)      2117.59 ( 30.60%)      5819.48 (-90.72%)      9252.59 (-203.23%)      3885.96 (-27.35%)      2580.44 ( 15.43%)
Stddev 13      2918.03 (  0.00%)      2252.11 ( 22.82%)      8340.05 (-185.81%)      9384.83 (-221.62%)      1833.07 ( 37.18%)      2523.28 ( 13.53%)
Stddev 14      3178.97 (  0.00%)      2337.49 ( 26.47%)      6166.98 (-93.99%)      9353.03 (-194.22%)      1072.60 ( 66.26%)      1140.55 ( 64.12%)
Stddev 15      2438.31 (  0.00%)      1707.72 ( 29.96%)     10687.74 (-338.33%)     10494.03 (-330.38%)      2295.76 (  5.85%)      1213.75 ( 50.22%)
Stddev 16      2682.25 (  0.00%)       840.47 ( 68.67%)     10963.32 (-308.74%)     10343.25 (-285.62%)      2416.09 (  9.92%)      1697.27 ( 36.72%)
Stddev 17      2807.66 (  0.00%)      1546.16 ( 44.93%)     10755.81 (-283.09%)     11446.15 (-307.68%)      2484.08 ( 11.52%)       563.50 ( 79.93%)
Stddev 18      3049.27 (  0.00%)       934.11 ( 69.37%)      8523.80 (-179.54%)     11779.80 (-286.31%)      1472.27 ( 51.72%)      1533.68 ( 49.70%)
Stddev 19      2782.65 (  0.00%)       735.28 ( 73.58%)      9045.84 (-225.08%)     11416.35 (-310.27%)       514.78 ( 81.50%)      1283.38 ( 53.88%)
Stddev 20      2379.12 (  0.00%)       956.25 ( 59.81%)      3789.62 (-59.29%)     10511.63 (-341.83%)      1641.25 ( 31.01%)      1758.22 ( 26.10%)
Stddev 21      2975.22 (  0.00%)       438.31 ( 85.27%)      8160.39 (-174.28%)     11292.91 (-279.57%)      1087.60 ( 63.44%)       434.51 ( 85.40%)
Stddev 22      2260.61 (  0.00%)       718.23 ( 68.23%)     10418.90 (-360.89%)     11993.84 (-430.56%)       909.16 ( 59.78%)       322.32 ( 85.74%)
Stddev 23      2900.85 (  0.00%)       275.47 ( 90.50%)      9829.57 (-238.85%)     12234.80 (-321.77%)       701.39 ( 75.82%)      1444.19 ( 50.21%)
Stddev 24      2578.98 (  0.00%)       481.68 ( 81.32%)      7696.37 (-198.43%)     12769.61 (-395.14%)       732.56 ( 71.60%)      1777.60 ( 31.07%)
TPut   1     104146.00 (  0.00%)     78380.00 (-24.74%)     95165.00 ( -8.62%)     98953.00 ( -4.99%)    102380.00 ( -1.70%)    102442.00 ( -1.64%)
TPut   2     214519.00 (  0.00%)    153926.00 (-28.25%)    187867.00 (-12.42%)    222587.00 (  3.76%)    212181.00 ( -1.09%)    213532.00 ( -0.46%)
TPut   3     309540.00 (  0.00%)    214742.00 (-30.63%)    267653.00 (-13.53%)    330859.00 (  6.89%)    306384.00 ( -1.02%)    306011.00 ( -1.14%)
TPut   4     400391.00 (  0.00%)    273014.00 (-31.81%)    288746.00 (-27.88%)    431533.00 (  7.78%)    394472.00 ( -1.48%)    399146.00 ( -0.31%)
TPut   5     476051.00 (  0.00%)    296658.00 (-37.68%)    288506.00 (-39.40%)    521041.00 (  9.45%)    477418.00 (  0.29%)    486967.00 (  2.29%)
TPut   6     549677.00 (  0.00%)    344634.00 (-37.30%)    208492.00 (-62.07%)    616978.00 ( 12.24%)    547607.00 ( -0.38%)    547962.00 ( -0.31%)
TPut   7     552073.00 (  0.00%)    384237.00 (-30.40%)    222330.00 (-59.73%)    638004.00 ( 15.57%)    553062.00 (  0.18%)    557595.00 (  1.00%)
TPut   8     547096.00 (  0.00%)    388014.00 (-29.08%)    120833.00 (-77.91%)    651472.00 ( 19.08%)    554218.00 (  1.30%)    549363.00 (  0.41%)
TPut   9     511866.00 (  0.00%)    381044.00 (-25.56%)    503602.00 ( -1.61%)    652032.00 ( 27.38%)    551816.00 (  7.80%)    536802.00 (  4.87%)
TPut   10    498515.00 (  0.00%)    384809.00 (-22.81%)    295236.00 (-40.78%)    638786.00 ( 28.14%)    525289.00 (  5.37%)    507710.00 (  1.84%)
TPut   11    469076.00 (  0.00%)    383697.00 (-18.20%)    511217.00 (  8.98%)    618806.00 ( 31.92%)    500131.00 (  6.62%)    491700.00 (  4.82%)
TPut   12    447849.00 (  0.00%)    376989.00 (-15.82%)    586321.00 ( 30.92%)    603746.00 ( 34.81%)    472478.00 (  5.50%)    479727.00 (  7.12%)
TPut   13    446382.00 (  0.00%)    426154.00 ( -4.53%)    537851.00 ( 20.49%)    588773.00 ( 31.90%)    465595.00 (  4.30%)    469399.00 (  5.16%)
TPut   14    443524.00 (  0.00%)    414196.00 ( -6.61%)    550293.00 ( 24.07%)    578336.00 ( 30.40%)    459738.00 (  3.66%)    463353.00 (  4.47%)
TPut   15    437350.00 (  0.00%)    406916.00 ( -6.96%)    558890.00 ( 27.79%)    573332.00 ( 31.09%)    462095.00 (  5.66%)    460605.00 (  5.32%)
TPut   16    428127.00 (  0.00%)    407935.00 ( -4.72%)    484887.00 ( 13.26%)    567631.00 ( 32.58%)    454664.00 (  6.20%)    454694.00 (  6.21%)
TPut   17    421965.00 (  0.00%)    400823.00 ( -5.01%)    517719.00 ( 22.69%)    562764.00 ( 33.37%)    451006.00 (  6.88%)    452885.00 (  7.33%)
TPut   18    404411.00 (  0.00%)    386542.00 ( -4.42%)    460346.00 ( 13.83%)    551137.00 ( 36.28%)    450330.00 ( 11.35%)    446134.00 ( 10.32%)
TPut   19    415629.00 (  0.00%)    378313.00 ( -8.98%)    505571.00 ( 21.64%)    542877.00 ( 30.62%)    440609.00 (  6.01%)    455837.00 (  9.67%)
TPut   20    401984.00 (  0.00%)    370735.00 ( -7.77%)    493275.00 ( 22.71%)    541057.00 ( 34.60%)    435446.00 (  8.32%)    454984.00 ( 13.18%)
TPut   21    398280.00 (  0.00%)    371823.00 ( -6.64%)    445172.00 ( 11.77%)    535564.00 ( 34.47%)    440376.00 ( 10.57%)    437850.00 (  9.94%)
TPut   22    394447.00 (  0.00%)    359127.00 ( -8.95%)    472874.00 ( 19.88%)    529599.00 ( 34.26%)    437291.00 ( 10.86%)    442011.00 ( 12.06%)
TPut   23    392692.00 (  0.00%)    355384.00 ( -9.50%)    472840.00 ( 20.41%)    522904.00 ( 33.16%)    424185.00 (  8.02%)    429217.00 (  9.30%)
TPut   24    368299.00 (  0.00%)    354324.00 ( -3.79%)    447860.00 ( 21.60%)    510209.00 ( 38.53%)    409448.00 ( 11.17%)    428477.00 ( 16.34%)

Latest numacore has improved dramatically here. In v17, it was regressing
heavily across the board. The latest figures show that it regresses heavily
for small numbers of warehouses and shows very large performance gains
for larger numbers of warehouses. This problem with regressions for smaller
numbers of warehouses has been reported repeatedly and it has been pointed out
multiple times that specjbb by default ignores these results which can be
very misleading.

autonuma shows large gains even for small numbers of warehouses and larger
performnace gains than numacore does. This is without the TLB optimisations.

balancenuma is not great, but it's better than mainline.


SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc8                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130           numafix-20121209         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        447849.00 (  0.00%)        376989.00 (-15.82%)        586321.00 ( 30.92%)        603746.00 ( 34.81%)        472478.00 (  5.50%)        479727.00 (  7.12%)
 Actual Warehouse             7.00 (  0.00%)            13.00 ( 85.71%)            12.00 ( 71.43%)             9.00 ( 28.57%)             8.00 ( 14.29%)             7.00 (  0.00%)
 Actual Peak Bops        552073.00 (  0.00%)        426154.00 (-22.81%)        586321.00 (  6.20%)        652032.00 ( 18.11%)        554218.00 (  0.39%)        557595.00 (  1.00%)
 SpecJBB Bops            415458.00 (  0.00%)        385328.00 ( -7.25%)        502608.00 ( 20.98%)        554456.00 ( 33.46%)        446405.00 (  7.45%)        451937.00 (  8.78%)
 SpecJBB Bops/JVM        103865.00 (  0.00%)         96332.00 ( -7.25%)        125652.00 ( 20.98%)        138614.00 ( 33.46%)        111601.00 (  7.45%)        112984.00 (  8.78%)

numacore is showing good performance gains both at the peak and in the
specjbb score. Note that the specjbb score ignored the regressions for
smaller numbers of warehouses.

autonuma was still better.

balancenuma was all right, better than mainline.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc8   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numafix-20121209autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
User       177832.71   148340.09   165197.46   177337.90   176411.93   176466.36
System         89.07    28052.02    12438.18      287.31     1464.93     1467.74
Elapsed      4035.81     4041.26     4038.34     4028.05     4041.53     4031.74

numacores system CPU usage is incredibly high -- over 8 times higher
than balancenumas.

balancenumas system CPU usage also sucks to be honest.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc8   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numafix-20121209autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
Page Ins                         37380       66040       34576       36416       35452       34948
Page Outs                        29224       46900       31972       29584       29612       30892
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      2           3           1           2           2           2
THP collapse alloc                   0           0           0           0           0           0
THP splits                           0           0           0           0           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0   193988041           0    37611432    39796961
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0      201359           0       39040       41309
NUMA PTE updates                     0           0   904384590           0   288455303   286931926
NUMA hint faults                     0           0           0           0   270103189   269176121
NUMA hint local faults               0           0           0           0    70822016    70400386
NUMA pages migrated                  0           0   193988041           0    37611432    39796961
AutoNUMA cost                        0           0       10016           0     1353249     1348645

According to this, numacore never had a NUMA fault. This is completely broken
obviously and it's because PTE NUMA hinting faults are not accounted for
by numacore because that path does not call numa_migration_target(). The
consequences are not that great, it just means that the notional "AutoNUMA
cost" is meaningless for numacore.

What is interesting is numacores migration rate -- 187MB/sec on average. This
is over quadruple balancenumas migration rate of 38MB/sec on average.

SpecJBB, Single JVM, THP is enabled
===================================

As with the Multiple JVM test with THP enabled, numacore crashes. This
time the message is

Timing Measurement began Sun Dec 09 17:12:53 GMT 2012 for 0.5 minutes
Exception in thread "Thread-1040" java.lang.NullPointerException
        at java.util.TreeMap.access$100(Unknown Source)
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(Unknown Source)
        at java.util.TreeMap$ValueIterator.next(Unknown Source)
        at spec.jbb.DeliveryTransaction.preprocess(Unknown Source)
        at spec.jbb.DeliveryHandler.handleDelivery(Unknown Source)
        at spec.jbb.DeliveryTransaction.process(Unknown Source)
        at spec.jbb.TransactionManager.runTxn(Unknown Source)
        at spec.jbb.TransactionManager.goManual(Unknown Source)
        at spec.jbb.TransactionManager.go(Unknown Source)
        at spec.jbb.JBBmain.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Timing Measurement ended Sun Dec 09 17:13:23 GMT 2012

Here are the rest of the resutls

                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                   stats-v8r6     numacore-20121130    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
TPut 1      25550.00 (  0.00%)     25491.00 ( -0.23%)     24233.00 ( -5.15%)     24913.00 ( -2.49%)     26480.00 (  3.64%)
TPut 2      55943.00 (  0.00%)     51630.00 ( -7.71%)     55312.00 ( -1.13%)     55042.00 ( -1.61%)     56920.00 (  1.75%)
TPut 3      87707.00 (  0.00%)     74497.00 (-15.06%)     88569.00 (  0.98%)     86135.00 ( -1.79%)     88608.00 (  1.03%)
TPut 4     117911.00 (  0.00%)     98435.00 (-16.52%)    118561.00 (  0.55%)    117486.00 ( -0.36%)    117953.00 (  0.04%)
TPut 5     143285.00 (  0.00%)    133964.00 ( -6.51%)    145703.00 (  1.69%)    142821.00 ( -0.32%)    144926.00 (  1.15%)
TPut 6     171208.00 (  0.00%)    152795.00 (-10.75%)    171006.00 ( -0.12%)    170635.00 ( -0.33%)    169394.00 ( -1.06%)
TPut 7     195635.00 (  0.00%)    162517.00 (-16.93%)    198699.00 (  1.57%)    196108.00 (  0.24%)    196491.00 (  0.44%)
TPut 8     222655.00 (  0.00%)    168679.00 (-24.24%)    224903.00 (  1.01%)    223494.00 (  0.38%)    225978.00 (  1.49%)
TPut 9     244787.00 (  0.00%)    193394.00 (-20.99%)    248313.00 (  1.44%)    251858.00 (  2.89%)    251569.00 (  2.77%)
TPut 10    271565.00 (  0.00%)    237987.00 (-12.36%)    272148.00 (  0.21%)    275869.00 (  1.58%)    279049.00 (  2.76%)
TPut 11    298270.00 (  0.00%)    207908.00 (-30.30%)    303749.00 (  1.84%)    301763.00 (  1.17%)    301399.00 (  1.05%)
TPut 12    320867.00 (  0.00%)    257937.00 (-19.61%)    327808.00 (  2.16%)    329681.00 (  2.75%)    330506.00 (  3.00%)
TPut 13    343514.00 (  0.00%)    248474.00 (-27.67%)    349080.00 (  1.62%)    340606.00 ( -0.85%)    350817.00 (  2.13%)
TPut 14    365321.00 (  0.00%)    298876.00 (-18.19%)    370026.00 (  1.29%)    379939.00 (  4.00%)    361752.00 ( -0.98%)
TPut 15    377071.00 (  0.00%)    296562.00 (-21.35%)    329847.00 (-12.52%)    395421.00 (  4.87%)    396091.00 (  5.04%)
TPut 16    404979.00 (  0.00%)    287964.00 (-28.89%)    411066.00 (  1.50%)    420551.00 (  3.85%)    411673.00 (  1.65%)
TPut 17    420593.00 (  0.00%)    342590.00 (-18.55%)    428242.00 (  1.82%)    437461.00 (  4.01%)    428270.00 (  1.83%)
TPut 18    440178.00 (  0.00%)    377508.00 (-14.24%)    440392.00 (  0.05%)    455014.00 (  3.37%)    447671.00 (  1.70%)
TPut 19    448876.00 (  0.00%)    397727.00 (-11.39%)    462036.00 (  2.93%)    479223.00 (  6.76%)    461881.00 (  2.90%)
TPut 20    460513.00 (  0.00%)    411831.00 (-10.57%)    476437.00 (  3.46%)    493176.00 (  7.09%)    474824.00 (  3.11%)
TPut 21    474161.00 (  0.00%)    442153.00 ( -6.75%)    487513.00 (  2.82%)    505246.00 (  6.56%)    468938.00 ( -1.10%)
TPut 22    474493.00 (  0.00%)    429921.00 ( -9.39%)    487920.00 (  2.83%)    527360.00 ( 11.14%)    475208.00 (  0.15%)
TPut 23    489559.00 (  0.00%)    460354.00 ( -5.97%)    508298.00 (  3.83%)    534820.00 (  9.25%)    490743.00 (  0.24%)
TPut 24    495378.00 (  0.00%)    486826.00 ( -1.73%)    514403.00 (  3.84%)    545294.00 ( 10.08%)    493974.00 ( -0.28%)
TPut 25    491795.00 (  0.00%)    520474.00 (  5.83%)    507373.00 (  3.17%)    543526.00 ( 10.52%)    489850.00 ( -0.40%)
TPut 26    490038.00 (  0.00%)    465587.00 ( -4.99%)    376322.00 (-23.21%)    545175.00 ( 11.25%)    491352.00 (  0.27%)
TPut 27    491233.00 (  0.00%)    469764.00 ( -4.37%)    366225.00 (-25.45%)    536927.00 (  9.30%)    489611.00 ( -0.33%)
TPut 28    489058.00 (  0.00%)    489561.00 (  0.10%)    414027.00 (-15.34%)    543127.00 ( 11.06%)    473835.00 ( -3.11%)
TPut 29    471539.00 (  0.00%)    492496.00 (  4.44%)    400529.00 (-15.06%)    541615.00 ( 14.86%)    486009.00 (  3.07%)
TPut 30    480343.00 (  0.00%)    488349.00 (  1.67%)    405612.00 (-15.56%)    542904.00 ( 13.02%)    478384.00 ( -0.41%)
TPut 31    478109.00 (  0.00%)    460043.00 ( -3.78%)    401471.00 (-16.03%)    529079.00 ( 10.66%)    466457.00 ( -2.44%)
TPut 32    475736.00 (  0.00%)    472007.00 ( -0.78%)    401075.00 (-15.69%)    532423.00 ( 11.92%)    467866.00 ( -1.65%)
TPut 33    470758.00 (  0.00%)    474348.00 (  0.76%)    399592.00 (-15.12%)    518811.00 ( 10.21%)    464764.00 ( -1.27%)
TPut 34    467304.00 (  0.00%)    475878.00 (  1.83%)    394589.00 (-15.56%)    518334.00 ( 10.92%)    446719.00 ( -4.41%)
TPut 35    466391.00 (  0.00%)    487411.00 (  4.51%)    382799.00 (-17.92%)    513591.00 ( 10.12%)    447071.00 ( -4.14%)
TPut 36    452722.00 (  0.00%)    478050.00 (  5.59%)    381120.00 (-15.82%)    503801.00 ( 11.28%)    452243.00 ( -0.11%)
TPut 37    447878.00 (  0.00%)    478467.00 (  6.83%)    382803.00 (-14.53%)    494555.00 ( 10.42%)    442751.00 ( -1.14%)
TPut 38    447907.00 (  0.00%)    455542.00 (  1.70%)    341693.00 (-23.71%)    482758.00 (  7.78%)    444023.00 ( -0.87%)
TPut 39    428322.00 (  0.00%)    367921.00 (-14.10%)    404210.00 ( -5.63%)    464550.00 (  8.46%)    440482.00 (  2.84%)
TPut 40    429157.00 (  0.00%)    394277.00 ( -8.13%)    378554.00 (-11.79%)    467767.00 (  9.00%)    411807.00 ( -4.04%)
TPut 41    424339.00 (  0.00%)    415413.00 ( -2.10%)    399220.00 ( -5.92%)    457669.00 (  7.85%)    428273.00 (  0.93%)
TPut 42    397440.00 (  0.00%)    421027.00 (  5.93%)    372161.00 ( -6.36%)    458156.00 ( 15.28%)    422535.00 (  6.31%)
TPut 43    405391.00 (  0.00%)    433900.00 (  7.03%)    383936.00 ( -5.29%)    438929.00 (  8.27%)    410196.00 (  1.19%)
TPut 44    400692.00 (  0.00%)    427504.00 (  6.69%)    374757.00 ( -6.47%)    423538.00 (  5.70%)    399471.00 ( -0.30%)
TPut 45    399623.00 (  0.00%)    372622.00 ( -6.76%)    379797.00 ( -4.96%)    407255.00 (  1.91%)    374068.00 ( -6.39%)
TPut 46    391920.00 (  0.00%)    351205.00 (-10.39%)    368042.00 ( -6.09%)    411353.00 (  4.96%)    384363.00 ( -1.93%)
TPut 47    378199.00 (  0.00%)    358150.00 ( -5.30%)    368744.00 ( -2.50%)    408739.00 (  8.08%)    385670.00 (  1.98%)
TPut 48    379346.00 (  0.00%)    387287.00 (  2.09%)    373581.00 ( -1.52%)    423791.00 ( 11.72%)    380665.00 (  0.35%)
TPut 49    373614.00 (  0.00%)    395793.00 (  5.94%)    372621.00 ( -0.27%)    423024.00 ( 13.22%)    377985.00 (  1.17%)
TPut 50    372494.00 (  0.00%)    366488.00 ( -1.61%)    388778.00 (  4.37%)    410647.00 ( 10.24%)    378831.00 (  1.70%)
TPut 51    382195.00 (  0.00%)    381771.00 ( -0.11%)    387687.00 (  1.44%)    423249.00 ( 10.74%)    402233.00 (  5.24%)
TPut 52    369118.00 (  0.00%)    429441.00 ( 16.34%)    390226.00 (  5.72%)    410023.00 ( 11.08%)    396558.00 (  7.43%)
TPut 53    366453.00 (  0.00%)    445744.00 ( 21.64%)    399257.00 (  8.95%)    405937.00 ( 10.77%)    383916.00 (  4.77%)
TPut 54    366571.00 (  0.00%)    375762.00 (  2.51%)    395098.00 (  7.78%)    402220.00 (  9.72%)    395417.00 (  7.87%)
TPut 55    367580.00 (  0.00%)    336113.00 ( -8.56%)    400550.00 (  8.97%)    420978.00 ( 14.53%)    398098.00 (  8.30%)
TPut 56    367056.00 (  0.00%)    375635.00 (  2.34%)    385743.00 (  5.09%)    412685.00 ( 12.43%)    384029.00 (  4.62%)
TPut 57    359163.00 (  0.00%)    354001.00 ( -1.44%)    389827.00 (  8.54%)    394688.00 (  9.89%)    381032.00 (  6.09%)
TPut 58    360552.00 (  0.00%)    353312.00 ( -2.01%)    394099.00 (  9.30%)    388655.00 (  7.79%)    378132.00 (  4.88%)
TPut 59    354967.00 (  0.00%)    368534.00 (  3.82%)    390746.00 ( 10.08%)    399086.00 ( 12.43%)    387101.00 (  9.05%)
TPut 60    362976.00 (  0.00%)    388472.00 (  7.02%)    383073.00 (  5.54%)    399713.00 ( 10.12%)    390635.00 (  7.62%)
TPut 61    368072.00 (  0.00%)    399476.00 (  8.53%)    380807.00 (  3.46%)    372060.00 (  1.08%)    383187.00 (  4.11%)
TPut 62    356938.00 (  0.00%)    385648.00 (  8.04%)    387736.00 (  8.63%)    377183.00 (  5.67%)    378484.00 (  6.04%)
TPut 63    357491.00 (  0.00%)    404325.00 ( 13.10%)    396672.00 ( 10.96%)    384221.00 (  7.48%)    378907.00 (  5.99%)
TPut 64    357322.00 (  0.00%)    389552.00 (  9.02%)    386826.00 (  8.26%)    378601.00 (  5.96%)    369852.00 (  3.51%)
TPut 65    341262.00 (  0.00%)    394964.00 ( 15.74%)    380271.00 ( 11.43%)    382896.00 ( 12.20%)    382897.00 ( 12.20%)
TPut 66    357807.00 (  0.00%)    384846.00 (  7.56%)    362723.00 (  1.37%)    361530.00 (  1.04%)    380023.00 (  6.21%)
TPut 67    345092.00 (  0.00%)    376842.00 (  9.20%)    364193.00 (  5.54%)    374449.00 (  8.51%)    373877.00 (  8.34%)
TPut 68    350334.00 (  0.00%)    358330.00 (  2.28%)    359368.00 (  2.58%)    384920.00 (  9.87%)    381888.00 (  9.01%)
TPut 69    348372.00 (  0.00%)    356188.00 (  2.24%)    364449.00 (  4.61%)    395611.00 ( 13.56%)    375892.00 (  7.90%)
TPut 70    335077.00 (  0.00%)    359313.00 (  7.23%)    356418.00 (  6.37%)    375448.00 ( 12.05%)    372358.00 ( 11.13%)
TPut 71    341197.00 (  0.00%)    364168.00 (  6.73%)    343847.00 (  0.78%)    376113.00 ( 10.23%)    384292.00 ( 12.63%)
TPut 72    345032.00 (  0.00%)    356934.00 (  3.45%)    345007.00 ( -0.01%)    375313.00 (  8.78%)    381504.00 ( 10.57%)

numacore v17 was doing reasonably well but we knew that already.

autonuma does not do great on this test.

balancenuma does all right. The scalability patches actually hurt in this case
but it's likely down to varability in the decisions made by the scheduler as much
as anything else.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        379346.00 (  0.00%)        387287.00 (  2.09%)        373581.00 ( -1.52%)        423791.00 ( 11.72%)        380665.00 (  0.35%)
 Actual Warehouse            24.00 (  0.00%)            25.00 (  4.17%)            24.00 (  0.00%)            24.00 (  0.00%)            24.00 (  0.00%)
 Actual Peak Bops        495378.00 (  0.00%)        520474.00 (  5.07%)        514403.00 (  3.84%)        545294.00 ( 10.08%)        493974.00 ( -0.28%)
 SpecJBB Bops            183389.00 (  0.00%)        193652.00 (  5.60%)        193461.00 (  5.49%)        201083.00 (  9.65%)        195465.00 (  6.58%)
 SpecJBB Bops/JVM        183389.00 (  0.00%)        193652.00 (  5.60%)        193461.00 (  5.49%)        201083.00 (  9.65%)        195465.00 (  6.58%)

Balancenuma does all right on its specjbb score but the peak score with
the migration scalability patches applied is hurt. At least it's still
comparable to mainline.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc8   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numafix-20121209autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
User       316340.52   311420.23    31308.52   314589.64   316061.23   315584.37
System        102.08     3067.27      803.23      352.70      428.76      450.71
Elapsed      7433.22     7436.63     1398.05     7434.74     7432.60     7435.03

Usual comments about system CPU usage. You actually see latest numacore
figures here because they are based on what happened up until the crash.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc8   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numafix-20121209autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
Page Ins                         66212       36180       31560       36152       36188       63852
Page Outs                        31248       35544       12016       28388       28024       42360
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                  48874       45657       34986       48296       48697       47056
THP collapse alloc                  51           2           9         157          53          69
THP splits                          70          37          28          83          78          56
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0   110442307           0    45908125    46995604
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0      114639           0       47652       48781
NUMA PTE updates                     0           0   391813174           0   351907231   361308027
NUMA hint faults                     0           0      796717           0     2010327     1867697
NUMA hint local faults               0           0      261885           0      677602      572742
NUMA pages migrated                  0           0   110442307           0    45908125    46995604
AutoNUMA cost                        0           0        8824           0       13387       12760

THP was certainly enabled.

numacores migration rate is extremely high until it crashed -- 308MB/sec
as opposed to balancenumas 24MB/sec on average.

SpecJBB, Single JVM, THP is disabled
====================================

                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc8             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                   stats-v8r6     numacore-20121130      numafix-20121209    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
TPut 1      19861.00 (  0.00%)     18255.00 ( -8.09%)     20169.00 (  1.55%)     19636.00 ( -1.13%)     19838.00 ( -0.12%)     20650.00 (  3.97%)
TPut 2      47613.00 (  0.00%)     37136.00 (-22.00%)     45050.00 ( -5.38%)     47153.00 ( -0.97%)     47481.00 ( -0.28%)     48199.00 (  1.23%)
TPut 3      72438.00 (  0.00%)     55692.00 (-23.12%)     64075.00 (-11.55%)     69394.00 ( -4.20%)     72029.00 ( -0.56%)     72932.00 (  0.68%)
TPut 4      98455.00 (  0.00%)     81301.00 (-17.42%)     93595.00 ( -4.94%)     98577.00 (  0.12%)     98437.00 ( -0.02%)     99748.00 (  1.31%)
TPut 5     120831.00 (  0.00%)     89067.00 (-26.29%)    115796.00 ( -4.17%)    120805.00 ( -0.02%)    117218.00 ( -2.99%)    121254.00 (  0.35%)
TPut 6     140013.00 (  0.00%)    108349.00 (-22.62%)    116704.00 (-16.65%)    125079.00 (-10.67%)    139878.00 ( -0.10%)    145360.00 (  3.82%)
TPut 7     163553.00 (  0.00%)    116192.00 (-28.96%)    118711.00 (-27.42%)    164368.00 (  0.50%)    167133.00 (  2.19%)    169539.00 (  3.66%)
TPut 8     190148.00 (  0.00%)    125955.00 (-33.76%)    118079.00 (-37.90%)    188906.00 ( -0.65%)    183058.00 ( -3.73%)    188936.00 ( -0.64%)
TPut 9     211343.00 (  0.00%)    144068.00 (-31.83%)    170067.00 (-19.53%)    206645.00 ( -2.22%)    205699.00 ( -2.67%)    217322.00 (  2.83%)
TPut 10    233190.00 (  0.00%)    148098.00 (-36.49%)    133365.00 (-42.81%)    234533.00 (  0.58%)    233632.00 (  0.19%)    227292.00 ( -2.53%)
TPut 11    253333.00 (  0.00%)    146043.00 (-42.35%)    108866.00 (-57.03%)    254167.00 (  0.33%)    251938.00 ( -0.55%)    259924.00 (  2.60%)
TPut 12    270661.00 (  0.00%)    131739.00 (-51.33%)    146170.00 (-46.00%)    271490.00 (  0.31%)    271393.00 (  0.27%)    272536.00 (  0.69%)
TPut 13    299807.00 (  0.00%)    169396.00 (-43.50%)    134946.00 (-54.99%)    299758.00 ( -0.02%)    270594.00 ( -9.74%)    299110.00 ( -0.23%)
TPut 14    319243.00 (  0.00%)    150705.00 (-52.79%)    145135.00 (-54.54%)    318481.00 ( -0.24%)    318566.00 ( -0.21%)    325133.00 (  1.84%)
TPut 15    339054.00 (  0.00%)    116872.00 (-65.53%)    127277.00 (-62.46%)    331534.00 ( -2.22%)    344672.00 (  1.66%)    318119.00 ( -6.17%)
TPut 16    354315.00 (  0.00%)    124346.00 (-64.91%)     86657.00 (-75.54%)    352600.00 ( -0.48%)    316761.00 (-10.60%)    364648.00 (  2.92%)
TPut 17    371306.00 (  0.00%)    118493.00 (-68.09%)     93297.00 (-74.87%)    368260.00 ( -0.82%)    328888.00 (-11.42%)    371088.00 ( -0.06%)
TPut 18    386361.00 (  0.00%)    138571.00 (-64.13%)    208447.00 (-46.05%)    374358.00 ( -3.11%)    356148.00 ( -7.82%)    399913.00 (  3.51%)
TPut 19    401827.00 (  0.00%)    118855.00 (-70.42%)    155803.00 (-61.23%)    399476.00 ( -0.59%)    393918.00 ( -1.97%)    405771.00 (  0.98%)
TPut 20    411130.00 (  0.00%)    144024.00 (-64.97%)    116524.00 (-71.66%)    407799.00 ( -0.81%)    377706.00 ( -8.13%)    406038.00 ( -1.24%)
TPut 21    425352.00 (  0.00%)    154264.00 (-63.73%)    144766.00 (-65.97%)    429226.00 (  0.91%)    431677.00 (  1.49%)    431583.00 (  1.46%)
TPut 22    438150.00 (  0.00%)    153892.00 (-64.88%)    222211.00 (-49.28%)    385827.00 (-11.94%)    440379.00 (  0.51%)    438861.00 (  0.16%)
TPut 23    438425.00 (  0.00%)    146506.00 (-66.58%)    213367.00 (-51.33%)    433963.00 ( -1.02%)    361427.00 (-17.56%)    445293.00 (  1.57%)
TPut 24    461598.00 (  0.00%)    138869.00 (-69.92%)    189745.00 (-58.89%)    439691.00 ( -4.75%)    471567.00 (  2.16%)    488259.00 (  5.78%)
TPut 25    459475.00 (  0.00%)    141698.00 (-69.16%)    105196.00 (-77.11%)    431373.00 ( -6.12%)    487921.00 (  6.19%)    447353.00 ( -2.64%)
TPut 26    452651.00 (  0.00%)    142844.00 (-68.44%)    125573.00 (-72.26%)    447517.00 ( -1.13%)    425336.00 ( -6.03%)    469793.00 (  3.79%)
TPut 27    450436.00 (  0.00%)    140870.00 (-68.73%)     68802.00 (-84.73%)    430805.00 ( -4.36%)    456114.00 (  1.26%)    461172.00 (  2.38%)
TPut 28    459770.00 (  0.00%)    143078.00 (-68.88%)    144373.00 (-68.60%)    432260.00 ( -5.98%)    478317.00 (  4.03%)    452144.00 ( -1.66%)
TPut 29    450347.00 (  0.00%)    142076.00 (-68.45%)    221760.00 (-50.76%)    440423.00 ( -2.20%)    388175.00 (-13.81%)    473273.00 (  5.09%)
TPut 30    449252.00 (  0.00%)    146900.00 (-67.30%)    139971.00 (-68.84%)    435082.00 ( -3.15%)    440795.00 ( -1.88%)    435189.00 ( -3.13%)
TPut 31    446802.00 (  0.00%)    148008.00 (-66.87%)    195143.00 (-56.32%)    418684.00 ( -6.29%)    417343.00 ( -6.59%)    437562.00 ( -2.07%)
TPut 32    439701.00 (  0.00%)    149591.00 (-65.98%)    159107.00 (-63.81%)    421866.00 ( -4.06%)    438719.00 ( -0.22%)    469763.00 (  6.84%)
TPut 33    434477.00 (  0.00%)    142801.00 (-67.13%)    110758.00 (-74.51%)    420631.00 ( -3.19%)    454673.00 (  4.65%)    451224.00 (  3.85%)
TPut 34    423014.00 (  0.00%)    152308.00 (-63.99%)    111701.00 (-73.59%)    415202.00 ( -1.85%)    415194.00 ( -1.85%)    446735.00 (  5.61%)
TPut 35    429012.00 (  0.00%)    154116.00 (-64.08%)    118968.00 (-72.27%)    402395.00 ( -6.20%)    425151.00 ( -0.90%)    434230.00 (  1.22%)
TPut 36    421097.00 (  0.00%)    157571.00 (-62.58%)    174626.00 (-58.53%)    404770.00 ( -3.88%)    430480.00 (  2.23%)    425324.00 (  1.00%)
TPut 37    414815.00 (  0.00%)    150771.00 (-63.65%)    238764.00 (-42.44%)    388842.00 ( -6.26%)    393351.00 ( -5.17%)    405824.00 ( -2.17%)
TPut 38    412361.00 (  0.00%)    157070.00 (-61.91%)    173206.00 (-58.00%)    398947.00 ( -3.25%)    401555.00 ( -2.62%)    432074.00 (  4.78%)
TPut 39    402234.00 (  0.00%)    161487.00 (-59.85%)    119790.00 (-70.22%)    382645.00 ( -4.87%)    423106.00 (  5.19%)    401091.00 ( -0.28%)
TPut 40    380278.00 (  0.00%)    165947.00 (-56.36%)    309375.00 (-18.65%)    394039.00 (  3.62%)    405371.00 (  6.60%)    410739.00 (  8.01%)
TPut 41    393204.00 (  0.00%)    160540.00 (-59.17%)    146153.00 (-62.83%)    385605.00 ( -1.93%)    403383.00 (  2.59%)    372466.00 ( -5.27%)
TPut 42    380622.00 (  0.00%)    151946.00 (-60.08%)    269523.00 (-29.19%)    374843.00 ( -1.52%)    380797.00 (  0.05%)    396227.00 (  4.10%)
TPut 43    371566.00 (  0.00%)    162369.00 (-56.30%)    344584.00 ( -7.26%)    347951.00 ( -6.36%)    386765.00 (  4.09%)    345633.00 ( -6.98%)
TPut 44    365538.00 (  0.00%)    161127.00 (-55.92%)    147195.00 (-59.73%)    355070.00 ( -2.86%)    344701.00 ( -5.70%)    391276.00 (  7.04%)
TPut 45    359305.00 (  0.00%)    159062.00 (-55.73%)    102716.00 (-71.41%)    350973.00 ( -2.32%)    370666.00 (  3.16%)    331191.00 ( -7.82%)
TPut 46    343160.00 (  0.00%)    163889.00 (-52.24%)    309203.00 ( -9.90%)    347960.00 (  1.40%)    380147.00 ( 10.78%)    323176.00 ( -5.82%)
TPut 47    346983.00 (  0.00%)    168666.00 (-51.39%)    330345.00 ( -4.80%)    313612.00 ( -9.62%)    362189.00 (  4.38%)    343154.00 ( -1.10%)
TPut 48    338143.00 (  0.00%)    153448.00 (-54.62%)    291944.00 (-13.66%)    341809.00 (  1.08%)    365342.00 (  8.04%)    354348.00 (  4.79%)
TPut 49    333941.00 (  0.00%)    142784.00 (-57.24%)    252850.00 (-24.28%)    336174.00 (  0.67%)    371700.00 ( 11.31%)    353148.00 (  5.75%)
TPut 50    334001.00 (  0.00%)    135713.00 (-59.37%)    252350.00 (-24.45%)    322489.00 ( -3.45%)    367963.00 ( 10.17%)    355823.00 (  6.53%)
TPut 51    338310.00 (  0.00%)    133402.00 (-60.57%)    232361.00 (-31.32%)    354805.00 (  4.88%)    372592.00 ( 10.13%)    351194.00 (  3.81%)
TPut 52    322897.00 (  0.00%)    150293.00 (-53.45%)    193895.00 (-39.95%)    353169.00 (  9.38%)    363024.00 ( 12.43%)    344846.00 (  6.80%)
TPut 53    329801.00 (  0.00%)    160792.00 (-51.25%)    180672.00 (-45.22%)    353588.00 (  7.21%)    365359.00 ( 10.78%)    355499.00 (  7.79%)
TPut 54    336610.00 (  0.00%)    164696.00 (-51.07%)    248332.00 (-26.23%)    361189.00 (  7.30%)    377851.00 ( 12.25%)    363987.00 (  8.13%)
TPut 55    325920.00 (  0.00%)    172380.00 (-47.11%)    271331.00 (-16.75%)    365678.00 ( 12.20%)    375735.00 ( 15.28%)    363697.00 ( 11.59%)
TPut 56    318997.00 (  0.00%)    176071.00 (-44.80%)    155354.00 (-51.30%)    367048.00 ( 15.06%)    380588.00 ( 19.31%)    362614.00 ( 13.67%)
TPut 57    321776.00 (  0.00%)    174531.00 (-45.76%)    279294.00 (-13.20%)    341874.00 (  6.25%)    378996.00 ( 17.78%)    360366.00 ( 11.99%)
TPut 58    308532.00 (  0.00%)    174202.00 (-43.54%)    170351.00 (-44.79%)    348156.00 ( 12.84%)    361623.00 ( 17.21%)    369693.00 ( 19.82%)
TPut 59    318974.00 (  0.00%)    175343.00 (-45.03%)    243463.00 (-23.67%)    358252.00 ( 12.31%)    360457.00 ( 13.01%)    364556.00 ( 14.29%)
TPut 60    325465.00 (  0.00%)    173694.00 (-46.63%)    222867.00 (-31.52%)    360808.00 ( 10.86%)    362745.00 ( 11.45%)    354232.00 (  8.84%)
TPut 61    319151.00 (  0.00%)    172320.00 (-46.01%)    218542.00 (-31.52%)    350597.00 (  9.85%)    371277.00 ( 16.33%)    352478.00 ( 10.44%)
TPut 62    320837.00 (  0.00%)    172312.00 (-46.29%)    251630.00 (-21.57%)    359062.00 ( 11.91%)    361009.00 ( 12.52%)    352930.00 ( 10.00%)
TPut 63    318198.00 (  0.00%)    172297.00 (-45.85%)    172040.00 (-45.93%)    356137.00 ( 11.92%)    347637.00 (  9.25%)    335322.00 (  5.38%)
TPut 64    321438.00 (  0.00%)    171894.00 (-46.52%)    151337.00 (-52.92%)    347376.00 (  8.07%)    346756.00 (  7.88%)    351410.00 (  9.32%)
TPut 65    314482.00 (  0.00%)    169147.00 (-46.21%)    143487.00 (-54.37%)    351726.00 ( 11.84%)    357429.00 ( 13.66%)    351236.00 ( 11.69%)
TPut 66    316802.00 (  0.00%)    170234.00 (-46.26%)    230207.00 (-27.33%)    344548.00 (  8.76%)    362143.00 ( 14.31%)    347058.00 (  9.55%)
TPut 67    312139.00 (  0.00%)    168180.00 (-46.12%)    148468.00 (-52.44%)    329030.00 (  5.41%)    353305.00 ( 13.19%)    345903.00 ( 10.82%)
TPut 68    323918.00 (  0.00%)    168392.00 (-48.01%)    184696.00 (-42.98%)    319985.00 ( -1.21%)    344250.00 (  6.28%)    345703.00 (  6.73%)
TPut 69    307506.00 (  0.00%)    167082.00 (-45.67%)    221855.00 (-27.85%)    340673.00 ( 10.79%)    339346.00 ( 10.35%)    336071.00 (  9.29%)
TPut 70    306799.00 (  0.00%)    165764.00 (-45.97%)    246518.00 (-19.65%)    331678.00 (  8.11%)    349583.00 ( 13.95%)    341944.00 ( 11.46%)
TPut 71    304232.00 (  0.00%)    165289.00 (-45.67%)    225582.00 (-25.85%)    319824.00 (  5.13%)    335238.00 ( 10.19%)    343396.00 ( 12.87%)
TPut 72    301619.00 (  0.00%)    163909.00 (-45.66%)    154552.00 (-48.76%)    326875.00 (  8.37%)    345999.00 ( 14.71%)    343949.00 ( 14.03%)

Latest numacore is regressing really badly here.

autonuma is all right.

balancenuma is all right. Migration scalability patches actually seem to
hurt a little.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc8                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130           numafix-20121209         autonuma-v28fastr4           balancenuma-v9r2          balancenuma-v10r3
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        338143.00 (  0.00%)        153448.00 (-54.62%)        291944.00 (-13.66%)        341809.00 (  1.08%)        365342.00 (  8.04%)        354348.00 (  4.79%)
 Actual Warehouse            24.00 (  0.00%)            56.00 (133.33%)            43.00 ( 79.17%)            26.00 (  8.33%)            25.00 (  4.17%)            24.00 (  0.00%)
 Actual Peak Bops        461598.00 (  0.00%)        176071.00 (-61.86%)        344584.00 (-25.35%)        447517.00 ( -3.05%)        487921.00 (  5.70%)        488259.00 (  5.78%)
 SpecJBB Bops            163683.00 (  0.00%)         83963.00 (-48.70%)        109061.00 (-33.37%)        176379.00 (  7.76%)        184040.00 ( 12.44%)        179621.00 (  9.74%)
 SpecJBB Bops/JVM        163683.00 (  0.00%)         83963.00 (-48.70%)        109061.00 (-33.37%)        176379.00 (  7.76%)        184040.00 ( 12.44%)        179621.00 (  9.74%)

numacore regresses 25.35% at the peak and 33.37% on its specjbb score.

balancenuma does all right -- 5.78% gain at the peak, 9.74% on its overall
specjbb score.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc8   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numafix-20121209autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
User       316751.91   167098.56   227496.59   307598.67   309109.47   313644.48
System         60.28   122511.08    72477.33     4411.81     1820.70     2654.77
Elapsed      7434.08     7451.36     7476.09     7437.52     7438.28     7438.19

numacores system CPu usage has improved but it's still insane -- 27 times
higher than balancenumas which itself is high. Put another way, numacore
is using over 1000 times more system CPU than the mainline kernel is.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc8   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numafix-20121209autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
Page Ins                         37112       36416       34572       37436       35400       34708
Page Outs                        29252       35664       29788       28120       28504       28292
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      3           2           3           2           2           2
THP collapse alloc                   0           0           0           4           0           0
THP splits                           0           0           0           1           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0   472734998           0    24675369    36216149
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0      490698           0       25613       37592
NUMA PTE updates                     0           0  2978374076           0   200854895   256255594
NUMA hint faults                     0           0           0           0   195451244   250219588
NUMA hint local faults               0           0           0           0    50377035    63739483
NUMA pages migrated                  0           0   472734998           0    24675369    36216149
AutoNUMA cost                        0           0       29830           0      979131     1253579

numacore is migrating on average 247MB/sec. balancenuma is migrating
19MB/sec on average.

I ran the other normal benchmarks too. kernbench and aim9 are more or less ok. The impact is on hackbench

HACKBENCH PIPES
                     3.7.0-rc7             3.7.0-rc6             3.7.0-rc8             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                    stats-v8r6     numacore-20121130      numafix-20121209    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
Procs 1       0.0250 (  0.00%)      0.0260 ( -4.00%)      0.0246 (  1.48%)      0.0261 ( -4.27%)      0.0325 (-30.07%)      0.0353 (-41.14%)
Procs 4       0.0696 (  0.00%)      0.0702 ( -0.84%)      0.0602 ( 13.57%)      0.0707 ( -1.65%)      0.0760 ( -9.20%)      0.0738 ( -5.98%)
Procs 8       0.0836 (  0.00%)      0.0973 (-16.43%)      0.0949 (-13.53%)      0.1030 (-23.21%)      0.0887 ( -6.15%)      0.1031 (-23.36%)
Procs 12      0.0971 (  0.00%)      0.0969 (  0.21%)      0.1447 (-49.00%)      0.1235 (-27.19%)      0.0953 (  1.88%)      0.1394 (-43.56%)
Procs 16      0.1218 (  0.00%)      0.1286 ( -5.52%)      0.2214 (-81.70%)      0.1775 (-45.69%)      0.1105 (  9.33%)      0.2188 (-79.57%)
Procs 20      0.1472 (  0.00%)      0.1508 ( -2.48%)      0.2744 (-86.43%)      0.1584 ( -7.64%)      0.1378 (  6.38%)      0.2567 (-74.37%)
Procs 24      0.1684 (  0.00%)      0.1823 ( -8.20%)      0.3602 (-113.82%)      0.4648 (-175.96%)      0.1623 (  3.68%)      0.3118 (-85.12%)
Procs 28      0.1919 (  0.00%)      0.1969 ( -2.61%)      0.4632 (-141.39%)      0.5287 (-175.57%)      0.1900 (  0.96%)      0.4326 (-125.48%)
Procs 32      0.2256 (  0.00%)      0.2163 (  4.12%)      0.5040 (-123.40%)      0.4607 (-104.23%)      0.2163 (  4.13%)      0.4583 (-103.16%)
Procs 36      0.2228 (  0.00%)      0.2658 (-19.29%)      0.5481 (-145.98%)      0.6190 (-177.83%)      0.2570 (-15.33%)      0.5267 (-136.38%)
Procs 40      0.2811 (  0.00%)      0.2906 ( -3.37%)      0.6223 (-121.36%)      0.2595 (  7.69%)      0.2638 (  6.15%)      0.5941 (-111.35%)

HACKBENCH SOCKETS
                     3.7.0-rc7             3.7.0-rc6             3.7.0-rc8             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                    stats-v8r6     numacore-20121130      numafix-20121209    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r3
Procs 1       0.0220 (  0.00%)      0.0220 (  0.00%)      0.0229 ( -4.20%)      0.0283 (-28.66%)      0.0216 (  1.89%)      0.0256 (-16.36%)
Procs 4       0.0456 (  0.00%)      0.0513 (-12.51%)      0.0559 (-22.50%)      0.0820 (-79.73%)      0.0407 ( 10.76%)      0.0627 (-37.46%)
Procs 8       0.0679 (  0.00%)      0.0714 ( -5.20%)      0.1472 (-116.82%)      0.2772 (-308.32%)      0.0697 ( -2.60%)      0.1715 (-152.63%)
Procs 12      0.0940 (  0.00%)      0.0973 ( -3.56%)      0.2259 (-140.32%)      0.1155 (-22.87%)      0.0973 ( -3.55%)      0.2459 (-161.55%)
Procs 16      0.1181 (  0.00%)      0.1263 ( -6.96%)      0.3248 (-174.92%)      0.4467 (-278.19%)      0.1234 ( -4.46%)      0.3231 (-173.55%)
Procs 20      0.1504 (  0.00%)      0.1531 ( -1.83%)      0.4039 (-168.54%)      0.4917 (-226.94%)      0.1534 ( -1.97%)      0.4172 (-177.36%)
Procs 24      0.1757 (  0.00%)      0.1826 ( -3.92%)      0.3965 (-125.60%)      0.5142 (-192.57%)      0.1826 ( -3.89%)      0.4759 (-170.78%)
Procs 28      0.2044 (  0.00%)      0.2166 ( -5.93%)      0.5438 (-165.99%)      0.6600 (-222.85%)      0.2164 ( -5.88%)      0.5455 (-166.83%)
Procs 32      0.2456 (  0.00%)      0.2501 ( -1.86%)      0.6261 (-154.93%)      0.6391 (-160.22%)      0.2449 (  0.27%)      0.6093 (-148.11%)
Procs 36      0.2649 (  0.00%)      0.2747 ( -3.70%)      0.7066 (-166.71%)      0.5775 (-117.97%)      0.2815 ( -6.27%)      0.6840 (-158.19%)
Procs 40      0.3067 (  0.00%)      0.3114 ( -1.56%)      0.7588 (-147.42%)      0.7517 (-145.12%)      0.3081 ( -0.48%)      0.8871 (-189.27%)

Latest numacore, autonuma and balancenuma are all butchering hackbench
performance. Considering that balancenuma started hurting performance with
the migration scalability patches leads me to conclude that they might be
directly or indirectly responsible.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc8   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numafix-20121209autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r3
Page Ins                             4           4           4           4           4           4
Page Outs                         1540        1636        2568        2264        1548        2484
Swap Ins                             0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0
THP fault alloc                      5           0           0           0           6           5
THP collapse alloc                   0           0           0           0           0           0
THP splits                           0           0           0           0           0           0
THP fault fallback                   0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0
Page migrate success                 0           0           0           0        1649          49
Page migrate failure                 0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0
Compaction cost                      0           0           0           0           1           0
NUMA PTE updates                     0           0           0           0       21646       22884
NUMA hint faults                     0           0           0           0        1045        2131
NUMA hint local faults               0           0           0           0          40        1218
NUMA pages migrated                  0           0           0           0        1649          49
AutoNUMA cost                        0           0           0           0           5          10

Based on this, I believe the migration patches are only indirectly
responsible. No way should hackbench be migrating or receiving a PTE update
at all. Rather than withdrawing the scalability patches it might make more
sense to either increase the length of time before a PTE takes place or
to delay NUMA PTE updates until the RSS reaches a particular size instead
of just relying on where the task gets scheduled.

So overall, I still believe that balancenuma should be merged at this point
based on these results. Nothing stops you doing a rebase of numacore on
top afterwards and introduce it in parts validating at each point it's
actually improving performance and not just assuming it does.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-09 20:36   ` Mel Gorman
@ 2012-12-09 21:17     ` Kirill A. Shutemov
  2012-12-10  8:44       ` Mel Gorman
  2012-12-10  5:07     ` Srikar Dronamraju
  2012-12-10 11:39     ` Ingo Molnar
  2 siblings, 1 reply; 80+ messages in thread
From: Kirill A. Shutemov @ 2012-12-09 21:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Sun, Dec 09, 2012 at 08:36:31PM +0000, Mel Gorman wrote:
> Either way, last night I applied a patch on top of latest tip/master to
> remove the nr_cpus_allowed check so that numacore would be enabled again
> and tested that. In some places it has indeed much improved. In others
> it is still regressing badly and in two case, it's corrupting memory --
> specjbb when THP is enabled crashes when running for single or multiple
> JVMs. It is likely that a zero page is being inserted due to a race with
> migration and causes the JVM to throw a null pointer exception. Here is
> the comparison on the rough off-chance you actually read it this time.

Are you talking about huge zero page, right?

I've fixed a race in huge zero page implementation recently[1]. Symptoms
were similar -- SIGSEGV in JVM. The patch is in mmotm-2012-12-05-16-56 and
later.

[1] http://lkml.org/lkml/2012/11/30/279
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-09 20:36   ` Mel Gorman
  2012-12-09 21:17     ` Kirill A. Shutemov
@ 2012-12-10  5:07     ` Srikar Dronamraju
  2012-12-10  6:28       ` Srikar Dronamraju
                         ` (2 more replies)
  2012-12-10 11:39     ` Ingo Molnar
  2 siblings, 3 replies; 80+ messages in thread
From: Srikar Dronamraju @ 2012-12-10  5:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

> 
> Either way, last night I applied a patch on top of latest tip/master to
> remove the nr_cpus_allowed check so that numacore would be enabled again
> and tested that. In some places it has indeed much improved. In others
> it is still regressing badly and in two case, it's corrupting memory --
> specjbb when THP is enabled crashes when running for single or multiple
> JVMs. It is likely that a zero page is being inserted due to a race with
> migration and causes the JVM to throw a null pointer exception. Here is
> the comparison on the rough off-chance you actually read it this time.

I see this failure when running with THP and KSM enabled on 
Friday's Tip master. Not sure if Mel was talking about the same issue.

------------[ cut here ]------------
kernel BUG at ../kernel/sched/fair.c:2371!
invalid opcode: 0000 [#1] SMP
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support kvm_intel kvm microcode cdc_ether usbnet mii serio_raw i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma i7core_edac edac_core bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
CPU 4
Pid: 116, comm: ksmd Not tainted 3.7.0-rc8-tip_master+ #5 IBM BladeCenter HS22V -[7871AC1]-/81Y5995
RIP: 0010:[<ffffffff8108c139>]  [<ffffffff8108c139>] task_numa_fault+0x1a9/0x1e0
RSP: 0018:ffff880372237ba8  EFLAGS: 00010246
RAX: 0000000000000074 RBX: 0000000000000001 RCX: 0000000000000001
RDX: 00000000000012ae RSI: 0000000000000004 RDI: 00007faf4fc01000
RBP: ffff880372237be8 R08: 0000000000000000 R09: ffff8803657463f0
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000012
R13: ffff880372210d00 R14: 0000000000010088 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88037fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001d26fec CR3: 000000000169f000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksmd (pid: 116, threadinfo ffff880372236000, task ffff880372210d00)
Stack:
 ffffea0016026c58 00007faf4fc00000 ffff880372237c48 0000000000000001
 00007faf4fc01000 ffffea000d6df928 0000000000000001 ffffea00166e9268
 ffff880372237c48 ffffffff8113cd0e ffff880300000001 0000000000000002
Call Trace:
 [<ffffffff8113cd0e>] __do_numa_page+0xde/0x160
 [<ffffffff8113de9e>] handle_pte_fault+0x32e/0xcd0
 [<ffffffffa01c22c0>] ? drop_large_spte+0x30/0x30 [kvm]
 [<ffffffffa01bf215>] ? kvm_set_spte_hva+0x25/0x30 [kvm]
 [<ffffffff8113eab9>] handle_mm_fault+0x279/0x760
 [<ffffffff8115c024>] break_ksm+0x74/0xa0
 [<ffffffff8115c222>] break_cow+0xa2/0xb0
 [<ffffffff8115e38c>] ksm_scan_thread+0xb5c/0xd50
 [<ffffffff810771c0>] ? wake_up_bit+0x40/0x40
 [<ffffffff8115d830>] ? run_store+0x340/0x340
 [<ffffffff8107692e>] kthread+0xce/0xe0
 [<ffffffff81076860>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff814fa7ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81076860>] ? kthread_freezable_should_stop+0x70/0x70
Code: 89 f0 41 bf 01 00 00 00 8b 1c 10 e9 d7 fe ff ff 8d 14 09 48 63 d2 eb bd 66 2e 0f 1f 84 00 00 00 00 00 49 8b 85 98 07 00 00 eb 91 <0f> 0b eb fe 80 3d 9c 3b 6b 00 01 0f 84 be fe ff ff be 42 09 00
RIP  [<ffffffff8108c139>] task_numa_fault+0x1a9/0x1e0
 RSP <ffff880372237ba8>
---[ end trace 9584c9b03fc0dbc0 ]---


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10  5:07     ` Srikar Dronamraju
@ 2012-12-10  6:28       ` Srikar Dronamraju
  2012-12-10 12:44         ` [PATCH] sched: Fix task_numa_fault() + KSM crash Ingo Molnar
  2012-12-10  8:46       ` [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
  2012-12-10 12:35       ` Ingo Molnar
  2 siblings, 1 reply; 80+ messages in thread
From: Srikar Dronamraju @ 2012-12-10  6:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Mel Gorman, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Aneesh Kumar,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2012-12-10 10:37:10]:

> > 
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> I see this failure when running with THP and KSM enabled on 
> Friday's Tip master. Not sure if Mel was talking about the same issue.
> 
 
Even occurs with !THP but KSM enabled.

> ------------[ cut here ]------------
> kernel BUG at ../kernel/sched/fair.c:2371!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt iTCO_vendor_support kvm_intel kvm microcode cdc_ether usbnet mii serio_raw i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma i7core_edac edac_core bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> CPU 4
> Pid: 116, comm: ksmd Not tainted 3.7.0-rc8-tip_master+ #5 IBM BladeCenter HS22V -[7871AC1]-/81Y5995
> RIP: 0010:[<ffffffff8108c139>]  [<ffffffff8108c139>] task_numa_fault+0x1a9/0x1e0
> RSP: 0018:ffff880372237ba8  EFLAGS: 00010246
> RAX: 0000000000000074 RBX: 0000000000000001 RCX: 0000000000000001
> RDX: 00000000000012ae RSI: 0000000000000004 RDI: 00007faf4fc01000
> RBP: ffff880372237be8 R08: 0000000000000000 R09: ffff8803657463f0
> R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000012
> R13: ffff880372210d00 R14: 0000000000010088 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff88037fc80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000001d26fec CR3: 000000000169f000 CR4: 00000000000027e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process ksmd (pid: 116, threadinfo ffff880372236000, task ffff880372210d00)
> Stack:
>  ffffea0016026c58 00007faf4fc00000 ffff880372237c48 0000000000000001
>  00007faf4fc01000 ffffea000d6df928 0000000000000001 ffffea00166e9268
>  ffff880372237c48 ffffffff8113cd0e ffff880300000001 0000000000000002
> Call Trace:
>  [<ffffffff8113cd0e>] __do_numa_page+0xde/0x160
>  [<ffffffff8113de9e>] handle_pte_fault+0x32e/0xcd0
>  [<ffffffffa01c22c0>] ? drop_large_spte+0x30/0x30 [kvm]
>  [<ffffffffa01bf215>] ? kvm_set_spte_hva+0x25/0x30 [kvm]
>  [<ffffffff8113eab9>] handle_mm_fault+0x279/0x760
>  [<ffffffff8115c024>] break_ksm+0x74/0xa0
>  [<ffffffff8115c222>] break_cow+0xa2/0xb0
>  [<ffffffff8115e38c>] ksm_scan_thread+0xb5c/0xd50
>  [<ffffffff810771c0>] ? wake_up_bit+0x40/0x40
>  [<ffffffff8115d830>] ? run_store+0x340/0x340
>  [<ffffffff8107692e>] kthread+0xce/0xe0
>  [<ffffffff81076860>] ? kthread_freezable_should_stop+0x70/0x70
>  [<ffffffff814fa7ac>] ret_from_fork+0x7c/0xb0
>  [<ffffffff81076860>] ? kthread_freezable_should_stop+0x70/0x70
> Code: 89 f0 41 bf 01 00 00 00 8b 1c 10 e9 d7 fe ff ff 8d 14 09 48 63 d2 eb bd 66 2e 0f 1f 84 00 00 00 00 00 49 8b 85 98 07 00 00 eb 91 <0f> 0b eb fe 80 3d 9c 3b 6b 00 01 0f 84 be fe ff ff be 42 09 00
> RIP  [<ffffffff8108c139>] task_numa_fault+0x1a9/0x1e0
>  RSP <ffff880372237ba8>
> ---[ end trace 9584c9b03fc0dbc0 ]---
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-09 21:17     ` Kirill A. Shutemov
@ 2012-12-10  8:44       ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-10  8:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Sun, Dec 09, 2012 at 11:17:09PM +0200, Kirill A. Shutemov wrote:
> On Sun, Dec 09, 2012 at 08:36:31PM +0000, Mel Gorman wrote:
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> Are you talking about huge zero page, right?
> 

No, this is happening in tip/master which does not include the huge zero
page work yet. AFAIK, that's still queued in Andrew's tree for the next
merge window. It is possible that there will be collisions between numa
balancing and the huge zero page work but it hasn't happened yet.

> I've fixed a race in huge zero page implementation recently[1]. Symptoms
> were similar -- SIGSEGV in JVM. The patch is in mmotm-2012-12-05-16-56 and
> later.
> 

It might be a similar class of bug.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10  5:07     ` Srikar Dronamraju
  2012-12-10  6:28       ` Srikar Dronamraju
@ 2012-12-10  8:46       ` Mel Gorman
  2012-12-10 12:35       ` Ingo Molnar
  2 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-10  8:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Mon, Dec 10, 2012 at 10:37:10AM +0530, Srikar Dronamraju wrote:
> > 
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> I see this failure when running with THP and KSM enabled on 
> Friday's Tip master. Not sure if Mel was talking about the same issue.
> 
> ------------[ cut here ]------------
> kernel BUG at ../kernel/sched/fair.c:2371!

I'm not, this is new to me. I grepped the console logs I have and the closest
I see is a WARN_ON triggered in numacore v17 which is no longer relevant.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-09 20:36   ` Mel Gorman
  2012-12-09 21:17     ` Kirill A. Shutemov
  2012-12-10  5:07     ` Srikar Dronamraju
@ 2012-12-10 11:39     ` Ingo Molnar
  2012-12-10 11:53       ` Ingo Molnar
  2012-12-10 15:24       ` Mel Gorman
  2 siblings, 2 replies; 80+ messages in thread
From: Ingo Molnar @ 2012-12-10 11:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Dec 07, 2012 at 12:01:13PM +0100, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > This is a full release of all the patches so apologies for the 
> > > flood. [...]
> > 
> > I have yet to process all your mails, but assuming I address all 
> > your review feedback and the latest unified tree in tip:master 
> > shows no regression in your testing, would you be willing to 
> > start using it for ongoing work?
> > 
> 
> Ingo,
> 
> If you had read the second paragraph of the mail you just responded to or
> the results at the end then you would have seen that I had problems with
> the performance. [...]

I've posted a (NUMA-placement sensitive workload centric) 
performance comparisons between "balancenuma", AutoNUMA and 
numa/core unified-v3 to:

   https://lkml.org/lkml/2012/12/7/331

I tried to address all performance regressions you and others 
have reported.

Here's the direct [bandwidth] comparison of 'balancenuma v10' to 
my -v3 tree:

                            balancenuma  | NUMA-tip
 [test unit]            :          -v10  |    -v3
------------------------------------------------------------
 2x1-bw-process         :         6.136  |  9.647:  57.2%
 3x1-bw-process         :         7.250  | 14.528: 100.4%
 4x1-bw-process         :         6.867  | 18.903: 175.3%
 8x1-bw-process         :         7.974  | 26.829: 236.5%
 8x1-bw-process-NOTHP   :         5.937  | 22.237: 274.5%
 16x1-bw-process        :         5.592  | 29.294: 423.9%
 4x1-bw-thread          :        13.598  | 19.290:  41.9%
 8x1-bw-thread          :        16.356  | 26.391:  61.4%
 16x1-bw-thread         :        24.608  | 29.557:  20.1%
 32x1-bw-thread         :        25.477  | 30.232:  18.7%
 2x3-bw-thread          :         8.785  | 15.327:  74.5%
 4x4-bw-thread          :         6.366  | 27.957: 339.2%
 4x6-bw-thread          :         6.287  | 27.877: 343.4%
 4x8-bw-thread          :         5.860  | 28.439: 385.3%
 4x8-bw-thread-NOTHP    :         6.167  | 25.067: 306.5%
 3x3-bw-thread          :         8.235  | 21.560: 161.8%
 5x5-bw-thread          :         5.762  | 26.081: 352.6%
 2x16-bw-thread         :         5.920  | 23.269: 293.1%
 1x32-bw-thread         :         5.828  | 18.985: 225.8%
 numa02-bw              :        29.054  | 31.431:   8.2%
 numa02-bw-NOTHP        :        27.064  | 29.104:   7.5%
 numa01-bw-thread       :        20.338  | 28.607:  40.7%
 numa01-bw-thread-NOTHP :        18.528  | 21.119:  14.0%
------------------------------------------------------------

I also tried to reproduce and fix as many bugs you reported as 
possible - but my point is that it would be _much_ better if we 
actually joined forces.

> [...] You would also know that tip/master testing for the last 
> week was failing due to a boot problem (issue was in mainline 
> not tip and has been already fixed) and would have known that 
> since the -v18 release that numacore was effectively disabled 
> on my test machine.

I'm glad it's fixed.

> Clearly you are not reading the bug reports you are receiving 
> and you're not seeing the small bit of review feedback or 
> answering the review questions you have received either. Why 
> would I be more forthcoming when I feel that it'll simply be 
> ignored? [...]

I am reading the bug reports and addressing bugs as I can.

> [...]  You simply assume that each batch of patches you place 
> on top must be fixing all known regressions and ignoring any 
> evidence to the contrary.
>
> If you had read my mail from last Tuesday you would even know 
> which patch was causing the problem that effectively disabled 
> numacore although not why. The comment about p->numa_faults 
> was completely off the mark (long journey, was tired, assumed 
> numa_faults was a counter and not a pointer which was 
> careless).  If you had called me on it then I would have 
> spotted the actual problem sooner. The problem was indeed with 
> the nr_cpus_allowed == num_online_cpus()s check which I had 
> pointed out was a suspicious check although for different 
> reasons. As it turns out, a printk() bodge showed that 
> nr_cpus_allowed == 80 set in sched_init_smp() while 
> num_online_cpus() == 48. This effectively disabling numacore. 
> If you had responded to the bug report, this would likely have 
> been found last Wednesday.

Does changing it from num_online_cpus() to num_possible_cpus() 
help? (Can send a patch if you want.)

> > It would make it much easier for me to pick up your 
> > enhancements, fixes, etc.
> > 
> > > Changelog since V9
> > >   o Migration scalability                                             (mingo)
> > 
> > To *really* see migration scalability bottlenecks you need to 
> > remove the migration-bandwidth throttling kludge from your tree 
> > (or configure it up very high if you want to do it simple).
> > 
> 
> Why is it a kludge? I already explained what the rational 
> behind the rate limiting was. It's not about scalability, it's 
> about mitigating worse-case behaviour and the amount of time 
> the kernel spends moving data around which a deliberately 
> adverse workload can trigger.  It is unacceptable if during a 
> phase change that a process would stall potentially for 
> milliseconds (seconds if the node is large enough I guess) 
> while the data is being migrated. Here is it again -- 
> http://www.spinics.net/lists/linux-mm/msg47440.html . You 
> either ignored the mail or simply could not be bothered 
> explaining why you thought this was the incorrect decision or 
> why the concerns about an adverse workload were unimportant.

I think the stalls could have been at least in part due to the 
scalability bottlenecks that the rate-limiting code has hidden.

If you think of the NUMA migration as a natural part of the 
workload, as a sort of extended cache-miss, and if you assume 
that the scheduler is intelligent about not flip-flopping tasks 
between nodes (which the latest code certainly is), then I don't 
see why the rate of migration should be rate-limited in the VM.

Note that I tried to quantify this effect: the perf bench numa 
testcases start from a practical 'worst-case adverse' workload 
in essence: all pages concentrated on the wrong node, and the 
workload having to migrate all of them over.

We could add a new 'absolutely worst case' testcase, to make it 
behaves sanely?

> I have a vague suspicion actually that when you are modelling 
> the task->data relationship that you make an implicit 
> assumption that moving data has zero or near-zero cost. In 
> such a model it would always make sense to move quickly and 
> immediately but in practice the cost of moving can exceed the 
> performance benefit of accessing local data and lead to 
> regressions. It becomes more pronounced if the nodes are not 
> fully connected.

I make no such assumption - convergence costs were part of my 
measurements.

> > Some (certainly not all) of the performance regressions you 
> > reported were certainly due to numa/core code hitting the 
> > migration codepaths as aggressively as the workload demanded 
> > - and hitting scalability bottlenecks.
> 
> How are you so certain? [...]

Hm, I don't think my "some (certainly not all)" statement 
reflected any sort of certainty. So we violently agree about:

> [...] How do you not know it's because your code is migrating 
> excessively for no good reason because the algorithm has a 
> flaw in it? [...]

That's another source - but again not something we should fix by 
hiding it under the carpet via migration bandwidth rate limits, 
right?

> [...] Or that the cost of excessive migration is not being 
> offset by local data accesses? [...]

That's another possibility.

The _real_ fix is to avoid excessive migration on the CPU and 
memory placement side, not to throttle the basic mechanism 
itself!

I don't exclude the possibility that bandwidth limits might be 
needed - but only if everything else fails. Meanwhile, the 
bandwidth limits were actively hiding scalability bottlenecks, 
which bottlenecks only trigger at higher migration rates.

> [...] The critical point to note is that if it really was only 
> scalability problems then autonuma would suffer the same 
> problems and would be impossible to autonumas performance to 
> exceed numacores. This isn't the case making it unlikely the 
> scalability is your only problem.

The scheduling patterns are different - so they can hit 
different bottlenecks.

> Either way, last night I applied a patch on top of latest 
> tip/master to remove the nr_cpus_allowed check so that 
> numacore would be enabled again and tested that. In some 
> places it has indeed much improved. In others it is still 
> regressing badly and in two case, it's corrupting memory -- 
> specjbb when THP is enabled crashes when running for single or 
> multiple JVMs. It is likely that a zero page is being inserted 
> due to a race with migration and causes the JVM to throw a 
> null pointer exception. Here is the comparison on the rough 
> off-chance you actually read it this time.

Can you still see the JVM crash with the unified -v3 tree?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10 11:39     ` Ingo Molnar
@ 2012-12-10 11:53       ` Ingo Molnar
  2012-12-10 15:24       ` Mel Gorman
  1 sibling, 0 replies; 80+ messages in thread
From: Ingo Molnar @ 2012-12-10 11:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> > reasons. As it turns out, a printk() bodge showed that 
> > nr_cpus_allowed == 80 set in sched_init_smp() while 
> > num_online_cpus() == 48. This effectively disabling 
> > numacore. If you had responded to the bug report, this would 
> > likely have been found last Wednesday.
> 
> Does changing it from num_online_cpus() to num_possible_cpus() 
> help? (Can send a patch if you want.)

I.e. something like the patch below.

Thanks,

	Ingo

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 503ec29..9d11a8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2646,7 +2646,7 @@ static bool task_numa_candidate(struct task_struct *p)
 
 	/* Don't disturb hard-bound tasks: */
 	if (sched_feat(NUMA_EXCLUDE_AFFINE)) {
-		if (p->nr_cpus_allowed != num_online_cpus())
+		if (p->nr_cpus_allowed != num_possible_cpus())
 			return false;
 	}
 

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10  5:07     ` Srikar Dronamraju
  2012-12-10  6:28       ` Srikar Dronamraju
  2012-12-10  8:46       ` [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
@ 2012-12-10 12:35       ` Ingo Molnar
  2 siblings, 0 replies; 80+ messages in thread
From: Ingo Molnar @ 2012-12-10 12:35 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


hi Srikar,

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> > 
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> I see this failure when running with THP and KSM enabled on 
> Friday's Tip master. Not sure if Mel was talking about the same issue.
> 
> ------------[ cut here ]------------
> kernel BUG at ../kernel/sched/fair.c:2371!

Could you check whether today's -tip (7ea8701a1a51 or later), 
plus the patch below, addresses the crash - while still giving 
good NUMA performance?

Thanks,

	Ingo

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d11a8a..6a89787 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2351,6 +2351,9 @@ void task_numa_fault(unsigned long addr, int node, int last_cpupid, int pages, b
 	int priv;
 	int idx;
 
+	if (!p->numa_faults)
+		return;
+
 	if (last_cpupid != cpu_pid_to_cpupid(-1, -1)) {
 		/* Did we access it last time around? */
 		if (last_pid == this_pid) {

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH] sched: Fix task_numa_fault() + KSM crash
  2012-12-10  6:28       ` Srikar Dronamraju
@ 2012-12-10 12:44         ` Ingo Molnar
  2012-12-13 13:57           ` Srikar Dronamraju
  0 siblings, 1 reply; 80+ messages in thread
From: Ingo Molnar @ 2012-12-10 12:44 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Mel Gorman, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

Srikar Dronamraju reported that the following assert triggers on 
his box:

   kernel BUG at ../kernel/sched/fair.c:2371!

   Call Trace:
     [<ffffffff8113cd0e>] __do_numa_page+0xde/0x160
     [<ffffffff8113de9e>] handle_pte_fault+0x32e/0xcd0
     [<ffffffffa01c22c0>] ? drop_large_spte+0x30/0x30 [kvm]
     [<ffffffffa01bf215>] ? kvm_set_spte_hva+0x25/0x30 [kvm]
     [<ffffffff8113eab9>] handle_mm_fault+0x279/0x760
     [<ffffffff8115c024>] break_ksm+0x74/0xa0
     [<ffffffff8115c222>] break_cow+0xa2/0xb0
     [<ffffffff8115e38c>] ksm_scan_thread+0xb5c/0xd50
     [<ffffffff810771c0>] ? wake_up_bit+0x40/0x40
     [<ffffffff8115d830>] ? run_store+0x340/0x340
     [<ffffffff8107692e>] kthread+0xce/0xe0

This means that task_numa_fault() was called for a kernel thread
which has no fault tracking.

This scenario is actually possible if a kernel thread does
fault processing on behalf of a user-space task - ignore
the page fault in that case.

Also remove the (now never triggering) assert and robustify
a nearby assert.

Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d11a8a..61c7a10 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2351,6 +2351,13 @@ void task_numa_fault(unsigned long addr, int node, int last_cpupid, int pages, b
 	int priv;
 	int idx;
 
+	/*
+	 * Kernel threads might not have an mm but might still
+	 * do fault processing (such as KSM):
+	 */
+	if (!p->numa_faults)
+		return;
+
 	if (last_cpupid != cpu_pid_to_cpupid(-1, -1)) {
 		/* Did we access it last time around? */
 		if (last_pid == this_pid) {
@@ -2367,8 +2374,8 @@ void task_numa_fault(unsigned long addr, int node, int last_cpupid, int pages, b
 
 	idx = 2*node + priv;
 
-	WARN_ON_ONCE(last_cpu == -1 || node == -1);
-	BUG_ON(!p->numa_faults);
+	if (WARN_ON_ONCE(last_cpu == -1 || node == -1))
+		return;
 
 	p->numa_faults_curr[idx] += pages;
 	shared_fault_tick(p, node, last_cpu, pages);

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10 11:39     ` Ingo Molnar
  2012-12-10 11:53       ` Ingo Molnar
@ 2012-12-10 15:24       ` Mel Gorman
  2012-12-11  1:02         ` Mel Gorman
  1 sibling, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2012-12-10 15:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Mon, Dec 10, 2012 at 12:39:45PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Fri, Dec 07, 2012 at 12:01:13PM +0100, Ingo Molnar wrote:
> > > 
> > > * Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > This is a full release of all the patches so apologies for the 
> > > > flood. [...]
> > > 
> > > I have yet to process all your mails, but assuming I address all 
> > > your review feedback and the latest unified tree in tip:master 
> > > shows no regression in your testing, would you be willing to 
> > > start using it for ongoing work?
> > > 
> > 
> > Ingo,
> > 
> > If you had read the second paragraph of the mail you just responded to or
> > the results at the end then you would have seen that I had problems with
> > the performance. [...]
> 
> I've posted a (NUMA-placement sensitive workload centric) 
> performance comparisons between "balancenuma", AutoNUMA and 
> numa/core unified-v3 to:
> 
>    https://lkml.org/lkml/2012/12/7/331
> 
> I tried to address all performance regressions you and others 
> have reported.
> 

I've responded to this now. I acknowledge that balancenuma does not do
great on them. I've also explained that it's very likely because I did
not hook into the scheduler and I'm relucent to do so. Once I do that,
we're directly colliding when my intention was to handle all the necessary
MM changes, the bare minimum of the scheduler hook and maintain that side
while numacore and all the additional scheduler changes was built on top.

> <SNIP>
> I also tried to reproduce and fix as many bugs you reported as 
> possible - but my point is that it would be _much_ better if we 
> actually joined forces.
> 

Which is what balancenuma was meant to do and what I wanted weeks ago
-- I wanted to keep a handle on the mm side of things and establish
performance baseline for just the mm side that numacore could be compared
against.  I'd then help maintain the result, review patches particularly
affecting mm etc.  I was hoping that numacore would be rebased to carry
the necessary scheduler changes but that didn't happen. The unified tree
is not equivalent. Just off-hand

1. there is no performance comparison possible with just the mm changes
2. the vmstat fault accounting is broken in the unified tree
3. the code to allow balancenuma to be disabled from command line
   was removed which the THP experience has told us is very useful
4. The THP patch was wedged in as hard as possible making it effectively
   impossible to treat in isolation
5. ptes are treated as effective hugepage faults which potentially
   results in remote->remote copies if tasks share data on a
   PMD-boundary even if they do not share data on the page boundary.
   For this reason I dislike it quite a bit
6. the migrate rate-limiting code was removed

To be fair, the last one is a difference in opinion. I think migrate
rate-limiting is important because I think it's more important for the
workload to run than the kernel to getting too much in the way thinking
it can do better.

Some of the other changes just made no sense to me and I still fail to
see why you didn't rebase numacore a few weeks ago and instead smacked the
trees together. If it had been a plain rebase then I would have switched
to looking at just numacore on top without having to worry if something
unexpected was broken on the MM side. If something had broken on the MM
side, I'd be on it without wondering if it was due to how the trees were
merged.

For example, I think that point 5 above is the potential source of the
corruption because. You're not flushing the TLBs for the PTEs you are
updating in batch. Granted, you're relaxing rather than restricting access
so it should be ok and at worse cause a spurious fault but I also find
it suspicious that you do not recheck pte_same under the PTL when doing
the final PTE update. I also find it strange that you hold the PTL while
calling task_numa_fault(). No way should the PTL have to protect structures
in kernel/sched and I wonder was that actually part of the reason why you
saw heavy PTL contention.

Basically if I felt that handling ptes in batch like this was of
critical important I would have implemented it very differently on top of
balancenuma. I would have only taken the PTL lock if updating the PTE to
keep contention down and redid racy checks under PTL, I'd have only used
trylock for every non-faulted PTE and I would only have migrated if it
was a remote->local copy. I certainly would not hold PTL while calling
task_numa_fault(). I would have kept the handling ona per-pmd basis when
it was expected that most PTEs underneath should be on the same node.

> > [...] You would also know that tip/master testing for the last 
> > week was failing due to a boot problem (issue was in mainline 
> > not tip and has been already fixed) and would have known that 
> > since the -v18 release that numacore was effectively disabled 
> > on my test machine.
> 
> I'm glad it's fixed.
> 

Agreed.

> > Clearly you are not reading the bug reports you are receiving 
> > and you're not seeing the small bit of review feedback or 
> > answering the review questions you have received either. Why 
> > would I be more forthcoming when I feel that it'll simply be 
> > ignored? [...]
> 
> I am reading the bug reports and addressing bugs as I can.
> 
> > [...]  You simply assume that each batch of patches you place 
> > on top must be fixing all known regressions and ignoring any 
> > evidence to the contrary.
> >
> > If you had read my mail from last Tuesday you would even know 
> > which patch was causing the problem that effectively disabled 
> > numacore although not why. The comment about p->numa_faults 
> > was completely off the mark (long journey, was tired, assumed 
> > numa_faults was a counter and not a pointer which was 
> > careless).  If you had called me on it then I would have 
> > spotted the actual problem sooner. The problem was indeed with 
> > the nr_cpus_allowed == num_online_cpus()s check which I had 
> > pointed out was a suspicious check although for different 
> > reasons. As it turns out, a printk() bodge showed that 
> > nr_cpus_allowed == 80 set in sched_init_smp() while 
> > num_online_cpus() == 48. This effectively disabling numacore. 
> > If you had responded to the bug report, this would likely have 
> > been found last Wednesday.
> 
> Does changing it from num_online_cpus() to num_possible_cpus() 
> help? (Can send a patch if you want.)
> 

I'll check. The patch would be trivial.

> > > It would make it much easier for me to pick up your 
> > > enhancements, fixes, etc.
> > > 
> > > > Changelog since V9
> > > >   o Migration scalability                                             (mingo)
> > > 
> > > To *really* see migration scalability bottlenecks you need to 
> > > remove the migration-bandwidth throttling kludge from your tree 
> > > (or configure it up very high if you want to do it simple).
> > > 
> > 
> > Why is it a kludge? I already explained what the rational 
> > behind the rate limiting was. It's not about scalability, it's 
> > about mitigating worse-case behaviour and the amount of time 
> > the kernel spends moving data around which a deliberately 
> > adverse workload can trigger.  It is unacceptable if during a 
> > phase change that a process would stall potentially for 
> > milliseconds (seconds if the node is large enough I guess) 
> > while the data is being migrated. Here is it again -- 
> > http://www.spinics.net/lists/linux-mm/msg47440.html . You 
> > either ignored the mail or simply could not be bothered 
> > explaining why you thought this was the incorrect decision or 
> > why the concerns about an adverse workload were unimportant.
> 
> I think the stalls could have been at least in part due to the 
> scalability bottlenecks that the rate-limiting code has hidden.
> 

In part yes, but the actual data copying will stall as well. If a node
is 16G and all the data has to migrate from one node to another, it could
take up to 2 seconds even if there is no other contention. This is assuming
roughly 8G/sec transfer speeds but I know is a bit on the low end and it
can vary a lot.

> If you think of the NUMA migration as a natural part of the 
> workload, as a sort of extended cache-miss, and if you assume 
> that the scheduler is intelligent about not flip-flopping tasks 
> between nodes (which the latest code certainly is), then I don't 
> see why the rate of migration should be rate-limited in the VM.
> 

That's just it. I don't view the NUMA migration as a natural part of
the workload. I treat is as a cost that is optionally paid to get local
memory access and that the cost of the move must be offset. I think care
should be taken to minimise the amount of data that is transferred and
the system CPU cost of working out when to migrate should be as low as
possible and my reports have emphasised this.

To some extent I consider THP to have similar restrictions. THP is useless
if the cost of THP allocation is not offset by performance gains due to
reduced TLB misses. I think it's preferable to fail a THP allocation than
spend a lot of time reclaiming pages and compacting memory to satisfy THP.
Reclaim/compaction is meant to give up very quickly.

> Note that I tried to quantify this effect: the perf bench numa 
> testcases start from a practical 'worst-case adverse' workload 
> in essence: all pages concentrated on the wrong node, and the 
> workload having to migrate all of them over.
> 
> We could add a new 'absolutely worst case' testcase, to make it 
> behaves sanely?
> 

I don't think it'll tell us anything new. Without rate limiting the process
will stall while the transfer takes place. The duration of the stall will
be related to inter-node bandwidth.

> > I have a vague suspicion actually that when you are modelling 
> > the task->data relationship that you make an implicit 
> > assumption that moving data has zero or near-zero cost. In 
> > such a model it would always make sense to move quickly and 
> > immediately but in practice the cost of moving can exceed the 
> > performance benefit of accessing local data and lead to 
> > regressions. It becomes more pronounced if the nodes are not 
> > fully connected.
> 
> I make no such assumption - convergence costs were part of my 
> measurements.
> 

Then you must expect that squashing all that cost into the smallest period
of time will result in stalls. It's a much higher cost than cache-line
misses when there a process changes to running on a new CPU for example.

> > > Some (certainly not all) of the performance regressions you 
> > > reported were certainly due to numa/core code hitting the 
> > > migration codepaths as aggressively as the workload demanded 
> > > - and hitting scalability bottlenecks.
> > 
> > How are you so certain? [...]
> 
> Hm, I don't think my "some (certainly not all)" statement 
> reflected any sort of certainty. So we violently agree about:
> 

"regressions you reported were certainly due to numa/core code hitting
the migration codepaths" is what led me to believe that you were very sure
about where the source of the regression was.

> > [...] How do you not know it's because your code is migrating 
> > excessively for no good reason because the algorithm has a 
> > flaw in it? [...]
> 
> That's another source - but again not something we should fix by 
> hiding it under the carpet via migration bandwidth rate limits, 
> right?
> 

I would agree if that was the point of the migration rate-limiting was
to avoid contention. It's not. It's to prevent the kernel getting in the
way of a workload doing work for long periods of time. As balancenuma is
also dumb as rocks with respect to the schedueler it was also aimed at
mitigating problems related to tasks bouncing around if a particular node
was over-subscribed.

> > [...] Or that the cost of excessive migration is not being 
> > offset by local data accesses? [...]
> 
> That's another possibility.
> 
> The _real_ fix is to avoid excessive migration on the CPU and 
> memory placement side, not to throttle the basic mechanism 
> itself!
> 
> I don't exclude the possibility that bandwidth limits might be 
> needed - but only if everything else fails. Meanwhile, the 
> bandwidth limits were actively hiding scalability bottlenecks, 
> which bottlenecks only trigger at higher migration rates.
> 

The bottleneck is visible with or without the migration rate limiting.
If it wasn't then the patches would have made no difference between
balancenuma v9 and v10 but they did but they did make a difference.

> > [...] The critical point to note is that if it really was only 
> > scalability problems then autonuma would suffer the same 
> > problems and would be impossible to autonumas performance to 
> > exceed numacores. This isn't the case making it unlikely the 
> > scalability is your only problem.
> 
> The scheduling patterns are different - so they can hit 
> different bottlenecks.
> 

Ok, that is fair enough.

> > Either way, last night I applied a patch on top of latest 
> > tip/master to remove the nr_cpus_allowed check so that 
> > numacore would be enabled again and tested that. In some 
> > places it has indeed much improved. In others it is still 
> > regressing badly and in two case, it's corrupting memory -- 
> > specjbb when THP is enabled crashes when running for single or 
> > multiple JVMs. It is likely that a zero page is being inserted 
> > due to a race with migration and causes the JVM to throw a 
> > null pointer exception. Here is the comparison on the rough 
> > off-chance you actually read it this time.
> 
> Can you still see the JVM crash with the unified -v3 tree?
> 

The crash was based on tip/master from yesterday. Does that not include
the unified -v3 tree?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (49 preceding siblings ...)
  2012-12-07 11:01 ` [PATCH 00/49] Automatic NUMA Balancing v10 Ingo Molnar
@ 2012-12-10 16:42 ` Srikar Dronamraju
  2012-12-10 19:23   ` Ingo Molnar
  2012-12-10 23:40   ` Srikar Dronamraju
  2012-12-13 13:21 ` Srikar Dronamraju
  51 siblings, 2 replies; 80+ messages in thread
From: Srikar Dronamraju @ 2012-12-10 16:42 UTC (permalink / raw)
  To: Mel Gorman, Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Aneesh Kumar,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML

Hi Mel, Ingo, 

Here are the results of running autonumabenchmark on a 64 core, 8 node
machine. Has six 32GB nodes and two 64 GB nodes.


KernelVersion: 3.7.0-rc8
                        Testcase:      Min      Max      Avg
                          numa01:  1475.37  1615.39  1555.24
                numa01_HARD_BIND:   900.42  1244.00   993.30
             numa01_INVERSE_BIND:  2835.44  5067.22  3634.86
             numa01_THREAD_ALLOC:   918.51  1384.21  1121.17
   numa01_THREAD_ALLOC_HARD_BIND:   599.58  1178.26   792.73
numa01_THREAD_ALLOC_INVERSE_BIND:  1841.33  2237.34  1988.95
                          numa02:   126.95   188.31   147.04
                numa02_HARD_BIND:    26.05    29.17    26.94
             numa02_INVERSE_BIND:   341.10   369.37   349.10
                      numa02_SMT:   144.32   922.65   386.43
            numa02_SMT_HARD_BIND:    26.61   170.71   101.98
         numa02_SMT_INVERSE_BIND:   288.12   456.45   325.26

KernelVersion: 3.7.0-rc8-tip_master+(December 7th Snapshot)
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  2927.89  3217.56  3103.21  -49.88%
                numa01_HARD_BIND:  2653.09  5964.23  3431.35  -71.05%
             numa01_INVERSE_BIND:  3567.03  3933.18  3811.91   -4.64%
             numa01_THREAD_ALLOC:  1801.80  2339.16  1980.96  -43.40%
   numa01_THREAD_ALLOC_HARD_BIND:  1705.84  2110.06  1913.64  -58.57%
numa01_THREAD_ALLOC_INVERSE_BIND:  2266.12  2540.61  2376.67  -16.31%
                          numa02:   179.26   358.03   264.19  -44.34%
                numa02_HARD_BIND:    26.07    29.38    27.70   -2.74%
             numa02_INVERSE_BIND:   337.99   347.95   343.51    1.63%
                      numa02_SMT:    93.65   402.58   213.15   81.29%
            numa02_SMT_HARD_BIND:    91.19   140.47   116.26  -12.28%
         numa02_SMT_INVERSE_BIND:   289.03   299.57   297.01    9.51%

KernelVersion: 3.7.0-rc6-mel_auto_balance(mm-balancenuma-v10r3)
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  1536.93  1819.85  1694.54   -8.22%
                numa01_HARD_BIND:   909.67  1145.32  1055.57   -5.90%
             numa01_INVERSE_BIND:  2882.07  3287.24  2976.89   22.10%
             numa01_THREAD_ALLOC:   995.79  4845.27  1905.85  -41.17%
   numa01_THREAD_ALLOC_HARD_BIND:   582.36   818.11   655.18   20.99%
numa01_THREAD_ALLOC_INVERSE_BIND:  1790.91  1927.90  1868.49    6.45%
                          numa02:   131.53   287.93   209.15  -29.70%
                numa02_HARD_BIND:    25.68    31.90    27.66   -2.60%
             numa02_INVERSE_BIND:   341.09   401.37   353.84   -1.34%
                      numa02_SMT:   156.61  2036.63   731.97  -47.21%
            numa02_SMT_HARD_BIND:    25.10   196.60    79.72   27.92%
         numa02_SMT_INVERSE_BIND:   294.22  1801.59   824.41  -60.55%

KernelVersion: 3.7.0-rc6-autonuma+(mm-autonuma-v28fastr4-mels-rebase)
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  1596.13  1715.34  1649.44   -5.71%
                numa01_HARD_BIND:   920.75  1127.86  1012.50   -1.90%
             numa01_INVERSE_BIND:  2858.79  3146.74  2977.16   22.09%
             numa01_THREAD_ALLOC:   250.55   374.27   290.12  286.45%
   numa01_THREAD_ALLOC_HARD_BIND:   572.29   712.74   630.62   25.71%
numa01_THREAD_ALLOC_INVERSE_BIND:  1835.94  2401.04  2011.20   -1.11%
                          numa02:    33.93   104.80    50.99  188.37%
                numa02_HARD_BIND:    25.94    27.51    26.42    1.97%
             numa02_INVERSE_BIND:   334.57   349.51   341.23    2.31%
                      numa02_SMT:    43.72   114.82    62.41  519.18%
            numa02_SMT_HARD_BIND:    34.98    45.61    42.07  142.41%
         numa02_SMT_INVERSE_BIND:   284.57   310.62   298.51    8.96%

Avg refers to mean of 5 iterations of autonuma-benchmark.
%Change refers to percentage change from 3.7-rc8

Please do let me know if you have questions/suggestions.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10 16:42 ` Srikar Dronamraju
@ 2012-12-10 19:23   ` Ingo Molnar
  2012-12-10 23:35     ` Srikar Dronamraju
  2012-12-10 23:40   ` Srikar Dronamraju
  1 sibling, 1 reply; 80+ messages in thread
From: Ingo Molnar @ 2012-12-10 19:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> KernelVersion: 3.7.0-rc8-tip_master+(December 7th Snapshot)

> Please do let me know if you have questions/suggestions.

Do you still have the exact sha1 by any chance?

By the date of the snapshot I'd say that this fix:

  f0c77b62ba9d sched: Fix NUMA_EXCLUDE_AFFINE check

could improve performance on your box.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10 19:23   ` Ingo Molnar
@ 2012-12-10 23:35     ` Srikar Dronamraju
  0 siblings, 0 replies; 80+ messages in thread
From: Srikar Dronamraju @ 2012-12-10 23:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Peter Zijlstra, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

* Ingo Molnar <mingo@kernel.org> [2012-12-10 20:23:32]:

> 
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
> > KernelVersion: 3.7.0-rc8-tip_master+(December 7th Snapshot)
> 
> > Please do let me know if you have questions/suggestions.
> 
> Do you still have the exact sha1 by any chance?
> 

commit ea8432f29a702cf5a4bf9d91bf4542f9fb190529
Merge: bca2293 18a2f37
Author: Ingo Molnar <mingo@kernel.org>
Date:   Fri Dec 7 10:46:05 2012 +0100

    Merge branch 'linus'



git log --oneline shows something like this.

ea8432f Merge branch 'linus'
bca2293 Merge branch 'x86/nuke386'
11a4441 Merge branch 'x86/cleanups'
b8ae5b0 Merge branch 'x86/bsp-hotplug'
232e4c0 Merge branch 'timers/core'
24a0668 Merge branch 'core/urgent'
9ee046a Merge branch 'core/rcu'
f1ab78f Merge branch 'core/locking'
2e44b38 Merge branch 'numa/base'
b12fe81 numa, sched: Streamline and fix numa_allow_migration() use
ef88e22 numa, sched: Improve directed convergence
2948b6d numa, sched: Improve staggered convergence
540431e numa, mm: Fix !THP, 4K-pte "2M-emu" NUMA fault handling
6de1a2e numa, mm, sched: Fix NUMA affinity tracking logic
ff2a9f9 numa, mm, sched: Implement last-CPU+PID hash tracking
490a116 numa, sched: Implement wake-cpu migration support
41ea712 numa, sched: Add tracking of runnable NUMA tasks
78fb84e numa, sched: Fix NUMA tick ->numa_shared setting
18a2f37 tmpfs: fix shared mempolicy leak
c702418 mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones
60177d3 mm: compaction: validate pfn range passed to isolate_freepages_block



> By the date of the snapshot I'd say that this fix:
> 
>   f0c77b62ba9d sched: Fix NUMA_EXCLUDE_AFFINE check
> 
> could improve performance on your box.
> 

Yeah, I dont see that commit, will add and test


Also here is the .config for tip_master

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 3.7.0-rc8 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION="-tip_master"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_FHANDLE is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
# CONFIG_AUDIT_LOGINUID_IMMUTABLE is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_NOCB_CPU is not set
CONFIG_IKCONFIG=m
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=19
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_PROT_NUMA_PROT_NONE=y
CONFIG_ARCH_USES_NUMA_PROT_NONE=y
CONFIG_NUMA_BALANCING=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
# CONFIG_MEMCG is not set
# CONFIG_CGROUP_HUGETLB is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EXPERT is not set
CONFIG_HAVE_UID16=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_KALLSYMS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
# CONFIG_SLUB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
# CONFIG_JUMP_LABEL is not set
CONFIG_OPTPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_GENERIC_KERNEL_THREAD=y
CONFIG_GENERIC_KERNEL_EXECVE=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_MODULES_USE_ELF_RELA=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
# CONFIG_MODULE_SIG is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_THROTTLING=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_VSMP is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_KVMTOOL_TEST_ENABLE is not set
CONFIG_PARAVIRT_GUEST=y
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_XEN=y
CONFIG_XEN_DOM0=y
CONFIG_XEN_PRIVILEGED_GUEST=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=500
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
# CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=512
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_MEMORY_PROBE=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
# CONFIG_MEMORY_HOTREMOVE is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_CLEANCACHE is not set
# CONFIG_FRONTSWAP is not set
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_EFI=y
# CONFIG_EFI_STUB is not set
# CONFIG_SECCOMP is not set
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM_RUNTIME=y
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
# CONFIG_ACPI_INITRD_TABLE_OVERRIDE is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_MEMORY=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
CONFIG_ACPI_APEI=y
# CONFIG_ACPI_APEI_GHES is not set
# CONFIG_ACPI_APEI_PCIEAER is not set
# CONFIG_ACPI_APEI_MEMORY_FAILURE is not set
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set

#
# x86 CPU frequency scaling drivers
#
# CONFIG_X86_PCC_CPUFREQ is not set
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_XEN=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
CONFIG_PCIE_ECRC=y
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
CONFIG_XEN_PCIDEV_FRONTEND=y
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
# CONFIG_PCI_IOAPIC is not set
CONFIG_PCI_LABEL=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
# CONFIG_YENTA is not set
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set
# CONFIG_RAPIDIO is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
# CONFIG_IP_FIB_TRIE_STATS is not set
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_IP_MROUTE=y
# CONFIG_IP_MROUTE_MULTIPLE_TABLES is not set
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=y
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_GRE is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
# CONFIG_IPV6_SUBTREES is not set
CONFIG_IPV6_MROUTE=y
# CONFIG_IPV6_MROUTE_MULTIPLE_TABLES is not set
CONFIG_IPV6_PIMSM_V2=y
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_AUDIT is not set
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
# CONFIG_NETFILTER_XT_TARGET_SECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
# CONFIG_IP_NF_QUEUE is not set
# CONFIG_IP_NF_IPTABLES is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV6 is not set
# CONFIG_IP6_NF_IPTABLES is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
CONFIG_NET_DSA=y
CONFIG_NET_DSA_TAG_DSA=y
CONFIG_NET_DSA_TAG_EDSA=y
CONFIG_NET_DSA_TAG_TRAILER=y
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
# CONFIG_DNS_RESOLVER is not set
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_NETPRIO_CGROUP is not set
CONFIG_BQL=y
# CONFIG_BPF_JIT is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
CONFIG_NET_DROP_MONITOR=y
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
# CONFIG_LIB80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_RFKILL_REGULATOR is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
CONFIG_HAVE_BPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
CONFIG_SYS_HYPERVISOR=y
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_DMA_SHARED_BUFFER=y

#
# Bus devices
#
# CONFIG_OMAP_OCP2SCP is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
CONFIG_MTD=y
# CONFIG_MTD_TESTS is not set
# CONFIG_MTD_REDBOOT_PARTS is not set
CONFIG_MTD_CMDLINE_PARTS=y
# CONFIG_MTD_AR7_PARTS is not set

#
# User Modules And Translation Layers
#
# CONFIG_MTD_CHAR is not set
# CONFIG_MTD_BLKDEVS is not set
# CONFIG_MTD_BLOCK is not set
# CONFIG_MTD_BLOCK_RO is not set
# CONFIG_FTL is not set
# CONFIG_NFTL is not set
# CONFIG_INFTL is not set
# CONFIG_RFD_FTL is not set
# CONFIG_SSFDC is not set
# CONFIG_SM_FTL is not set
# CONFIG_MTD_OOPS is not set
# CONFIG_MTD_SWAP is not set

#
# RAM/ROM/Flash chip drivers
#
# CONFIG_MTD_CFI is not set
# CONFIG_MTD_JEDECPROBE is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
# CONFIG_MTD_CFI_I4 is not set
# CONFIG_MTD_CFI_I8 is not set
# CONFIG_MTD_RAM is not set
# CONFIG_MTD_ROM is not set
# CONFIG_MTD_ABSENT is not set

#
# Mapping drivers for chip access
#
CONFIG_MTD_COMPLEX_MAPPINGS=y
# CONFIG_MTD_TS5500 is not set
# CONFIG_MTD_PCI is not set
# CONFIG_MTD_PCMCIA is not set
# CONFIG_MTD_GPIO_ADDR is not set
# CONFIG_MTD_INTEL_VR_NOR is not set
# CONFIG_MTD_PLATRAM is not set
# CONFIG_MTD_LATCH_ADDR is not set

#
# Self-contained MTD device drivers
#
# CONFIG_MTD_PMC551 is not set
# CONFIG_MTD_SLRAM is not set
# CONFIG_MTD_PHRAM is not set
# CONFIG_MTD_MTDRAM is not set
# CONFIG_MTD_BLOCK2MTD is not set

#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOCG3 is not set
# CONFIG_MTD_NAND is not set
# CONFIG_MTD_ONENAND is not set

#
# LPDDR flash memory drivers
#
# CONFIG_MTD_LPDDR is not set
# CONFIG_MTD_UBI is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_BLK_DEV_SX8 is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_XEN_BLKDEV_FRONTEND is not set
# CONFIG_XEN_BLKDEV_BACKEND is not set
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_INTEL_MID_PTI is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
CONFIG_ENCLOSURE_SERVICES=m
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1780 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_VMWARE_BALLOON is not set
# CONFIG_BMP085_I2C is not set
# CONFIG_PCH_PHUB is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_TI_ST is not set
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=m
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_ENCLOSURE=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_FC_TGT_ATTRS=y
# CONFIG_SCSI_ISCSI_ATTRS is not set
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_SCSI_BNX2X_FCOE is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
CONFIG_SCSI_AACRAID=m
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
CONFIG_SCSI_AIC94XX=m
# CONFIG_AIC94XX_DEBUG is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_FCOE_FNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA_FC=m
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_BFA_FC is not set
CONFIG_SCSI_LOWLEVEL_PCMCIA=y
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_QLOGIC is not set
# CONFIG_PCMCIA_SYM53C500 is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
# CONFIG_SCSI_DH_HP_SW is not set
# CONFIG_SCSI_DH_EMC is not set
# CONFIG_SCSI_DH_ALUA is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
# CONFIG_SATA_AHCI is not set
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_HIGHBANK is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARASAN_CF is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SCH is not set
CONFIG_PATA_SERVERWORKS=m
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=m
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
CONFIG_BLK_DEV_DM=m
CONFIG_DM_DEBUG=y
# CONFIG_DM_CRYPT is not set
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_THIN_PROVISIONING is not set
CONFIG_DM_MIRROR=m
# CONFIG_DM_RAID is not set
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_TARGET_CORE is not set
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=128
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# CONFIG_I2O is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_MII is not set
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
CONFIG_NETCONSOLE=m
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
CONFIG_NET_DSA_MV88E6XXX=y
CONFIG_NET_DSA_MV88E6060=y
CONFIG_NET_DSA_MV88E6XXX_NEED_PPU=y
CONFIG_NET_DSA_MV88E6131=y
CONFIG_NET_DSA_MV88E6123_61_65=y
CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_3C589 is not set
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_PCMCIA_NMCLAN is not set
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
CONFIG_TIGON3=m
# CONFIG_BNX2X is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
# CONFIG_NET_CALXEDA_XGMAC is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_EXAR=y
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
CONFIG_NET_VENDOR_FUJITSU=y
# CONFIG_PCMCIA_FMVJ18X is not set
CONFIG_NET_VENDOR_HP=y
# CONFIG_HP100 is not set
CONFIG_NET_VENDOR_INTEL=y
# CONFIG_E100 is not set
# CONFIG_E1000 is not set
# CONFIG_E1000E is not set
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_IXGB is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGBEVF is not set
CONFIG_NET_VENDOR_I825XX=y
# CONFIG_ZNET is not set
# CONFIG_IP1000 is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_PCMCIA_PCNET is not set
CONFIG_NET_VENDOR_NVIDIA=y
# CONFIG_FORCEDETH is not set
CONFIG_NET_VENDOR_OKI=y
# CONFIG_PCH_GBE is not set
# CONFIG_ETHOC is not set
CONFIG_NET_PACKET_ENGINE=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_QLGE is not set
# CONFIG_NETXEN_NIC is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R8169 is not set
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_SEEQ=y
# CONFIG_SEEQ8005 is not set
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
# CONFIG_SFC is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_PCMCIA_SMC91C92 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XIRCOM=y
# CONFIG_PCMCIA_XIRC2PS is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_AT803X_PHY is not set
# CONFIG_AMD_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MICREL_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_IPHETH is not set
CONFIG_WLAN=y
# CONFIG_PCMCIA_RAYCS is not set
# CONFIG_AIRO is not set
# CONFIG_ATMEL is not set
# CONFIG_AIRO_CS is not set
# CONFIG_PCMCIA_WL3501 is not set
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set
# CONFIG_WL_TI is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
# CONFIG_SBNI is not set
# CONFIG_XEN_NETDEV_FRONTEND is not set
# CONFIG_XEN_NETDEV_BACKEND is not set
# CONFIG_VMXNET3 is not set
CONFIG_ISDN=y
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
# CONFIG_ISDN_DRV_GIGASET is not set
# CONFIG_HYSDN is not set
# CONFIG_MISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
# CONFIG_INPUT_POLLDEV is not set
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_GPIO_POLLED is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_OMAP4 is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
# CONFIG_INPUT_JOYSTICK is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_WACOM is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_AUO_PIXCIR is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CY8CTMG110 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_MPU3050 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_GP2A is not set
# CONFIG_INPUT_GPIO_TILT_POLLED is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_CMA3000 is not set
CONFIG_INPUT_XEN_KBDDEV_FRONTEND=y

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_NOZOMI is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
# CONFIG_DEVKMEM is not set
# CONFIG_STALDRV is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_TIMBERDALE is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_PCH_UART is not set
# CONFIG_SERIAL_XILINX_PS_UART is not set
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
CONFIG_HVC_XEN_FRONTEND=y
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_HW_RANDOM_TPM=y
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_MWAVE is not set
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=8192
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y
# CONFIG_TCG_TIS_I2C_INFINEON is not set
# CONFIG_TCG_NSC is not set
# CONFIG_TCG_ATMEL is not set
# CONFIG_TCG_INFINEON is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=m
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EG20T is not set
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_INTEL_MID is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_HSI is not set

#
# PPS support
#
# CONFIG_PPS is not set

#
# PPS generators support
#

#
# PTP clock support
#

#
# Enable Device Drivers -> PPS to see the PTP clock options.
#
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
# CONFIG_GPIO_SYSFS is not set

#
# Memory mapped GPIO drivers:
#
# CONFIG_GPIO_GENERIC_PLATFORM is not set
# CONFIG_GPIO_IT8761E is not set
# CONFIG_GPIO_SCH is not set
# CONFIG_GPIO_ICH is not set
# CONFIG_GPIO_VX855 is not set

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_ADP5588 is not set

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_AMD8111 is not set
# CONFIG_GPIO_LANGWELL is not set
# CONFIG_GPIO_PCH is not set
# CONFIG_GPIO_ML_IOH is not set
# CONFIG_GPIO_RDC321X is not set

#
# SPI GPIO expanders:
#
# CONFIG_GPIO_MCP23S08 is not set

#
# AC97 GPIO expanders:
#

#
# MODULbus GPIO expanders:
#
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_GPIO is not set
# CONFIG_CHARGER_MANAGER is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_POWER_AVS is not set
CONFIG_HWMON=m
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_GPIO_FAN is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_APPLESMC is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
# CONFIG_CPU_THERMAL is not set
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SC520_WDT is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_XEN_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_CS5535 is not set
# CONFIG_MFD_TIMBERDALE is not set
# CONFIG_LPC_SCH is not set
# CONFIG_LPC_ICH is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_WL1273_CORE is not set
CONFIG_REGULATOR=y
# CONFIG_REGULATOR_DEBUG is not set
# CONFIG_REGULATOR_DUMMY is not set
# CONFIG_REGULATOR_FIXED_VOLTAGE is not set
# CONFIG_REGULATOR_VIRTUAL_CONSUMER is not set
# CONFIG_REGULATOR_USERSPACE_CONSUMER is not set
# CONFIG_REGULATOR_GPIO is not set
# CONFIG_REGULATOR_AD5398 is not set
# CONFIG_REGULATOR_FAN53555 is not set
# CONFIG_REGULATOR_ISL6271A is not set
# CONFIG_REGULATOR_MAX1586 is not set
# CONFIG_REGULATOR_MAX8649 is not set
# CONFIG_REGULATOR_MAX8660 is not set
# CONFIG_REGULATOR_MAX8952 is not set
# CONFIG_REGULATOR_LP3971 is not set
# CONFIG_REGULATOR_LP3972 is not set
# CONFIG_REGULATOR_TPS62360 is not set
# CONFIG_REGULATOR_TPS65023 is not set
# CONFIG_REGULATOR_TPS6507X is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=64
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_DRM=m
CONFIG_DRM_KMS_HELPER=m
# CONFIG_DRM_LOAD_EDID_FIRMWARE is not set
CONFIG_DRM_TTM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
CONFIG_DRM_RADEON_KMS=y
# CONFIG_DRM_NOUVEAU is not set

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_STUB_POULSBO is not set
# CONFIG_VGASTATE is not set
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
# CONFIG_FB_DDC is not set
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
# CONFIG_FB_WMT_GE_ROPS is not set
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_XEN_FBDEV_FRONTEND=y
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_EXYNOS_VIDEO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3630 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LP855X is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_DRAGONRISE=y
# CONFIG_DRAGONRISE_FF is not set
# CONFIG_HID_EMS_FF is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_KEYTOUCH is not set
CONFIG_HID_KYE=y
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
CONFIG_HID_GYRATION=y
CONFIG_HID_TWINHAN=y
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO_TPKBD is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_LOGITECH_FF is not set
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
# CONFIG_LOGIWHEELS_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
# CONFIG_HID_SPEEDLINK is not set
CONFIG_HID_SUNPLUS=y
CONFIG_HID_GREENASIA=y
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_SMARTJOYPLUS=y
CONFIG_SMARTJOYPLUS_FF=y
# CONFIG_HID_TIVO is not set
CONFIG_HID_TOPSEED=y
CONFIG_HID_THRUSTMASTER=y
# CONFIG_THRUSTMASTER_FF is not set
CONFIG_HID_ZEROPLUS=y
# CONFIG_ZEROPLUS_FF is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB_ARCH_HAS_XHCI=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_XHCI_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_CHIPIDEA is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
# CONFIG_USB_STORAGE is not set
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set

#
# USB Physical Layer drivers
#
# CONFIG_OMAP_USB2 is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA9633 is not set
# CONFIG_LEDS_REGULATOR is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_LT3593 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_LM355x is not set
# CONFIG_LEDS_OT200 is not set
# CONFIG_LEDS_BLINKM is not set
CONFIG_LEDS_TRIGGERS=y

#
# LED Triggers
#
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_DECODE_MCE is not set
# CONFIG_EDAC_MM_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set
# CONFIG_RTC_DRV_DS2404 is not set

#
# on-CPU RTC drivers
#
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
# CONFIG_INTEL_MID_DMAC is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_TIMB_DMA is not set
# CONFIG_PCH_DMA is not set
CONFIG_AUXDISPLAY=y
# CONFIG_UIO is not set
# CONFIG_VFIO is not set

#
# Virtio drivers
#
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set

#
# Xen driver support
#
CONFIG_XEN_BALLOON=y
# CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is not set
CONFIG_XEN_SCRUB_PAGES=y
# CONFIG_XEN_DEV_EVTCHN is not set
CONFIG_XEN_BACKEND=y
# CONFIG_XENFS is not set
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
# CONFIG_XEN_GNTDEV is not set
# CONFIG_XEN_GRANT_DEV_ALLOC is not set
CONFIG_SWIOTLB_XEN=y
# CONFIG_XEN_PCIDEV_BACKEND is not set
CONFIG_XEN_PRIVCMD=m
# CONFIG_XEN_ACPI_PROCESSOR is not set
# CONFIG_XEN_MCE_LOG is not set
CONFIG_STAGING=y
# CONFIG_ET131X is not set
# CONFIG_SLICOSS is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_ECHO is not set
# CONFIG_COMEDI is not set
# CONFIG_ASUS_OLED is not set
# CONFIG_R8187SE is not set
# CONFIG_RTL8192U is not set
# CONFIG_RTLLIB is not set
# CONFIG_R8712U is not set
# CONFIG_RTS_PSTOR is not set
# CONFIG_RTS5139 is not set
# CONFIG_TRANZPORT is not set
# CONFIG_IDE_PHISON is not set
# CONFIG_VT6655 is not set
# CONFIG_VT6656 is not set
# CONFIG_DX_SEP is not set
# CONFIG_ZSMALLOC is not set
# CONFIG_WLAGS49_H2 is not set
# CONFIG_WLAGS49_H25 is not set
# CONFIG_FB_SM7XX is not set
# CONFIG_CRYSTALHD is not set
# CONFIG_FB_XGI is not set
# CONFIG_ACPI_QUICKSTART is not set
# CONFIG_USB_ENESTORAGE is not set
# CONFIG_BCM_WIMAX is not set
# CONFIG_FT1000 is not set

#
# Speakup console speech
#
# CONFIG_SPEAKUP is not set
# CONFIG_TOUCHSCREEN_CLEARPAD_TM1217 is not set
# CONFIG_TOUCHSCREEN_SYNAPTICS_I2C_RMI4 is not set
# CONFIG_STAGING_MEDIA is not set

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_PHONE is not set
# CONFIG_USB_WPAN_HCD is not set
# CONFIG_IPACK_BUS is not set
# CONFIG_WIMAX_GDM72XX is not set
CONFIG_NET_VENDOR_SILICOM=y
# CONFIG_SBYPASS is not set
# CONFIG_BPCTL is not set
# CONFIG_CED1401 is not set
# CONFIG_DGRP is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_HP_ACCEL is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_IBM_RTL is not set
# CONFIG_XO15_EBOOK is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_APPLE_GMUX is not set

#
# Hardware Spinlock drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
# CONFIG_AMD_IOMMU_V2 is not set
# CONFIG_INTEL_IOMMU is not set
# CONFIG_IRQ_REMAP is not set

#
# Remoteproc drivers (EXPERIMENTAL)
#
# CONFIG_STE_MODEM_RPROC is not set

#
# Rpmsg drivers (EXPERIMENTAL)
#
# CONFIG_VIRT_DRIVERS is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=m
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=m
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=m
# CONFIG_FUSE_FS is not set
CONFIG_GENERIC_ACL=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
# CONFIG_CONFIGFS_FS is not set
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_JFFS2_FS is not set
# CONFIG_LOGFS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
# CONFIG_NLS_UTF8 is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=7
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
# CONFIG_DEBUG_KERNEL is not set
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_SPARSE_RCU_POINTER is not set
CONFIG_STACKTRACE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
# CONFIG_FRAME_POINTER is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_LKDTM is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_EVENT_POWER_TRACING_DEPRECATED=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENT=y
# CONFIG_UPROBE_EVENT is not set
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_TEST_KSTRTOX is not set
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_DEBUG_SET_MODULE_RONX is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_OPTIMIZE_INLINING=y

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_TRUSTED_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
CONFIG_KEYS_DEBUG_PROC_KEYS=y
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
# CONFIG_SECURITY_PATH is not set
CONFIG_LSM_MMAP_MIN_ADDR=65535
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_YAMA is not set
CONFIG_INTEGRITY=y
# CONFIG_INTEGRITY_SIGNATURE is not set
CONFIG_IMA=y
CONFIG_IMA_MEASURE_PCR_IDX=10
CONFIG_IMA_AUDIT=y
CONFIG_IMA_LSM_RULES=y
# CONFIG_IMA_APPRAISE is not set
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_ASYMMETRIC_KEY_TYPE is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
# CONFIG_KVM_AMD is not set
# CONFIG_KVM_MMU_AUDIT is not set
# CONFIG_VHOST_NET is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
# CONFIG_CRC_CCITT is not set
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=m
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
# CONFIG_CRC8 is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_AVERAGE=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set
> Thanks,
> 
> 	Ingo
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10 16:42 ` Srikar Dronamraju
  2012-12-10 19:23   ` Ingo Molnar
@ 2012-12-10 23:40   ` Srikar Dronamraju
  1 sibling, 0 replies; 80+ messages in thread
From: Srikar Dronamraju @ 2012-12-10 23:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

> Here are the results of running autonumabenchmark on a 64 core, 8 node
> machine. Has six 32GB nodes and two 64 GB nodes.

> KernelVersion: 3.7.0-rc6-mel_auto_balance(mm-balancenuma-v10r3)
>                         Testcase:      Min      Max      Avg  %Change
>                           numa01:  1536.93  1819.85  1694.54   -8.22%
>                 numa01_HARD_BIND:   909.67  1145.32  1055.57   -5.90%
>              numa01_INVERSE_BIND:  2882.07  3287.24  2976.89   22.10%
>              numa01_THREAD_ALLOC:   995.79  4845.27  1905.85  -41.17%
>    numa01_THREAD_ALLOC_HARD_BIND:   582.36   818.11   655.18   20.99%
> numa01_THREAD_ALLOC_INVERSE_BIND:  1790.91  1927.90  1868.49    6.45%
>                           numa02:   131.53   287.93   209.15  -29.70%
>                 numa02_HARD_BIND:    25.68    31.90    27.66   -2.60%
>              numa02_INVERSE_BIND:   341.09   401.37   353.84   -1.34%
>                       numa02_SMT:   156.61  2036.63   731.97  -47.21%
>             numa02_SMT_HARD_BIND:    25.10   196.60    79.72   27.92%
>          numa02_SMT_INVERSE_BIND:   294.22  1801.59   824.41  -60.55%
> 

Here is the config I used for balancenuma.

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 3.7.0-rc6 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION="-mel_auto_balance"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_FHANDLE is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
# CONFIG_AUDIT_LOGINUID_IMMUTABLE is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_IKCONFIG=m
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=19
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE=y
CONFIG_ARCH_USES_NUMA_PROT_NONE=y
CONFIG_BALANCE_NUMA_DEFAULT_ENABLED=y
CONFIG_BALANCE_NUMA=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
# CONFIG_MEMCG is not set
# CONFIG_CGROUP_HUGETLB is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EXPERT is not set
CONFIG_HAVE_UID16=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_KALLSYMS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
# CONFIG_SLUB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
# CONFIG_JUMP_LABEL is not set
CONFIG_OPTPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_GENERIC_KERNEL_THREAD=y
CONFIG_GENERIC_KERNEL_EXECVE=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_HAVE_RCU_USER_QS=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_MODULES_USE_ELF_RELA=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
# CONFIG_MODULE_SIG is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_THROTTLING=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_VSMP is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_XEN=y
CONFIG_XEN_DOM0=y
CONFIG_XEN_PRIVILEGED_GUEST=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=500
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
# CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=512
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_MEMORY_PROBE=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
# CONFIG_MEMORY_HOTREMOVE is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_CLEANCACHE is not set
# CONFIG_FRONTSWAP is not set
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_EFI=y
# CONFIG_EFI_STUB is not set
# CONFIG_SECCOMP is not set
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM_RUNTIME=y
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_MEMORY=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
CONFIG_ACPI_APEI=y
# CONFIG_ACPI_APEI_GHES is not set
# CONFIG_ACPI_APEI_PCIEAER is not set
# CONFIG_ACPI_APEI_MEMORY_FAILURE is not set
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set

#
# x86 CPU frequency scaling drivers
#
# CONFIG_X86_PCC_CPUFREQ is not set
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_XEN=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
CONFIG_PCIE_ECRC=y
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
CONFIG_XEN_PCIDEV_FRONTEND=y
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
# CONFIG_PCI_IOAPIC is not set
CONFIG_PCI_LABEL=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
# CONFIG_YENTA is not set
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set
# CONFIG_RAPIDIO is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
# CONFIG_IP_FIB_TRIE_STATS is not set
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_IP_MROUTE=y
# CONFIG_IP_MROUTE_MULTIPLE_TABLES is not set
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=y
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_GRE is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
# CONFIG_IPV6_SUBTREES is not set
CONFIG_IPV6_MROUTE=y
# CONFIG_IPV6_MROUTE_MULTIPLE_TABLES is not set
CONFIG_IPV6_PIMSM_V2=y
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_AUDIT is not set
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
# CONFIG_NETFILTER_XT_TARGET_SECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
# CONFIG_IP_NF_QUEUE is not set
# CONFIG_IP_NF_IPTABLES is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV6 is not set
# CONFIG_IP6_NF_IPTABLES is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
CONFIG_NET_DSA=y
CONFIG_NET_DSA_TAG_DSA=y
CONFIG_NET_DSA_TAG_EDSA=y
CONFIG_NET_DSA_TAG_TRAILER=y
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
# CONFIG_DNS_RESOLVER is not set
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_NETPRIO_CGROUP is not set
CONFIG_BQL=y
# CONFIG_BPF_JIT is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
CONFIG_NET_DROP_MONITOR=y
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
# CONFIG_LIB80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_RFKILL_REGULATOR is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
CONFIG_HAVE_BPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
CONFIG_SYS_HYPERVISOR=y
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_DMA_SHARED_BUFFER=y

#
# Bus devices
#
# CONFIG_OMAP_OCP2SCP is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
CONFIG_MTD=y
# CONFIG_MTD_TESTS is not set
# CONFIG_MTD_REDBOOT_PARTS is not set
CONFIG_MTD_CMDLINE_PARTS=y
# CONFIG_MTD_AR7_PARTS is not set

#
# User Modules And Translation Layers
#
# CONFIG_MTD_CHAR is not set
# CONFIG_MTD_BLKDEVS is not set
# CONFIG_MTD_BLOCK is not set
# CONFIG_MTD_BLOCK_RO is not set
# CONFIG_FTL is not set
# CONFIG_NFTL is not set
# CONFIG_INFTL is not set
# CONFIG_RFD_FTL is not set
# CONFIG_SSFDC is not set
# CONFIG_SM_FTL is not set
# CONFIG_MTD_OOPS is not set
# CONFIG_MTD_SWAP is not set

#
# RAM/ROM/Flash chip drivers
#
# CONFIG_MTD_CFI is not set
# CONFIG_MTD_JEDECPROBE is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
# CONFIG_MTD_CFI_I4 is not set
# CONFIG_MTD_CFI_I8 is not set
# CONFIG_MTD_RAM is not set
# CONFIG_MTD_ROM is not set
# CONFIG_MTD_ABSENT is not set

#
# Mapping drivers for chip access
#
CONFIG_MTD_COMPLEX_MAPPINGS=y
# CONFIG_MTD_TS5500 is not set
# CONFIG_MTD_PCI is not set
# CONFIG_MTD_PCMCIA is not set
# CONFIG_MTD_GPIO_ADDR is not set
# CONFIG_MTD_INTEL_VR_NOR is not set
# CONFIG_MTD_PLATRAM is not set
# CONFIG_MTD_LATCH_ADDR is not set

#
# Self-contained MTD device drivers
#
# CONFIG_MTD_PMC551 is not set
# CONFIG_MTD_SLRAM is not set
# CONFIG_MTD_PHRAM is not set
# CONFIG_MTD_MTDRAM is not set
# CONFIG_MTD_BLOCK2MTD is not set

#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOCG3 is not set
# CONFIG_MTD_NAND is not set
# CONFIG_MTD_ONENAND is not set

#
# LPDDR flash memory drivers
#
# CONFIG_MTD_LPDDR is not set
# CONFIG_MTD_UBI is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_BLK_DEV_SX8 is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_XEN_BLKDEV_FRONTEND is not set
# CONFIG_XEN_BLKDEV_BACKEND is not set
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_INTEL_MID_PTI is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
CONFIG_ENCLOSURE_SERVICES=m
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1780 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_VMWARE_BALLOON is not set
# CONFIG_BMP085_I2C is not set
# CONFIG_PCH_PHUB is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_TI_ST is not set
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=m
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_ENCLOSURE=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_FC_TGT_ATTRS=y
# CONFIG_SCSI_ISCSI_ATTRS is not set
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_SCSI_BNX2X_FCOE is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
CONFIG_SCSI_AACRAID=m
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
CONFIG_SCSI_AIC94XX=m
# CONFIG_AIC94XX_DEBUG is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_FCOE_FNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA_FC=m
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_BFA_FC is not set
CONFIG_SCSI_LOWLEVEL_PCMCIA=y
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_QLOGIC is not set
# CONFIG_PCMCIA_SYM53C500 is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
# CONFIG_SCSI_DH_HP_SW is not set
# CONFIG_SCSI_DH_EMC is not set
# CONFIG_SCSI_DH_ALUA is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
# CONFIG_SATA_AHCI is not set
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_HIGHBANK is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARASAN_CF is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SCH is not set
CONFIG_PATA_SERVERWORKS=m
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=m
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
CONFIG_BLK_DEV_DM=m
CONFIG_DM_DEBUG=y
# CONFIG_DM_CRYPT is not set
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_THIN_PROVISIONING is not set
CONFIG_DM_MIRROR=m
# CONFIG_DM_RAID is not set
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_TARGET_CORE is not set
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=128
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# CONFIG_I2O is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_MII is not set
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
CONFIG_NETCONSOLE=m
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
CONFIG_NET_DSA_MV88E6XXX=y
CONFIG_NET_DSA_MV88E6060=y
CONFIG_NET_DSA_MV88E6XXX_NEED_PPU=y
CONFIG_NET_DSA_MV88E6131=y
CONFIG_NET_DSA_MV88E6123_61_65=y
CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_3C589 is not set
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_PCMCIA_NMCLAN is not set
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
CONFIG_TIGON3=m
# CONFIG_BNX2X is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
# CONFIG_NET_CALXEDA_XGMAC is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_EXAR=y
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
CONFIG_NET_VENDOR_FUJITSU=y
# CONFIG_PCMCIA_FMVJ18X is not set
CONFIG_NET_VENDOR_HP=y
# CONFIG_HP100 is not set
CONFIG_NET_VENDOR_INTEL=y
# CONFIG_E100 is not set
# CONFIG_E1000 is not set
# CONFIG_E1000E is not set
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_IXGB is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGBEVF is not set
CONFIG_NET_VENDOR_I825XX=y
# CONFIG_ZNET is not set
# CONFIG_IP1000 is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_PCMCIA_PCNET is not set
CONFIG_NET_VENDOR_NVIDIA=y
# CONFIG_FORCEDETH is not set
CONFIG_NET_VENDOR_OKI=y
# CONFIG_PCH_GBE is not set
# CONFIG_ETHOC is not set
CONFIG_NET_PACKET_ENGINE=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_QLGE is not set
# CONFIG_NETXEN_NIC is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R8169 is not set
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_SEEQ=y
# CONFIG_SEEQ8005 is not set
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
# CONFIG_SFC is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_PCMCIA_SMC91C92 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XIRCOM=y
# CONFIG_PCMCIA_XIRC2PS is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_AT803X_PHY is not set
# CONFIG_AMD_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MICREL_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_IPHETH is not set
CONFIG_WLAN=y
# CONFIG_PCMCIA_RAYCS is not set
# CONFIG_AIRO is not set
# CONFIG_ATMEL is not set
# CONFIG_AIRO_CS is not set
# CONFIG_PCMCIA_WL3501 is not set
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set
# CONFIG_WL_TI is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
# CONFIG_SBNI is not set
# CONFIG_XEN_NETDEV_FRONTEND is not set
# CONFIG_XEN_NETDEV_BACKEND is not set
# CONFIG_VMXNET3 is not set
CONFIG_ISDN=y
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
# CONFIG_ISDN_DRV_GIGASET is not set
# CONFIG_HYSDN is not set
# CONFIG_MISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
# CONFIG_INPUT_POLLDEV is not set
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_GPIO_POLLED is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_OMAP4 is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
# CONFIG_INPUT_JOYSTICK is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_WACOM is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_AUO_PIXCIR is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CY8CTMG110 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_MPU3050 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_GP2A is not set
# CONFIG_INPUT_GPIO_TILT_POLLED is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_CMA3000 is not set
CONFIG_INPUT_XEN_KBDDEV_FRONTEND=y

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_NOZOMI is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
# CONFIG_DEVKMEM is not set
# CONFIG_STALDRV is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_TIMBERDALE is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_PCH_UART is not set
# CONFIG_SERIAL_XILINX_PS_UART is not set
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
CONFIG_HVC_XEN_FRONTEND=y
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_HW_RANDOM_TPM=y
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_MWAVE is not set
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=8192
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y
# CONFIG_TCG_TIS_I2C_INFINEON is not set
# CONFIG_TCG_NSC is not set
# CONFIG_TCG_ATMEL is not set
# CONFIG_TCG_INFINEON is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=m
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EG20T is not set
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_INTEL_MID is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_HSI is not set

#
# PPS support
#
# CONFIG_PPS is not set

#
# PPS generators support
#

#
# PTP clock support
#

#
# Enable Device Drivers -> PPS to see the PTP clock options.
#
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
# CONFIG_GPIO_SYSFS is not set

#
# Memory mapped GPIO drivers:
#
# CONFIG_GPIO_GENERIC_PLATFORM is not set
# CONFIG_GPIO_IT8761E is not set
# CONFIG_GPIO_SCH is not set
# CONFIG_GPIO_ICH is not set
# CONFIG_GPIO_VX855 is not set

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_ADP5588 is not set

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_AMD8111 is not set
# CONFIG_GPIO_LANGWELL is not set
# CONFIG_GPIO_PCH is not set
# CONFIG_GPIO_ML_IOH is not set
# CONFIG_GPIO_RDC321X is not set

#
# SPI GPIO expanders:
#
# CONFIG_GPIO_MCP23S08 is not set

#
# AC97 GPIO expanders:
#

#
# MODULbus GPIO expanders:
#
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_GPIO is not set
# CONFIG_CHARGER_MANAGER is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_POWER_AVS is not set
CONFIG_HWMON=m
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_GPIO_FAN is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_APPLESMC is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
# CONFIG_CPU_THERMAL is not set
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SC520_WDT is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_XEN_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_CS5535 is not set
# CONFIG_MFD_TIMBERDALE is not set
# CONFIG_LPC_SCH is not set
# CONFIG_LPC_ICH is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_WL1273_CORE is not set
CONFIG_REGULATOR=y
# CONFIG_REGULATOR_DEBUG is not set
# CONFIG_REGULATOR_DUMMY is not set
# CONFIG_REGULATOR_FIXED_VOLTAGE is not set
# CONFIG_REGULATOR_VIRTUAL_CONSUMER is not set
# CONFIG_REGULATOR_USERSPACE_CONSUMER is not set
# CONFIG_REGULATOR_GPIO is not set
# CONFIG_REGULATOR_AD5398 is not set
# CONFIG_REGULATOR_FAN53555 is not set
# CONFIG_REGULATOR_ISL6271A is not set
# CONFIG_REGULATOR_MAX1586 is not set
# CONFIG_REGULATOR_MAX8649 is not set
# CONFIG_REGULATOR_MAX8660 is not set
# CONFIG_REGULATOR_MAX8952 is not set
# CONFIG_REGULATOR_LP3971 is not set
# CONFIG_REGULATOR_LP3972 is not set
# CONFIG_REGULATOR_TPS62360 is not set
# CONFIG_REGULATOR_TPS65023 is not set
# CONFIG_REGULATOR_TPS6507X is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=64
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_DRM=m
CONFIG_DRM_KMS_HELPER=m
# CONFIG_DRM_LOAD_EDID_FIRMWARE is not set
CONFIG_DRM_TTM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
CONFIG_DRM_RADEON_KMS=y
# CONFIG_DRM_NOUVEAU is not set

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_STUB_POULSBO is not set
# CONFIG_VGASTATE is not set
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
# CONFIG_FB_DDC is not set
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
# CONFIG_FB_WMT_GE_ROPS is not set
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_XEN_FBDEV_FRONTEND=y
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_EXYNOS_VIDEO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3630 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LP855X is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_DRAGONRISE=y
# CONFIG_DRAGONRISE_FF is not set
# CONFIG_HID_EMS_FF is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_KEYTOUCH is not set
CONFIG_HID_KYE=y
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
CONFIG_HID_GYRATION=y
CONFIG_HID_TWINHAN=y
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO_TPKBD is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_LOGITECH_FF is not set
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
# CONFIG_LOGIWHEELS_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
# CONFIG_HID_SPEEDLINK is not set
CONFIG_HID_SUNPLUS=y
CONFIG_HID_GREENASIA=y
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_SMARTJOYPLUS=y
CONFIG_SMARTJOYPLUS_FF=y
# CONFIG_HID_TIVO is not set
CONFIG_HID_TOPSEED=y
CONFIG_HID_THRUSTMASTER=y
# CONFIG_THRUSTMASTER_FF is not set
CONFIG_HID_ZEROPLUS=y
# CONFIG_ZEROPLUS_FF is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB_ARCH_HAS_XHCI=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_XHCI_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_CHIPIDEA is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
# CONFIG_USB_STORAGE is not set
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set

#
# USB Physical Layer drivers
#
# CONFIG_OMAP_USB2 is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA9633 is not set
# CONFIG_LEDS_REGULATOR is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_LT3593 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_LM355x is not set
# CONFIG_LEDS_OT200 is not set
# CONFIG_LEDS_BLINKM is not set
CONFIG_LEDS_TRIGGERS=y

#
# LED Triggers
#
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_DECODE_MCE is not set
# CONFIG_EDAC_MM_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set
# CONFIG_RTC_DRV_DS2404 is not set

#
# on-CPU RTC drivers
#
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
# CONFIG_INTEL_MID_DMAC is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_TIMB_DMA is not set
# CONFIG_PCH_DMA is not set
CONFIG_AUXDISPLAY=y
# CONFIG_UIO is not set
# CONFIG_VFIO is not set

#
# Virtio drivers
#
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set

#
# Xen driver support
#
CONFIG_XEN_BALLOON=y
# CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is not set
CONFIG_XEN_SCRUB_PAGES=y
# CONFIG_XEN_DEV_EVTCHN is not set
CONFIG_XEN_BACKEND=y
# CONFIG_XENFS is not set
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
# CONFIG_XEN_GNTDEV is not set
# CONFIG_XEN_GRANT_DEV_ALLOC is not set
CONFIG_SWIOTLB_XEN=y
# CONFIG_XEN_PCIDEV_BACKEND is not set
CONFIG_XEN_PRIVCMD=m
# CONFIG_XEN_ACPI_PROCESSOR is not set
# CONFIG_XEN_MCE_LOG is not set
CONFIG_STAGING=y
# CONFIG_ET131X is not set
# CONFIG_SLICOSS is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_ECHO is not set
# CONFIG_COMEDI is not set
# CONFIG_ASUS_OLED is not set
# CONFIG_R8187SE is not set
# CONFIG_RTL8192U is not set
# CONFIG_RTLLIB is not set
# CONFIG_R8712U is not set
# CONFIG_RTS_PSTOR is not set
# CONFIG_RTS5139 is not set
# CONFIG_TRANZPORT is not set
# CONFIG_IDE_PHISON is not set
# CONFIG_VT6655 is not set
# CONFIG_VT6656 is not set
# CONFIG_DX_SEP is not set
# CONFIG_ZSMALLOC is not set
# CONFIG_WLAGS49_H2 is not set
# CONFIG_WLAGS49_H25 is not set
# CONFIG_FB_SM7XX is not set
# CONFIG_CRYSTALHD is not set
# CONFIG_FB_XGI is not set
# CONFIG_ACPI_QUICKSTART is not set
# CONFIG_USB_ENESTORAGE is not set
# CONFIG_BCM_WIMAX is not set
# CONFIG_FT1000 is not set

#
# Speakup console speech
#
# CONFIG_SPEAKUP is not set
# CONFIG_TOUCHSCREEN_CLEARPAD_TM1217 is not set
# CONFIG_TOUCHSCREEN_SYNAPTICS_I2C_RMI4 is not set
# CONFIG_STAGING_MEDIA is not set

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_PHONE is not set
# CONFIG_USB_WPAN_HCD is not set
# CONFIG_IPACK_BUS is not set
# CONFIG_WIMAX_GDM72XX is not set
CONFIG_NET_VENDOR_SILICOM=y
# CONFIG_SBYPASS is not set
# CONFIG_BPCTL is not set
# CONFIG_CED1401 is not set
# CONFIG_DGRP is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_HP_ACCEL is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_IBM_RTL is not set
# CONFIG_XO15_EBOOK is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_APPLE_GMUX is not set

#
# Hardware Spinlock drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
# CONFIG_AMD_IOMMU_V2 is not set
# CONFIG_INTEL_IOMMU is not set
# CONFIG_IRQ_REMAP is not set

#
# Remoteproc drivers (EXPERIMENTAL)
#
# CONFIG_STE_MODEM_RPROC is not set

#
# Rpmsg drivers (EXPERIMENTAL)
#
# CONFIG_VIRT_DRIVERS is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=m
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=m
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=m
# CONFIG_FUSE_FS is not set
CONFIG_GENERIC_ACL=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
# CONFIG_CONFIGFS_FS is not set
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_JFFS2_FS is not set
# CONFIG_LOGFS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
# CONFIG_NLS_UTF8 is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=7
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
# CONFIG_DEBUG_KERNEL is not set
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_SPARSE_RCU_POINTER is not set
CONFIG_STACKTRACE=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
# CONFIG_FRAME_POINTER is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_LKDTM is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_EVENT_POWER_TRACING_DEPRECATED=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENT=y
# CONFIG_UPROBE_EVENT is not set
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_TEST_KSTRTOX is not set
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_DEBUG_SET_MODULE_RONX is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_OPTIMIZE_INLINING=y

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_TRUSTED_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
CONFIG_KEYS_DEBUG_PROC_KEYS=y
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
# CONFIG_SECURITY_PATH is not set
CONFIG_LSM_MMAP_MIN_ADDR=65535
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_YAMA is not set
CONFIG_INTEGRITY=y
# CONFIG_INTEGRITY_SIGNATURE is not set
CONFIG_IMA=y
CONFIG_IMA_MEASURE_PCR_IDX=10
CONFIG_IMA_AUDIT=y
CONFIG_IMA_LSM_RULES=y
# CONFIG_IMA_APPRAISE is not set
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_ASYMMETRIC_KEY_TYPE is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
# CONFIG_KVM_AMD is not set
# CONFIG_KVM_MMU_AUDIT is not set
# CONFIG_VHOST_NET is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
# CONFIG_CRC_CCITT is not set
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=m
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
# CONFIG_CRC8 is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_AVERAGE=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-10 15:24       ` Mel Gorman
@ 2012-12-11  1:02         ` Mel Gorman
  2012-12-11  8:52           ` Ingo Molnar
  0 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2012-12-11  1:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Mon, Dec 10, 2012 at 03:24:05PM +0000, Mel Gorman wrote:
> For example, I think that point 5 above is the potential source of the
> corruption because. You're not flushing the TLBs for the PTEs you are
> updating in batch. Granted, you're relaxing rather than restricting access
> so it should be ok and at worse cause a spurious fault but I also find
> it suspicious that you do not recheck pte_same under the PTL when doing
> the final PTE update.

Looking again, the lack of a pte_same check should be ok. The addr,
addr_start, ptep and ptep_start is a little messy but also look fine.
You're not accidentally crossing a PMD boundary. You should be protected
against huge pages being collapsed underneath you as you hold mmap_sem for
read. If the first page in the pmd (or VMA) is not present then
target_nid == -1 which gets passed into __do_numa_page. This check

        if (target_nid == -1 || target_nid == page_nid)
                goto out;

then means you never actually migrate for that whole PMD and will just
clear the PTEs. Possibly wrong, but not what we're looking for. Holding
PTL across task_numa_fault is bad, but not the bad we're looking for.

/me scratches his head

Machine is still unavailable so in an attempt to rattle this out I prototyped
the equivalent patch for balancenuma and then went back to numacore to see
could I spot a major difference.  Comparing them, there is no guarantee you
clear pte_numa for the address that was originally faulted if there was a
racing fault that cleared it underneath you but in itself that should not
be an issue. Your use of ptep++ instead of pte_offset_map() might break
on 32-bit with NUMA support if PTE pages are stored in highmem. Still the
wrong wrong.

If the bug is indeed here, it's not obvious. I don't know why I'm
triggering it or why it only triggers for specjbb as I cannot imagine
what the JVM would be doing that is that weird or that would not have
triggered before. Maybe we both suffer this type of problem but that
numacores rate of migration is able to trigger it.

> Basically if I felt that handling ptes in batch like this was of
> critical important I would have implemented it very differently on top of
> balancenuma. I would have only taken the PTL lock if updating the PTE to
> keep contention down and redid racy checks under PTL, I'd have only used
> trylock for every non-faulted PTE and I would only have migrated if it
> was a remote->local copy. I certainly would not hold PTL while calling
> task_numa_fault(). I would have kept the handling ona per-pmd basis when
> it was expected that most PTEs underneath should be on the same node.
> 

This is prototype only but what I was using as a reference to see could
I spot a problem in yours. It has not been even boot tested but avoids
remote->remote copies, contending on PTL or holding it longer than necessary
(should anyway)

---8<---
mm: numa: Batch pte handling

diff --git a/mm/memory.c b/mm/memory.c
index 33e20b3..f871d5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3461,30 +3461,14 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 	return mpol_misplaced(page, vma, addr);
 }
 
-int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static
+int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd,
+		   spinlock_t *ptl, bool only_local, bool *migrated)
 {
 	struct page *page = NULL;
-	spinlock_t *ptl;
 	int current_nid = -1;
 	int target_nid;
-	bool migrated = false;
-
-	/*
-	* The "pte" at this point cannot be used safely without
-	* validation through pte_unmap_same(). It's of NUMA type but
-	* the pfn may be screwed if the read is non atomic.
-	*
-	* ptep_modify_prot_start is not called as this is clearing
-	* the _PAGE_NUMA bit and it is not really expected that there
-	* would be concurrent hardware modifications to the PTE.
-	*/
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-	if (unlikely(!pte_same(*ptep, pte))) {
-		pte_unmap_unlock(ptep, ptl);
-		goto out;
-	}
 
 	pte = pte_mknonnuma(pte);
 	set_pte_at(mm, addr, ptep, pte);
@@ -3493,7 +3477,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = vm_normal_page(vma, addr, pte);
 	if (!page) {
 		pte_unmap_unlock(ptep, ptl);
-		return 0;
+		goto out;
 	}
 
 	current_nid = page_to_nid(page);
@@ -3509,15 +3493,88 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out;
 	}
 
+	/*
+	 * Only do remote-local copies when handling PTEs in batch. This does
+	 * mean we effectively lost the NUMA hinting fault if the workload
+	 * was not converged on a PMD boundary. This is bad but is it worse
+	 * can doing a remote->remote copy?
+	 */
+	if (only_local && target_nid != numa_node_id()) {
+		current_nid = -1;
+		put_page(page);
+		goto out;
+	}
+
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
-	if (migrated)
+	*migrated = migrate_misplaced_page(page, target_nid);
+	if (*migrated)
 		current_nid = target_nid;
 
 out:
+	return current_nid;
+}
+
+int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	int current_nid = -1;
+	bool migrated = false;
+	unsigned long end_addr;
+
+	/*
+	* The "pte" at this point cannot be used safely without
+	* validation through pte_unmap_same(). It's of NUMA type but
+	* the pfn may be screwed if the read is non atomic.
+	*
+	* ptep_modify_prot_start is not called as this is clearing
+	* the _PAGE_NUMA bit and it is not really expected that there
+	* would be concurrent hardware modifications to the PTE.
+	*/
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte))) {
+		pte_unmap_unlock(ptep, ptl);
+		goto out;
+	}
+
+	current_nid = __do_numa_page(mm, vma, addr, pte, ptep, pmd, ptl, false, &migrated);
+
+	/* Batch handle all PTEs in this area. PTL is not held initially */
+	addr = max(addr & PMD_MASK, vma->vm_start);
+	end_addr = min((addr + PMD_SIZE) & PMD_MASK, vma->vm_end);
+	for (; addr < end_addr; addr += PAGE_SIZE) {
+		bool batch_migrated = false;
+		int batch_nid = -1;
+
+		ptep = pte_offset_map(pmd, addr);
+		pte = *ptep;
+		if (!pte_present(pte))
+			continue;
+		if (!pte_numa(pte))
+			continue;
+
+		if (!spin_trylock(ptl)) {
+			pte_unmap(ptep);
+			break;
+		}
+
+		/* Recheck PTE under lock */
+		if (!pte_same(*ptep, pte)) {
+			pte_unmap_unlock(ptep, ptl);
+			continue;
+		}
+
+		batch_nid = __do_numa_page(mm, vma, addr, pte, ptep, pmd, ptl, true, &batch_migrated);
+		if (batch_nid != -1)
+			task_numa_fault(current_nid, 1, batch_migrated);
+	}
+
+out:
 	if (current_nid != -1)
 		task_numa_fault(current_nid, 1, migrated);
 	return 0;
+
 }
 
 /* NUMA hinting page fault entry point for regular pmds */

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-11  1:02         ` Mel Gorman
@ 2012-12-11  8:52           ` Ingo Molnar
  2012-12-11  9:18             ` Ingo Molnar
  2012-12-11 16:30             ` Mel Gorman
  0 siblings, 2 replies; 80+ messages in thread
From: Ingo Molnar @ 2012-12-11  8:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Dec 10, 2012 at 03:24:05PM +0000, Mel Gorman wrote:
> > For example, I think that point 5 above is the potential source of the
> > corruption because. You're not flushing the TLBs for the PTEs you are
> > updating in batch. Granted, you're relaxing rather than restricting access
> > so it should be ok and at worse cause a spurious fault but I also find
> > it suspicious that you do not recheck pte_same under the PTL when doing
> > the final PTE update.
> 
> Looking again, the lack of a pte_same check should be ok. The 
> addr, addr_start, ptep and ptep_start is a little messy but 
> also look fine. You're not accidentally crossing a PMD 
> boundary. You should be protected against huge pages being 
> collapsed underneath you as you hold mmap_sem for read. If the 
> first page in the pmd (or VMA) is not present then target_nid 
> == -1 which gets passed into __do_numa_page. This check
> 
>         if (target_nid == -1 || target_nid == page_nid)
>                 goto out;
> 
> then means you never actually migrate for that whole PMD and 
> will just clear the PTEs. [...]

Yes.

> [...] Possibly wrong, but not what we're looking for. [...]

It's a detail - I thought not touching partial 2MB pages is just 
as valid as picking some other page to represent it, and went 
for the simpler option.

But yes, I agree that using the first present page would be 
better, as it would better handle partial vmas not 
starting/ending at a 2MB boundary - which happens frequently in 
practice.

> [...] Holding PTL across task_numa_fault is bad, but not the 
> bad we're looking for.

No, holding the PTL across task_numa_fault() is fine, because 
this bit got reworked in my tree rather significantly, see:

 6030a23a1c66 sched: Move the NUMA placement logic to a worklet

and followup patches.

> /me scratches his head
> 
> Machine is still unavailable so in an attempt to rattle this 
> out I prototyped the equivalent patch for balancenuma and then 
> went back to numacore to see could I spot a major difference.  
> Comparing them, there is no guarantee you clear pte_numa for 
> the address that was originally faulted if there was a racing 
> fault that cleared it underneath you but in itself that should 
> not be an issue. Your use of ptep++ instead of 
> pte_offset_map() might break on 32-bit with NUMA support if 
> PTE pages are stored in highmem. Still the wrong wrong.

Yes.

> If the bug is indeed here, it's not obvious. I don't know why 
> I'm triggering it or why it only triggers for specjbb as I 
> cannot imagine what the JVM would be doing that is that weird 
> or that would not have triggered before. Maybe we both suffer 
> this type of problem but that numacores rate of migration is 
> able to trigger it.

Agreed.

> > Basically if I felt that handling ptes in batch like this 
> > was of critical important I would have implemented it very 
> > differently on top of balancenuma. I would have only taken 
> > the PTL lock if updating the PTE to keep contention down and 
> > redid racy checks under PTL, I'd have only used trylock for 
> > every non-faulted PTE and I would only have migrated if it 
> > was a remote->local copy. I certainly would not hold PTL 
> > while calling task_numa_fault(). I would have kept the 
> > handling ona per-pmd basis when it was expected that most 
> > PTEs underneath should be on the same node.
> 
> This is prototype only but what I was using as a reference to 
> see could I spot a problem in yours. It has not been even boot 
> tested but avoids remote->remote copies, contending on PTL or 
> holding it longer than necessary (should anyway)

So ... because time is running out and it would be nice to 
progress with this for v3.8, I'd suggest the following approach:

 - Please send your current tree to Linus as-is. You already 
   have my Acked-by/Reviewed-by for its scheduler bits, and my
   testing found your tree to have no regression to mainline,
   plus it's a nice win in a number of NUMA-intense workloads.
   So it's a good, monotonic step forward in terms of NUMA
   balancing, very close to what the bits I'm working on need as
   infrastructure.

 - I'll rebase all my devel bits on top of it. Instead of
   removing the migration bandwidth I'll simply increase it for
   testing - this should trigger similarly aggressive behavior.
   I'll try to touch as little of the mm/ code as possible, to
   keep things debuggable.

If the JVM segfault is a bug introduced by some non-obvious 
difference only present in numa/core and fixed in your tree then 
the bug will be fixed magically and we can forget about it.

If it's something latent in your tree as well, then at least we 
will be able to stare at the exact same tree, instead of 
endlessly wondering about small, unnecessary differences.

( My gut feeling is that it's 50%/50%, I really cannot exclude
  any of the two possibilities. )

Agreed?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-11  8:52           ` Ingo Molnar
@ 2012-12-11  9:18             ` Ingo Molnar
  2012-12-11 15:22               ` Mel Gorman
  2012-12-11 16:30             ` Mel Gorman
  1 sibling, 1 reply; 80+ messages in thread
From: Ingo Molnar @ 2012-12-11  9:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> > This is prototype only but what I was using as a reference 
> > to see could I spot a problem in yours. It has not been even 
> > boot tested but avoids remote->remote copies, contending on 
> > PTL or holding it longer than necessary (should anyway)
> 
> So ... because time is running out and it would be nice to 
> progress with this for v3.8, I'd suggest the following 
> approach:
> 
>  - Please send your current tree to Linus as-is. You already 
>    have my Acked-by/Reviewed-by for its scheduler bits, and my
>    testing found your tree to have no regression to mainline,
>    plus it's a nice win in a number of NUMA-intense workloads.
>    So it's a good, monotonic step forward in terms of NUMA
>    balancing, very close to what the bits I'm working on need as
>    infrastructure.
> 
>  - I'll rebase all my devel bits on top of it. Instead of
>    removing the migration bandwidth I'll simply increase it for
>    testing - this should trigger similarly aggressive behavior.
>    I'll try to touch as little of the mm/ code as possible, to
>    keep things debuggable.

One minor last-minute request/nit before you send it to Linus, 
would you mind doing a:

   CONFIG_BALANCE_NUMA => CONFIG_NUMA_BALANCING

rename please? (I can do it for you if you don't have the time.)

CONFIG_NUMA_BALANCING is really what fits into our existing NUMA 
namespace, CONFIG_NUMA, CONFIG_NUMA_EMU - and, more importantly, 
the ordering of words follows the common generic -> less generic 
ordering we do in the kernel for config names and methods.

So it would fit nicely into existing Kconfig naming schemes:

   CONFIG_TRACING
   CONFIG_FILE_LOCKING
   CONFIG_EVENT_TRACING

etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-11  9:18             ` Ingo Molnar
@ 2012-12-11 15:22               ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2012-12-11 15:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Tue, Dec 11, 2012 at 10:18:07AM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > > This is prototype only but what I was using as a reference 
> > > to see could I spot a problem in yours. It has not been even 
> > > boot tested but avoids remote->remote copies, contending on 
> > > PTL or holding it longer than necessary (should anyway)
> > 
> > So ... because time is running out and it would be nice to 
> > progress with this for v3.8, I'd suggest the following 
> > approach:
> > 
> >  - Please send your current tree to Linus as-is. You already 
> >    have my Acked-by/Reviewed-by for its scheduler bits, and my
> >    testing found your tree to have no regression to mainline,
> >    plus it's a nice win in a number of NUMA-intense workloads.
> >    So it's a good, monotonic step forward in terms of NUMA
> >    balancing, very close to what the bits I'm working on need as
> >    infrastructure.
> > 
> >  - I'll rebase all my devel bits on top of it. Instead of
> >    removing the migration bandwidth I'll simply increase it for
> >    testing - this should trigger similarly aggressive behavior.
> >    I'll try to touch as little of the mm/ code as possible, to
> >    keep things debuggable.
> 
> One minor last-minute request/nit before you send it to Linus, 
> would you mind doing a:
> 
>    CONFIG_BALANCE_NUMA => CONFIG_NUMA_BALANCING
> 
> rename please? (I can do it for you if you don't have the time.)
> 
> CONFIG_NUMA_BALANCING is really what fits into our existing NUMA 
> namespace, CONFIG_NUMA, CONFIG_NUMA_EMU - and, more importantly, 
> the ordering of words follows the common generic -> less generic 
> ordering we do in the kernel for config names and methods.
> 
> So it would fit nicely into existing Kconfig naming schemes:
> 
>    CONFIG_TRACING
>    CONFIG_FILE_LOCKING
>    CONFIG_EVENT_TRACING
> 
> etc.
> 

Yes, that makes sense. I should have spotted the rationale. I also took
the liberty of renaming the command-line parameter and the variables to
be consistent with this.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-11  8:52           ` Ingo Molnar
  2012-12-11  9:18             ` Ingo Molnar
@ 2012-12-11 16:30             ` Mel Gorman
  2012-12-17 10:33               ` Ingo Molnar
  1 sibling, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2012-12-11 16:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

On Tue, Dec 11, 2012 at 09:52:38AM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Mon, Dec 10, 2012 at 03:24:05PM +0000, Mel Gorman wrote:
> > > For example, I think that point 5 above is the potential source of the
> > > corruption because. You're not flushing the TLBs for the PTEs you are
> > > updating in batch. Granted, you're relaxing rather than restricting access
> > > so it should be ok and at worse cause a spurious fault but I also find
> > > it suspicious that you do not recheck pte_same under the PTL when doing
> > > the final PTE update.
> > 
> > Looking again, the lack of a pte_same check should be ok. The 
> > addr, addr_start, ptep and ptep_start is a little messy but 
> > also look fine. You're not accidentally crossing a PMD 
> > boundary. You should be protected against huge pages being 
> > collapsed underneath you as you hold mmap_sem for read. If the 
> > first page in the pmd (or VMA) is not present then target_nid 
> > == -1 which gets passed into __do_numa_page. This check
> > 
> >         if (target_nid == -1 || target_nid == page_nid)
> >                 goto out;
> > 
> > then means you never actually migrate for that whole PMD and 
> > will just clear the PTEs. [...]
> 
> Yes.
> 
> > [...] Possibly wrong, but not what we're looking for. [...]
> 
> It's a detail - I thought not touching partial 2MB pages is just 
> as valid as picking some other page to represent it, and went 
> for the simpler option.
> 

I very strongly suspect that in the majority of cases that it behaves just
as well. I considered whether it makes a difference if the first page
or faulting page was used as the hint but concluded it doesn't.  If the
workload is converged on the PMD, it makes no difference. If it's not,
then tasks are equally affected at least.

> But yes, I agree that using the first present page would be 
> better, as it would better handle partial vmas not 
> starting/ending at a 2MB boundary - which happens frequently in 
> practice.
> 
> > [...] Holding PTL across task_numa_fault is bad, but not the 
> > bad we're looking for.
> 
> No, holding the PTL across task_numa_fault() is fine, because 
> this bit got reworked in my tree rather significantly, see:
> 
>  6030a23a1c66 sched: Move the NUMA placement logic to a worklet
> 
> and followup patches.
> 

I believe I see your point. After that patch is applied task_numa_fault()
is a relatively small function and is no longer calling task_numa_placement.
Sure, PTL is held longer than necessary but not enough to cause real
scalability issues.

> > If the bug is indeed here, it's not obvious. I don't know why 
> > I'm triggering it or why it only triggers for specjbb as I 
> > cannot imagine what the JVM would be doing that is that weird 
> > or that would not have triggered before. Maybe we both suffer 
> > this type of problem but that numacores rate of migration is 
> > able to trigger it.
> 
> Agreed.
> 

I spent some more time on this today and the bug is *really* hard to trigger
or at least I have been unable to trigger it today. This begs the question
why it triggered three times in relatively quick succession separated by
a few hours when testing numacore on Dec 9th. Other tests ran between the
failures. The first failure results were discarded. I deleted them to see
if the same test reproduced it a second time (it did).

Of the three times this bug triggered in the last week, two were unclear
where they crashed but one showed that the bug was triggered by the JVMs
garbage collector. That at least is a corner case and might explain why
it's hard to trigger.

I feel extremely bad about how I reported this because even though we
differ in how we handle faults, I really cannot see any difference that
would explain this and I've looked long enough. Triggering this by the
kernel would *have* to be something like missing a cache or TLB flush
after page tables have been modified or during migration but in most way
that matters we share that logic. Where we differ, it shouldn't matter.

I'm contemplating even that this is a JVM timing bug that can be triggered if
page migration happens at the wrong time. numacore would only be indirectly
at fault by migrating more often. If this was the case, balancenuma would
hit the problem given enough time.

I'll keep kicking it in the background.

FWIW, numacore pulled yesterday completed the same tests without any error
this time but none of the commits since Dec 9th would account for fixing it.

> > > Basically if I felt that handling ptes in batch like this 
> > > was of critical important I would have implemented it very 
> > > differently on top of balancenuma. I would have only taken 
> > > the PTL lock if updating the PTE to keep contention down and 
> > > redid racy checks under PTL, I'd have only used trylock for 
> > > every non-faulted PTE and I would only have migrated if it 
> > > was a remote->local copy. I certainly would not hold PTL 
> > > while calling task_numa_fault(). I would have kept the 
> > > handling ona per-pmd basis when it was expected that most 
> > > PTEs underneath should be on the same node.
> > 
> > This is prototype only but what I was using as a reference to 
> > see could I spot a problem in yours. It has not been even boot 
> > tested but avoids remote->remote copies, contending on PTL or 
> > holding it longer than necessary (should anyway)
> 
> So ... because time is running out and it would be nice to 
> progress with this for v3.8, I'd suggest the following approach:
> 
>  - Please send your current tree to Linus as-is. You already 
>    have my Acked-by/Reviewed-by for its scheduler bits, and my
>    testing found your tree to have no regression to mainline,
>    plus it's a nice win in a number of NUMA-intense workloads.
>    So it's a good, monotonic step forward in terms of NUMA
>    balancing, very close to what the bits I'm working on need as
>    infrastructure.
> 

Thanks.

>  - I'll rebase all my devel bits on top of it. Instead of
>    removing the migration bandwidth I'll simply increase it for
>    testing - this should trigger similarly aggressive behavior.
>    I'll try to touch as little of the mm/ code as possible, to
>    keep things debuggable.
> 

Agreed. I'll do my best to review the patches on top and any of the MM
changes you want to make. I know that at the very least you'll want to
change what information it sent to task_numa_fault(), last_nid needs to
be renamed and I should review the flag-packing-patch properly with the
view to seeing can that hurt any of the other flags.

> If the JVM segfault is a bug introduced by some non-obvious 
> difference only present in numa/core and fixed in your tree then 
> the bug will be fixed magically and we can forget about it.
> 

Magic fix is the worst of all fixes :(. I'd really like to know why this
triggered but now my big mouth has landed me with the problem. If this
magically goes away then it's either a really-hard-to-hit-JVM error or
far worse from my perspective -- this is a transient hardware error that
was triggered by the machine running at maximum capacity for 6 weeks that
went away when the machine was turned off for a day.

If it turns out to be hardware, it has planked me straight into the asshat
end of the spectrum, particularly after the first THP debacle.

> If it's something latent in your tree as well, then at least we 
> will be able to stare at the exact same tree, instead of 
> endlessly wondering about small, unnecessary differences.
> 

True.

> ( My gut feeling is that it's 50%/50%, I really cannot exclude
>   any of the two possibilities. )
> 

Neither can I but I've managed to convince myself that it *has* to be on
my side somewhere (or VM code, the JVM I'm using or the hardware). I just
have to find where.

> Agreed?
> 

Yes.

I've queued the following for tests before I send the pull request just in
case. The only difference is adding "mm: Check if PTE is already allocated
during page fault" in case it got lost. I'll send the following request
tomorrow unless you have any objections. If any of the signed-offs are in
error, please shout and I'll get them fixed up.

---8<---
This is a pull request for "Automatic NUMA Balancing V11". The list
of changes since commit f4a75d2eb7b1e2206094b901be09adb31ba63681:

  Linux 3.7-rc6 (2012-11-16 17:42:40 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git balancenuma-v11

for you to fetch changes up to 4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6:

  mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable (2012-12-11 14:43:00 +0000)

There are three implementations for NUMA balancing, this tree (balancenuma),
numacore which has been developed in tip/master and autonuma which is in
aa.git. In almost all respects balancenuma is the dumbest of the three
because its main impact is on the VM side with no attempt to be smart
about scheduling.  In the interest of getting the ball rolling, it would
be desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.

The most recent set of comparisons available from different people are

mel:    https://lkml.org/lkml/2012/12/9/108
mingo:  https://lkml.org/lkml/2012/12/7/331
tglx:   https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397

The results are a mixed bag. In my own tests, balancenuma does reasonably
well. It's dumb as rocks and does not regress against mainline. On the
other hand, Ingo's tests shows that balancenuma is incapable of converging
for this workloads driven by perf which is bad but is potentially explained
by the lack of scheduler smarts. Thomas' results show balancenuma improves
on mainline but falls far short of numacore or autonuma. Srikar's results
indicate we all suck on a large machine with imbalanced node sizes.

My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.  We've
butted heads heavily on system CPU usage and high levels of migration even
when it shows that overall performance is better. There are also cases
where it regresses (in my case, single JVM, THP enabled) but at times the
regressions are for lower numbers of warehouses and not higher numbers so
reports are inconsistent. Recently I reported for numacore that the JVM
was crashing with NullPointerExceptions but currently it's unclear what
the source of this problem is. Initially I thought it was in how numacore
batch handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.

These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks.

Andrea Arcangeli (5):
      mm: numa: define _PAGE_NUMA
      mm: numa: pte_numa() and pmd_numa()
      mm: numa: Support NUMA hinting page faults from gup/gup_fast
      mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte
      mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting

Hillf Danton (2):
      mm: numa: split_huge_page: Transfer last_nid on tail page
      mm: numa: migrate: Set last_nid on newly allocated page

Ingo Molnar (3):
      mm: Optimize the TLB flush of sys_mprotect() and change_protection() users
      mm/rmap: Convert the struct anon_vma::mutex to an rwsem
      mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable

Lee Schermerhorn (3):
      mm: mempolicy: Add MPOL_NOOP
      mm: mempolicy: Check for misplaced page
      mm: mempolicy: Add MPOL_MF_LAZY

Mel Gorman (26):
      mm: Check if PTE is already allocated during page fault
      mm: compaction: Move migration fail/success stats to migrate.c
      mm: migrate: Add a tracepoint for migrate_pages
      mm: compaction: Add scanned and isolated counters for compaction
      mm: numa: Create basic numa page hinting infrastructure
      mm: migrate: Drop the misplaced pages reference count if the target node is full
      mm: mempolicy: Use _PAGE_NUMA to migrate pages
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection()
      mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges
      mm: numa: Add pte updates, hinting and migration stats
      mm: numa: Migrate on reference policy
      mm: numa: Migrate pages handled during a pmd_numa hinting fault
      mm: numa: Rate limit the amount of memory that is migrated between nodes
      mm: numa: Rate limit setting of pte_numa if node is saturated
      sched: numa: Slowly increase the scanning period as NUMA faults are handled
      mm: numa: Introduce last_nid to the page frame
      mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
      mm: sched: numa: Control enabling and disabling of NUMA balancing
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
      mm: numa: Add THP migration for the NUMA working set scanning fault case.
      mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
      mm: numa: Account for failed allocations and isolations as migration failures
      mm: migrate: Account a transhuge page properly when rate limiting

Peter Zijlstra (6):
      mm: Count the number of pages affected in change_protection()
      mm: mempolicy: Make MPOL_LOCAL a real policy
      mm: migrate: Introduce migrate_misplaced_page()
      mm: numa: Add fault driven placement and migration
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
      mm: sched: numa: Implement slow start for working set sampling

Rik van Riel (5):
      x86: mm: only do a local tlb flush in ptep_set_access_flags()
      x86: mm: drop TLB flush from ptep_set_access_flags
      mm,generic: only flush the local TLB in ptep_set_access_flags
      x86/mm: Introduce pte_accessible()
      mm: Only flush the TLB when clearing an accessible pte

 Documentation/kernel-parameters.txt  |    3 +
 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/Kconfig                     |    2 +
 arch/x86/include/asm/pgtable.h       |   17 +-
 arch/x86/include/asm/pgtable_types.h |   20 ++
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |  110 +++++++++++
 include/linux/huge_mm.h              |   16 +-
 include/linux/hugetlb.h              |    8 +-
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   47 ++++-
 include/linux/mm.h                   |   39 ++++
 include/linux/mm_types.h             |   31 ++++
 include/linux/mmzone.h               |   13 ++
 include/linux/rmap.h                 |   33 ++--
 include/linux/sched.h                |   27 +++
 include/linux/vm_event_item.h        |   12 +-
 include/linux/vmstat.h               |    8 +
 include/trace/events/migrate.h       |   51 +++++
 include/uapi/linux/mempolicy.h       |   15 +-
 init/Kconfig                         |   45 +++++
 kernel/fork.c                        |    3 +
 kernel/sched/core.c                  |   71 +++++--
 kernel/sched/fair.c                  |  227 +++++++++++++++++++++++
 kernel/sched/features.h              |   11 ++
 kernel/sched/sched.h                 |   12 ++
 kernel/sysctl.c                      |   45 ++++-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |  108 ++++++++++-
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |    7 +-
 mm/ksm.c                             |    6 +-
 mm/memcontrol.c                      |    7 +-
 mm/memory-failure.c                  |    7 +-
 mm/memory.c                          |  199 +++++++++++++++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  283 +++++++++++++++++++++++++---
 mm/migrate.c                         |  337 +++++++++++++++++++++++++++++++++-
 mm/mmap.c                            |   10 +-
 mm/mprotect.c                        |  135 +++++++++++---
 mm/mremap.c                          |    2 +-
 mm/page_alloc.c                      |   10 +-
 mm/pgtable-generic.c                 |    9 +-
 mm/rmap.c                            |   66 +++----
 mm/vmstat.c                          |   16 +-
 45 files changed, 1940 insertions(+), 173 deletions(-)
 create mode 100644 include/trace/events/migrate.h


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
                   ` (50 preceding siblings ...)
  2012-12-10 16:42 ` Srikar Dronamraju
@ 2012-12-13 13:21 ` Srikar Dronamraju
  51 siblings, 0 replies; 80+ messages in thread
From: Srikar Dronamraju @ 2012-12-13 13:21 UTC (permalink / raw)
  To: Mel Gorman, Ingo Molnar, Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Johannes Weiner, Hugh Dickins,
	Thomas Gleixner, Paul Turner, Hillf Danton, David Rientjes,
	Lee Schermerhorn, Alex Shi, Aneesh Kumar, Linus Torvalds,
	Andrew Morton, Linux-MM, LKML

* Mel Gorman <mgorman@suse.de> [2012-12-07 10:23:03]:

> This is a full release of all the patches so apologies for the flood.  V9 was
> just a MIPS build fix and did not justify a full release. V10 includes Ingo's
> scalability patches because even though they increase system CPU usage,
> they also helped in a number of test cases. It would be worthwhile trying
> to reduce the system CPU usage by looking closer at how rwsem works and
> dealing with the contended case a bit better. Otherwise the rate of change
> in the last few weeks has been tiny as the preliminary objectives had been
> met and I did not want to invalidate any testing other people had conducted.
> 
> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v10r3
> git tag:  git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git mm-balancenuma-v10

Here are the specjbb results on a 2 node 24 GB machine.
vm_1 was allocated 12 GB, while vm_2 and vm_3 were allocated 6 GB each
All vms were running specjbb2005 workload

All numbers presented are improvements/regression from v3.7-rc8

----------------------------------------------------------------------------------------------
|                      |     |                          nofit|                            fit|
----------------------------------------------------------------------------------------------
|                      |     |          noksm|            ksm|          noksm|            ksm|
----------------------------------------------------------------------------------------------
|                      |     |  nothp|    thp|  nothp|    thp|  nothp|    thp|  nothp|    thp|
----------------------------------------------------------------------------------------------
| autonuma-mels-rebase | vm_1|   2.48|  14.25|   1.80|  15.59|   8.16|  14.62|   8.56|  17.49|
| autonuma-mels-rebase | vm_2|  23.59|  18.67|  14.20|  23.25|  10.73|  13.18|  17.94|  21.72|
| autonuma-mels-rebase | vm_3|  16.19|  19.40|  14.42|  22.54|  11.08|  12.04|   9.79|  20.34|
----------------------------------------------------------------------------------------------
| mel-balancenuma v10r3| vm_1|   0.10|   1.49|   1.78|   4.00|  -1.01|  -1.16|  -1.02|  -0.60|
| mel-balancenuma v10r3| vm_2|   3.45|  -0.67|  -1.54|   2.65|  -2.83|  -7.10|   0.10|  -2.41|
| mel-balancenuma v10r3| vm_3|   0.56|   5.49|  -0.63|   0.09|  -7.41|  -4.52|  -0.77|  -1.80|
----------------------------------------------------------------------------------------------
| tip-master 11-dec    | vm_1|  -5.68|  12.34|  35.96|  13.33|  10.79|  15.22|   9.65|  12.80|
| tip-master 11-dec    | vm_2|  14.70|  15.54|  77.45|  15.10|  12.82|  11.20|  12.66|  na   |
| tip-master 11-dec    | vm_3|   6.66|  19.26|  na   |  14.93|   7.62|  14.72|  14.73|  12.34|
----------------------------------------------------------------------------------------------


there are couple na's .. In those case, the testlog for some wierd
reason didnt have any data. this somehow seems to happen with tip/master
kernel only. May be its just coincidence.

-- 
Thanks and Regards
Srikar

PS: benchmark was run under non-standard conditions run only for the
purpose of relative comparision of different kernels.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH] sched: Fix task_numa_fault() + KSM crash
  2012-12-10 12:44         ` [PATCH] sched: Fix task_numa_fault() + KSM crash Ingo Molnar
@ 2012-12-13 13:57           ` Srikar Dronamraju
  0 siblings, 0 replies; 80+ messages in thread
From: Srikar Dronamraju @ 2012-12-13 13:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mel Gorman, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML

* Ingo Molnar <mingo@kernel.org> [2012-12-10 13:44:58]:

> Srikar Dronamraju reported that the following assert triggers on 
> his box:
> 
>    kernel BUG at ../kernel/sched/fair.c:2371!
> 
>    Call Trace:
>      [<ffffffff8113cd0e>] __do_numa_page+0xde/0x160
>      [<ffffffff8113de9e>] handle_pte_fault+0x32e/0xcd0
>      [<ffffffffa01c22c0>] ? drop_large_spte+0x30/0x30 [kvm]
>      [<ffffffffa01bf215>] ? kvm_set_spte_hva+0x25/0x30 [kvm]
>      [<ffffffff8113eab9>] handle_mm_fault+0x279/0x760
>      [<ffffffff8115c024>] break_ksm+0x74/0xa0
>      [<ffffffff8115c222>] break_cow+0xa2/0xb0
>      [<ffffffff8115e38c>] ksm_scan_thread+0xb5c/0xd50
>      [<ffffffff810771c0>] ? wake_up_bit+0x40/0x40
>      [<ffffffff8115d830>] ? run_store+0x340/0x340
>      [<ffffffff8107692e>] kthread+0xce/0xe0
> 
> This means that task_numa_fault() was called for a kernel thread
> which has no fault tracking.
> 
> This scenario is actually possible if a kernel thread does
> fault processing on behalf of a user-space task - ignore
> the page fault in that case.
> 
> Also remove the (now never triggering) assert and robustify
> a nearby assert.
> 


I do confirm that with this change, I dont see the assert anymore.

> Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/fair.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9d11a8a..61c7a10 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2351,6 +2351,13 @@ void task_numa_fault(unsigned long addr, int node, int last_cpupid, int pages, b
>  	int priv;
>  	int idx;
> 
> +	/*
> +	 * Kernel threads might not have an mm but might still
> +	 * do fault processing (such as KSM):
> +	 */
> +	if (!p->numa_faults)
> +		return;
> +
>  	if (last_cpupid != cpu_pid_to_cpupid(-1, -1)) {
>  		/* Did we access it last time around? */
>  		if (last_pid == this_pid) {
> @@ -2367,8 +2374,8 @@ void task_numa_fault(unsigned long addr, int node, int last_cpupid, int pages, b
> 
>  	idx = 2*node + priv;
> 
> -	WARN_ON_ONCE(last_cpu == -1 || node == -1);
> -	BUG_ON(!p->numa_faults);
> +	if (WARN_ON_ONCE(last_cpu == -1 || node == -1))
> +		return;
> 
>  	p->numa_faults_curr[idx] += pages;
>  	shared_fault_tick(p, node, last_cpu, pages);
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/49] Automatic NUMA Balancing v10
  2012-12-11 16:30             ` Mel Gorman
@ 2012-12-17 10:33               ` Ingo Molnar
  0 siblings, 0 replies; 80+ messages in thread
From: Ingo Molnar @ 2012-12-17 10:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Thomas Gleixner, Paul Turner, Hillf Danton,
	David Rientjes, Lee Schermerhorn, Alex Shi, Srikar Dronamraju,
	Aneesh Kumar, Linus Torvalds, Andrew Morton, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > > [...] Holding PTL across task_numa_fault is bad, but not 
> > > the bad we're looking for.
> > 
> > No, holding the PTL across task_numa_fault() is fine, 
> > because this bit got reworked in my tree rather 
> > significantly, see:
> > 
> >  6030a23a1c66 sched: Move the NUMA placement logic to a 
> >  worklet
> > 
> > and followup patches.
> 
> I believe I see your point. After that patch is applied 
> task_numa_fault() is a relatively small function and is no 
> longer calling task_numa_placement. Sure, PTL is held longer 
> than necessary but not enough to cause real scalability 
> issues.

Yes - my motivation for that was three-fold:

1) to push rebalancing into process context and thus make it
   essentially lockless and also potentially preemptable.

2) enable the flip-tasks logic, which relies on taking a
   balancing decision and acting on it immediately. If you are
   in process context then this is doable. If you are in a
   balancing irq context then not so much.

3) to simplify the 2M-emu loop was extra dressing on the cake:
   instead of taking and dropping the PTL 512 times (possibly
   interleaving two threads on the same pmd, both of them
   taking/dropping the same set of locks?), it only takes the
   ptl once.

I'll revive this aspect, it has many positives.

> > > If the bug is indeed here, it's not obvious. I don't know 
> > > why I'm triggering it or why it only triggers for specjbb 
> > > as I cannot imagine what the JVM would be doing that is 
> > > that weird or that would not have triggered before. Maybe 
> > > we both suffer this type of problem but that numacores 
> > > rate of migration is able to trigger it.
> > 
> > Agreed.
> 
> I spent some more time on this today and the bug is *really* 
> hard to trigger or at least I have been unable to trigger it 
> today. This begs the question why it triggered three times in 
> relatively quick succession separated by a few hours when 
> testing numacore on Dec 9th. Other tests ran between the 
> failures. The first failure results were discarded. I deleted 
> them to see if the same test reproduced it a second time (it 
> did).
>
> Of the three times this bug triggered in the last week, two 
> were unclear where they crashed but one showed that the bug 
> was triggered by the JVMs garbage collector. That at least is 
> a corner case and might explain why it's hard to trigger.
> 
> I feel extremely bad about how I reported this because even 
> though we differ in how we handle faults, I really cannot see 
> any difference that would explain this and I've looked long 
> enough. Triggering this by the kernel would *have* to be 
> something like missing a cache or TLB flush after page tables 
> have been modified or during migration but in most way that 
> matters we share that logic. Where we differ, it shouldn't 
> matter.

Don't worry, I really think you reported a genuine bug, even if 
it's hard to hit.

> FWIW, numacore pulled yesterday completed the same tests 
> without any error this time but none of the commits since Dec 
> 9th would account for fixing it.

Correct. I think chances are that it's still latent. Either 
fixed in your version of the code, which will be hard to 
reconstruct - or it's an active upstream bug.

I'd not blame it on the JVM for a good while - JVMs are one of 
the most abused pieces of code on the planet, literally running 
millions of applications on thousands of kernel variants.

Could you try the patch below on latest upstream with 
CONFIG_NUMA_BALANCING=y, it increases migration bandwidth 
10-fold - does it make it easier to trigger the bug on the now 
upstream NUMA-balancing feature?

It will kill throughput on a number of your tests, but it should 
make all the NUMA-specific activities during the JVM test a lot 
more frequent.

Thanks,

	Ingo

diff --git a/mm/migrate.c b/mm/migrate.c
index 32efd80..8699e8f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1511,7 +1511,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
 static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+static unsigned int ratelimit_pages __read_mostly = 1280 << (20 - PAGE_SHIFT);
 
 /* Returns true if NUMA migration is currently rate limited */
 bool migrate_ratelimited(int node)

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats
  2012-12-07 10:23 ` [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
@ 2013-01-04 11:42   ` Simon Jeons
  2013-01-07 15:29     ` Mel Gorman
  0 siblings, 1 reply; 80+ messages in thread
From: Simon Jeons @ 2013-01-04 11:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Fri, 2012-12-07 at 10:23 +0000, Mel Gorman wrote:
> It is tricky to quantify the basic cost of automatic NUMA placement in a
> meaningful manner. This patch adds some vmstats that can be used as part
> of a basic costing model.

Hi Gorman, 

> 
> u    = basic unit = sizeof(void *)
> Ca   = cost of struct page access = sizeof(struct page) / u
> Cpte = Cost PTE access = Ca
> Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
> 	where Cpte is incurred twice for a read and a write and Wlock
> 	is a constant representing the cost of taking or releasing a
> 	lock
> Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
> Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u

Why cpagerw = Ca + PAGE_SIZE/u instead of Cpte + PAGE_SIZE/u ?

> Ci = Cost of page isolation = Ca + Wi
> 	where Wi is a constant that should reflect the approximate cost
> 	of the locking operation
> Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
> 	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
> 	would imply that remote accesses are 20% more expensive
> 
> Balancing cost = Cpte * numa_pte_updates +
> 		Cnumahint * numa_hint_faults +
> 		Ci * numa_pages_migrated +
> 		Cpagecopy * numa_pages_migrated
> 

Since Cpagecopy has already accumulated ci why count ci twice ?

> Note that numa_pages_migrated is used as a measure of how many pages
> were isolated even though it would miss pages that failed to migrate. A
> vmstat counter could have been added for it but the isolation cost is
> pretty marginal in comparison to the overall cost so it seemed overkill.
> 
> The ideal way to measure automatic placement benefit would be to count
> the number of remote accesses versus local accesses and do something like
> 
> 	benefit = (remote_accesses_before - remove_access_after) * Wnuma
> 
> but the information is not readily available. As a workload converges, the
> expection would be that the number of remote numa hints would reduce to 0.
> 
> 	convergence = numa_hint_faults_local / numa_hint_faults
> 		where this is measured for the last N number of
> 		numa hints recorded. When the workload is fully
> 		converged the value is 1.
> 

convergence tend to 0 is better or 1 is better? If tend to 1, Cpte *
numa_pte_updates + Cnumahint * numa_hint_faults are just waste, where I
miss?

> This can measure if the placement policy is converging and how fast it is
> doing it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/vm_event_item.h |    6 ++++++
>  include/linux/vmstat.h        |    8 ++++++++
>  mm/huge_memory.c              |    5 +++++
>  mm/memory.c                   |   12 ++++++++++++
>  mm/mempolicy.c                |    2 ++
>  mm/migrate.c                  |    3 ++-
>  mm/vmstat.c                   |    6 ++++++
>  7 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index a1f750b..dded0af 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -38,6 +38,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>  		KSWAPD_SKIP_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +#ifdef CONFIG_BALANCE_NUMA
> +		NUMA_PTE_UPDATES,
> +		NUMA_HINT_FAULTS,
> +		NUMA_HINT_FAULTS_LOCAL,
> +		NUMA_PAGE_MIGRATE,
> +#endif
>  #ifdef CONFIG_MIGRATION
>  		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
>  #endif
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 92a86b2..dffccfa 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -80,6 +80,14 @@ static inline void vm_events_fold_cpu(int cpu)
>  
>  #endif /* CONFIG_VM_EVENT_COUNTERS */
>  
> +#ifdef CONFIG_BALANCE_NUMA
> +#define count_vm_numa_event(x)     count_vm_event(x)
> +#define count_vm_numa_events(x, y) count_vm_events(x, y)
> +#else
> +#define count_vm_numa_event(x) do {} while (0)
> +#define count_vm_numa_events(x, y) do {} while (0)
> +#endif /* CONFIG_BALANCE_NUMA */
> +
>  #define __count_zone_vm_events(item, zone, delta) \
>  		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
>  		zone_idx(zone), delta)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b3d4c4b..66e73cc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1025,6 +1025,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct page *page = NULL;
>  	unsigned long haddr = addr & HPAGE_PMD_MASK;
>  	int target_nid;
> +	int current_nid = -1;
>  
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_same(pmd, *pmdp)))
> @@ -1033,6 +1034,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	page = pmd_page(pmd);
>  	get_page(page);
>  	spin_unlock(&mm->page_table_lock);
> +	current_nid = page_to_nid(page);
> +	count_vm_numa_event(NUMA_HINT_FAULTS);
> +	if (current_nid == numa_node_id())
> +		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>  
>  	target_nid = mpol_misplaced(page, vma, haddr);
>  	if (target_nid == -1)
> diff --git a/mm/memory.c b/mm/memory.c
> index 1d6f85a..47f5dd1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3477,6 +3477,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	set_pte_at(mm, addr, ptep, pte);
>  	update_mmu_cache(vma, addr, ptep);
>  
> +	count_vm_numa_event(NUMA_HINT_FAULTS);
>  	page = vm_normal_page(vma, addr, pte);
>  	if (!page) {
>  		pte_unmap_unlock(ptep, ptl);
> @@ -3485,6 +3486,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	get_page(page);
>  	current_nid = page_to_nid(page);
> +	if (current_nid == numa_node_id())
> +		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>  	target_nid = mpol_misplaced(page, vma, addr);
>  	pte_unmap_unlock(ptep, ptl);
>  	if (target_nid == -1) {
> @@ -3517,6 +3520,9 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	unsigned long offset;
>  	spinlock_t *ptl;
>  	bool numa = false;
> +	int local_nid = numa_node_id();
> +	unsigned long nr_faults = 0;
> +	unsigned long nr_faults_local = 0;
>  
>  	spin_lock(&mm->page_table_lock);
>  	pmd = *pmdp;
> @@ -3565,10 +3571,16 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		curr_nid = page_to_nid(page);
>  		task_numa_fault(curr_nid, 1);
>  
> +		nr_faults++;
> +		if (curr_nid == local_nid)
> +			nr_faults_local++;
> +
>  		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>  	}
>  	pte_unmap_unlock(orig_pte, ptl);
>  
> +	count_vm_numa_events(NUMA_HINT_FAULTS, nr_faults);
> +	count_vm_numa_events(NUMA_HINT_FAULTS_LOCAL, nr_faults_local);
>  	return 0;
>  }
>  #else
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index a7a62fe..516491f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -583,6 +583,8 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
>  	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
>  
>  	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
> +	if (nr_updated)
> +		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
>  
>  	return nr_updated;
>  }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 49878d7..4f55694 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1514,7 +1514,8 @@ int migrate_misplaced_page(struct page *page, int node)
>  		if (nr_remaining) {
>  			putback_lru_pages(&migratepages);
>  			isolated = 0;
> -		}
> +		} else
> +			count_vm_numa_event(NUMA_PAGE_MIGRATE);
>  	}
>  	BUG_ON(!list_empty(&migratepages));
>  out:
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3a067fa..cfa386da 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -774,6 +774,12 @@ const char * const vmstat_text[] = {
>  
>  	"pgrotated",
>  
> +#ifdef CONFIG_BALANCE_NUMA
> +	"numa_pte_updates",
> +	"numa_hint_faults",
> +	"numa_hint_faults_local",
> +	"numa_pages_migrated",
> +#endif
>  #ifdef CONFIG_MIGRATION
>  	"pgmigrate_success",
>  	"pgmigrate_fail",



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 25/49] mm: numa: Add fault driven placement and migration
  2012-12-07 10:23 ` [PATCH 25/49] mm: numa: Add fault driven placement and migration Mel Gorman
@ 2013-01-04 11:56   ` Simon Jeons
  0 siblings, 0 replies; 80+ messages in thread
From: Simon Jeons @ 2013-01-04 11:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Fri, 2012-12-07 at 10:23 +0000, Mel Gorman wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> NOTE: This patch is based on "sched, numa, mm: Add fault driven
> 	placement and migration policy" but as it throws away all the policy
> 	to just leave a basic foundation I had to drop the signed-offs-by.
> 
> This patch creates a bare-bones method for setting PTEs pte_numa in the
> context of the scheduler that when faulted later will be faulted onto the
> node the CPU is running on.  In itself this does nothing useful but any
> placement policy will fundamentally depend on receiving hints on placement
> from fault context and doing something intelligent about it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  arch/sh/mm/Kconfig       |    1 +
>  arch/x86/Kconfig         |    2 +
>  include/linux/mm_types.h |   11 ++++
>  include/linux/sched.h    |   20 ++++++++
>  kernel/sched/core.c      |   13 +++++
>  kernel/sched/fair.c      |  125 ++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h  |    7 +++
>  kernel/sched/sched.h     |    6 +++
>  kernel/sysctl.c          |   24 ++++++++-
>  mm/huge_memory.c         |    5 +-
>  mm/memory.c              |   14 +++++-
>  11 files changed, 224 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
> index cb8f992..0f7c852 100644
> --- a/arch/sh/mm/Kconfig
> +++ b/arch/sh/mm/Kconfig
> @@ -111,6 +111,7 @@ config VSYSCALL
>  config NUMA
>  	bool "Non Uniform Memory Access (NUMA) Support"
>  	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
> +	select ARCH_WANT_NUMA_VARIABLE_LOCALITY
>  	default n
>  	help
>  	  Some SH systems have many various memories scattered around
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 46c3bff..1137028 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -22,6 +22,8 @@ config X86
>  	def_bool y
>  	select HAVE_AOUT if X86_32
>  	select HAVE_UNSTABLE_SCHED_CLOCK
> +	select ARCH_SUPPORTS_NUMA_BALANCING
> +	select ARCH_WANTS_PROT_NUMA_PROT_NONE
>  	select HAVE_IDE
>  	select HAVE_OPROFILE
>  	select HAVE_PCSPKR_PLATFORM
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 31f8a3a..d82accb 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -398,6 +398,17 @@ struct mm_struct {
>  #ifdef CONFIG_CPUMASK_OFFSTACK
>  	struct cpumask cpumask_allocation;
>  #endif
> +#ifdef CONFIG_BALANCE_NUMA
> +	/*
> +	 * numa_next_scan is the next time when the PTEs will me marked

s/me/be

> +	 * pte_numa to gather statistics and migrate pages to new nodes
> +	 * if necessary
> +	 */
> +	unsigned long numa_next_scan;
> +
> +	/* numa_scan_seq prevents two threads setting pte_numa */
> +	int numa_scan_seq;
> +#endif
>  	struct uprobes_state uprobes_state;
>  };
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0dd42a0..ac71181 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1479,6 +1479,14 @@ struct task_struct {
>  	short il_next;
>  	short pref_node_fork;
>  #endif
> +#ifdef CONFIG_BALANCE_NUMA
> +	int numa_scan_seq;
> +	int numa_migrate_seq;
> +	unsigned int numa_scan_period;
> +	u64 node_stamp;			/* migration stamp  */
> +	struct callback_head numa_work;
> +#endif /* CONFIG_BALANCE_NUMA */
> +
>  	struct rcu_head rcu;
>  
>  	/*
> @@ -1553,6 +1561,14 @@ struct task_struct {
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
>  #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
>  
> +#ifdef CONFIG_BALANCE_NUMA
> +extern void task_numa_fault(int node, int pages);
> +#else
> +static inline void task_numa_fault(int node, int pages)
> +{
> +}
> +#endif
> +
>  /*
>   * Priority of a process goes from 0..MAX_PRIO-1, valid RT
>   * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
> @@ -1990,6 +2006,10 @@ enum sched_tunable_scaling {
>  };
>  extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>  
> +extern unsigned int sysctl_balance_numa_scan_period_min;
> +extern unsigned int sysctl_balance_numa_scan_period_max;
> +extern unsigned int sysctl_balance_numa_settle_count;
> +
>  #ifdef CONFIG_SCHED_DEBUG
>  extern unsigned int sysctl_sched_migration_cost;
>  extern unsigned int sysctl_sched_nr_migrate;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8927f..81fa185 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1533,6 +1533,19 @@ static void __sched_fork(struct task_struct *p)
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>  #endif
> +
> +#ifdef CONFIG_BALANCE_NUMA
> +	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
> +		p->mm->numa_next_scan = jiffies;
> +		p->mm->numa_scan_seq = 0;
> +	}
> +
> +	p->node_stamp = 0ULL;
> +	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> +	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> +	p->numa_scan_period = sysctl_balance_numa_scan_period_min;
> +	p->numa_work.next = &p->numa_work;
> +#endif /* CONFIG_BALANCE_NUMA */
>  }
>  
>  /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b800a1..b6d3ed7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -26,6 +26,8 @@
>  #include <linux/slab.h>
>  #include <linux/profile.h>
>  #include <linux/interrupt.h>
> +#include <linux/mempolicy.h>
> +#include <linux/task_work.h>
>  
>  #include <trace/events/sched.h>
>  
> @@ -776,6 +778,126 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   * Scheduling class queueing methods:
>   */
>  
> +#ifdef CONFIG_BALANCE_NUMA
> +/*
> + * numa task sample period in ms: 5s
> + */
> +unsigned int sysctl_balance_numa_scan_period_min = 5000;
> +unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
> +
> +static void task_numa_placement(struct task_struct *p)
> +{
> +	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> +
> +	if (p->numa_scan_seq == seq)
> +		return;
> +	p->numa_scan_seq = seq;
> +
> +	/* FIXME: Scheduling placement policy hints go here */
> +}
> +
> +/*
> + * Got a PROT_NONE fault for a page on @node.
> + */
> +void task_numa_fault(int node, int pages)
> +{
> +	struct task_struct *p = current;
> +
> +	/* FIXME: Allocate task-specific structure for placement policy here */
> +
> +	task_numa_placement(p);
> +}
> +
> +/*
> + * The expensive part of numa migration is done from task_work context.
> + * Triggered from task_tick_numa().
> + */
> +void task_numa_work(struct callback_head *work)
> +{
> +	unsigned long migrate, next_scan, now = jiffies;
> +	struct task_struct *p = current;
> +	struct mm_struct *mm = p->mm;
> +
> +	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
> +
> +	work->next = work; /* protect against double add */
> +	/*
> +	 * Who cares about NUMA placement when they're dying.
> +	 *
> +	 * NOTE: make sure not to dereference p->mm before this check,
> +	 * exit_task_work() happens _after_ exit_mm() so we could be called
> +	 * without p->mm even though we still had it when we enqueued this
> +	 * work.
> +	 */
> +	if (p->flags & PF_EXITING)
> +		return;
> +
> +	/*
> +	 * Enforce maximal scan/migration frequency..
> +	 */
> +	migrate = mm->numa_next_scan;
> +	if (time_before(now, migrate))
> +		return;
> +
> +	if (p->numa_scan_period == 0)
> +		p->numa_scan_period = sysctl_balance_numa_scan_period_min;
> +
> +	next_scan = now + 2*msecs_to_jiffies(p->numa_scan_period);
> +	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
> +		return;
> +
> +	ACCESS_ONCE(mm->numa_scan_seq)++;
> +	{
> +		struct vm_area_struct *vma;
> +
> +		down_read(&mm->mmap_sem);
> +		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +			if (!vma_migratable(vma))
> +				continue;
> +			change_prot_numa(vma, vma->vm_start, vma->vm_end);
> +		}
> +		up_read(&mm->mmap_sem);
> +	}
> +}
> +
> +/*
> + * Drive the periodic memory faults..
> + */
> +void task_tick_numa(struct rq *rq, struct task_struct *curr)
> +{
> +	struct callback_head *work = &curr->numa_work;
> +	u64 period, now;
> +
> +	/*
> +	 * We don't care about NUMA placement if we don't have memory.
> +	 */
> +	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
> +		return;
> +
> +	/*
> +	 * Using runtime rather than walltime has the dual advantage that
> +	 * we (mostly) drive the selection from busy threads and that the
> +	 * task needs to have done some actual work before we bother with
> +	 * NUMA placement.
> +	 */
> +	now = curr->se.sum_exec_runtime;
> +	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> +
> +	if (now - curr->node_stamp > period) {
> +		curr->node_stamp = now;
> +
> +		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
> +			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
> +			task_work_add(curr, work, true);
> +		}
> +	}
> +}
> +#else
> +static void task_tick_numa(struct rq *rq, struct task_struct *curr)
> +{
> +}
> +#endif /* CONFIG_BALANCE_NUMA */
> +
>  static void
>  account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> @@ -4954,6 +5076,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  		cfs_rq = cfs_rq_of(se);
>  		entity_tick(cfs_rq, se, queued);
>  	}
> +
> +	if (sched_feat_numa(NUMA))
> +		task_tick_numa(rq, curr);
>  }
>  
>  /*
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index eebefca..7cfd289 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -61,3 +61,10 @@ SCHED_FEAT(TTWU_QUEUE, true)
>  SCHED_FEAT(FORCE_SD_OVERLAP, false)
>  SCHED_FEAT(RT_RUNTIME_SHARE, true)
>  SCHED_FEAT(LB_MIN, false)
> +
> +/*
> + * Apply the automatic NUMA scheduling policy
> + */
> +#ifdef CONFIG_BALANCE_NUMA
> +SCHED_FEAT(NUMA,	true)
> +#endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7a7db09..9a43241 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -648,6 +648,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
>  #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
>  #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
>  
> +#ifdef CONFIG_BALANCE_NUMA
> +#define sched_feat_numa(x) sched_feat(x)
> +#else
> +#define sched_feat_numa(x) (0)
> +#endif
> +
>  static inline u64 global_rt_period(void)
>  {
>  	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 26f65ea..1359f51 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000;		/* 100 usecs */
>  static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
>  static int min_wakeup_granularity_ns;			/* 0 usecs */
>  static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
> +#ifdef CONFIG_SMP
>  static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
>  static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
> -#endif
> +#endif /* CONFIG_SMP */
> +#endif /* CONFIG_SCHED_DEBUG */
>  
>  #ifdef CONFIG_COMPACTION
>  static int min_extfrag_threshold;
> @@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
>  		.extra1		= &min_wakeup_granularity_ns,
>  		.extra2		= &max_wakeup_granularity_ns,
>  	},
> +#ifdef CONFIG_SMP
>  	{
>  		.procname	= "sched_tunable_scaling",
>  		.data		= &sysctl_sched_tunable_scaling,
> @@ -347,7 +350,24 @@ static struct ctl_table kern_table[] = {
>  		.extra1		= &zero,
>  		.extra2		= &one,
>  	},
> -#endif
> +#endif /* CONFIG_SMP */
> +#ifdef CONFIG_BALANCE_NUMA
> +	{
> +		.procname	= "balance_numa_scan_period_min_ms",
> +		.data		= &sysctl_balance_numa_scan_period_min,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
> +		.procname	= "balance_numa_scan_period_max_ms",
> +		.data		= &sysctl_balance_numa_scan_period_max,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +#endif /* CONFIG_BALANCE_NUMA */
> +#endif /* CONFIG_SCHED_DEBUG */
>  	{
>  		.procname	= "sched_rt_period_us",
>  		.data		= &sysctl_sched_rt_period,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 68e0412..b3d4c4b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1045,6 +1045,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 */
>  	split_huge_page(page);
>  	put_page(page);
> +
>  	return 0;
>  
>  clear_pmdnuma:
> @@ -1059,8 +1060,10 @@ clear_pmdnuma:
>  
>  out_unlock:
>  	spin_unlock(&mm->page_table_lock);
> -	if (page)
> +	if (page) {
>  		put_page(page);
> +		task_numa_fault(numa_node_id(), HPAGE_PMD_NR);
> +	}
>  	return 0;
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 1757ad8..1d6f85a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3454,7 +3454,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	struct page *page = NULL;
>  	spinlock_t *ptl;
> -	int current_nid, target_nid;
> +	int current_nid = -1;
> +	int target_nid;
>  
>  	/*
>  	* The "pte" at this point cannot be used safely without
> @@ -3501,6 +3502,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		current_nid = target_nid;
>  
>  out:
> +	task_numa_fault(current_nid, 1);
>  	return 0;
>  }
>  
> @@ -3537,6 +3539,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = *pte;
>  		struct page *page;
> +		int curr_nid;
>  		if (!pte_present(pteval))
>  			continue;
>  		if (!pte_numa(pteval))
> @@ -3554,6 +3557,15 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		page = vm_normal_page(vma, addr, pteval);
>  		if (unlikely(!page))
>  			continue;
> +		/* only check non-shared pages */
> +		if (unlikely(page_mapcount(page) != 1))
> +			continue;
> +		pte_unmap_unlock(pte, ptl);
> +
> +		curr_nid = page_to_nid(page);
> +		task_numa_fault(curr_nid, 1);
> +
> +		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>  	}
>  	pte_unmap_unlock(orig_pte, ptl);
>  



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY
  2012-12-07 10:23 ` [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
@ 2013-01-05  5:18   ` Simon Jeons
  2013-01-07 15:14     ` Mel Gorman
  0 siblings, 1 reply; 80+ messages in thread
From: Simon Jeons @ 2013-01-05  5:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Fri, 2012-12-07 at 10:23 +0000, Mel Gorman wrote:
> From: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> NOTE: Once again there is a lot of patch stealing and the end result
> 	is sufficiently different that I had to drop the signed-offs.
> 	Will re-add if the original authors are ok with that.
> 
> This patch adds another mbind() flag to request "lazy migration".  The
> flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> pages are marked PROT_NONE. The pages will be migrated in the fault
> path on "first touch", if the policy dictates at that time.
> 
> "Lazy Migration" will allow testing of migrate-on-fault via mbind().
> Also allows applications to specify that only subsequently touched
> pages be migrated to obey new policy, instead of all pages in range.
> This can be useful for multi-threaded applications working on a
> large shared data area that is initialized by an initial thread
> resulting in all pages on one [or a few, if overflowed] nodes.
> After PROT_NONE, the pages in regions assigned to the worker threads
> will be automatically migrated local to the threads on 1st touch.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/mm.h             |    5 ++
>  include/uapi/linux/mempolicy.h |   13 ++-
>  mm/mempolicy.c                 |  185 ++++++++++++++++++++++++++++++++++++----
>  3 files changed, 185 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa16152..471185e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1551,6 +1551,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
>  }
>  #endif
>  
> +#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
> +void change_prot_numa(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end);
> +#endif
> +
>  struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
>  int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
>  			unsigned long pfn, unsigned long size, pgprot_t);
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 472de8a..6a1baae 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -49,9 +49,16 @@ enum mpol_rebind_step {
>  
>  /* Flags for mbind */
>  #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
> -#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
> -#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
> -#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
> +#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
> +				   to policy */
> +#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
> +#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
> +#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
> +
> +#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
> +			 MPOL_MF_MOVE     | 	\
> +			 MPOL_MF_MOVE_ALL |	\
> +			 MPOL_MF_LAZY)
>  
>  /*
>   * Internal flags that share the struct mempolicy flags word with
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index df1466d..51d3ebd 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -90,6 +90,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/ctype.h>
>  #include <linux/mm_inline.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/uaccess.h>
> @@ -565,6 +566,145 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
> +/*
> + * Here we search for not shared page mappings (mapcount == 1) and we
> + * set up the pmd/pte_numa on those mappings so the very next access
> + * will fire a NUMA hinting page fault.
> + */
> +static int
> +change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte, *_pte;
> +	struct page *page;
> +	unsigned long _address, end;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	VM_BUG_ON(address & ~PAGE_MASK);
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> +		int page_nid;
> +		ret = HPAGE_PMD_NR;
> +
> +		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +		if (pmd_numa(*pmd)) {
> +			spin_unlock(&mm->page_table_lock);
> +			goto out;
> +		}
> +
> +		page = pmd_page(*pmd);
> +
> +		/* only check non-shared pages */
> +		if (page_mapcount(page) != 1) {
> +			spin_unlock(&mm->page_table_lock);
> +			goto out;
> +		}
> +
> +		page_nid = page_to_nid(page);
> +
> +		if (pmd_numa(*pmd)) {
> +			spin_unlock(&mm->page_table_lock);
> +			goto out;
> +		}
> +

Hi Gorman,

Since pmd_trans_huge_lock has already held &mm->page_table_lock, then
why check pmd_numa(*pmd) again?

> +		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +		ret += HPAGE_PMD_NR;
> +		/* defer TLB flush to lower the overhead */
> +		spin_unlock(&mm->page_table_lock);
> +		goto out;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		goto out;
> +	VM_BUG_ON(!pmd_present(*pmd));
> +
> +	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
> +	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	for (_address = address, _pte = pte; _address < end;
> +	     _pte++, _address += PAGE_SIZE) {
> +		pte_t pteval = *_pte;
> +		if (!pte_present(pteval))
> +			continue;
> +		if (pte_numa(pteval))
> +			continue;
> +		page = vm_normal_page(vma, _address, pteval);
> +		if (unlikely(!page))
> +			continue;
> +		/* only check non-shared pages */
> +		if (page_mapcount(page) != 1)
> +			continue;
> +
> +		set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
> +
> +		/* defer TLB flush to lower the overhead */
> +		ret++;
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +
> +	if (ret && !pmd_numa(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +		spin_unlock(&mm->page_table_lock);
> +		/* defer TLB flush to lower the overhead */
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +/* Assumes mmap_sem is held */
> +void
> +change_prot_numa(struct vm_area_struct *vma,
> +			unsigned long address, unsigned long end)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	int progress = 0;
> +
> +	while (address < end) {
> +		VM_BUG_ON(address < vma->vm_start ||
> +			  address + PAGE_SIZE > vma->vm_end);
> +
> +		progress += change_prot_numa_range(mm, vma, address);
> +		address = (address + PMD_SIZE) & PMD_MASK;
> +	}
> +
> +	/*
> +	 * Flush the TLB for the mm to start the NUMA hinting
> +	 * page faults after we finish scanning this vma part
> +	 * if there were any PTE updates
> +	 */
> +	if (progress) {
> +		mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
> +		flush_tlb_range(vma, address, end);
> +		mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
> +	}
> +}
> +#else
> +static unsigned long change_prot_numa(struct vm_area_struct *vma,
> +			unsigned long addr, unsigned long end)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
> +
>  /*
>   * Check if all pages in a range are on a set of nodes.
>   * If pagelist != NULL then isolate pages from the LRU and
> @@ -583,22 +723,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
>  		return ERR_PTR(-EFAULT);
>  	prev = NULL;
>  	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
> +		unsigned long endvma = vma->vm_end;
> +
> +		if (endvma > end)
> +			endvma = end;
> +		if (vma->vm_start > start)
> +			start = vma->vm_start;
> +
>  		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
>  			if (!vma->vm_next && vma->vm_end < end)
>  				return ERR_PTR(-EFAULT);
>  			if (prev && prev->vm_end < vma->vm_start)
>  				return ERR_PTR(-EFAULT);
>  		}
> -		if (!is_vm_hugetlb_page(vma) &&
> -		    ((flags & MPOL_MF_STRICT) ||
> +
> +		if (is_vm_hugetlb_page(vma))
> +			goto next;
> +
> +		if (flags & MPOL_MF_LAZY) {
> +			change_prot_numa(vma, start, endvma);
> +			goto next;
> +		}
> +
> +		if ((flags & MPOL_MF_STRICT) ||
>  		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
> -				vma_migratable(vma)))) {
> -			unsigned long endvma = vma->vm_end;
> +		      vma_migratable(vma))) {
>  
> -			if (endvma > end)
> -				endvma = end;
> -			if (vma->vm_start > start)
> -				start = vma->vm_start;
>  			err = check_pgd_range(vma, start, endvma, nodes,
>  						flags, private);
>  			if (err) {
> @@ -606,6 +756,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
>  				break;
>  			}
>  		}
> +next:
>  		prev = vma;
>  	}
>  	return first;
> @@ -1138,8 +1289,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>  	int err;
>  	LIST_HEAD(pagelist);
>  
> -	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
> -				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> +	if (flags & ~(unsigned long)MPOL_MF_VALID)
>  		return -EINVAL;
>  	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
>  		return -EPERM;
> @@ -1162,6 +1312,9 @@ static long do_mbind(unsigned long start, unsigned long len,
>  	if (IS_ERR(new))
>  		return PTR_ERR(new);
>  
> +	if (flags & MPOL_MF_LAZY)
> +		new->flags |= MPOL_F_MOF;
> +
>  	/*
>  	 * If we are using the default policy then operation
>  	 * on discontinuous address spaces is okay after all
> @@ -1198,13 +1351,15 @@ static long do_mbind(unsigned long start, unsigned long len,
>  	vma = check_range(mm, start, end, nmask,
>  			  flags | MPOL_MF_INVERT, &pagelist);
>  
> -	err = PTR_ERR(vma);
> -	if (!IS_ERR(vma)) {
> -		int nr_failed = 0;
> -
> +	err = PTR_ERR(vma);	/* maybe ... */
> +	if (!IS_ERR(vma) && mode != MPOL_NOOP)
>  		err = mbind_range(mm, start, end, new);
>  
> +	if (!err) {
> +		int nr_failed = 0;
> +
>  		if (!list_empty(&pagelist)) {
> +			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>  			nr_failed = migrate_pages(&pagelist, new_vma_page,
>  						(unsigned long)vma,
>  						false, MIGRATE_SYNC,
> @@ -1213,7 +1368,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>  				putback_lru_pages(&pagelist);
>  		}
>  
> -		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
> +		if (nr_failed && (flags & MPOL_MF_STRICT))
>  			err = -EIO;
>  	} else
>  		putback_lru_pages(&pagelist);



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY
  2013-01-05  5:18   ` Simon Jeons
@ 2013-01-07 15:14     ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2013-01-07 15:14 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Fri, Jan 04, 2013 at 11:18:17PM -0600, Simon Jeons wrote:
> > +static int
> > +change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
> > +			unsigned long address)
> > +{
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> > +	pmd_t *pmd;
> > +	pte_t *pte, *_pte;
> > +	struct page *page;
> > +	unsigned long _address, end;
> > +	spinlock_t *ptl;
> > +	int ret = 0;
> > +
> > +	VM_BUG_ON(address & ~PAGE_MASK);
> > +
> > +	pgd = pgd_offset(mm, address);
> > +	if (!pgd_present(*pgd))
> > +		goto out;
> > +
> > +	pud = pud_offset(pgd, address);
> > +	if (!pud_present(*pud))
> > +		goto out;
> > +
> > +	pmd = pmd_offset(pud, address);
> > +	if (pmd_none(*pmd))
> > +		goto out;
> > +
> > +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> > +		int page_nid;
> > +		ret = HPAGE_PMD_NR;
> > +
> > +		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > +
> > +		if (pmd_numa(*pmd)) {
> > +			spin_unlock(&mm->page_table_lock);
> > +			goto out;
> > +		}
> > +
> > +		page = pmd_page(*pmd);
> > +
> > +		/* only check non-shared pages */
> > +		if (page_mapcount(page) != 1) {
> > +			spin_unlock(&mm->page_table_lock);
> > +			goto out;
> > +		}
> > +
> > +		page_nid = page_to_nid(page);
> > +
> > +		if (pmd_numa(*pmd)) {
> > +			spin_unlock(&mm->page_table_lock);
> > +			goto out;
> > +		}
> > +
> 
> Hi Gorman,
> 
> Since pmd_trans_huge_lock has already held &mm->page_table_lock, then
> why check pmd_numa(*pmd) again?
> 

It looks like oversight. I've added a TODO item to clean it up when I
revisit NUMA balancing some time soon.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats
  2013-01-04 11:42   ` Simon Jeons
@ 2013-01-07 15:29     ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2013-01-07 15:29 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Fri, Jan 04, 2013 at 05:42:24AM -0600, Simon Jeons wrote:
> On Fri, 2012-12-07 at 10:23 +0000, Mel Gorman wrote:
> > It is tricky to quantify the basic cost of automatic NUMA placement in a
> > meaningful manner. This patch adds some vmstats that can be used as part
> > of a basic costing model.
> 
> Hi Gorman, 
> 
> > 
> > u    = basic unit = sizeof(void *)
> > Ca   = cost of struct page access = sizeof(struct page) / u
> > Cpte = Cost PTE access = Ca
> > Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
> > 	where Cpte is incurred twice for a read and a write and Wlock
> > 	is a constant representing the cost of taking or releasing a
> > 	lock
> > Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
> > Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
> 
> Why cpagerw = Ca + PAGE_SIZE/u instead of Cpte + PAGE_SIZE/u ?
> 

Because I was thinking of the cost of just access the struct page.  Arguably
it would be both Ca and Cpte and if I wanted to be very comprehensive I
would also take into account the potential cost of kmapping the page in
the 32-bit case but it'd be overkill. The cost of the PTE and struct page
is negligible in comparison to the actual copy.

> > Ci = Cost of page isolation = Ca + Wi
> > 	where Wi is a constant that should reflect the approximate cost
> > 	of the locking operation
> > Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
> > 	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
> > 	would imply that remote accesses are 20% more expensive
> > 
> > Balancing cost = Cpte * numa_pte_updates +
> > 		Cnumahint * numa_hint_faults +
> > 		Ci * numa_pages_migrated +
> > 		Cpagecopy * numa_pages_migrated
> > 
> 
> Since Cpagecopy has already accumulated ci why count ci twice ?
> 

Good point. Interestingly when I went to fix this in mmtests I found
that I accounted for Ci properly there but got it wrong in the
changelog.

> > Note that numa_pages_migrated is used as a measure of how many pages
> > were isolated even though it would miss pages that failed to migrate. A
> > vmstat counter could have been added for it but the isolation cost is
> > pretty marginal in comparison to the overall cost so it seemed overkill.
> > 
> > The ideal way to measure automatic placement benefit would be to count
> > the number of remote accesses versus local accesses and do something like
> > 
> > 	benefit = (remote_accesses_before - remove_access_after) * Wnuma
> > 
> > but the information is not readily available. As a workload converges, the
> > expection would be that the number of remote numa hints would reduce to 0.
> > 
> > 	convergence = numa_hint_faults_local / numa_hint_faults
> > 		where this is measured for the last N number of
> > 		numa hints recorded. When the workload is fully
> > 		converged the value is 1.
> > 
> 
> convergence tend to 0 is better or 1 is better

1 is better.

> If tend to 1, Cpte *
> numa_pte_updates + Cnumahint * numa_hint_faults are just waste, where I
> miss?
> 

I don't get the question, waste of what? None of these calculations are
used by the kernel. The kernel only maintains counters and the point of
the changelog was to illustrate how the counters can be used to do some
meaningful evaluation.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case.
       [not found]   ` <20130105084229.GA3208@hacker.(null)>
@ 2013-01-07 15:37     ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2013-01-07 15:37 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Thomas Gleixner, Paul Turner,
	Hillf Danton, David Rientjes, Lee Schermerhorn, Alex Shi,
	Srikar Dronamraju, Aneesh Kumar, Linus Torvalds, Andrew Morton,
	Linux-MM, LKML

On Sat, Jan 05, 2013 at 04:42:29PM +0800, Wanpeng Li wrote:
> >+int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> >+{
> >+	int ret = 0;
> >
> > 	/* Avoid migrating to a node that is nearly full */
> > 	if (migrate_balanced_pgdat(pgdat, 1)) {
> 
> Hi Mel Gorman,
> 
> This parameter nr_migrate_pags = 1 is not correct, since balancenuma also 
> support THP in this patchset, the parameter should be 1 <= compound_order(page) 
> instead of 1. I'd rather change to something like:
> 

True. The impact is marginal because it only applies when a node is almost
full but it does mean that we do some unnecessary work before migration
fails anyway. I've added a TODO item to fix it when I next revisit NUMA
balancing. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2013-01-07 15:37 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
2012-12-07 10:23 ` [PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
2012-12-07 10:23 ` [PATCH 02/49] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
2012-12-07 10:23 ` [PATCH 03/49] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
2012-12-07 10:23 ` [PATCH 04/49] x86/mm: Introduce pte_accessible() Mel Gorman
2012-12-07 10:23 ` [PATCH 05/49] mm: Only flush the TLB when clearing an accessible pte Mel Gorman
2012-12-07 10:23 ` [PATCH 06/49] mm: Count the number of pages affected in change_protection() Mel Gorman
2012-12-07 10:23 ` [PATCH 07/49] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Mel Gorman
2012-12-07 10:23 ` [PATCH 08/49] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
2012-12-07 10:23 ` [PATCH 09/49] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
2012-12-07 10:23 ` [PATCH 10/49] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
2012-12-07 10:23 ` [PATCH 11/49] mm: numa: define _PAGE_NUMA Mel Gorman
2012-12-07 10:23 ` [PATCH 12/49] mm: numa: pte_numa() and pmd_numa() Mel Gorman
2012-12-07 10:23 ` [PATCH 13/49] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
2012-12-07 10:23 ` [PATCH 14/49] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
2012-12-07 10:23 ` [PATCH 15/49] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
2012-12-07 10:23 ` [PATCH 16/49] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
2012-12-07 10:23 ` [PATCH 17/49] mm: mempolicy: Add MPOL_NOOP Mel Gorman
2012-12-07 10:23 ` [PATCH 18/49] mm: mempolicy: Check for misplaced page Mel Gorman
2012-12-07 10:23 ` [PATCH 19/49] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
2012-12-07 10:23 ` [PATCH 20/49] mm: migrate: Drop the misplaced pages reference count if the target node is full Mel Gorman
2012-12-07 10:23 ` [PATCH 21/49] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
2012-12-07 10:23 ` [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
2013-01-05  5:18   ` Simon Jeons
2013-01-07 15:14     ` Mel Gorman
2012-12-07 10:23 ` [PATCH 23/49] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() Mel Gorman
2012-12-07 10:23 ` [PATCH 24/49] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
2012-12-07 10:23 ` [PATCH 25/49] mm: numa: Add fault driven placement and migration Mel Gorman
2013-01-04 11:56   ` Simon Jeons
2012-12-07 10:23 ` [PATCH 26/49] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
2012-12-07 10:23 ` [PATCH 27/49] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
2012-12-07 10:23 ` [PATCH 28/49] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
2012-12-07 10:23 ` [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
2013-01-04 11:42   ` Simon Jeons
2013-01-07 15:29     ` Mel Gorman
2012-12-07 10:23 ` [PATCH 30/49] mm: numa: Migrate on reference policy Mel Gorman
2012-12-07 10:23 ` [PATCH 31/49] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
2012-12-07 10:23 ` [PATCH 32/49] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
2012-12-07 10:23 ` [PATCH 33/49] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
2012-12-07 10:23 ` [PATCH 34/49] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
2012-12-07 10:23 ` [PATCH 35/49] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
2012-12-07 10:23 ` [PATCH 36/49] mm: numa: Introduce last_nid to the page frame Mel Gorman
2012-12-07 10:23 ` [PATCH 37/49] mm: numa: split_huge_page: Transfer last_nid on tail page Mel Gorman
2012-12-07 10:23 ` [PATCH 38/49] mm: numa: migrate: Set last_nid on newly allocated page Mel Gorman
2012-12-07 10:23 ` [PATCH 39/49] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
2012-12-07 10:23 ` [PATCH 40/49] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate Mel Gorman
2012-12-07 10:23 ` [PATCH 41/49] mm: sched: numa: Control enabling and disabling of NUMA balancing Mel Gorman
2012-12-07 10:23 ` [PATCH 42/49] mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG Mel Gorman
2012-12-07 10:23 ` [PATCH 43/49] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Mel Gorman
2012-12-07 10:23 ` [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
     [not found]   ` <20130105084229.GA3208@hacker.(null)>
2013-01-07 15:37     ` Mel Gorman
2012-12-07 10:23 ` [PATCH 45/49] mm: numa: Add THP migration for the NUMA working set scanning fault case build fix Mel Gorman
2012-12-07 10:23 ` [PATCH 46/49] mm: numa: Account for failed allocations and isolations as migration failures Mel Gorman
2012-12-07 10:23 ` [PATCH 47/49] mm: migrate: Account a transhuge page properly when rate limiting Mel Gorman
2012-12-07 10:23 ` [PATCH 48/49] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Mel Gorman
2012-12-07 10:23 ` [PATCH 49/49] mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Mel Gorman
2012-12-07 11:01 ` [PATCH 00/49] Automatic NUMA Balancing v10 Ingo Molnar
2012-12-09 20:36   ` Mel Gorman
2012-12-09 21:17     ` Kirill A. Shutemov
2012-12-10  8:44       ` Mel Gorman
2012-12-10  5:07     ` Srikar Dronamraju
2012-12-10  6:28       ` Srikar Dronamraju
2012-12-10 12:44         ` [PATCH] sched: Fix task_numa_fault() + KSM crash Ingo Molnar
2012-12-13 13:57           ` Srikar Dronamraju
2012-12-10  8:46       ` [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
2012-12-10 12:35       ` Ingo Molnar
2012-12-10 11:39     ` Ingo Molnar
2012-12-10 11:53       ` Ingo Molnar
2012-12-10 15:24       ` Mel Gorman
2012-12-11  1:02         ` Mel Gorman
2012-12-11  8:52           ` Ingo Molnar
2012-12-11  9:18             ` Ingo Molnar
2012-12-11 15:22               ` Mel Gorman
2012-12-11 16:30             ` Mel Gorman
2012-12-17 10:33               ` Ingo Molnar
2012-12-10 16:42 ` Srikar Dronamraju
2012-12-10 19:23   ` Ingo Molnar
2012-12-10 23:35     ` Srikar Dronamraju
2012-12-10 23:40   ` Srikar Dronamraju
2012-12-13 13:21 ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).