All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-07 10:28 ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This series has roughly the same goals as previous versions despite the
size. It reduces overhead of automatic balancing through scan rate reduction
and the avoidance of TLB flushes. It selects a preferred node and moves tasks
towards their memory as well as moving memory toward their task. It handles
shared pages and groups related tasks together. Some problems such as shared
page interleaving and properly dealing with processes that are larger than
a node are being deferred. This version should be ready for wider testing
in -tip.

Note that with kernel 3.12-rc3 that numa balancing will fail to boot if
CONFIG_JUMP_LABEL is configured. This is a separate bug that is currently
being dealt with.

Changelog since V8
o Rebased to v3.12-rc3
o Handle races against hotplug

Changelog since V7
o THP migration race and pagetable insertion fixes
o Do no handle PMDs in batch
o Shared page migration throttling
o Various changes to how last nid/pid information is recorded
o False pid match sanity checks when joining NUMA task groups
o Adapt scan rate based on local/remote fault statistics
o Period retry of migration to preferred node
o Limit scope of system-wide search
o Schedule threads on the same node as process that created them
o Cleanup numa_group on exec

Changelog since V6
o Group tasks that share pages together
o More scan avoidance of VMAs mapping pages that are not likely to migrate
o cpunid conversion, system-wide searching of tasks to balance with

Changelog since V6
o Various TLB flush optimisations
o Comment updates
o Sanitise task_numa_fault callsites for consistent semantics
o Revert some of the scanning adaption stuff
o Revert patch that defers scanning until task schedules on another node
o Start delayed scanning properly
o Avoid the same task always performing the PTE scan
o Continue PTE scanning even if migration is rate limited

Changelog since V5
o Add __GFP_NOWARN for numa hinting fault count
o Use is_huge_zero_page
o Favour moving tasks towards nodes with higher faults
o Optionally resist moving tasks towards nodes with lower faults
o Scan shared THP pages

Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It was initially based on Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).
There has been a tonne of additional work from both Peter and Rik van Riel.

Some reports indicate that the performance is getting close to manual
bindings for some workloads but your mileage will vary.

Patch 1 is a monolithic dump of patches thare are destined for upstream that
	this series indirectly depends upon.

Patches 2-3 adds sysctl documentation and comment fixlets

Patch 4 avoids accounting for a hinting fault if another thread handled the
	fault in parallel

Patches 5-6 avoid races with parallel THP migration and THP splits.

Patch 7 corrects a THP NUMA hint fault accounting bug

Patches 8-9 avoids TLB flushes during the PTE scan if no updates are made

Patch 10 sanitizes task_numa_fault callsites to have consist semantics and
	always record the fault based on the correct location of the page.

Patch 11 closes races between THP migration and PMD clearing.

Patch 12 avoids trying to migrate the THP zero page

Patch 13 avoids the same task being selected to perform the PTE scan within
	a shared address space.

Patch 14 continues PTE scanning even if migration rate limited

Patch 15 notes that delaying the PTE scan until a task is scheduled on an
	alternative node misses the case where the task is only accessing
	shared memory on a partially loaded machine and reverts a patch.

Patch 16 initialises numa_next_scan properly so that PTE scanning is delayed
	when a process starts.

Patch 17 sets the scan rate proportional to the size of the task being
	scanned.

Patch 18 slows the scan rate if no hinting faults were trapped by an idle task.

Patch 19 tracks NUMA hinting faults per-task and per-node

Patches 20-24 selects a preferred node at the end of a PTE scan based on what
	node incurred the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node. When initially selected the task is rescheduled on
	the preferred node if it is not running on that node already. This
	avoids waiting for the scheduler to move the task slowly.

Patch 25 adds infrastructure to allow separate tracking of shared/private
	pages but treats all faults as if they are private accesses. Laying
	it out this way reduces churn later in the series when private
	fault detection is introduced

Patch 26 avoids some unnecessary allocation

Patch 27-28 kicks away some training wheels and scans shared pages and
	small VMAs.

Patch 29 introduces private fault detection based on the PID of the faulting
	process and accounts for shared/private accesses differently.

Patch 30 avoids migrating memory immediately after the load balancer moves
	a task to another node in case it's a transient migration.

Patch 31 avoids scanning VMAs that do not migrate-on-fault which addresses
	a serious regression on a database performance test.

Patch 32 pick the least loaded CPU based on a preferred node based on
	a scheduling domain common to both the source and destination
	NUMA node.

Patch 33 retries task migration if an earlier attempt failed

Patch 34 will begin task migration immediately if running on its preferred
	node

Patch 35 will avoid trapping hinting faults for shared read-only library
	pages as these never migrate anyway

Patch 36 avoids handling pmd hinting faults if none of the ptes below it were
	marked pte numa

Patches 37-38 introduce a mechanism for swapping tasks

Patch 39 uses a system-wide search to find tasks that can be swapped
	to improve the overall locality of the system.

Patch 40 notes that the system-wide search may ignore the preferred node and
	will use the preferred node placement if it has spare compute
	capacity.

Patch 41 will perform a global search if a node that should have had capacity
	cannot have a task migrated to it

Patches 42-43 use cpupid to track pages so potential sharing tasks can
	be quickly found

Patch 44 reports the ID of the numa group a task belongs.

Patch 45 copies the cpupid on page migration

Patch 46 avoids grouping based on read-only pages

Patch 47 stops handling pages within a PMD in batch as it distorts fault
	statistics and failed to flush TLBs correctly.

Patch 48 schedules new threads on the same node as the parent.

Patch 49 schedules tasks based on their numa group

Patch 50 cleans up tasks numa_group on exec

Patch 51 avoids parallel updates to group stats

Patch 52 adds some debugging aids

Patches 53-54 separately considers task and group weights when selecting the node to
	schedule a task on

Patch 56 checks if PID truncation may have caused false matches before joining tasks
	to a NUMA grou

Patch 57 uses the false shared detection information for scan rate adaption later

Patch 58 adapts the scan rate based on local/remote faults

Patch 59 removes the period scan rate reset

Patch 60-61 throttles shared page migrations

Patch 62 avoids the use of atomics protects the values with a spinlock

Patch 63 periodically retries migrating a task back to its preferred node

Kernel 3.12-rc3 is the testing baseline.

o account-v9		Patches 1-8
o periodretry-v8	Patches 1-63

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.

specjbb
                   3.12.0-rc3            3.12.0-rc3
                 account-v9        periodretry-v9  
TPut 1      26187.00 (  0.00%)     25922.00 ( -1.01%)
TPut 2      55752.00 (  0.00%)     53928.00 ( -3.27%)
TPut 3      88878.00 (  0.00%)     84689.00 ( -4.71%)
TPut 4     111226.00 (  0.00%)    111843.00 (  0.55%)
TPut 5     138700.00 (  0.00%)    139712.00 (  0.73%)
TPut 6     173467.00 (  0.00%)    161226.00 ( -7.06%)
TPut 7     197609.00 (  0.00%)    194035.00 ( -1.81%)
TPut 8     220501.00 (  0.00%)    218853.00 ( -0.75%)
TPut 9     247997.00 (  0.00%)    244480.00 ( -1.42%)
TPut 10    275616.00 (  0.00%)    269962.00 ( -2.05%)
TPut 11    301610.00 (  0.00%)    301051.00 ( -0.19%)
TPut 12    326151.00 (  0.00%)    318040.00 ( -2.49%)
TPut 13    341671.00 (  0.00%)    346890.00 (  1.53%)
TPut 14    372805.00 (  0.00%)    367204.00 ( -1.50%)
TPut 15    390175.00 (  0.00%)    371538.00 ( -4.78%)
TPut 16    406716.00 (  0.00%)    409835.00 (  0.77%)
TPut 17    429094.00 (  0.00%)    436172.00 (  1.65%)
TPut 18    457167.00 (  0.00%)    456528.00 ( -0.14%)
TPut 19    476963.00 (  0.00%)    479680.00 (  0.57%)
TPut 20    492751.00 (  0.00%)    480019.00 ( -2.58%)
TPut 21    514952.00 (  0.00%)    511950.00 ( -0.58%)
TPut 22    521962.00 (  0.00%)    516450.00 ( -1.06%)
TPut 23    537268.00 (  0.00%)    532825.00 ( -0.83%)
TPut 24    541231.00 (  0.00%)    539425.00 ( -0.33%)
TPut 25    530459.00 (  0.00%)    538714.00 (  1.56%)
TPut 26    538837.00 (  0.00%)    524894.00 ( -2.59%)
TPut 27    534132.00 (  0.00%)    519628.00 ( -2.72%)
TPut 28    529470.00 (  0.00%)    519044.00 ( -1.97%)
TPut 29    504426.00 (  0.00%)    514158.00 (  1.93%)
TPut 30    514785.00 (  0.00%)    513080.00 ( -0.33%)
TPut 31    501018.00 (  0.00%)    492377.00 ( -1.72%)
TPut 32    488377.00 (  0.00%)    492108.00 (  0.76%)
TPut 33    484809.00 (  0.00%)    493612.00 (  1.82%)
TPut 34    473015.00 (  0.00%)    477716.00 (  0.99%)
TPut 35    451833.00 (  0.00%)    455368.00 (  0.78%)
TPut 36    445787.00 (  0.00%)    460138.00 (  3.22%)
TPut 37    446034.00 (  0.00%)    453011.00 (  1.56%)
TPut 38    433305.00 (  0.00%)    441966.00 (  2.00%)
TPut 39    431202.00 (  0.00%)    443747.00 (  2.91%)
TPut 40    420040.00 (  0.00%)    432818.00 (  3.04%)
TPut 41    416519.00 (  0.00%)    424105.00 (  1.82%)
TPut 42    426047.00 (  0.00%)    430164.00 (  0.97%)
TPut 43    421725.00 (  0.00%)    419106.00 ( -0.62%)
TPut 44    414340.00 (  0.00%)    425471.00 (  2.69%)
TPut 45    413836.00 (  0.00%)    418506.00 (  1.13%)
TPut 46    403636.00 (  0.00%)    421177.00 (  4.35%)
TPut 47    387726.00 (  0.00%)    388190.00 (  0.12%)
TPut 48    405375.00 (  0.00%)    418321.00 (  3.19%)

Mostly flat. Profiles were interesting because they showed heavy contention
on the mm->page_table_lock due to THP faults and migration. It is expected
that Kirill's page table lock split lock work will help here. At the time
of writing that series has been rebased on top for testing.

specjbb Peaks
                                3.12.0-rc3               3.12.0-rc3
                              account-v9           periodretry-v9  
 Expctd Warehouse          48.00 (  0.00%)          48.00 (  0.00%)
 Expctd Peak Bops      387726.00 (  0.00%)      388190.00 (  0.12%)
 Actual Warehouse          25.00 (  0.00%)          25.00 (  0.00%)
 Actual Peak Bops      541231.00 (  0.00%)      539425.00 ( -0.33%)
 SpecJBB Bops            8273.00 (  0.00%)        8537.00 (  3.19%)
 SpecJBB Bops/JVM        8273.00 (  0.00%)        8537.00 (  3.19%)

Minor gain in the overal specjbb score but the peak performance is
slightly lower.

          3.12.0-rc3  3.12.0-rc3
        account-v9  periodretry-v9  
User        44731.08    44820.18
System        189.53      124.16
Elapsed      1665.71     1666.42

                            3.12.0-rc3  3.12.0-rc3
                          account-v9  periodretry-v9  
Minor Faults                   3815276     4471086
Major Faults                       108         131
Compaction cost                  12002        3214
NUMA PTE updates              17955537     3849428
NUMA hint faults               3950201     3822150
NUMA hint local faults         1032610     1029273
NUMA hint local percent             26          26
NUMA pages migrated           11562658     3096443
AutoNUMA cost                    20096       19196

As with previous releases system CPU usage is generally lower with fewer
scans.

autonumabench
                                     3.12.0-rc3            3.12.0-rc3
                                   account-v9        periodretry-v9  
User    NUMA01               43871.21 (  0.00%)    53162.55 (-21.18%)
User    NUMA01_THEADLOCAL    25270.59 (  0.00%)    28868.37 (-14.24%)
User    NUMA02                2196.67 (  0.00%)     2110.35 (  3.93%)
User    NUMA02_SMT            1039.18 (  0.00%)     1035.41 (  0.36%)
System  NUMA01                 187.11 (  0.00%)      154.69 ( 17.33%)
System  NUMA01_THEADLOCAL      216.47 (  0.00%)       95.47 ( 55.90%)
System  NUMA02                   3.52 (  0.00%)        3.26 (  7.39%)
System  NUMA02_SMT               2.42 (  0.00%)        2.03 ( 16.12%)
Elapsed NUMA01                 970.59 (  0.00%)     1199.46 (-23.58%)
Elapsed NUMA01_THEADLOCAL      569.11 (  0.00%)      643.37 (-13.05%)
Elapsed NUMA02                  51.59 (  0.00%)       49.94 (  3.20%)
Elapsed NUMA02_SMT              49.73 (  0.00%)       50.29 ( -1.13%)
CPU     NUMA01                4539.00 (  0.00%)     4445.00 (  2.07%)
CPU     NUMA01_THEADLOCAL     4478.00 (  0.00%)     4501.00 ( -0.51%)
CPU     NUMA02                4264.00 (  0.00%)     4231.00 (  0.77%)
CPU     NUMA02_SMT            2094.00 (  0.00%)     2062.00 (  1.53%)

The numa01 (adverse workload) is hit quite badly but it often is. The
numa01-threadlocal regression is of greater concern and will be examined
further. It is interesting to note that monitoring the workload affects
the results quite severely. These results are based on no monitoring.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running per node on the system.

specjbb
                     3.12.0-rc3            3.12.0-rc3
                   account-v9        periodretry-v9  
Mean   1      30900.00 (  0.00%)     29541.50 ( -4.40%)
Mean   2      62820.50 (  0.00%)     63330.25 (  0.81%)
Mean   3      92803.00 (  0.00%)     92629.75 ( -0.19%)
Mean   4     119122.25 (  0.00%)    121981.75 (  2.40%)
Mean   5     142391.00 (  0.00%)    148290.50 (  4.14%)
Mean   6     151073.00 (  0.00%)    169823.75 ( 12.41%)
Mean   7     152618.50 (  0.00%)    166411.00 (  9.04%)
Mean   8     141284.25 (  0.00%)    153222.00 (  8.45%)
Mean   9     136055.25 (  0.00%)    139262.50 (  2.36%)
Mean   10    124290.50 (  0.00%)    133464.50 (  7.38%)
Mean   11    139939.25 (  0.00%)    159681.25 ( 14.11%)
Mean   12    137545.75 (  0.00%)    159829.50 ( 16.20%)
Mean   13    133607.25 (  0.00%)    157809.00 ( 18.11%)
Mean   14    135512.00 (  0.00%)    153510.50 ( 13.28%)
Mean   15    132730.75 (  0.00%)    151627.25 ( 14.24%)
Mean   16    129924.25 (  0.00%)    148248.00 ( 14.10%)
Mean   17    130339.00 (  0.00%)    149250.00 ( 14.51%)
Mean   18    124314.25 (  0.00%)    146486.50 ( 17.84%)
Mean   19    120331.25 (  0.00%)    143616.75 ( 19.35%)
Mean   20    118827.25 (  0.00%)    141381.50 ( 18.98%)
Mean   21    120938.25 (  0.00%)    138196.75 ( 14.27%)
Mean   22    118660.75 (  0.00%)    136879.50 ( 15.35%)
Mean   23    117005.75 (  0.00%)    134200.50 ( 14.70%)
Mean   24    112711.50 (  0.00%)    131302.50 ( 16.49%)
Mean   25    115458.50 (  0.00%)    129939.25 ( 12.54%)
Mean   26    114008.50 (  0.00%)    128834.50 ( 13.00%)
Mean   27    115063.50 (  0.00%)    128394.00 ( 11.59%)
Mean   28    114359.50 (  0.00%)    124072.50 (  8.49%)
Mean   29    113637.50 (  0.00%)    124954.50 (  9.96%)
Mean   30    113392.75 (  0.00%)    123941.75 (  9.30%)
Mean   31    115131.25 (  0.00%)    121477.75 (  5.51%)
Mean   32    112004.00 (  0.00%)    122235.00 (  9.13%)
Mean   33    111287.50 (  0.00%)    120992.50 (  8.72%)
Mean   34    111206.75 (  0.00%)    118769.75 (  6.80%)
Mean   35    108469.50 (  0.00%)    120061.50 ( 10.69%)
Mean   36    105932.00 (  0.00%)    118039.75 ( 11.43%)
Mean   37    107428.00 (  0.00%)    118295.75 ( 10.12%)
Mean   38    102804.75 (  0.00%)    120519.50 ( 17.23%)
Mean   39    104095.00 (  0.00%)    121461.50 ( 16.68%)
Mean   40    103460.00 (  0.00%)    122506.50 ( 18.41%)
Mean   41    100417.00 (  0.00%)    118570.50 ( 18.08%)
Mean   42    101025.75 (  0.00%)    120612.00 ( 19.39%)
Mean   43    100311.75 (  0.00%)    120743.50 ( 20.37%)
Mean   44    101769.00 (  0.00%)    120410.25 ( 18.32%)
Mean   45     99649.25 (  0.00%)    121260.50 ( 21.69%)
Mean   46    101178.50 (  0.00%)    121210.75 ( 19.80%)
Mean   47    101148.75 (  0.00%)    119994.25 ( 18.63%)
Mean   48    103446.00 (  0.00%)    120204.50 ( 16.20%)
Stddev 1        940.15 (  0.00%)      1277.19 (-35.85%)
Stddev 2        292.47 (  0.00%)      1851.80 (-533.15%)
Stddev 3       1750.78 (  0.00%)      1808.61 ( -3.30%)
Stddev 4        859.01 (  0.00%)      2790.10 (-224.80%)
Stddev 5       3236.13 (  0.00%)      1892.19 ( 41.53%)
Stddev 6       2489.07 (  0.00%)      2157.76 ( 13.31%)
Stddev 7       1981.85 (  0.00%)      4299.27 (-116.93%)
Stddev 8       2586.24 (  0.00%)      3090.27 (-19.49%)
Stddev 9       7250.82 (  0.00%)      4762.66 ( 34.32%)
Stddev 10      1242.89 (  0.00%)      1448.14 (-16.51%)
Stddev 11      1631.31 (  0.00%)      9758.25 (-498.19%)
Stddev 12      1964.66 (  0.00%)     17425.60 (-786.95%)
Stddev 13      2080.24 (  0.00%)     17824.45 (-756.84%)
Stddev 14      1362.07 (  0.00%)     18551.85 (-1262.03%)
Stddev 15      3142.86 (  0.00%)     20410.21 (-549.42%)
Stddev 16      2026.28 (  0.00%)     19767.72 (-875.57%)
Stddev 17      2059.98 (  0.00%)     19358.07 (-839.72%)
Stddev 18      2832.80 (  0.00%)     19434.41 (-586.05%)
Stddev 19      4248.17 (  0.00%)     19590.94 (-361.16%)
Stddev 20      3163.70 (  0.00%)     18608.43 (-488.19%)
Stddev 21      1046.22 (  0.00%)     17766.10 (-1598.13%)
Stddev 22      1458.72 (  0.00%)     16295.25 (-1017.09%)
Stddev 23      1453.80 (  0.00%)     16933.28 (-1064.76%)
Stddev 24      3387.76 (  0.00%)     17276.97 (-409.98%)
Stddev 25       467.26 (  0.00%)     17228.85 (-3587.21%)
Stddev 26       269.10 (  0.00%)     17614.19 (-6445.71%)
Stddev 27      1024.92 (  0.00%)     16197.85 (-1480.40%)
Stddev 28      2547.19 (  0.00%)     22532.91 (-784.62%)
Stddev 29      2496.51 (  0.00%)     21734.79 (-770.61%)
Stddev 30      1777.21 (  0.00%)     22407.22 (-1160.81%)
Stddev 31      2948.17 (  0.00%)     22046.59 (-647.81%)
Stddev 32      3045.75 (  0.00%)     21317.50 (-599.91%)
Stddev 33      3088.42 (  0.00%)     24073.34 (-679.47%)
Stddev 34      1695.86 (  0.00%)     25483.66 (-1402.69%)
Stddev 35      2392.89 (  0.00%)     22319.81 (-832.76%)
Stddev 36      1002.99 (  0.00%)     24788.30 (-2371.43%)
Stddev 37      1246.07 (  0.00%)     22969.98 (-1743.39%)
Stddev 38      3340.47 (  0.00%)     17764.75 (-431.80%)
Stddev 39       951.45 (  0.00%)     17467.43 (-1735.88%)
Stddev 40      1861.87 (  0.00%)     16746.88 (-799.47%)
Stddev 41      3019.63 (  0.00%)     22203.85 (-635.32%)
Stddev 42      3305.80 (  0.00%)     19226.07 (-481.59%)
Stddev 43      2149.96 (  0.00%)     19788.85 (-820.43%)
Stddev 44      4743.81 (  0.00%)     20232.47 (-326.50%)
Stddev 45      3701.87 (  0.00%)     19876.40 (-436.93%)
Stddev 46      3742.49 (  0.00%)     17963.46 (-379.99%)
Stddev 47      1637.98 (  0.00%)     20138.13 (-1129.45%)
Stddev 48      2192.84 (  0.00%)     16729.79 (-662.93%)
TPut   1     123600.00 (  0.00%)    118166.00 ( -4.40%)
TPut   2     251282.00 (  0.00%)    253321.00 (  0.81%)
TPut   3     371212.00 (  0.00%)    370519.00 ( -0.19%)
TPut   4     476489.00 (  0.00%)    487927.00 (  2.40%)
TPut   5     569564.00 (  0.00%)    593162.00 (  4.14%)
TPut   6     604292.00 (  0.00%)    679295.00 ( 12.41%)
TPut   7     610474.00 (  0.00%)    665644.00 (  9.04%)
TPut   8     565137.00 (  0.00%)    612888.00 (  8.45%)
TPut   9     544221.00 (  0.00%)    557050.00 (  2.36%)
TPut   10    497162.00 (  0.00%)    533858.00 (  7.38%)
TPut   11    559757.00 (  0.00%)    638725.00 ( 14.11%)
TPut   12    550183.00 (  0.00%)    639318.00 ( 16.20%)
TPut   13    534429.00 (  0.00%)    631236.00 ( 18.11%)
TPut   14    542048.00 (  0.00%)    614042.00 ( 13.28%)
TPut   15    530923.00 (  0.00%)    606509.00 ( 14.24%)
TPut   16    519697.00 (  0.00%)    592992.00 ( 14.10%)
TPut   17    521356.00 (  0.00%)    597000.00 ( 14.51%)
TPut   18    497257.00 (  0.00%)    585946.00 ( 17.84%)
TPut   19    481325.00 (  0.00%)    574467.00 ( 19.35%)
TPut   20    475309.00 (  0.00%)    565526.00 ( 18.98%)
TPut   21    483753.00 (  0.00%)    552787.00 ( 14.27%)
TPut   22    474643.00 (  0.00%)    547518.00 ( 15.35%)
TPut   23    468023.00 (  0.00%)    536802.00 ( 14.70%)
TPut   24    450846.00 (  0.00%)    525210.00 ( 16.49%)
TPut   25    461834.00 (  0.00%)    519757.00 ( 12.54%)
TPut   26    456034.00 (  0.00%)    515338.00 ( 13.00%)
TPut   27    460254.00 (  0.00%)    513576.00 ( 11.59%)
TPut   28    457438.00 (  0.00%)    496290.00 (  8.49%)
TPut   29    454550.00 (  0.00%)    499818.00 (  9.96%)
TPut   30    453571.00 (  0.00%)    495767.00 (  9.30%)
TPut   31    460525.00 (  0.00%)    485911.00 (  5.51%)
TPut   32    448016.00 (  0.00%)    488940.00 (  9.13%)
TPut   33    445150.00 (  0.00%)    483970.00 (  8.72%)
TPut   34    444827.00 (  0.00%)    475079.00 (  6.80%)
TPut   35    433878.00 (  0.00%)    480246.00 ( 10.69%)
TPut   36    423728.00 (  0.00%)    472159.00 ( 11.43%)
TPut   37    429712.00 (  0.00%)    473183.00 ( 10.12%)
TPut   38    411219.00 (  0.00%)    482078.00 ( 17.23%)
TPut   39    416380.00 (  0.00%)    485846.00 ( 16.68%)
TPut   40    413840.00 (  0.00%)    490026.00 ( 18.41%)
TPut   41    401668.00 (  0.00%)    474282.00 ( 18.08%)
TPut   42    404103.00 (  0.00%)    482448.00 ( 19.39%)
TPut   43    401247.00 (  0.00%)    482974.00 ( 20.37%)
TPut   44    407076.00 (  0.00%)    481641.00 ( 18.32%)
TPut   45    398597.00 (  0.00%)    485042.00 ( 21.69%)
TPut   46    404714.00 (  0.00%)    484843.00 ( 19.80%)
TPut   47    404595.00 (  0.00%)    479977.00 ( 18.63%)
TPut   48    413784.00 (  0.00%)    480818.00 ( 16.20%)

This is looking much better overall although I am concerned about the
increased variability between JVMs.

specjbb Peaks
                                3.12.0-rc3               3.12.0-rc3
                              account-v9           periodretry-v9  
 Expctd Warehouse          12.00 (  0.00%)          12.00 (  0.00%)
 Expctd Peak Bops      559757.00 (  0.00%)      638725.00 ( 14.11%)
 Actual Warehouse           8.00 (  0.00%)           7.00 (-12.50%)
 Actual Peak Bops      610474.00 (  0.00%)      679295.00 ( 11.27%)
 SpecJBB Bops          502292.00 (  0.00%)      582258.00 ( 15.92%)
 SpecJBB Bops/JVM      125573.00 (  0.00%)      145565.00 ( 15.92%)

Looking fine.

          3.12.0-rc3  3.12.0-rc3
        account-v9  periodretry-v9  
User       481412.08   481942.54
System       1301.91      578.20
Elapsed     10402.09    10404.47

                            3.12.0-rc3  3.12.0-rc3
                          account-v9  periodretry-v9  
Compaction cost                 105928       13748
NUMA PTE updates             457567880    45890118
NUMA hint faults              69831880    45725506
NUMA hint local faults        19303679    28637898
NUMA hint local percent             27          62
NUMA pages migrated          102050548    13244738
AutoNUMA cost                   354301      229200

and system CPU usage is still way down so now we are seeing large
improvements for less work. Previous tests had indicated that period
retrying of task migration was necessary for a good "local percent"
of local/remote faults. It implies that the load balancer and NUMA
scheduling may be making conflicting decisions.

While there is still plenty of future work it looks like this is ready
for wider testing.

 Documentation/sysctl/kernel.txt   |   76 +++
 fs/exec.c                         |    1 +
 fs/proc/array.c                   |    2 +
 include/linux/cpu.h               |   67 ++-
 include/linux/mempolicy.h         |    1 +
 include/linux/migrate.h           |    7 +-
 include/linux/mm.h                |  118 +++-
 include/linux/mm_types.h          |   17 +-
 include/linux/page-flags-layout.h |   28 +-
 include/linux/sched.h             |   67 ++-
 include/linux/sched/sysctl.h      |    1 -
 include/linux/stop_machine.h      |    1 +
 kernel/bounds.c                   |    4 +
 kernel/cpu.c                      |  227 ++++++--
 kernel/fork.c                     |    5 +-
 kernel/sched/core.c               |  184 ++++++-
 kernel/sched/debug.c              |   60 +-
 kernel/sched/fair.c               | 1092 ++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h           |   19 +-
 kernel/sched/idle_task.c          |    2 +-
 kernel/sched/rt.c                 |    5 +-
 kernel/sched/sched.h              |   27 +-
 kernel/sched/stop_task.c          |    2 +-
 kernel/stop_machine.c             |  272 +++++----
 kernel/sysctl.c                   |   21 +-
 mm/huge_memory.c                  |  119 +++-
 mm/memory.c                       |  158 ++----
 mm/mempolicy.c                    |   82 ++-
 mm/migrate.c                      |   49 +-
 mm/mm_init.c                      |   18 +-
 mm/mmzone.c                       |   14 +-
 mm/mprotect.c                     |   65 +--
 mm/page_alloc.c                   |    4 +-
 33 files changed, 2248 insertions(+), 567 deletions(-)

-- 
1.8.4


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-07 10:28 ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This series has roughly the same goals as previous versions despite the
size. It reduces overhead of automatic balancing through scan rate reduction
and the avoidance of TLB flushes. It selects a preferred node and moves tasks
towards their memory as well as moving memory toward their task. It handles
shared pages and groups related tasks together. Some problems such as shared
page interleaving and properly dealing with processes that are larger than
a node are being deferred. This version should be ready for wider testing
in -tip.

Note that with kernel 3.12-rc3 that numa balancing will fail to boot if
CONFIG_JUMP_LABEL is configured. This is a separate bug that is currently
being dealt with.

Changelog since V8
o Rebased to v3.12-rc3
o Handle races against hotplug

Changelog since V7
o THP migration race and pagetable insertion fixes
o Do no handle PMDs in batch
o Shared page migration throttling
o Various changes to how last nid/pid information is recorded
o False pid match sanity checks when joining NUMA task groups
o Adapt scan rate based on local/remote fault statistics
o Period retry of migration to preferred node
o Limit scope of system-wide search
o Schedule threads on the same node as process that created them
o Cleanup numa_group on exec

Changelog since V6
o Group tasks that share pages together
o More scan avoidance of VMAs mapping pages that are not likely to migrate
o cpunid conversion, system-wide searching of tasks to balance with

Changelog since V6
o Various TLB flush optimisations
o Comment updates
o Sanitise task_numa_fault callsites for consistent semantics
o Revert some of the scanning adaption stuff
o Revert patch that defers scanning until task schedules on another node
o Start delayed scanning properly
o Avoid the same task always performing the PTE scan
o Continue PTE scanning even if migration is rate limited

Changelog since V5
o Add __GFP_NOWARN for numa hinting fault count
o Use is_huge_zero_page
o Favour moving tasks towards nodes with higher faults
o Optionally resist moving tasks towards nodes with lower faults
o Scan shared THP pages

Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It was initially based on Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).
There has been a tonne of additional work from both Peter and Rik van Riel.

Some reports indicate that the performance is getting close to manual
bindings for some workloads but your mileage will vary.

Patch 1 is a monolithic dump of patches thare are destined for upstream that
	this series indirectly depends upon.

Patches 2-3 adds sysctl documentation and comment fixlets

Patch 4 avoids accounting for a hinting fault if another thread handled the
	fault in parallel

Patches 5-6 avoid races with parallel THP migration and THP splits.

Patch 7 corrects a THP NUMA hint fault accounting bug

Patches 8-9 avoids TLB flushes during the PTE scan if no updates are made

Patch 10 sanitizes task_numa_fault callsites to have consist semantics and
	always record the fault based on the correct location of the page.

Patch 11 closes races between THP migration and PMD clearing.

Patch 12 avoids trying to migrate the THP zero page

Patch 13 avoids the same task being selected to perform the PTE scan within
	a shared address space.

Patch 14 continues PTE scanning even if migration rate limited

Patch 15 notes that delaying the PTE scan until a task is scheduled on an
	alternative node misses the case where the task is only accessing
	shared memory on a partially loaded machine and reverts a patch.

Patch 16 initialises numa_next_scan properly so that PTE scanning is delayed
	when a process starts.

Patch 17 sets the scan rate proportional to the size of the task being
	scanned.

Patch 18 slows the scan rate if no hinting faults were trapped by an idle task.

Patch 19 tracks NUMA hinting faults per-task and per-node

Patches 20-24 selects a preferred node at the end of a PTE scan based on what
	node incurred the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node. When initially selected the task is rescheduled on
	the preferred node if it is not running on that node already. This
	avoids waiting for the scheduler to move the task slowly.

Patch 25 adds infrastructure to allow separate tracking of shared/private
	pages but treats all faults as if they are private accesses. Laying
	it out this way reduces churn later in the series when private
	fault detection is introduced

Patch 26 avoids some unnecessary allocation

Patch 27-28 kicks away some training wheels and scans shared pages and
	small VMAs.

Patch 29 introduces private fault detection based on the PID of the faulting
	process and accounts for shared/private accesses differently.

Patch 30 avoids migrating memory immediately after the load balancer moves
	a task to another node in case it's a transient migration.

Patch 31 avoids scanning VMAs that do not migrate-on-fault which addresses
	a serious regression on a database performance test.

Patch 32 pick the least loaded CPU based on a preferred node based on
	a scheduling domain common to both the source and destination
	NUMA node.

Patch 33 retries task migration if an earlier attempt failed

Patch 34 will begin task migration immediately if running on its preferred
	node

Patch 35 will avoid trapping hinting faults for shared read-only library
	pages as these never migrate anyway

Patch 36 avoids handling pmd hinting faults if none of the ptes below it were
	marked pte numa

Patches 37-38 introduce a mechanism for swapping tasks

Patch 39 uses a system-wide search to find tasks that can be swapped
	to improve the overall locality of the system.

Patch 40 notes that the system-wide search may ignore the preferred node and
	will use the preferred node placement if it has spare compute
	capacity.

Patch 41 will perform a global search if a node that should have had capacity
	cannot have a task migrated to it

Patches 42-43 use cpupid to track pages so potential sharing tasks can
	be quickly found

Patch 44 reports the ID of the numa group a task belongs.

Patch 45 copies the cpupid on page migration

Patch 46 avoids grouping based on read-only pages

Patch 47 stops handling pages within a PMD in batch as it distorts fault
	statistics and failed to flush TLBs correctly.

Patch 48 schedules new threads on the same node as the parent.

Patch 49 schedules tasks based on their numa group

Patch 50 cleans up tasks numa_group on exec

Patch 51 avoids parallel updates to group stats

Patch 52 adds some debugging aids

Patches 53-54 separately considers task and group weights when selecting the node to
	schedule a task on

Patch 56 checks if PID truncation may have caused false matches before joining tasks
	to a NUMA grou

Patch 57 uses the false shared detection information for scan rate adaption later

Patch 58 adapts the scan rate based on local/remote faults

Patch 59 removes the period scan rate reset

Patch 60-61 throttles shared page migrations

Patch 62 avoids the use of atomics protects the values with a spinlock

Patch 63 periodically retries migrating a task back to its preferred node

Kernel 3.12-rc3 is the testing baseline.

o account-v9		Patches 1-8
o periodretry-v8	Patches 1-63

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.

specjbb
                   3.12.0-rc3            3.12.0-rc3
                 account-v9        periodretry-v9  
TPut 1      26187.00 (  0.00%)     25922.00 ( -1.01%)
TPut 2      55752.00 (  0.00%)     53928.00 ( -3.27%)
TPut 3      88878.00 (  0.00%)     84689.00 ( -4.71%)
TPut 4     111226.00 (  0.00%)    111843.00 (  0.55%)
TPut 5     138700.00 (  0.00%)    139712.00 (  0.73%)
TPut 6     173467.00 (  0.00%)    161226.00 ( -7.06%)
TPut 7     197609.00 (  0.00%)    194035.00 ( -1.81%)
TPut 8     220501.00 (  0.00%)    218853.00 ( -0.75%)
TPut 9     247997.00 (  0.00%)    244480.00 ( -1.42%)
TPut 10    275616.00 (  0.00%)    269962.00 ( -2.05%)
TPut 11    301610.00 (  0.00%)    301051.00 ( -0.19%)
TPut 12    326151.00 (  0.00%)    318040.00 ( -2.49%)
TPut 13    341671.00 (  0.00%)    346890.00 (  1.53%)
TPut 14    372805.00 (  0.00%)    367204.00 ( -1.50%)
TPut 15    390175.00 (  0.00%)    371538.00 ( -4.78%)
TPut 16    406716.00 (  0.00%)    409835.00 (  0.77%)
TPut 17    429094.00 (  0.00%)    436172.00 (  1.65%)
TPut 18    457167.00 (  0.00%)    456528.00 ( -0.14%)
TPut 19    476963.00 (  0.00%)    479680.00 (  0.57%)
TPut 20    492751.00 (  0.00%)    480019.00 ( -2.58%)
TPut 21    514952.00 (  0.00%)    511950.00 ( -0.58%)
TPut 22    521962.00 (  0.00%)    516450.00 ( -1.06%)
TPut 23    537268.00 (  0.00%)    532825.00 ( -0.83%)
TPut 24    541231.00 (  0.00%)    539425.00 ( -0.33%)
TPut 25    530459.00 (  0.00%)    538714.00 (  1.56%)
TPut 26    538837.00 (  0.00%)    524894.00 ( -2.59%)
TPut 27    534132.00 (  0.00%)    519628.00 ( -2.72%)
TPut 28    529470.00 (  0.00%)    519044.00 ( -1.97%)
TPut 29    504426.00 (  0.00%)    514158.00 (  1.93%)
TPut 30    514785.00 (  0.00%)    513080.00 ( -0.33%)
TPut 31    501018.00 (  0.00%)    492377.00 ( -1.72%)
TPut 32    488377.00 (  0.00%)    492108.00 (  0.76%)
TPut 33    484809.00 (  0.00%)    493612.00 (  1.82%)
TPut 34    473015.00 (  0.00%)    477716.00 (  0.99%)
TPut 35    451833.00 (  0.00%)    455368.00 (  0.78%)
TPut 36    445787.00 (  0.00%)    460138.00 (  3.22%)
TPut 37    446034.00 (  0.00%)    453011.00 (  1.56%)
TPut 38    433305.00 (  0.00%)    441966.00 (  2.00%)
TPut 39    431202.00 (  0.00%)    443747.00 (  2.91%)
TPut 40    420040.00 (  0.00%)    432818.00 (  3.04%)
TPut 41    416519.00 (  0.00%)    424105.00 (  1.82%)
TPut 42    426047.00 (  0.00%)    430164.00 (  0.97%)
TPut 43    421725.00 (  0.00%)    419106.00 ( -0.62%)
TPut 44    414340.00 (  0.00%)    425471.00 (  2.69%)
TPut 45    413836.00 (  0.00%)    418506.00 (  1.13%)
TPut 46    403636.00 (  0.00%)    421177.00 (  4.35%)
TPut 47    387726.00 (  0.00%)    388190.00 (  0.12%)
TPut 48    405375.00 (  0.00%)    418321.00 (  3.19%)

Mostly flat. Profiles were interesting because they showed heavy contention
on the mm->page_table_lock due to THP faults and migration. It is expected
that Kirill's page table lock split lock work will help here. At the time
of writing that series has been rebased on top for testing.

specjbb Peaks
                                3.12.0-rc3               3.12.0-rc3
                              account-v9           periodretry-v9  
 Expctd Warehouse          48.00 (  0.00%)          48.00 (  0.00%)
 Expctd Peak Bops      387726.00 (  0.00%)      388190.00 (  0.12%)
 Actual Warehouse          25.00 (  0.00%)          25.00 (  0.00%)
 Actual Peak Bops      541231.00 (  0.00%)      539425.00 ( -0.33%)
 SpecJBB Bops            8273.00 (  0.00%)        8537.00 (  3.19%)
 SpecJBB Bops/JVM        8273.00 (  0.00%)        8537.00 (  3.19%)

Minor gain in the overal specjbb score but the peak performance is
slightly lower.

          3.12.0-rc3  3.12.0-rc3
        account-v9  periodretry-v9  
User        44731.08    44820.18
System        189.53      124.16
Elapsed      1665.71     1666.42

                            3.12.0-rc3  3.12.0-rc3
                          account-v9  periodretry-v9  
Minor Faults                   3815276     4471086
Major Faults                       108         131
Compaction cost                  12002        3214
NUMA PTE updates              17955537     3849428
NUMA hint faults               3950201     3822150
NUMA hint local faults         1032610     1029273
NUMA hint local percent             26          26
NUMA pages migrated           11562658     3096443
AutoNUMA cost                    20096       19196

As with previous releases system CPU usage is generally lower with fewer
scans.

autonumabench
                                     3.12.0-rc3            3.12.0-rc3
                                   account-v9        periodretry-v9  
User    NUMA01               43871.21 (  0.00%)    53162.55 (-21.18%)
User    NUMA01_THEADLOCAL    25270.59 (  0.00%)    28868.37 (-14.24%)
User    NUMA02                2196.67 (  0.00%)     2110.35 (  3.93%)
User    NUMA02_SMT            1039.18 (  0.00%)     1035.41 (  0.36%)
System  NUMA01                 187.11 (  0.00%)      154.69 ( 17.33%)
System  NUMA01_THEADLOCAL      216.47 (  0.00%)       95.47 ( 55.90%)
System  NUMA02                   3.52 (  0.00%)        3.26 (  7.39%)
System  NUMA02_SMT               2.42 (  0.00%)        2.03 ( 16.12%)
Elapsed NUMA01                 970.59 (  0.00%)     1199.46 (-23.58%)
Elapsed NUMA01_THEADLOCAL      569.11 (  0.00%)      643.37 (-13.05%)
Elapsed NUMA02                  51.59 (  0.00%)       49.94 (  3.20%)
Elapsed NUMA02_SMT              49.73 (  0.00%)       50.29 ( -1.13%)
CPU     NUMA01                4539.00 (  0.00%)     4445.00 (  2.07%)
CPU     NUMA01_THEADLOCAL     4478.00 (  0.00%)     4501.00 ( -0.51%)
CPU     NUMA02                4264.00 (  0.00%)     4231.00 (  0.77%)
CPU     NUMA02_SMT            2094.00 (  0.00%)     2062.00 (  1.53%)

The numa01 (adverse workload) is hit quite badly but it often is. The
numa01-threadlocal regression is of greater concern and will be examined
further. It is interesting to note that monitoring the workload affects
the results quite severely. These results are based on no monitoring.

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running per node on the system.

specjbb
                     3.12.0-rc3            3.12.0-rc3
                   account-v9        periodretry-v9  
Mean   1      30900.00 (  0.00%)     29541.50 ( -4.40%)
Mean   2      62820.50 (  0.00%)     63330.25 (  0.81%)
Mean   3      92803.00 (  0.00%)     92629.75 ( -0.19%)
Mean   4     119122.25 (  0.00%)    121981.75 (  2.40%)
Mean   5     142391.00 (  0.00%)    148290.50 (  4.14%)
Mean   6     151073.00 (  0.00%)    169823.75 ( 12.41%)
Mean   7     152618.50 (  0.00%)    166411.00 (  9.04%)
Mean   8     141284.25 (  0.00%)    153222.00 (  8.45%)
Mean   9     136055.25 (  0.00%)    139262.50 (  2.36%)
Mean   10    124290.50 (  0.00%)    133464.50 (  7.38%)
Mean   11    139939.25 (  0.00%)    159681.25 ( 14.11%)
Mean   12    137545.75 (  0.00%)    159829.50 ( 16.20%)
Mean   13    133607.25 (  0.00%)    157809.00 ( 18.11%)
Mean   14    135512.00 (  0.00%)    153510.50 ( 13.28%)
Mean   15    132730.75 (  0.00%)    151627.25 ( 14.24%)
Mean   16    129924.25 (  0.00%)    148248.00 ( 14.10%)
Mean   17    130339.00 (  0.00%)    149250.00 ( 14.51%)
Mean   18    124314.25 (  0.00%)    146486.50 ( 17.84%)
Mean   19    120331.25 (  0.00%)    143616.75 ( 19.35%)
Mean   20    118827.25 (  0.00%)    141381.50 ( 18.98%)
Mean   21    120938.25 (  0.00%)    138196.75 ( 14.27%)
Mean   22    118660.75 (  0.00%)    136879.50 ( 15.35%)
Mean   23    117005.75 (  0.00%)    134200.50 ( 14.70%)
Mean   24    112711.50 (  0.00%)    131302.50 ( 16.49%)
Mean   25    115458.50 (  0.00%)    129939.25 ( 12.54%)
Mean   26    114008.50 (  0.00%)    128834.50 ( 13.00%)
Mean   27    115063.50 (  0.00%)    128394.00 ( 11.59%)
Mean   28    114359.50 (  0.00%)    124072.50 (  8.49%)
Mean   29    113637.50 (  0.00%)    124954.50 (  9.96%)
Mean   30    113392.75 (  0.00%)    123941.75 (  9.30%)
Mean   31    115131.25 (  0.00%)    121477.75 (  5.51%)
Mean   32    112004.00 (  0.00%)    122235.00 (  9.13%)
Mean   33    111287.50 (  0.00%)    120992.50 (  8.72%)
Mean   34    111206.75 (  0.00%)    118769.75 (  6.80%)
Mean   35    108469.50 (  0.00%)    120061.50 ( 10.69%)
Mean   36    105932.00 (  0.00%)    118039.75 ( 11.43%)
Mean   37    107428.00 (  0.00%)    118295.75 ( 10.12%)
Mean   38    102804.75 (  0.00%)    120519.50 ( 17.23%)
Mean   39    104095.00 (  0.00%)    121461.50 ( 16.68%)
Mean   40    103460.00 (  0.00%)    122506.50 ( 18.41%)
Mean   41    100417.00 (  0.00%)    118570.50 ( 18.08%)
Mean   42    101025.75 (  0.00%)    120612.00 ( 19.39%)
Mean   43    100311.75 (  0.00%)    120743.50 ( 20.37%)
Mean   44    101769.00 (  0.00%)    120410.25 ( 18.32%)
Mean   45     99649.25 (  0.00%)    121260.50 ( 21.69%)
Mean   46    101178.50 (  0.00%)    121210.75 ( 19.80%)
Mean   47    101148.75 (  0.00%)    119994.25 ( 18.63%)
Mean   48    103446.00 (  0.00%)    120204.50 ( 16.20%)
Stddev 1        940.15 (  0.00%)      1277.19 (-35.85%)
Stddev 2        292.47 (  0.00%)      1851.80 (-533.15%)
Stddev 3       1750.78 (  0.00%)      1808.61 ( -3.30%)
Stddev 4        859.01 (  0.00%)      2790.10 (-224.80%)
Stddev 5       3236.13 (  0.00%)      1892.19 ( 41.53%)
Stddev 6       2489.07 (  0.00%)      2157.76 ( 13.31%)
Stddev 7       1981.85 (  0.00%)      4299.27 (-116.93%)
Stddev 8       2586.24 (  0.00%)      3090.27 (-19.49%)
Stddev 9       7250.82 (  0.00%)      4762.66 ( 34.32%)
Stddev 10      1242.89 (  0.00%)      1448.14 (-16.51%)
Stddev 11      1631.31 (  0.00%)      9758.25 (-498.19%)
Stddev 12      1964.66 (  0.00%)     17425.60 (-786.95%)
Stddev 13      2080.24 (  0.00%)     17824.45 (-756.84%)
Stddev 14      1362.07 (  0.00%)     18551.85 (-1262.03%)
Stddev 15      3142.86 (  0.00%)     20410.21 (-549.42%)
Stddev 16      2026.28 (  0.00%)     19767.72 (-875.57%)
Stddev 17      2059.98 (  0.00%)     19358.07 (-839.72%)
Stddev 18      2832.80 (  0.00%)     19434.41 (-586.05%)
Stddev 19      4248.17 (  0.00%)     19590.94 (-361.16%)
Stddev 20      3163.70 (  0.00%)     18608.43 (-488.19%)
Stddev 21      1046.22 (  0.00%)     17766.10 (-1598.13%)
Stddev 22      1458.72 (  0.00%)     16295.25 (-1017.09%)
Stddev 23      1453.80 (  0.00%)     16933.28 (-1064.76%)
Stddev 24      3387.76 (  0.00%)     17276.97 (-409.98%)
Stddev 25       467.26 (  0.00%)     17228.85 (-3587.21%)
Stddev 26       269.10 (  0.00%)     17614.19 (-6445.71%)
Stddev 27      1024.92 (  0.00%)     16197.85 (-1480.40%)
Stddev 28      2547.19 (  0.00%)     22532.91 (-784.62%)
Stddev 29      2496.51 (  0.00%)     21734.79 (-770.61%)
Stddev 30      1777.21 (  0.00%)     22407.22 (-1160.81%)
Stddev 31      2948.17 (  0.00%)     22046.59 (-647.81%)
Stddev 32      3045.75 (  0.00%)     21317.50 (-599.91%)
Stddev 33      3088.42 (  0.00%)     24073.34 (-679.47%)
Stddev 34      1695.86 (  0.00%)     25483.66 (-1402.69%)
Stddev 35      2392.89 (  0.00%)     22319.81 (-832.76%)
Stddev 36      1002.99 (  0.00%)     24788.30 (-2371.43%)
Stddev 37      1246.07 (  0.00%)     22969.98 (-1743.39%)
Stddev 38      3340.47 (  0.00%)     17764.75 (-431.80%)
Stddev 39       951.45 (  0.00%)     17467.43 (-1735.88%)
Stddev 40      1861.87 (  0.00%)     16746.88 (-799.47%)
Stddev 41      3019.63 (  0.00%)     22203.85 (-635.32%)
Stddev 42      3305.80 (  0.00%)     19226.07 (-481.59%)
Stddev 43      2149.96 (  0.00%)     19788.85 (-820.43%)
Stddev 44      4743.81 (  0.00%)     20232.47 (-326.50%)
Stddev 45      3701.87 (  0.00%)     19876.40 (-436.93%)
Stddev 46      3742.49 (  0.00%)     17963.46 (-379.99%)
Stddev 47      1637.98 (  0.00%)     20138.13 (-1129.45%)
Stddev 48      2192.84 (  0.00%)     16729.79 (-662.93%)
TPut   1     123600.00 (  0.00%)    118166.00 ( -4.40%)
TPut   2     251282.00 (  0.00%)    253321.00 (  0.81%)
TPut   3     371212.00 (  0.00%)    370519.00 ( -0.19%)
TPut   4     476489.00 (  0.00%)    487927.00 (  2.40%)
TPut   5     569564.00 (  0.00%)    593162.00 (  4.14%)
TPut   6     604292.00 (  0.00%)    679295.00 ( 12.41%)
TPut   7     610474.00 (  0.00%)    665644.00 (  9.04%)
TPut   8     565137.00 (  0.00%)    612888.00 (  8.45%)
TPut   9     544221.00 (  0.00%)    557050.00 (  2.36%)
TPut   10    497162.00 (  0.00%)    533858.00 (  7.38%)
TPut   11    559757.00 (  0.00%)    638725.00 ( 14.11%)
TPut   12    550183.00 (  0.00%)    639318.00 ( 16.20%)
TPut   13    534429.00 (  0.00%)    631236.00 ( 18.11%)
TPut   14    542048.00 (  0.00%)    614042.00 ( 13.28%)
TPut   15    530923.00 (  0.00%)    606509.00 ( 14.24%)
TPut   16    519697.00 (  0.00%)    592992.00 ( 14.10%)
TPut   17    521356.00 (  0.00%)    597000.00 ( 14.51%)
TPut   18    497257.00 (  0.00%)    585946.00 ( 17.84%)
TPut   19    481325.00 (  0.00%)    574467.00 ( 19.35%)
TPut   20    475309.00 (  0.00%)    565526.00 ( 18.98%)
TPut   21    483753.00 (  0.00%)    552787.00 ( 14.27%)
TPut   22    474643.00 (  0.00%)    547518.00 ( 15.35%)
TPut   23    468023.00 (  0.00%)    536802.00 ( 14.70%)
TPut   24    450846.00 (  0.00%)    525210.00 ( 16.49%)
TPut   25    461834.00 (  0.00%)    519757.00 ( 12.54%)
TPut   26    456034.00 (  0.00%)    515338.00 ( 13.00%)
TPut   27    460254.00 (  0.00%)    513576.00 ( 11.59%)
TPut   28    457438.00 (  0.00%)    496290.00 (  8.49%)
TPut   29    454550.00 (  0.00%)    499818.00 (  9.96%)
TPut   30    453571.00 (  0.00%)    495767.00 (  9.30%)
TPut   31    460525.00 (  0.00%)    485911.00 (  5.51%)
TPut   32    448016.00 (  0.00%)    488940.00 (  9.13%)
TPut   33    445150.00 (  0.00%)    483970.00 (  8.72%)
TPut   34    444827.00 (  0.00%)    475079.00 (  6.80%)
TPut   35    433878.00 (  0.00%)    480246.00 ( 10.69%)
TPut   36    423728.00 (  0.00%)    472159.00 ( 11.43%)
TPut   37    429712.00 (  0.00%)    473183.00 ( 10.12%)
TPut   38    411219.00 (  0.00%)    482078.00 ( 17.23%)
TPut   39    416380.00 (  0.00%)    485846.00 ( 16.68%)
TPut   40    413840.00 (  0.00%)    490026.00 ( 18.41%)
TPut   41    401668.00 (  0.00%)    474282.00 ( 18.08%)
TPut   42    404103.00 (  0.00%)    482448.00 ( 19.39%)
TPut   43    401247.00 (  0.00%)    482974.00 ( 20.37%)
TPut   44    407076.00 (  0.00%)    481641.00 ( 18.32%)
TPut   45    398597.00 (  0.00%)    485042.00 ( 21.69%)
TPut   46    404714.00 (  0.00%)    484843.00 ( 19.80%)
TPut   47    404595.00 (  0.00%)    479977.00 ( 18.63%)
TPut   48    413784.00 (  0.00%)    480818.00 ( 16.20%)

This is looking much better overall although I am concerned about the
increased variability between JVMs.

specjbb Peaks
                                3.12.0-rc3               3.12.0-rc3
                              account-v9           periodretry-v9  
 Expctd Warehouse          12.00 (  0.00%)          12.00 (  0.00%)
 Expctd Peak Bops      559757.00 (  0.00%)      638725.00 ( 14.11%)
 Actual Warehouse           8.00 (  0.00%)           7.00 (-12.50%)
 Actual Peak Bops      610474.00 (  0.00%)      679295.00 ( 11.27%)
 SpecJBB Bops          502292.00 (  0.00%)      582258.00 ( 15.92%)
 SpecJBB Bops/JVM      125573.00 (  0.00%)      145565.00 ( 15.92%)

Looking fine.

          3.12.0-rc3  3.12.0-rc3
        account-v9  periodretry-v9  
User       481412.08   481942.54
System       1301.91      578.20
Elapsed     10402.09    10404.47

                            3.12.0-rc3  3.12.0-rc3
                          account-v9  periodretry-v9  
Compaction cost                 105928       13748
NUMA PTE updates             457567880    45890118
NUMA hint faults              69831880    45725506
NUMA hint local faults        19303679    28637898
NUMA hint local percent             27          62
NUMA pages migrated          102050548    13244738
AutoNUMA cost                   354301      229200

and system CPU usage is still way down so now we are seeing large
improvements for less work. Previous tests had indicated that period
retrying of task migration was necessary for a good "local percent"
of local/remote faults. It implies that the load balancer and NUMA
scheduling may be making conflicting decisions.

While there is still plenty of future work it looks like this is ready
for wider testing.

 Documentation/sysctl/kernel.txt   |   76 +++
 fs/exec.c                         |    1 +
 fs/proc/array.c                   |    2 +
 include/linux/cpu.h               |   67 ++-
 include/linux/mempolicy.h         |    1 +
 include/linux/migrate.h           |    7 +-
 include/linux/mm.h                |  118 +++-
 include/linux/mm_types.h          |   17 +-
 include/linux/page-flags-layout.h |   28 +-
 include/linux/sched.h             |   67 ++-
 include/linux/sched/sysctl.h      |    1 -
 include/linux/stop_machine.h      |    1 +
 kernel/bounds.c                   |    4 +
 kernel/cpu.c                      |  227 ++++++--
 kernel/fork.c                     |    5 +-
 kernel/sched/core.c               |  184 ++++++-
 kernel/sched/debug.c              |   60 +-
 kernel/sched/fair.c               | 1092 ++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h           |   19 +-
 kernel/sched/idle_task.c          |    2 +-
 kernel/sched/rt.c                 |    5 +-
 kernel/sched/sched.h              |   27 +-
 kernel/sched/stop_task.c          |    2 +-
 kernel/stop_machine.c             |  272 +++++----
 kernel/sysctl.c                   |   21 +-
 mm/huge_memory.c                  |  119 +++-
 mm/memory.c                       |  158 ++----
 mm/mempolicy.c                    |   82 ++-
 mm/migrate.c                      |   49 +-
 mm/mm_init.c                      |   18 +-
 mm/mmzone.c                       |   14 +-
 mm/mprotect.c                     |   65 +--
 mm/page_alloc.c                   |    4 +-
 33 files changed, 2248 insertions(+), 567 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 01/63] hotplug: Optimize {get,put}_online_cpus()
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

NOTE: This is a placeholder only. A more comprehensive series is in
	progress but this patch on its own mitigates most of the
	overhead the migrate_swap patch is concerned with. It's
	expected that CPU hotplug locking series would go in before
	this series.

The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.

Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.

The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.

However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.

Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion -- this new task_struct member is placed in a 4 byte hole on
64bit builds.

Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/cpu.h   |  67 ++++++++++++++-
 include/linux/sched.h |   3 +
 kernel/cpu.c          | 227 +++++++++++++++++++++++++++++++++++++-------------
 kernel/sched/core.c   |   2 +
 4 files changed, 237 insertions(+), 62 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 801ff9e..e520c76 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,8 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
 
 struct device;
 
@@ -173,10 +175,69 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader recursion */
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (current->cpuhp_ref++)
+		return;
+
+	preempt_disable();
+	/*
+	 * We are in an RCU-sched read-side critical section, so the writer
+	 * cannot both change __cpuhp_state from readers_fast and start
+	 * checking counters while we are here. So if we see !__cpuhp_state,
+	 * we know that the writer won't be checking until we past the
+	 * preempt_enable() and that once the synchronize_sched() is done, the
+	 * writer will see anything we did within this RCU-sched read-side
+	 * critical section.
+	 */
+	if (likely(!__cpuhp_state))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus(); /* Unconditional memory barrier. */
+	preempt_enable();
+	/*
+	 * The barrier() from preempt_enable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (--current->cpuhp_ref)
+		return;
+
+	/*
+	 * The barrier() in preempt_disable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+	preempt_disable();
+	/*
+	 * Same as in get_online_cpus().
+	 */
+	if (likely(!__cpuhp_state))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus(); /* Unconditional memory barrier. */
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +261,8 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6682da3..5308d89 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1026,6 +1026,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+#ifdef CONFIG_HOTPLUG_CPU
+	int cpuhp_ref;
+#endif
 	struct task_struct *last_wakee;
 	unsigned long wakee_flips;
 	unsigned long wakee_flip_decay_ts;
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d7f07a2..dccf605 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,195 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+again:
+	__this_cpu_inc(__cpuhp_refcount);
+
 	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
+	 * Due to having preemption disabled the decrement happens on
+	 * the same CPU as the increment, avoiding the
+	 * increment-on-one-CPU-and-decrement-on-another problem.
+	 *
+	 * And yes, if the reader misses the writer's assignment of
+	 * readers_block to __cpuhp_state, then the writer is
+	 * guaranteed to see the reader's increment.  Conversely, any
+	 * readers that increment their __cpuhp_refcount after the
+	 * writer looks are guaranteed to see the readers_block value,
+	 * which in turn means that they are guaranteed to immediately
+	 * decrement their __cpuhp_refcount, so that it doesn't matter
+	 * that the writer missed them.
 	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
 
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+	smp_mb(); /* A matches D */
+
+	if (likely(__cpuhp_state != readers_block))
 		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
 
+	/*
+	 * Make sure an outgoing writer sees the waitcount to ensure we
+	 * make progress.
+	 */
+	atomic_inc(&cpuhp_waitcount);
+
+	/*
+	 * Per the above comment; we still have preemption disabled and
+	 * will thus decrement on the same CPU as we incremented.
+	 */
+	__put_online_cpus();
+
+	/*
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
+	 */
+	preempt_enable_no_resched();
+	__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+	preempt_disable();
+
+	/*
+	 * Given we've still got preempt_disabled and new cpu_hotplug_begin()
+	 * must do a synchronize_sched() we're guaranteed a successfull
+	 * acquisition this time -- even if we wake the current
+	 * cpu_hotplug_end() now.
+	 */
+	if (atomic_dec_and_test(&cpuhp_waitcount))
+		wake_up(&cpuhp_writer);
+
+	goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	smp_mb(); /* B matches C */
+	/*
+	 * In other words, if they see our decrement (presumably to aggregate
+	 * zero, as that is the only time it matters) they will also see our
+	 * critical section.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
+
+	/* Prod writer to recheck readers_active */
+	wake_up(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+#define per_cpu_sum(var)						\
+({ 									\
+ 	typeof(var) __sum = 0;						\
+ 	int cpu;							\
+ 	for_each_possible_cpu(cpu)					\
+ 		__sum += per_cpu(var, cpu);				\
+ 	__sum;								\
+})
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+/*
+ * Return true if the modular sum of the __cpuhp_refcount per-CPU variables
+ * is zero. If this sum is zero, then it is stable due to the fact that if
+ * any newly arriving readers increment a given counter, they will
+ * immediately decrement that same counter.
+ */
+static bool cpuhp_readers_active_check(void)
+{
+	if (per_cpu_sum(__cpuhp_refcount) != 0)
+		return false;
 
+	/*
+	 * If we observed the decrement; ensure we see the entire critical
+	 * section.
+	 */
+
+	smp_mb(); /* C matches B */
+
+	return true;
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	/*
+	 * Since cpu_hotplug_begin() is always called after invoking
+	 * cpu_maps_update_begin(), we can be sure that only one writer is
+	 * active.
+	 */
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	/* Allow reader-in-writer recursion. */
+	current->cpuhp_ref++;
+
+	/* Notify readers to take the slow path. */
+	__cpuhp_state = readers_slow;
+
+	/* See percpu_down_write(); guarantees all readers take the slow path */
+	synchronize_sched();
+
+	/*
+	 * Notify new readers to block; up until now, and thus throughout the
+	 * longish synchronize_sched() above, new readers could still come in.
+	 */
+	__cpuhp_state = readers_block;
+
+	smp_mb(); /* D matches A */
+
+	/*
+	 * If they don't see our writer of readers_block to __cpuhp_state,
+	 * then we are guaranteed to see their __cpuhp_refcount increment, and
+	 * therefore will wait for them.
+	 */
+
+	/* Wait for all now active readers to complete. */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/*
+	 * Signal the writer is done, no fast path yet.
+	 *
+	 * One reason that we cannot just immediately flip to readers_fast is
+	 * that new readers might fail to see the results of this writer's
+	 * critical section.
+	 */
+	__cpuhp_state = readers_slow;
+	wake_up_all(&cpuhp_readers);
+
+	/*
+	 * The wait_event()/wake_up_all() prevents the race where the readers
+	 * are delayed between fetching __cpuhp_state and blocking.
+	 */
+
+	/* See percpu_up_write(); readers will no longer attempt to block. */
+	synchronize_sched();
+
+	/* Let 'em rip */
+	__cpuhp_state = readers_fast;
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..2f3420c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1630,6 +1630,8 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 01/63] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

NOTE: This is a placeholder only. A more comprehensive series is in
	progress but this patch on its own mitigates most of the
	overhead the migrate_swap patch is concerned with. It's
	expected that CPU hotplug locking series would go in before
	this series.

The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.

Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.

The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.

However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.

Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion -- this new task_struct member is placed in a 4 byte hole on
64bit builds.

Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/cpu.h   |  67 ++++++++++++++-
 include/linux/sched.h |   3 +
 kernel/cpu.c          | 227 +++++++++++++++++++++++++++++++++++++-------------
 kernel/sched/core.c   |   2 +
 4 files changed, 237 insertions(+), 62 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 801ff9e..e520c76 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,8 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
 
 struct device;
 
@@ -173,10 +175,69 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader recursion */
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (current->cpuhp_ref++)
+		return;
+
+	preempt_disable();
+	/*
+	 * We are in an RCU-sched read-side critical section, so the writer
+	 * cannot both change __cpuhp_state from readers_fast and start
+	 * checking counters while we are here. So if we see !__cpuhp_state,
+	 * we know that the writer won't be checking until we past the
+	 * preempt_enable() and that once the synchronize_sched() is done, the
+	 * writer will see anything we did within this RCU-sched read-side
+	 * critical section.
+	 */
+	if (likely(!__cpuhp_state))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus(); /* Unconditional memory barrier. */
+	preempt_enable();
+	/*
+	 * The barrier() from preempt_enable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (--current->cpuhp_ref)
+		return;
+
+	/*
+	 * The barrier() in preempt_disable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+	preempt_disable();
+	/*
+	 * Same as in get_online_cpus().
+	 */
+	if (likely(!__cpuhp_state))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus(); /* Unconditional memory barrier. */
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +261,8 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6682da3..5308d89 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1026,6 +1026,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+#ifdef CONFIG_HOTPLUG_CPU
+	int cpuhp_ref;
+#endif
 	struct task_struct *last_wakee;
 	unsigned long wakee_flips;
 	unsigned long wakee_flip_decay_ts;
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d7f07a2..dccf605 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,195 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+again:
+	__this_cpu_inc(__cpuhp_refcount);
+
 	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
+	 * Due to having preemption disabled the decrement happens on
+	 * the same CPU as the increment, avoiding the
+	 * increment-on-one-CPU-and-decrement-on-another problem.
+	 *
+	 * And yes, if the reader misses the writer's assignment of
+	 * readers_block to __cpuhp_state, then the writer is
+	 * guaranteed to see the reader's increment.  Conversely, any
+	 * readers that increment their __cpuhp_refcount after the
+	 * writer looks are guaranteed to see the readers_block value,
+	 * which in turn means that they are guaranteed to immediately
+	 * decrement their __cpuhp_refcount, so that it doesn't matter
+	 * that the writer missed them.
 	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
 
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+	smp_mb(); /* A matches D */
+
+	if (likely(__cpuhp_state != readers_block))
 		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
 
+	/*
+	 * Make sure an outgoing writer sees the waitcount to ensure we
+	 * make progress.
+	 */
+	atomic_inc(&cpuhp_waitcount);
+
+	/*
+	 * Per the above comment; we still have preemption disabled and
+	 * will thus decrement on the same CPU as we incremented.
+	 */
+	__put_online_cpus();
+
+	/*
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
+	 */
+	preempt_enable_no_resched();
+	__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+	preempt_disable();
+
+	/*
+	 * Given we've still got preempt_disabled and new cpu_hotplug_begin()
+	 * must do a synchronize_sched() we're guaranteed a successfull
+	 * acquisition this time -- even if we wake the current
+	 * cpu_hotplug_end() now.
+	 */
+	if (atomic_dec_and_test(&cpuhp_waitcount))
+		wake_up(&cpuhp_writer);
+
+	goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	smp_mb(); /* B matches C */
+	/*
+	 * In other words, if they see our decrement (presumably to aggregate
+	 * zero, as that is the only time it matters) they will also see our
+	 * critical section.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
+
+	/* Prod writer to recheck readers_active */
+	wake_up(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+#define per_cpu_sum(var)						\
+({ 									\
+ 	typeof(var) __sum = 0;						\
+ 	int cpu;							\
+ 	for_each_possible_cpu(cpu)					\
+ 		__sum += per_cpu(var, cpu);				\
+ 	__sum;								\
+})
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+/*
+ * Return true if the modular sum of the __cpuhp_refcount per-CPU variables
+ * is zero. If this sum is zero, then it is stable due to the fact that if
+ * any newly arriving readers increment a given counter, they will
+ * immediately decrement that same counter.
+ */
+static bool cpuhp_readers_active_check(void)
+{
+	if (per_cpu_sum(__cpuhp_refcount) != 0)
+		return false;
 
+	/*
+	 * If we observed the decrement; ensure we see the entire critical
+	 * section.
+	 */
+
+	smp_mb(); /* C matches B */
+
+	return true;
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	/*
+	 * Since cpu_hotplug_begin() is always called after invoking
+	 * cpu_maps_update_begin(), we can be sure that only one writer is
+	 * active.
+	 */
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	/* Allow reader-in-writer recursion. */
+	current->cpuhp_ref++;
+
+	/* Notify readers to take the slow path. */
+	__cpuhp_state = readers_slow;
+
+	/* See percpu_down_write(); guarantees all readers take the slow path */
+	synchronize_sched();
+
+	/*
+	 * Notify new readers to block; up until now, and thus throughout the
+	 * longish synchronize_sched() above, new readers could still come in.
+	 */
+	__cpuhp_state = readers_block;
+
+	smp_mb(); /* D matches A */
+
+	/*
+	 * If they don't see our writer of readers_block to __cpuhp_state,
+	 * then we are guaranteed to see their __cpuhp_refcount increment, and
+	 * therefore will wait for them.
+	 */
+
+	/* Wait for all now active readers to complete. */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/*
+	 * Signal the writer is done, no fast path yet.
+	 *
+	 * One reason that we cannot just immediately flip to readers_fast is
+	 * that new readers might fail to see the results of this writer's
+	 * critical section.
+	 */
+	__cpuhp_state = readers_slow;
+	wake_up_all(&cpuhp_readers);
+
+	/*
+	 * The wait_event()/wake_up_all() prevents the race where the readers
+	 * are delayed between fetching __cpuhp_state and blocking.
+	 */
+
+	/* See percpu_up_write(); readers will no longer attempt to block. */
+	synchronize_sched();
+
+	/* Let 'em rip */
+	__cpuhp_state = readers_fast;
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..2f3420c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1630,6 +1630,8 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9d4c1d1..1428c66 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -355,6 +355,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9d4c1d1..1428c66 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -355,6 +355,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 03/63] sched, numa: Comment fixlets
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Fix a 80 column violation and a PTE vs PMD reference.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 ++++----
 mm/huge_memory.c    | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..b22f52a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * It is possible to reach the end of the VMA list but the last few VMAs are
-	 * not guaranteed to the vma_migratable. If they are not, we would find the
-	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * It is possible to reach the end of the VMA list but the last few
+	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
+	 * would find the !migratable VMA on the next scan but not reset the
+	 * scanner to the start so check it now.
 	 */
 	if (vma)
 		mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..19dbb08 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 03/63] sched, numa: Comment fixlets
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Fix a 80 column violation and a PTE vs PMD reference.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 ++++----
 mm/huge_memory.c    | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..b22f52a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * It is possible to reach the end of the VMA list but the last few VMAs are
-	 * not guaranteed to the vma_migratable. If they are not, we would find the
-	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * It is possible to reach the end of the VMA list but the last few
+	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
+	 * would find the !migratable VMA on the next scan but not reset the
+	 * scanner to the start so check it now.
 	 */
 	if (vma)
 		mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..19dbb08 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

If another task handled a hinting fault in parallel then do not double
account for it.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 19dbb08..dab2bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 check_same:
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		/* Someone else took our fault */
+		current_nid = -1;
 		goto out_unlock;
+	}
 clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

If another task handled a hinting fault in parallel then do not double
account for it.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 19dbb08..dab2bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 check_same:
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		/* Someone else took our fault */
+		current_nid = -1;
 		goto out_unlock;
+	}
 clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.

This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dab2bab..f362363 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
-	}
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	if (trylock_page(page))
+		goto got_lock;
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Serialise against migrationa and check placement check placement */
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
+
+got_lock:
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		unlock_page(page);
+		put_page(page);
+		goto clear_pmdnuma;
+	}
 
 	/* Migrate the THP to the requested node */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (!migrated)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.

This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dab2bab..f362363 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
-	}
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	if (trylock_page(page))
+		goto got_lock;
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Serialise against migrationa and check placement check placement */
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
+
+got_lock:
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		unlock_page(page);
+		put_page(page);
+		goto clear_pmdnuma;
+	}
 
 	/* Migrate the THP to the requested node */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (!migrated)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 06/63] mm: Prevent parallel splits during THP migration
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f362363..1d6334f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
 	int current_nid = -1;
-	bool migrated;
+	bool migrated, page_locked;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
 	current_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
-	if (trylock_page(page))
-		goto got_lock;
+	page_locked = trylock_page(page);
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked) {
+			unlock_page(page);
+			goto clear_pmdnuma;
+		}
 
-	/* Serialise against migrationa and check placement check placement */
+		/* Otherwise wait for potential migrations and retry fault */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		goto out;
+	}
+
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked) {
+		lock_page(page);
+		page_locked = true;
+	}
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
-got_lock:
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		unlock_page(page);
-		put_page(page);
-		goto clear_pmdnuma;
-	}
-
 	/* Migrate the THP to the requested node */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
 		goto check_same;
 
 	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
 	return 0;
 
 check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+
+out:
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
 	if (current_nid != -1)
 		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
 	return 0;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 06/63] mm: Prevent parallel splits during THP migration
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f362363..1d6334f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
 	int current_nid = -1;
-	bool migrated;
+	bool migrated, page_locked;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
 	current_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
-	if (trylock_page(page))
-		goto got_lock;
+	page_locked = trylock_page(page);
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked) {
+			unlock_page(page);
+			goto clear_pmdnuma;
+		}
 
-	/* Serialise against migrationa and check placement check placement */
+		/* Otherwise wait for potential migrations and retry fault */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		goto out;
+	}
+
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked) {
+		lock_page(page);
+		page_locked = true;
+	}
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
-got_lock:
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		unlock_page(page);
-		put_page(page);
-		goto clear_pmdnuma;
-	}
-
 	/* Migrate the THP to the requested node */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
 		goto check_same;
 
 	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
 	return 0;
 
 check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+
+out:
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
 	if (current_nid != -1)
 		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
 	return 0;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

There are three callers of task_numa_fault():

 - do_huge_pmd_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_pmd_numa_page():
     Accounts not at all when the page isn't migrated, otherwise
     accounts against the node we migrated towards.

This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.

So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.

They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 25 +++++++++++++------------
 mm/memory.c      | 53 +++++++++++++++++++++--------------------------------
 2 files changed, 34 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6334f..c3bb65f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated, page_locked;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
+	if (migrated)
+		page_nid = target_nid;
+	else
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-	return 0;
+	goto out;
 
 check_same:
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		/* Someone else took our fault */
-		current_nid = -1;
+		page_nid = -1;
 		goto out_unlock;
 	}
 clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);
 
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..42ae82e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3519,12 +3519,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3535,7 +3535,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3565,15 +3565,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3581,11 +3576,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3600,7 +3595,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3623,9 +3617,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3647,25 +3642,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

There are three callers of task_numa_fault():

 - do_huge_pmd_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_pmd_numa_page():
     Accounts not at all when the page isn't migrated, otherwise
     accounts against the node we migrated towards.

This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.

So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.

They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 25 +++++++++++++------------
 mm/memory.c      | 53 +++++++++++++++++++++--------------------------------
 2 files changed, 34 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6334f..c3bb65f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated, page_locked;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
+	if (migrated)
+		page_nid = target_nid;
+	else
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-	return 0;
+	goto out;
 
 check_same:
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		/* Someone else took our fault */
-		current_nid = -1;
+		page_nid = -1;
 		goto out_unlock;
 	}
 clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);
 
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..42ae82e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3519,12 +3519,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3535,7 +3535,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3565,15 +3565,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3581,11 +3576,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3600,7 +3595,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3623,9 +3617,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3647,25 +3642,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open

Task A						Task B
do_huge_pmd_numa_page				do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
						lock_page
						mpol_misplaced == 2
						migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at

During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 33 +++++++++++++++------------------
 mm/migrate.c     | 19 +++++++++++--------
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bb65f..d4928769 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		/* If the page was locked, there are no parallel migrations */
-		if (page_locked) {
-			unlock_page(page);
+		if (page_locked)
 			goto clear_pmdnuma;
-		}
 
-		/* Otherwise wait for potential migrations and retry fault */
+		/*
+		 * Otherwise wait for potential migrations and retry. We do
+		 * relock and check_same as the page may no longer be mapped.
+		 * As the fault is being retried, do not account for it.
+		 */
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
+		page_nid = -1;
 		goto out;
 	}
 
 	/* Page is misplaced, serialise migrations and parallel THP splits */
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	if (!page_locked) {
+	if (!page_locked)
 		lock_page(page);
-		page_locked = true;
-	}
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
 		put_page(page);
+		page_nid = -1;
 		goto out_unlock;
 	}
 
-	/* Migrate the THP to the requested node */
+	/*
+	 * Migrate the THP to the requested node, returns with page unlocked
+	 * and pmd_numa cleared.
+	 */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (migrated)
 		page_nid = target_nid;
-	else
-		goto check_same;
 
 	goto out;
-
-check_same:
-	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp))) {
-		/* Someone else took our fault */
-		page_nid = -1;
-		goto out_unlock;
-	}
 clear_pmdnuma:
+	BUG_ON(!PageLocked(page));
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	unlock_page(page);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 9c8d5f5..ce8c3a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1713,12 +1713,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		unlock_page(new_page);
 		put_page(new_page);		/* Free it */
 
-		unlock_page(page);
+		/* Retake the callers reference and putback on LRU */
+		get_page(page);
 		putback_lru_page(page);
-
-		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-		isolated = 0;
-		goto out;
+		mod_zone_page_state(page_zone(page),
+			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+		goto out_fail;
 	}
 
 	/*
@@ -1735,9 +1735,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
 
-	page_add_new_anon_rmap(new_page, vma, haddr);
-
+	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
 	/*
@@ -1756,7 +1756,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-out:
 	mod_zone_page_state(page_zone(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
@@ -1765,6 +1764,10 @@ out:
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
+	entry = pmd_mknonnuma(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, &entry);
+
 	unlock_page(page);
 	put_page(page);
 	return 0;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open

Task A						Task B
do_huge_pmd_numa_page				do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
						lock_page
						mpol_misplaced == 2
						migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at

During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 33 +++++++++++++++------------------
 mm/migrate.c     | 19 +++++++++++--------
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bb65f..d4928769 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		/* If the page was locked, there are no parallel migrations */
-		if (page_locked) {
-			unlock_page(page);
+		if (page_locked)
 			goto clear_pmdnuma;
-		}
 
-		/* Otherwise wait for potential migrations and retry fault */
+		/*
+		 * Otherwise wait for potential migrations and retry. We do
+		 * relock and check_same as the page may no longer be mapped.
+		 * As the fault is being retried, do not account for it.
+		 */
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
+		page_nid = -1;
 		goto out;
 	}
 
 	/* Page is misplaced, serialise migrations and parallel THP splits */
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	if (!page_locked) {
+	if (!page_locked)
 		lock_page(page);
-		page_locked = true;
-	}
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
 		put_page(page);
+		page_nid = -1;
 		goto out_unlock;
 	}
 
-	/* Migrate the THP to the requested node */
+	/*
+	 * Migrate the THP to the requested node, returns with page unlocked
+	 * and pmd_numa cleared.
+	 */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (migrated)
 		page_nid = target_nid;
-	else
-		goto check_same;
 
 	goto out;
-
-check_same:
-	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp))) {
-		/* Someone else took our fault */
-		page_nid = -1;
-		goto out_unlock;
-	}
 clear_pmdnuma:
+	BUG_ON(!PageLocked(page));
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	unlock_page(page);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 9c8d5f5..ce8c3a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1713,12 +1713,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		unlock_page(new_page);
 		put_page(new_page);		/* Free it */
 
-		unlock_page(page);
+		/* Retake the callers reference and putback on LRU */
+		get_page(page);
 		putback_lru_page(page);
-
-		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-		isolated = 0;
-		goto out;
+		mod_zone_page_state(page_zone(page),
+			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+		goto out_fail;
 	}
 
 	/*
@@ -1735,9 +1735,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
 
-	page_add_new_anon_rmap(new_page, vma, haddr);
-
+	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
 	/*
@@ -1756,7 +1756,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-out:
 	mod_zone_page_state(page_zone(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
@@ -1765,6 +1764,10 @@ out:
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
+	entry = pmd_mknonnuma(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, &entry);
+
 	unlock_page(page);
 	put_page(page);
 	return 0;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..7bdbd4b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -101,8 +101,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				make_migration_entry_read(&entry);
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
+
+				pages++;
 			}
-			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..7bdbd4b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -101,8 +101,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				make_migration_entry_read(&entry);
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
+
+				pages++;
 			}
-			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 11/63] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 19 ++++++++++++++++---
 mm/mprotect.c    | 14 ++++++++++----
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4928769..de8d5cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1458,6 +1458,12 @@ out:
 	return ret;
 }
 
+/*
+ * Returns
+ *  - 0 if PMD could not be locked
+ *  - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
@@ -1466,9 +1472,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
-		entry = pmdp_get_and_clear(mm, addr, pmd);
+		ret = 1;
 		if (!prot_numa) {
+			entry = pmdp_get_and_clear(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
+			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
@@ -1476,12 +1484,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			/* only check non-shared pages */
 			if (page_mapcount(page) == 1 &&
 			    !pmd_numa(*pmd)) {
+				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
+				ret = HPAGE_PMD_NR;
 			}
 		}
-		set_pmd_at(mm, addr, pmd, entry);
+
+		/* Set PMD if cleared earlier */
+		if (ret == HPAGE_PMD_NR)
+			set_pmd_at(mm, addr, pmd, entry);
+
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		ret = 1;
 	}
 
 	return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7bdbd4b..2da33dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -144,10 +144,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot,
-						 prot_numa)) {
-				pages++;
-				continue;
+			else {
+				int nr_ptes = change_huge_pmd(vma, pmd, addr,
+						newprot, prot_numa);
+
+				if (nr_ptes) {
+					if (nr_ptes == HPAGE_PMD_NR)
+						pages++;
+
+					continue;
+				}
 			}
 			/* fall through */
 		}
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 11/63] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 19 ++++++++++++++++---
 mm/mprotect.c    | 14 ++++++++++----
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4928769..de8d5cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1458,6 +1458,12 @@ out:
 	return ret;
 }
 
+/*
+ * Returns
+ *  - 0 if PMD could not be locked
+ *  - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
@@ -1466,9 +1472,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
-		entry = pmdp_get_and_clear(mm, addr, pmd);
+		ret = 1;
 		if (!prot_numa) {
+			entry = pmdp_get_and_clear(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
+			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
@@ -1476,12 +1484,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			/* only check non-shared pages */
 			if (page_mapcount(page) == 1 &&
 			    !pmd_numa(*pmd)) {
+				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
+				ret = HPAGE_PMD_NR;
 			}
 		}
-		set_pmd_at(mm, addr, pmd, entry);
+
+		/* Set PMD if cleared earlier */
+		if (ret == HPAGE_PMD_NR)
+			set_pmd_at(mm, addr, pmd, entry);
+
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		ret = 1;
 	}
 
 	return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7bdbd4b..2da33dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -144,10 +144,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot,
-						 prot_numa)) {
-				pages++;
-				continue;
+			else {
+				int nr_ptes = change_huge_pmd(vma, pmd, addr,
+						newprot, prot_numa);
+
+				if (nr_ptes) {
+					if (nr_ptes == HPAGE_PMD_NR)
+						pages++;
+
+					continue;
+				}
 			}
 			/* fall through */
 		}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.

[peterz@infradead.org: Correct use of is_huge_zero_page]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 10 +++++++++-
 mm/memory.c      |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de8d5cf..8677dbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,6 +1291,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
@@ -1481,8 +1482,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			struct page *page = pmd_page(*pmd);
 
-			/* only check non-shared pages */
+			/*
+			 * Only check non-shared pages. Do not trap faults
+			 * against the zero page. The read-only data is likely
+			 * to be read-cached on the local CPU cache and it is
+			 * less useful to know about local vs remote hits on
+			 * the zero page.
+			 */
 			if (page_mapcount(page) == 1 &&
+			    !is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 42ae82e..ed51f15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3564,6 +3564,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
+	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.

[peterz@infradead.org: Correct use of is_huge_zero_page]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 10 +++++++++-
 mm/memory.c      |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de8d5cf..8677dbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,6 +1291,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
@@ -1481,8 +1482,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			struct page *page = pmd_page(*pmd);
 
-			/* only check non-shared pages */
+			/*
+			 * Only check non-shared pages. Do not trap faults
+			 * against the zero page. The read-only data is likely
+			 * to be read-cached on the local CPU cache and it is
+			 * less useful to know about local vs remote hits on
+			 * the zero page.
+			 */
 			if (page_mapcount(page) == 1 &&
+			    !is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 42ae82e..ed51f15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3564,6 +3564,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
+	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.

This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.

Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.

Before:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
      thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
      thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
      thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
      thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
      thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
      thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
      thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
      thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
      thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
      thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
      thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
      thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
      thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
      thread 0/0-3232  [022] ....   214.209342: task_numa_work: working

After:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
      thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
      thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
      thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
      thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
      thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
      thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
      thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
      thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
      thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
      thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
      thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
      thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
      thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
      thread 0/3-3256  [024] ....   138.267207: task_numa_work: working

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b22f52a..8b9ff79 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * Delay this task enough that another task of this mm will likely win
+	 * the next time around.
+	 */
+	p->node_stamp += 2 * TICK_NSEC;
+
+	/*
 	 * Do not set pte_numa if the current running node is rate-limited.
 	 * This loses statistics on the fault but if we are unwilling to
 	 * migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		curr->node_stamp = now;
+		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
 			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.

This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.

Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.

Before:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
      thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
      thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
      thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
      thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
      thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
      thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
      thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
      thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
      thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
      thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
      thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
      thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
      thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
      thread 0/0-3232  [022] ....   214.209342: task_numa_work: working

After:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
      thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
      thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
      thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
      thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
      thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
      thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
      thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
      thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
      thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
      thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
      thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
      thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
      thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
      thread 0/3-3256  [024] ....   138.267207: task_numa_work: working

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b22f52a..8b9ff79 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * Delay this task enough that another task of this mm will likely win
+	 * the next time around.
+	 */
+	p->node_stamp += 2 * TICK_NSEC;
+
+	/*
 	 * Do not set pte_numa if the current running node is rate-limited.
 	 * This loses statistics on the fault but if we are unwilling to
 	 * migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		curr->node_stamp = now;
+		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
 			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b9ff79..39be6af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	/*
-	 * Do not set pte_numa if the current running node is rate-limited.
-	 * This loses statistics on the fault but if we are unwilling to
-	 * migrate to this node, it is less likely we can do useful work
-	 */
-	if (migrate_ratelimited(numa_node_id()))
-		return;
-
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b9ff79..39be6af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	/*
-	 * Do not set pte_numa if the current running node is rate-limited.
-	 * This loses statistics on the fault but if we are unwilling to
-	 * migrate to this node, it is less likely we can do useful work
-	 */
-	if (migrate_ratelimited(numa_node_id()))
-		return;
-
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h | 10 ----------
 kernel/fork.c            |  3 ---
 kernel/sched/fair.c      | 18 ------------------
 kernel/sched/features.h  |  4 +---
 4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b7adf1d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -428,20 +428,10 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
-
-	/*
-	 * The first node a task was scheduled on. If a task runs on
-	 * a different node than Make PTE Scan Go Now.
-	 */
-	int first_nid;
 #endif
 	struct uprobes_state uprobes_state;
 };
 
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT	-1
-#define NUMA_PTE_SCAN_ACTIVE	-2
-
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 #ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..7192d91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39be6af..148838c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
-	 * We do not care about task placement until a task runs on a node
-	 * other than the first one used by the address space. This is
-	 * largely because migrations are driven by what CPU the task
-	 * is running on. If it's never scheduled on another node, it'll
-	 * not migrate so why bother trapping the fault.
-	 */
-	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
-		mm->first_nid = numa_node_id();
-	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
-		/* Are we running on a new node yet? */
-		if (numa_node_id() == mm->first_nid &&
-		    !sched_feat_numa(NUMA_FORCE))
-			return;
-
-		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
-	}
-
-	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
-SCHED_FEAT(NUMA_FORCE,	false)
 #endif
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h | 10 ----------
 kernel/fork.c            |  3 ---
 kernel/sched/fair.c      | 18 ------------------
 kernel/sched/features.h  |  4 +---
 4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b7adf1d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -428,20 +428,10 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
-
-	/*
-	 * The first node a task was scheduled on. If a task runs on
-	 * a different node than Make PTE Scan Go Now.
-	 */
-	int first_nid;
 #endif
 	struct uprobes_state uprobes_state;
 };
 
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT	-1
-#define NUMA_PTE_SCAN_ACTIVE	-2
-
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 #ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..7192d91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39be6af..148838c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
-	 * We do not care about task placement until a task runs on a node
-	 * other than the first one used by the address space. This is
-	 * largely because migrations are driven by what CPU the task
-	 * is running on. If it's never scheduled on another node, it'll
-	 * not migrate so why bother trapping the fault.
-	 */
-	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
-		mm->first_nid = numa_node_id();
-	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
-		/* Are we running on a new node yet? */
-		if (numa_node_id() == mm->first_nid &&
-		    !sched_feat_numa(NUMA_FORCE))
-			return;
-
-		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
-	}
-
-	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
-SCHED_FEAT(NUMA_FORCE,	false)
 #endif
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c | 4 ++--
 kernel/sched/fair.c | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f3420c..681945e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1619,8 +1619,8 @@ static void __sched_fork(struct task_struct *p)
 
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
-		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
+		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 148838c..22c0c7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+		mm->numa_next_scan = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		mm->numa_next_reset = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+	}
+
 	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c | 4 ++--
 kernel/sched/fair.c | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f3420c..681945e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1619,8 +1619,8 @@ static void __sched_fork(struct task_struct *p)
 
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
-		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
+		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 148838c..22c0c7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+		mm->numa_next_scan = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		mm->numa_next_reset = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+	}
+
 	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 88 +++++++++++++++++++++++++++++++++++------
 3 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1428c66..8cd7e5f 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -403,15 +403,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5308d89..a8095ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1329,6 +1329,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22c0c7c..c0092e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 60000;
+unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long rss = 0;
+	unsigned long nr_scan_pages;
+
+	/*
+	 * Calculations based on RSS as non-present and empty pages are skipped
+	 * by the PTE scanner and NUMA hinting faults should be trapped based
+	 * on resident pages
+	 */
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	rss = get_mm_rss(p->mm);
+	if (!rss)
+		rss = nr_scan_pages;
+
+	rss = round_up(rss, nr_scan_pages);
+	return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* FIXME: Scheduling placement policy hints go here */
 }
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
-			p->numa_scan_period + jiffies_to_msecs(10));
+	if (!migrated) {
+		/* Initialise if necessary */
+		if (!p->numa_scan_period_max)
+			p->numa_scan_period_max = task_scan_max(p);
+
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period + 10);
+	}
 
 	task_numa_placement(p);
 }
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
 	 */
 	migrate = mm->numa_next_reset;
 	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+		p->numa_scan_period = task_scan_min(p);
 		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		xchg(&mm->numa_next_reset, next_scan);
 	}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 88 +++++++++++++++++++++++++++++++++++------
 3 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1428c66..8cd7e5f 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -403,15 +403,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5308d89..a8095ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1329,6 +1329,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22c0c7c..c0092e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 60000;
+unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long rss = 0;
+	unsigned long nr_scan_pages;
+
+	/*
+	 * Calculations based on RSS as non-present and empty pages are skipped
+	 * by the PTE scanner and NUMA hinting faults should be trapped based
+	 * on resident pages
+	 */
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	rss = get_mm_rss(p->mm);
+	if (!rss)
+		rss = nr_scan_pages;
+
+	rss = round_up(rss, nr_scan_pages);
+	return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* FIXME: Scheduling placement policy hints go here */
 }
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
-			p->numa_scan_period + jiffies_to_msecs(10));
+	if (!migrated) {
+		/* Initialise if necessary */
+		if (!p->numa_scan_period_max)
+			p->numa_scan_period_max = task_scan_max(p);
+
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period + 10);
+	}
 
 	task_numa_placement(p);
 }
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
 	 */
 	migrate = mm->numa_next_reset;
 	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+		p->numa_scan_period = task_scan_min(p);
 		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		xchg(&mm->numa_next_reset, next_scan);
 	}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0092e5..8cea7a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
+	 * If the whole process was scanned without updates then no NUMA
+	 * hinting faults are being recorded and scan rate should be lower.
+	 */
+	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+		mm->numa_next_scan = next_scan;
+	}
+
+	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0092e5..8cea7a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
+	 * If the whole process was scanned without updates then no NUMA
+	 * hinting faults are being recorded and scan rate should be lower.
+	 */
+	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+		mm->numa_next_scan = next_scan;
+	}
+
+	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 11 ++++++++++-
 kernel/sched/sched.h  | 12 ++++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a8095ad..8828e40 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,6 +1332,8 @@ struct task_struct {
 	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 681945e..aad2e02 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	cpu_hotplug_init_task(p);
@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cea7a2..df300d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
 	}
 
 	task_numa_placement(p);
+
+	p->numa_faults[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..6a955f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 #include <linux/tick.h>
+#include <linux/slab.h>
 
 #include "cpupri.h"
 #include "cpuacct.h"
@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
 	return rq->clock_task;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 11 ++++++++++-
 kernel/sched/sched.h  | 12 ++++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a8095ad..8828e40 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,6 +1332,8 @@ struct task_struct {
 	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 681945e..aad2e02 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	cpu_hotplug_init_task(p);
@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cea7a2..df300d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
 	}
 
 	task_numa_placement(p);
+
+	p->numa_faults[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..6a955f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 #include <linux/tick.h>
+#include <linux/slab.h>
 
 #include "cpupri.h"
 #include "cpuacct.h"
@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
 	return rq->clock_task;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8828e40..83bc1f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aad2e02..cecbbed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1628,6 +1628,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df300d9..5fdab8c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_scan_seq = seq;
 	p->numa_scan_period_max = task_scan_max(p);
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for_each_online_node(nid) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8828e40..83bc1f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aad2e02..cecbbed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1628,6 +1628,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df300d9..5fdab8c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_scan_seq = seq;
 	p->numa_scan_period_max = task_scan_max(p);
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for_each_online_node(nid) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 21/63] sched: Update NUMA hinting faults once per scan
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 83bc1f5..2e02757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1333,7 +1333,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cecbbed..201c953 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,6 +1631,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	cpu_hotplug_init_task(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5fdab8c..6227fb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults[node] += pages;
+	p->numa_faults_buffer[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 21/63] sched: Update NUMA hinting faults once per scan
@ 2013-10-07 10:28   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 83bc1f5..2e02757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1333,7 +1333,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cecbbed..201c953 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,6 +1631,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	cpu_hotplug_init_task(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5fdab8c..6227fb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults[node] += pages;
+	p->numa_faults_buffer[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 22/63] sched: Favour moving tasks towards the preferred node
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
[peterz@infradead.org: Tunable and use higher faults instead of preferred]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 +-
 kernel/sched/fair.c             | 63 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h         |  7 +++++
 kernel/sysctl.c                 |  7 +++++
 6 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 8cd7e5f..d48bca4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -420,6 +421,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2e02757..d5ae4bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -768,6 +768,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 201c953..3515c41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5661,6 +5661,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6227fb4..8c2b779 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
 	return max(smin, smax);
 }
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -4070,6 +4082,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+	    !(env->sd->flags & SD_NUMA)) {
+		return false;
+	}
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (dst_nid == p->numa_preferred_nid ||
+	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -4125,11 +4169,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..42f616a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 22/63] sched: Favour moving tasks towards the preferred node
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
[peterz@infradead.org: Tunable and use higher faults instead of preferred]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 +-
 kernel/sched/fair.c             | 63 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h         |  7 +++++
 kernel/sysctl.c                 |  7 +++++
 6 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 8cd7e5f..d48bca4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -420,6 +421,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2e02757..d5ae4bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -768,6 +768,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 201c953..3515c41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5661,6 +5661,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6227fb4..8c2b779 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
 	return max(smin, smax);
 }
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -4070,6 +4082,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+	    !(env->sd->flags & SD_NUMA)) {
+		return false;
+	}
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (dst_nid == p->numa_preferred_nid ||
+	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -4125,11 +4169,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..42f616a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.

[mgorman@suse.de: changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c     | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  8 ++++++++
 2 files changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c2b779..21cad59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4106,12 +4106,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 
 	return false;
 }
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+		return false;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+
 #else
 static inline bool migrate_improves_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
 	return false;
 }
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
 #endif
 
 /*
@@ -4174,6 +4205,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 
 	if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA,	false)
  * balancing.
  */
 SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.

[mgorman@suse.de: changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c     | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  8 ++++++++
 2 files changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c2b779..21cad59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4106,12 +4106,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 
 	return false;
 }
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+		return false;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+
 #else
 static inline bool migrate_improves_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
 	return false;
 }
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
 #endif
 
 /*
@@ -4174,6 +4205,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 
 	if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA,	false)
  * balancing.
  */
 SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 19 +++++++++++++++++++
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3515c41..60e640d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4353,6 +4353,25 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	/* TODO: This is not properly updating schedstats */
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * migration_cpu_stop - this will be executed by a highprio stopper thread
  * and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 21cad59..63677ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			min_load = load;
+			idlest_cpu = i;
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a955f4..dca80b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,6 +554,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 19 +++++++++++++++++++
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3515c41..60e640d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4353,6 +4353,25 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	/* TODO: This is not properly updating schedstats */
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * migration_cpu_stop - this will be executed by a highprio stopper thread
  * and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 21cad59..63677ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			min_load = load;
+			idlest_cpu = i;
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a955f4..dca80b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,6 +554,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 46 +++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |  5 +++--
 mm/memory.c           |  8 ++++++--
 4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5ae4bd..8a3aa9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1435,10 +1435,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63677ed..dce3545 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_faults)
+		return 0;
+
+	return p->numa_faults[task_faults_idx(nid, 0)] +
+		p->numa_faults[task_faults_idx(nid, 1)];
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 
 
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!numabalancing_enabled)
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults_buffer[node] += pages;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -4145,7 +4169,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 		return false;
 
 	if (dst_nid == p->numa_preferred_nid ||
-	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+	    task_faults(p, dst_nid) > task_faults(p, src_nid))
 		return true;
 
 	return false;
@@ -4169,7 +4193,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
 		return false;
 
-	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
 		return true;
 
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8677dbf..9142167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid;
+	int target_nid, last_nid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,6 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
+	last_nid = page_nid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1361,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index ed51f15..24bc9b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,6 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
+	int last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	last_nid = page_nid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3581,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(page_nid, 1, migrated);
+		task_numa_fault(last_nid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3596,6 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,6 +3646,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
+		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3655,7 +3659,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(page_nid, 1, migrated);
+			task_numa_fault(last_nid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 46 +++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |  5 +++--
 mm/memory.c           |  8 ++++++--
 4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5ae4bd..8a3aa9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1435,10 +1435,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63677ed..dce3545 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_faults)
+		return 0;
+
+	return p->numa_faults[task_faults_idx(nid, 0)] +
+		p->numa_faults[task_faults_idx(nid, 1)];
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 
 
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!numabalancing_enabled)
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults_buffer[node] += pages;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -4145,7 +4169,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 		return false;
 
 	if (dst_nid == p->numa_preferred_nid ||
-	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+	    task_faults(p, dst_nid) > task_faults(p, src_nid))
 		return true;
 
 	return false;
@@ -4169,7 +4193,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
 		return false;
 
-	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
 		return true;
 
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8677dbf..9142167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid;
+	int target_nid, last_nid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,6 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
+	last_nid = page_nid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1361,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index ed51f15..24bc9b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,6 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
+	int last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	last_nid = page_nid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3581,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(page_nid, 1, migrated);
+		task_numa_fault(last_nid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3596,6 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,6 +3646,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
+		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3655,7 +3659,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(page_nid, 1, migrated);
+			task_numa_fault(last_nid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dce3545..9eb384b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = -1;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dce3545..9eb384b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = -1;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/huge_memory.c        | 12 +++++-------
 mm/memory.c             |  7 ++-----
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 5 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..f5096b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,11 +90,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9142167..2a28c2c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1484,14 +1484,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			struct page *page = pmd_page(*pmd);
 
 			/*
-			 * Only check non-shared pages. Do not trap faults
-			 * against the zero page. The read-only data is likely
-			 * to be read-cached on the local CPU cache and it is
-			 * less useful to know about local vs remote hits on
-			 * the zero page.
+			 * Do not trap faults against the zero page. The
+			 * read-only data is likely to be read-cached on the
+			 * local CPU cache and it is less useful to know about
+			 * local vs remote hits on the zero page.
 			 */
-			if (page_mapcount(page) == 1 &&
-			    !is_huge_zero_page(page) &&
+			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 24bc9b8..3e3b4b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3577,7 +3577,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		page_nid = target_nid;
 
@@ -3642,16 +3642,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
 
 		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, target_nid);
+			migrated = migrate_misplaced_page(page, vma, target_nid);
 			if (migrated)
 				page_nid = target_nid;
 		} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index ce8c3a0..f212944 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1599,7 +1599,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1607,10 +1608,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1661,13 +1663,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2da33dc..41e0292 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/huge_memory.c        | 12 +++++-------
 mm/memory.c             |  7 ++-----
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 5 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..f5096b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,11 +90,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9142167..2a28c2c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1484,14 +1484,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			struct page *page = pmd_page(*pmd);
 
 			/*
-			 * Only check non-shared pages. Do not trap faults
-			 * against the zero page. The read-only data is likely
-			 * to be read-cached on the local CPU cache and it is
-			 * less useful to know about local vs remote hits on
-			 * the zero page.
+			 * Do not trap faults against the zero page. The
+			 * read-only data is likely to be read-cached on the
+			 * local CPU cache and it is less useful to know about
+			 * local vs remote hits on the zero page.
 			 */
-			if (page_mapcount(page) == 1 &&
-			    !is_huge_zero_page(page) &&
+			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 24bc9b8..3e3b4b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3577,7 +3577,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		page_nid = target_nid;
 
@@ -3642,16 +3642,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
 
 		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, target_nid);
+			migrated = migrate_misplaced_page(page, vma, target_nid);
 			if (migrated)
 				page_nid = target_nid;
 		} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index ce8c3a0..f212944 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1599,7 +1599,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1607,10 +1608,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1661,13 +1663,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2da33dc..41e0292 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 28/63] sched: Remove check that skips small VMAs
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9eb384b..fb4fc66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 28/63] sched: Remove check that skips small VMAs
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9eb384b..fb4fc66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

[riel@redhat.com: Fix complication error when !NUMA_BALANCING]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 89 +++++++++++++++++++++++++++++----------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 28 +++++++-----
 kernel/sched/fair.c               | 12 ++++--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    |  8 ++--
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 26 ++++++++----
 mm/page_alloc.c                   |  4 +-
 12 files changed, 149 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..bb412ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,48 +661,93 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
 {
-	page->_last_nid = -1;
+	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b7adf1d..38a902a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb4fc66..f83da25 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (!nidpid_pid_unset(last_nidpid))
+		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	else
+		priv = 1;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a28c2c..0baf0e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nid = -1;
+	int target_nid, last_nidpid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 3e3b4b8..cc7f206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nid;
+	int last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, 1, migrated);
+		task_numa_fault(last_nidpid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nid, page_nid, 1, migrated);
+			task_numa_fault(last_nidpid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..aff1f1e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2348,9 +2348,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2373,8 +2375,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index f212944..22abf87 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 41e0292..f0b087d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
+	bool all_same_nidpid = true;
 	int last_nid = -1;
+	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -63,11 +64,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int this_nid = page_to_nid(page);
+					int nidpid = page_nidpid_last(page);
+					int this_nid = nidpid_to_nid(nidpid);
+					int this_pid = nidpid_to_pid(nidpid);
+
 					if (last_nid == -1)
 						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
+					if (last_pid == -1)
+						last_pid = this_pid;
+					if (last_nid != this_nid ||
+					    last_pid != this_pid) {
+						all_same_nidpid = false;
+					}
 
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
@@ -107,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
+	*ret_all_same_nidpid = all_same_nidpid;
 	return pages;
 }
 
@@ -134,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
+	bool all_same_nidpid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -158,7 +166,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
+				 dirty_accountable, prot_numa, &all_same_nidpid);
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node)
+		if (prot_numa && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ee638f..f6301d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

[riel@redhat.com: Fix complication error when !NUMA_BALANCING]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 89 +++++++++++++++++++++++++++++----------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 28 +++++++-----
 kernel/sched/fair.c               | 12 ++++--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    |  8 ++--
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 26 ++++++++----
 mm/page_alloc.c                   |  4 +-
 12 files changed, 149 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..bb412ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,48 +661,93 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
 {
-	page->_last_nid = -1;
+	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b7adf1d..38a902a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb4fc66..f83da25 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (!nidpid_pid_unset(last_nidpid))
+		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	else
+		priv = 1;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a28c2c..0baf0e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nid = -1;
+	int target_nid, last_nidpid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 3e3b4b8..cc7f206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nid;
+	int last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, 1, migrated);
+		task_numa_fault(last_nidpid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nid, page_nid, 1, migrated);
+			task_numa_fault(last_nidpid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..aff1f1e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2348,9 +2348,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2373,8 +2375,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index f212944..22abf87 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 41e0292..f0b087d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
+	bool all_same_nidpid = true;
 	int last_nid = -1;
+	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -63,11 +64,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int this_nid = page_to_nid(page);
+					int nidpid = page_nidpid_last(page);
+					int this_nid = nidpid_to_nid(nidpid);
+					int this_pid = nidpid_to_pid(nidpid);
+
 					if (last_nid == -1)
 						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
+					if (last_pid == -1)
+						last_pid = this_pid;
+					if (last_nid != this_nid ||
+					    last_pid != this_pid) {
+						all_same_nidpid = false;
+					}
 
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
@@ -107,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
+	*ret_all_same_nidpid = all_same_nidpid;
 	return pages;
 }
 
@@ -134,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
+	bool all_same_nidpid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -158,7 +166,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
+				 dirty_accountable, prot_numa, &all_same_nidpid);
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node)
+		if (prot_numa && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ee638f..f6301d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 30/63] sched: Do not migrate memory immediately after switching node
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 18 ++++++++++++++++--
 mm/mempolicy.c      | 12 ++++++++++++
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 60e640d..124bb40 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 0;
+	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f83da25..b7052ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
  * the preferred node but still allow the scheduler to move the task again if
  * the nodes CPUs are overloaded.
  */
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
 static inline int task_faults_idx(int nid, int priv)
 {
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 0;
+		p->numa_migrate_seq = 1;
 		migrate_task_to(p, preferred_cpu);
 	}
 }
@@ -4120,6 +4120,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+	if (p->numa_preferred_nid != -1) {
+		int src_nid = cpu_to_node(env->src_cpu);
+		int dst_nid = cpu_to_node(env->dst_cpu);
+
+		/*
+		 * If the load balancer has moved the task then limit
+		 * migrations from taking place in the short term in
+		 * case this is a short-lived migration.
+		 */
+		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+			p->numa_migrate_seq = 0;
+	}
+#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aff1f1e..196d8da 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2378,6 +2378,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
 		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * If the scheduler has just moved us away from our
+		 * preferred node, do not bother migrating pages yet.
+		 * This way a short and temporary process migration will
+		 * not cause excessive memory migration.
+		 */
+		if (polnid != current->numa_preferred_nid &&
+				!current->numa_migrate_seq)
+			goto out;
+#endif
 	}
 
 	if (curnid != polnid)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 30/63] sched: Do not migrate memory immediately after switching node
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 18 ++++++++++++++++--
 mm/mempolicy.c      | 12 ++++++++++++
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 60e640d..124bb40 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 0;
+	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f83da25..b7052ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
  * the preferred node but still allow the scheduler to move the task again if
  * the nodes CPUs are overloaded.
  */
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
 static inline int task_faults_idx(int nid, int priv)
 {
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 0;
+		p->numa_migrate_seq = 1;
 		migrate_task_to(p, preferred_cpu);
 	}
 }
@@ -4120,6 +4120,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+	if (p->numa_preferred_nid != -1) {
+		int src_nid = cpu_to_node(env->src_cpu);
+		int dst_nid = cpu_to_node(env->dst_cpu);
+
+		/*
+		 * If the load balancer has moved the task then limit
+		 * migrations from taking place in the short term in
+		 * case this is a short-lived migration.
+		 */
+		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+			p->numa_migrate_seq = 0;
+	}
+#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aff1f1e..196d8da 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2378,6 +2378,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
 		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * If the scheduler has just moved us away from our
+		 * preferred node, do not bother migrating pages yet.
+		 * This way a short and temporary process migration will
+		 * not cause excessive memory migration.
+		 */
+		if (polnid != current->numa_preferred_nid &&
+				!current->numa_migrate_seq)
+			goto out;
+#endif
 	}
 
 	if (curnid != polnid)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 31/63] mm: numa: only unmap migrate-on-fault VMAs
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.

[riel@redhat.com: Initial patch]
Reviewed-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h |  1 +
 kernel/sched/fair.c       |  2 +-
 mm/mempolicy.c            | 24 ++++++++++++++++++++++++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index da6716b..ea4d249 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -136,6 +136,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 
 struct mempolicy *get_vma_policy(struct task_struct *tsk,
 		struct vm_area_struct *vma, unsigned long addr);
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7052ed..1789e3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1130,7 +1130,7 @@ void task_numa_work(struct callback_head *work)
 		vma = mm->mmap;
 	}
 	for (; vma; vma = vma->vm_next) {
-		if (!vma_migratable(vma))
+		if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
 			continue;
 
 		do {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 196d8da..0e895a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,6 +1679,30 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 	return pol;
 }
 
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
+{
+	struct mempolicy *pol = get_task_policy(task);
+	if (vma) {
+		if (vma->vm_ops && vma->vm_ops->get_policy) {
+			bool ret = false;
+
+			pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+			if (pol && (pol->flags & MPOL_F_MOF))
+				ret = true;
+			mpol_cond_put(pol);
+
+			return ret;
+		} else if (vma->vm_policy) {
+			pol = vma->vm_policy;
+		}
+	}
+
+	if (!pol)
+		return default_policy.flags & MPOL_F_MOF;
+
+	return pol->flags & MPOL_F_MOF;
+}
+
 static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 {
 	enum zone_type dynamic_policy_zone = policy_zone;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 31/63] mm: numa: only unmap migrate-on-fault VMAs
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.

[riel@redhat.com: Initial patch]
Reviewed-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mempolicy.h |  1 +
 kernel/sched/fair.c       |  2 +-
 mm/mempolicy.c            | 24 ++++++++++++++++++++++++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index da6716b..ea4d249 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -136,6 +136,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 
 struct mempolicy *get_vma_policy(struct task_struct *tsk,
 		struct vm_area_struct *vma, unsigned long addr);
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7052ed..1789e3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1130,7 +1130,7 @@ void task_numa_work(struct callback_head *work)
 		vma = mm->mmap;
 	}
 	for (; vma; vma = vma->vm_next) {
-		if (!vma_migratable(vma))
+		if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
 			continue;
 
 		do {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 196d8da..0e895a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,6 +1679,30 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 	return pol;
 }
 
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
+{
+	struct mempolicy *pol = get_task_policy(task);
+	if (vma) {
+		if (vma->vm_ops && vma->vm_ops->get_policy) {
+			bool ret = false;
+
+			pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+			if (pol && (pol->flags & MPOL_F_MOF))
+				ret = true;
+			mpol_cond_put(pol);
+
+			return ret;
+		} else if (vma->vm_policy) {
+			pol = vma->vm_policy;
+		}
+	}
+
+	if (!pol)
+		return default_policy.flags & MPOL_F_MOF;
+
+	return pol->flags & MPOL_F_MOF;
+}
+
 static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 {
 	enum zone_type dynamic_policy_zone = policy_zone;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1789e3c..fd6e9e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 }
 
 static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+struct numa_stats {
+	unsigned long load;
+	s64 eff_load;
+	unsigned long faults;
+};
 
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+	struct task_struct *p;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	int src_cpu, src_nid;
+	int dst_cpu, dst_nid;
 
-	rcu_read_lock();
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	struct numa_stats src_stats, dst_stats;
 
-		if (load < min_load) {
-			min_load = load;
-			idlest_cpu = i;
+	unsigned long best_load;
+	int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+	struct task_numa_env env = {
+		.p = p,
+		.src_cpu = task_cpu(p),
+		.src_nid = cpu_to_node(task_cpu(p)),
+		.dst_cpu = node_cpu,
+		.dst_nid = p->numa_preferred_nid,
+		.best_load = ULONG_MAX,
+		.best_cpu = task_cpu(p),
+	};
+	struct sched_domain *sd;
+	int cpu;
+	struct task_group *tg = task_group(p);
+	unsigned long weight;
+	bool balanced;
+	int imbalance_pct, idx = -1;
+
+	/*
+	 * Find the lowest common scheduling domain covering the nodes of both
+	 * the CPU the task is currently running on and the target NUMA node.
+	 */
+	rcu_read_lock();
+	for_each_domain(env.src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			/*
+			 * busy_idx is used for the load decision as it is the
+			 * same index used by the regular load balancer for an
+			 * active cpu.
+			 */
+			idx = sd->busy_idx;
+			imbalance_pct = sd->imbalance_pct;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return idlest_cpu;
+	if (WARN_ON_ONCE(idx == -1))
+		return 0;
+
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
+	weight = p->se.load.weight;
+	env.src_stats.load = source_load(env.src_cpu, idx);
+	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+	env.src_stats.eff_load *= power_of(env.src_cpu);
+	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+		env.dst_cpu = cpu;
+		env.dst_stats.load = target_load(cpu, idx);
+
+		/* If the CPU is idle, use it */
+		if (!env.dst_stats.load) {
+			env.best_cpu = cpu;
+			goto migrate;
+		}
+
+		/* Otherwise check the target CPU load */
+		env.dst_stats.eff_load = 100;
+		env.dst_stats.eff_load *= power_of(cpu);
+		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+		/*
+		 * Destination is considered balanced if the destination CPU is
+		 * less loaded than the source CPU. Unfortunately there is a
+		 * risk that a task running on a lightly loaded CPU will not
+		 * migrate to its preferred node due to load imbalances.
+		 */
+		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+		if (!balanced)
+			continue;
+
+		if (env.dst_stats.eff_load < env.best_load) {
+			env.best_load = env.dst_stats.eff_load;
+			env.best_cpu = cpu;
+		}
+	}
+
+migrate:
+	return migrate_task_to(p, env.best_cpu);
 }
 
 static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
-
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid) {
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-		}
-
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		migrate_task_to(p, preferred_cpu);
+		task_numa_migrate(p);
 	}
 }
 
@@ -3292,7 +3366,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
-	if (!tg->parent)	/* the trivial, non-cgroup case */
+	if (!tg->parent || !wl)	/* the trivial, non-cgroup case */
 		return wl;
 
 	for_each_sched_entity(se) {
@@ -3345,8 +3419,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1789e3c..fd6e9e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 }
 
 static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+struct numa_stats {
+	unsigned long load;
+	s64 eff_load;
+	unsigned long faults;
+};
 
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+	struct task_struct *p;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	int src_cpu, src_nid;
+	int dst_cpu, dst_nid;
 
-	rcu_read_lock();
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	struct numa_stats src_stats, dst_stats;
 
-		if (load < min_load) {
-			min_load = load;
-			idlest_cpu = i;
+	unsigned long best_load;
+	int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+	struct task_numa_env env = {
+		.p = p,
+		.src_cpu = task_cpu(p),
+		.src_nid = cpu_to_node(task_cpu(p)),
+		.dst_cpu = node_cpu,
+		.dst_nid = p->numa_preferred_nid,
+		.best_load = ULONG_MAX,
+		.best_cpu = task_cpu(p),
+	};
+	struct sched_domain *sd;
+	int cpu;
+	struct task_group *tg = task_group(p);
+	unsigned long weight;
+	bool balanced;
+	int imbalance_pct, idx = -1;
+
+	/*
+	 * Find the lowest common scheduling domain covering the nodes of both
+	 * the CPU the task is currently running on and the target NUMA node.
+	 */
+	rcu_read_lock();
+	for_each_domain(env.src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			/*
+			 * busy_idx is used for the load decision as it is the
+			 * same index used by the regular load balancer for an
+			 * active cpu.
+			 */
+			idx = sd->busy_idx;
+			imbalance_pct = sd->imbalance_pct;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return idlest_cpu;
+	if (WARN_ON_ONCE(idx == -1))
+		return 0;
+
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
+	weight = p->se.load.weight;
+	env.src_stats.load = source_load(env.src_cpu, idx);
+	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+	env.src_stats.eff_load *= power_of(env.src_cpu);
+	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+		env.dst_cpu = cpu;
+		env.dst_stats.load = target_load(cpu, idx);
+
+		/* If the CPU is idle, use it */
+		if (!env.dst_stats.load) {
+			env.best_cpu = cpu;
+			goto migrate;
+		}
+
+		/* Otherwise check the target CPU load */
+		env.dst_stats.eff_load = 100;
+		env.dst_stats.eff_load *= power_of(cpu);
+		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+		/*
+		 * Destination is considered balanced if the destination CPU is
+		 * less loaded than the source CPU. Unfortunately there is a
+		 * risk that a task running on a lightly loaded CPU will not
+		 * migrate to its preferred node due to load imbalances.
+		 */
+		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+		if (!balanced)
+			continue;
+
+		if (env.dst_stats.eff_load < env.best_load) {
+			env.best_load = env.dst_stats.eff_load;
+			env.best_cpu = cpu;
+		}
+	}
+
+migrate:
+	return migrate_task_to(p, env.best_cpu);
 }
 
 static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
-
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid) {
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-		}
-
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		migrate_task_to(p, preferred_cpu);
+		task_numa_migrate(p);
 	}
 }
 
@@ -3292,7 +3366,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
-	if (!tg->parent)	/* the trivial, non-cgroup case */
+	if (!tg->parent || !wl)	/* the trivial, non-cgroup case */
 		return wl;
 
 	for_each_sched_entity(se) {
@@ -3345,8 +3419,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 30 +++++++++++++++++++++++-------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8a3aa9e..4dd0c94 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,7 @@ struct task_struct {
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd6e9e1..559175b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,23 @@ migrate:
 	return migrate_task_to(p, env.best_cpu);
 }
 
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+	/* Success if task is already running on preferred CPU */
+	p->numa_migrate_retry = 0;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+		return;
+
+	/* This task has no NUMA fault statistics yet */
+	if (unlikely(p->numa_preferred_nid == -1))
+		return;
+
+	/* Otherwise, try migrate to a CPU on the preferred node */
+	if (task_numa_migrate(p) != 0)
+		p->numa_migrate_retry = jiffies + HZ*5;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -1045,17 +1062,12 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * Record the preferred node as the node with the most faults,
-	 * requeue the task to be running on the idlest CPU on the
-	 * preferred node and reset the scanning rate to recheck
-	 * the working set placement.
-	 */
+	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		task_numa_migrate(p);
+		numa_migrate_preferred(p);
 	}
 }
 
@@ -1111,6 +1123,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
+	/* Retry task to preferred node migration if it previously failed */
+	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+		numa_migrate_preferred(p);
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 30 +++++++++++++++++++++++-------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8a3aa9e..4dd0c94 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,7 @@ struct task_struct {
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd6e9e1..559175b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,23 @@ migrate:
 	return migrate_task_to(p, env.best_cpu);
 }
 
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+	/* Success if task is already running on preferred CPU */
+	p->numa_migrate_retry = 0;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+		return;
+
+	/* This task has no NUMA fault statistics yet */
+	if (unlikely(p->numa_preferred_nid == -1))
+		return;
+
+	/* Otherwise, try migrate to a CPU on the preferred node */
+	if (task_numa_migrate(p) != 0)
+		p->numa_migrate_retry = jiffies + HZ*5;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -1045,17 +1062,12 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * Record the preferred node as the node with the most faults,
-	 * requeue the task to be running on the idlest CPU on the
-	 * preferred node and reset the scanning rate to recheck
-	 * the working set placement.
-	 */
+	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		task_numa_migrate(p);
+		numa_migrate_preferred(p);
 	}
 }
 
@@ -1111,6 +1123,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
+	/* Retry task to preferred node migration if it previously failed */
+	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+		numa_migrate_preferred(p);
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 34/63] sched: numa: increment numa_migrate_seq when task runs in correct location
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.

[mgorman@suse.de: Only increment migrate_seq if migration temporarily disabled]
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 559175b..9a2e68e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+		/*
+		 * If migration is temporarily disabled due to a task migration
+		 * then re-enable it now as the task is running on its
+		 * preferred node and memory should migrate locally
+		 */
+		if (!p->numa_migrate_seq)
+			p->numa_migrate_seq++;
 		return;
+	}
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1))
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 34/63] sched: numa: increment numa_migrate_seq when task runs in correct location
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.

[mgorman@suse.de: Only increment migrate_seq if migration temporarily disabled]
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 559175b..9a2e68e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+		/*
+		 * If migration is temporarily disabled due to a task migration
+		 * then re-enable it now as the task is running on its
+		 * preferred node and memory should migrate locally
+		 */
+		if (!p->numa_migrate_seq)
+			p->numa_migrate_seq++;
 		return;
+	}
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1))
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a2e68e..8760231 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,16 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
 			continue;
 
+		/*
+		 * Shared library pages mapped by multiple processes are not
+		 * migrated as it is expected they are cache replicated. Avoid
+		 * hinting faults in read-only file-backed mappings or the vdso
+		 * as migrating the pages will be of marginal benefit.
+		 */
+		if (!vma->vm_mm ||
+		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a2e68e..8760231 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,16 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
 			continue;
 
+		/*
+		 * Shared library pages mapped by multiple processes are not
+		 * migrated as it is expected they are cache replicated. Avoid
+		 * hinting faults in read-only file-backed mappings or the vdso
+		 * as migrating the pages will be of marginal benefit.
+		 */
+		if (!vma->vm_mm ||
+		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index f0b087d..5aae390 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 	pmd = pmd_offset(pud, addr);
 	do {
+		unsigned long this_pages;
+
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
@@ -165,8 +167,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma, pmd, addr, next, newprot,
+		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa, &all_same_nidpid);
+		pages += this_pages;
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -174,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index f0b087d..5aae390 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 	pmd = pmd_offset(pud, addr);
 	do {
+		unsigned long this_pages;
+
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
@@ -165,8 +167,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma, pmd, addr, next, newprot,
+		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa, &all_same_nidpid);
+		pages += this_pages;
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -174,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 37/63] stop_machine: Introduce stop_two_cpus()
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.

[riel@redhat.com: Deadlock avoidance]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/stop_machine.h |   1 +
 kernel/stop_machine.c        | 272 +++++++++++++++++++++++++++----------------
 2 files changed, 175 insertions(+), 98 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
 };
 
 int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
 void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
 			 struct cpu_stop_work *work_buf);
 int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
 	return done.executed ? done.ret : -ENOENT;
 }
 
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+	/* Dummy starting state for thread. */
+	MULTI_STOP_NONE,
+	/* Awaiting everyone to be scheduled. */
+	MULTI_STOP_PREPARE,
+	/* Disable interrupts. */
+	MULTI_STOP_DISABLE_IRQ,
+	/* Run the function */
+	MULTI_STOP_RUN,
+	/* Exit */
+	MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+	int			(*fn)(void *);
+	void			*data;
+	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+	unsigned int		num_threads;
+	const struct cpumask	*active_cpus;
+
+	enum multi_stop_state	state;
+	atomic_t		thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+		      enum multi_stop_state newstate)
+{
+	/* Reset ack counter. */
+	atomic_set(&msdata->thread_ack, msdata->num_threads);
+	smp_wmb();
+	msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+	if (atomic_dec_and_test(&msdata->thread_ack))
+		set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+	struct multi_stop_data *msdata = data;
+	enum multi_stop_state curstate = MULTI_STOP_NONE;
+	int cpu = smp_processor_id(), err = 0;
+	unsigned long flags;
+	bool is_active;
+
+	/*
+	 * When called from stop_machine_from_inactive_cpu(), irq might
+	 * already be disabled.  Save the state and restore it on exit.
+	 */
+	local_save_flags(flags);
+
+	if (!msdata->active_cpus)
+		is_active = cpu == cpumask_first(cpu_online_mask);
+	else
+		is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+	/* Simple state machine */
+	do {
+		/* Chill out and ensure we re-read multi_stop_state. */
+		cpu_relax();
+		if (msdata->state != curstate) {
+			curstate = msdata->state;
+			switch (curstate) {
+			case MULTI_STOP_DISABLE_IRQ:
+				local_irq_disable();
+				hard_irq_disable();
+				break;
+			case MULTI_STOP_RUN:
+				if (is_active)
+					err = msdata->fn(msdata->data);
+				break;
+			default:
+				break;
+			}
+			ack_state(msdata);
+		}
+	} while (curstate != MULTI_STOP_EXIT);
+
+	local_irq_restore(flags);
+	return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+	int cpu1;
+	int cpu2;
+	struct cpu_stop_work *work1;
+	struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+	struct irq_cpu_stop_queue_work_info *info = arg;
+	cpu_stop_queue_work(info->cpu1, info->work1);
+	cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+	int call_cpu;
+	struct cpu_stop_done done;
+	struct cpu_stop_work work1, work2;
+	struct irq_cpu_stop_queue_work_info call_args;
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = arg,
+		.num_threads = 2,
+		.active_cpus = cpumask_of(cpu1),
+	};
+
+	work1 = work2 = (struct cpu_stop_work){
+		.fn = multi_cpu_stop,
+		.arg = &msdata,
+		.done = &done
+	};
+
+	call_args = (struct irq_cpu_stop_queue_work_info){
+		.cpu1 = cpu1,
+		.cpu2 = cpu2,
+		.work1 = &work1,
+		.work2 = &work2,
+	};
+
+	cpu_stop_init_done(&done, 2);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+
+	/*
+	 * Queuing needs to be done by the lowest numbered CPU, to ensure
+	 * that works are always queued in the same order on every CPU.
+	 * This prevents deadlocks.
+	 */
+	call_cpu = min(cpu1, cpu2);
+
+	smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+				 &call_args, 0);
+
+	wait_for_completion(&done.completion);
+	return done.executed ? done.ret : -ENOENT;
+}
+
 /**
  * stop_one_cpu_nowait - stop a cpu but don't wait for completion
  * @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);
 
 #ifdef CONFIG_STOP_MACHINE
 
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
-	/* Dummy starting state for thread. */
-	STOPMACHINE_NONE,
-	/* Awaiting everyone to be scheduled. */
-	STOPMACHINE_PREPARE,
-	/* Disable interrupts. */
-	STOPMACHINE_DISABLE_IRQ,
-	/* Run the function */
-	STOPMACHINE_RUN,
-	/* Exit */
-	STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
-	int			(*fn)(void *);
-	void			*data;
-	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-	unsigned int		num_threads;
-	const struct cpumask	*active_cpus;
-
-	enum stopmachine_state	state;
-	atomic_t		thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
-		      enum stopmachine_state newstate)
-{
-	/* Reset ack counter. */
-	atomic_set(&smdata->thread_ack, smdata->num_threads);
-	smp_wmb();
-	smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
-	if (atomic_dec_and_test(&smdata->thread_ack))
-		set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
-	struct stop_machine_data *smdata = data;
-	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	int cpu = smp_processor_id(), err = 0;
-	unsigned long flags;
-	bool is_active;
-
-	/*
-	 * When called from stop_machine_from_inactive_cpu(), irq might
-	 * already be disabled.  Save the state and restore it on exit.
-	 */
-	local_save_flags(flags);
-
-	if (!smdata->active_cpus)
-		is_active = cpu == cpumask_first(cpu_online_mask);
-	else
-		is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
-	/* Simple state machine */
-	do {
-		/* Chill out and ensure we re-read stopmachine_state. */
-		cpu_relax();
-		if (smdata->state != curstate) {
-			curstate = smdata->state;
-			switch (curstate) {
-			case STOPMACHINE_DISABLE_IRQ:
-				local_irq_disable();
-				hard_irq_disable();
-				break;
-			case STOPMACHINE_RUN:
-				if (is_active)
-					err = smdata->fn(smdata->data);
-				break;
-			default:
-				break;
-			}
-			ack_state(smdata);
-		}
-	} while (curstate != STOPMACHINE_EXIT);
-
-	local_irq_restore(flags);
-	return err;
-}
-
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
-					    .num_threads = num_online_cpus(),
-					    .active_cpus = cpus };
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = data,
+		.num_threads = num_online_cpus(),
+		.active_cpus = cpus,
+	};
 
 	if (!stop_machine_initialized) {
 		/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 		unsigned long flags;
 		int ret;
 
-		WARN_ON_ONCE(smdata.num_threads != 1);
+		WARN_ON_ONCE(msdata.num_threads != 1);
 
 		local_irq_save(flags);
 		hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	}
 
 	/* Set the initial state and stop all online cpus. */
-	set_state(&smdata, STOPMACHINE_PREPARE);
-	return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
 }
 
 int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
 int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 				  const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
+	struct multi_stop_data msdata = { .fn = fn, .data = data,
 					    .active_cpus = cpus };
 	struct cpu_stop_done done;
 	int ret;
 
 	/* Local CPU must be inactive and CPU hotplug in progress. */
 	BUG_ON(cpu_active(raw_smp_processor_id()));
-	smdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
+	msdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
 
 	/* No proper task established and can't sleep - busy wait for lock. */
 	while (!mutex_trylock(&stop_cpus_mutex))
 		cpu_relax();
 
 	/* Schedule work on other CPUs and execute directly for local CPU */
-	set_state(&smdata, STOPMACHINE_PREPARE);
+	set_state(&msdata, MULTI_STOP_PREPARE);
 	cpu_stop_init_done(&done, num_active_cpus());
-	queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+	queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
 			     &done);
-	ret = stop_machine_cpu_stop(&smdata);
+	ret = multi_cpu_stop(&msdata);
 
 	/* Busy wait for completion. */
 	while (!completion_done(&done.completion))
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 37/63] stop_machine: Introduce stop_two_cpus()
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.

[riel@redhat.com: Deadlock avoidance]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/stop_machine.h |   1 +
 kernel/stop_machine.c        | 272 +++++++++++++++++++++++++++----------------
 2 files changed, 175 insertions(+), 98 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
 };
 
 int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
 void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
 			 struct cpu_stop_work *work_buf);
 int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
 	return done.executed ? done.ret : -ENOENT;
 }
 
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+	/* Dummy starting state for thread. */
+	MULTI_STOP_NONE,
+	/* Awaiting everyone to be scheduled. */
+	MULTI_STOP_PREPARE,
+	/* Disable interrupts. */
+	MULTI_STOP_DISABLE_IRQ,
+	/* Run the function */
+	MULTI_STOP_RUN,
+	/* Exit */
+	MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+	int			(*fn)(void *);
+	void			*data;
+	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+	unsigned int		num_threads;
+	const struct cpumask	*active_cpus;
+
+	enum multi_stop_state	state;
+	atomic_t		thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+		      enum multi_stop_state newstate)
+{
+	/* Reset ack counter. */
+	atomic_set(&msdata->thread_ack, msdata->num_threads);
+	smp_wmb();
+	msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+	if (atomic_dec_and_test(&msdata->thread_ack))
+		set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+	struct multi_stop_data *msdata = data;
+	enum multi_stop_state curstate = MULTI_STOP_NONE;
+	int cpu = smp_processor_id(), err = 0;
+	unsigned long flags;
+	bool is_active;
+
+	/*
+	 * When called from stop_machine_from_inactive_cpu(), irq might
+	 * already be disabled.  Save the state and restore it on exit.
+	 */
+	local_save_flags(flags);
+
+	if (!msdata->active_cpus)
+		is_active = cpu == cpumask_first(cpu_online_mask);
+	else
+		is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+	/* Simple state machine */
+	do {
+		/* Chill out and ensure we re-read multi_stop_state. */
+		cpu_relax();
+		if (msdata->state != curstate) {
+			curstate = msdata->state;
+			switch (curstate) {
+			case MULTI_STOP_DISABLE_IRQ:
+				local_irq_disable();
+				hard_irq_disable();
+				break;
+			case MULTI_STOP_RUN:
+				if (is_active)
+					err = msdata->fn(msdata->data);
+				break;
+			default:
+				break;
+			}
+			ack_state(msdata);
+		}
+	} while (curstate != MULTI_STOP_EXIT);
+
+	local_irq_restore(flags);
+	return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+	int cpu1;
+	int cpu2;
+	struct cpu_stop_work *work1;
+	struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+	struct irq_cpu_stop_queue_work_info *info = arg;
+	cpu_stop_queue_work(info->cpu1, info->work1);
+	cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+	int call_cpu;
+	struct cpu_stop_done done;
+	struct cpu_stop_work work1, work2;
+	struct irq_cpu_stop_queue_work_info call_args;
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = arg,
+		.num_threads = 2,
+		.active_cpus = cpumask_of(cpu1),
+	};
+
+	work1 = work2 = (struct cpu_stop_work){
+		.fn = multi_cpu_stop,
+		.arg = &msdata,
+		.done = &done
+	};
+
+	call_args = (struct irq_cpu_stop_queue_work_info){
+		.cpu1 = cpu1,
+		.cpu2 = cpu2,
+		.work1 = &work1,
+		.work2 = &work2,
+	};
+
+	cpu_stop_init_done(&done, 2);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+
+	/*
+	 * Queuing needs to be done by the lowest numbered CPU, to ensure
+	 * that works are always queued in the same order on every CPU.
+	 * This prevents deadlocks.
+	 */
+	call_cpu = min(cpu1, cpu2);
+
+	smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+				 &call_args, 0);
+
+	wait_for_completion(&done.completion);
+	return done.executed ? done.ret : -ENOENT;
+}
+
 /**
  * stop_one_cpu_nowait - stop a cpu but don't wait for completion
  * @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);
 
 #ifdef CONFIG_STOP_MACHINE
 
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
-	/* Dummy starting state for thread. */
-	STOPMACHINE_NONE,
-	/* Awaiting everyone to be scheduled. */
-	STOPMACHINE_PREPARE,
-	/* Disable interrupts. */
-	STOPMACHINE_DISABLE_IRQ,
-	/* Run the function */
-	STOPMACHINE_RUN,
-	/* Exit */
-	STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
-	int			(*fn)(void *);
-	void			*data;
-	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-	unsigned int		num_threads;
-	const struct cpumask	*active_cpus;
-
-	enum stopmachine_state	state;
-	atomic_t		thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
-		      enum stopmachine_state newstate)
-{
-	/* Reset ack counter. */
-	atomic_set(&smdata->thread_ack, smdata->num_threads);
-	smp_wmb();
-	smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
-	if (atomic_dec_and_test(&smdata->thread_ack))
-		set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
-	struct stop_machine_data *smdata = data;
-	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	int cpu = smp_processor_id(), err = 0;
-	unsigned long flags;
-	bool is_active;
-
-	/*
-	 * When called from stop_machine_from_inactive_cpu(), irq might
-	 * already be disabled.  Save the state and restore it on exit.
-	 */
-	local_save_flags(flags);
-
-	if (!smdata->active_cpus)
-		is_active = cpu == cpumask_first(cpu_online_mask);
-	else
-		is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
-	/* Simple state machine */
-	do {
-		/* Chill out and ensure we re-read stopmachine_state. */
-		cpu_relax();
-		if (smdata->state != curstate) {
-			curstate = smdata->state;
-			switch (curstate) {
-			case STOPMACHINE_DISABLE_IRQ:
-				local_irq_disable();
-				hard_irq_disable();
-				break;
-			case STOPMACHINE_RUN:
-				if (is_active)
-					err = smdata->fn(smdata->data);
-				break;
-			default:
-				break;
-			}
-			ack_state(smdata);
-		}
-	} while (curstate != STOPMACHINE_EXIT);
-
-	local_irq_restore(flags);
-	return err;
-}
-
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
-					    .num_threads = num_online_cpus(),
-					    .active_cpus = cpus };
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = data,
+		.num_threads = num_online_cpus(),
+		.active_cpus = cpus,
+	};
 
 	if (!stop_machine_initialized) {
 		/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 		unsigned long flags;
 		int ret;
 
-		WARN_ON_ONCE(smdata.num_threads != 1);
+		WARN_ON_ONCE(msdata.num_threads != 1);
 
 		local_irq_save(flags);
 		hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	}
 
 	/* Set the initial state and stop all online cpus. */
-	set_state(&smdata, STOPMACHINE_PREPARE);
-	return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
 }
 
 int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
 int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 				  const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
+	struct multi_stop_data msdata = { .fn = fn, .data = data,
 					    .active_cpus = cpus };
 	struct cpu_stop_done done;
 	int ret;
 
 	/* Local CPU must be inactive and CPU hotplug in progress. */
 	BUG_ON(cpu_active(raw_smp_processor_id()));
-	smdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
+	msdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
 
 	/* No proper task established and can't sleep - busy wait for lock. */
 	while (!mutex_trylock(&stop_cpus_mutex))
 		cpu_relax();
 
 	/* Schedule work on other CPUs and execute directly for local CPU */
-	set_state(&smdata, STOPMACHINE_PREPARE);
+	set_state(&msdata, MULTI_STOP_PREPARE);
 	cpu_stop_init_done(&done, num_active_cpus());
-	queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+	queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
 			     &done);
-	ret = stop_machine_cpu_stop(&smdata);
+	ret = multi_cpu_stop(&msdata);
 
 	/* Busy wait for completion. */
 	while (!completion_done(&done.completion))
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 38/63] sched: Introduce migrate_swap()
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h    |   2 +
 kernel/sched/core.c      | 106 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c      |   3 +-
 kernel/sched/idle_task.c |   2 +-
 kernel/sched/rt.c        |   5 +--
 kernel/sched/sched.h     |   4 +-
 kernel/sched/stop_task.c |   2 +-
 7 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4dd0c94..703b256 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1033,6 +1033,8 @@ struct task_struct {
 	struct task_struct *last_wakee;
 	unsigned long wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+
+	int wake_cpu;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 124bb40..0862196 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,6 +1017,102 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+static void __migrate_swap_task(struct task_struct *p, int cpu)
+{
+	if (p->on_rq) {
+		struct rq *src_rq, *dst_rq;
+
+		src_rq = task_rq(p);
+		dst_rq = cpu_rq(cpu);
+
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	} else {
+		/*
+		 * Task isn't running anymore; make it appear like we migrated
+		 * it before it went to sleep. This means on wakeup we make the
+		 * previous cpu our targer instead of where it really is.
+		 */
+		p->wake_cpu = cpu;
+	}
+}
+
+struct migration_swap_arg {
+	struct task_struct *src_task, *dst_task;
+	int src_cpu, dst_cpu;
+};
+
+static int migrate_swap_stop(void *data)
+{
+	struct migration_swap_arg *arg = data;
+	struct rq *src_rq, *dst_rq;
+	int ret = -EAGAIN;
+
+	src_rq = cpu_rq(arg->src_cpu);
+	dst_rq = cpu_rq(arg->dst_cpu);
+
+	double_rq_lock(src_rq, dst_rq);
+	if (task_cpu(arg->dst_task) != arg->dst_cpu)
+		goto unlock;
+
+	if (task_cpu(arg->src_task) != arg->src_cpu)
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
+		goto unlock;
+
+	__migrate_swap_task(arg->src_task, arg->dst_cpu);
+	__migrate_swap_task(arg->dst_task, arg->src_cpu);
+
+	ret = 0;
+
+unlock:
+	double_rq_unlock(src_rq, dst_rq);
+
+	return ret;
+}
+
+/*
+ * Cross migrate two tasks
+ */
+int migrate_swap(struct task_struct *cur, struct task_struct *p)
+{
+	struct migration_swap_arg arg;
+	int ret = -EINVAL;
+
+	get_online_cpus();
+
+	arg = (struct migration_swap_arg){
+		.src_task = cur,
+		.src_cpu = task_cpu(cur),
+		.dst_task = p,
+		.dst_cpu = task_cpu(p),
+	};
+
+	if (arg.src_cpu == arg.dst_cpu)
+		goto out;
+
+	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
+		goto out;
+
+	if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
+		goto out;
+
+	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
+		goto out;
+
+	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+
+out:
+	put_online_cpus();
+	return ret;
+}
+
 struct migration_arg {
 	struct task_struct *task;
 	int dest_cpu;
@@ -1236,9 +1332,9 @@ out:
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+	cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -1513,7 +1609,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
 
-	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -1752,7 +1848,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
-	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
+	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 
 	/* Initialize new task's runnable average */
@@ -2080,7 +2176,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8760231..b19c044 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3706,11 +3706,10 @@ done:
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..516c3d9 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..d81866d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1169,13 +1169,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
-	int cpu;
-
-	cpu = task_cpu(p);
 
 	if (p->nr_cpus_allowed == 1)
 		goto out;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dca80b8..e9ab96c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,6 +555,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_swap(struct task_struct *, struct task_struct *);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
@@ -732,6 +733,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 	 */
 	smp_wmb();
 	task_thread_info(p)->cpu = cpu;
+	p->wake_cpu = cpu;
 #endif
 }
 
@@ -987,7 +989,7 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..47197de 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 38/63] sched: Introduce migrate_swap()
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h    |   2 +
 kernel/sched/core.c      | 106 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c      |   3 +-
 kernel/sched/idle_task.c |   2 +-
 kernel/sched/rt.c        |   5 +--
 kernel/sched/sched.h     |   4 +-
 kernel/sched/stop_task.c |   2 +-
 7 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4dd0c94..703b256 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1033,6 +1033,8 @@ struct task_struct {
 	struct task_struct *last_wakee;
 	unsigned long wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+
+	int wake_cpu;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 124bb40..0862196 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,6 +1017,102 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+static void __migrate_swap_task(struct task_struct *p, int cpu)
+{
+	if (p->on_rq) {
+		struct rq *src_rq, *dst_rq;
+
+		src_rq = task_rq(p);
+		dst_rq = cpu_rq(cpu);
+
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	} else {
+		/*
+		 * Task isn't running anymore; make it appear like we migrated
+		 * it before it went to sleep. This means on wakeup we make the
+		 * previous cpu our targer instead of where it really is.
+		 */
+		p->wake_cpu = cpu;
+	}
+}
+
+struct migration_swap_arg {
+	struct task_struct *src_task, *dst_task;
+	int src_cpu, dst_cpu;
+};
+
+static int migrate_swap_stop(void *data)
+{
+	struct migration_swap_arg *arg = data;
+	struct rq *src_rq, *dst_rq;
+	int ret = -EAGAIN;
+
+	src_rq = cpu_rq(arg->src_cpu);
+	dst_rq = cpu_rq(arg->dst_cpu);
+
+	double_rq_lock(src_rq, dst_rq);
+	if (task_cpu(arg->dst_task) != arg->dst_cpu)
+		goto unlock;
+
+	if (task_cpu(arg->src_task) != arg->src_cpu)
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
+		goto unlock;
+
+	__migrate_swap_task(arg->src_task, arg->dst_cpu);
+	__migrate_swap_task(arg->dst_task, arg->src_cpu);
+
+	ret = 0;
+
+unlock:
+	double_rq_unlock(src_rq, dst_rq);
+
+	return ret;
+}
+
+/*
+ * Cross migrate two tasks
+ */
+int migrate_swap(struct task_struct *cur, struct task_struct *p)
+{
+	struct migration_swap_arg arg;
+	int ret = -EINVAL;
+
+	get_online_cpus();
+
+	arg = (struct migration_swap_arg){
+		.src_task = cur,
+		.src_cpu = task_cpu(cur),
+		.dst_task = p,
+		.dst_cpu = task_cpu(p),
+	};
+
+	if (arg.src_cpu == arg.dst_cpu)
+		goto out;
+
+	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
+		goto out;
+
+	if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
+		goto out;
+
+	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
+		goto out;
+
+	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+
+out:
+	put_online_cpus();
+	return ret;
+}
+
 struct migration_arg {
 	struct task_struct *task;
 	int dest_cpu;
@@ -1236,9 +1332,9 @@ out:
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+	cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -1513,7 +1609,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
 
-	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -1752,7 +1848,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
-	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
+	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 
 	/* Initialize new task's runnable average */
@@ -2080,7 +2176,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8760231..b19c044 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3706,11 +3706,10 @@ done:
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..516c3d9 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..d81866d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1169,13 +1169,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
-	int cpu;
-
-	cpu = task_cpu(p);
 
 	if (p->nr_cpus_allowed == 1)
 		goto out;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dca80b8..e9ab96c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,6 +555,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_swap(struct task_struct *, struct task_struct *);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
@@ -732,6 +733,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 	 */
 	smp_wmb();
 	task_thread_info(p)->cpu = cpu;
+	p->wake_cpu = cpu;
 #endif
 }
 
@@ -987,7 +989,7 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..47197de 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 39/63] sched: numa: Use a system-wide search to find swap/migration candidates
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Not there is no guarantee that the node the source task is placed
on by task_numa_migrate() has any relationship to the newly selected
task->numa_preferred_nid due to compute overloading.

[riel@redhat.com: Do not swap with tasks that cannot run on source cpu]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  |   4 +
 kernel/sched/fair.c  | 253 ++++++++++++++++++++++++++++++++++++---------------
 kernel/sched/sched.h |  13 +++
 3 files changed, 199 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0862196..18f9bbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5241,6 +5241,7 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -5257,6 +5258,9 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+
+	sd = lowest_flag_domain(cpu, SD_NUMA);
+	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b19c044..59abe50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -816,6 +816,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static unsigned long task_h_load(struct task_struct *p);
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -906,12 +908,40 @@ static unsigned long target_load(int cpu, int type);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+/* Cached statistics for all CPUs within a node */
 struct numa_stats {
+	unsigned long nr_running;
 	unsigned long load;
-	s64 eff_load;
-	unsigned long faults;
+
+	/* Total compute capacity of CPUs on a node */
+	unsigned long power;
+
+	/* Approximate capacity in terms of runnable tasks on a node */
+	unsigned long capacity;
+	int has_capacity;
 };
 
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->nr_running += rq->nr_running;
+		ns->load += weighted_cpuload(cpu);
+		ns->power += power_of(cpu);
+	}
+
+	ns->load = (ns->load * SCHED_POWER_SCALE) / ns->power;
+	ns->capacity = DIV_ROUND_CLOSEST(ns->power, SCHED_POWER_SCALE);
+	ns->has_capacity = (ns->nr_running < ns->capacity);
+}
+
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -920,95 +950,178 @@ struct task_numa_env {
 
 	struct numa_stats src_stats, dst_stats;
 
-	unsigned long best_load;
+	int imbalance_pct, idx;
+
+	struct task_struct *best_task;
+	long best_imp;
 	int best_cpu;
 };
 
+static void task_numa_assign(struct task_numa_env *env,
+			     struct task_struct *p, long imp)
+{
+	if (env->best_task)
+		put_task_struct(env->best_task);
+	if (p)
+		get_task_struct(p);
+
+	env->best_task = p;
+	env->best_imp = imp;
+	env->best_cpu = env->dst_cpu;
+}
+
+/*
+ * This checks if the overall compute and NUMA accesses of the system would
+ * be improved if the source tasks was migrated to the target dst_cpu taking
+ * into account that it might be best if task running on the dst_cpu should
+ * be exchanged with the source task
+ */
+static void task_numa_compare(struct task_numa_env *env, long imp)
+{
+	struct rq *src_rq = cpu_rq(env->src_cpu);
+	struct rq *dst_rq = cpu_rq(env->dst_cpu);
+	struct task_struct *cur;
+	long dst_load, src_load;
+	long load;
+
+	rcu_read_lock();
+	cur = ACCESS_ONCE(dst_rq->curr);
+	if (cur->pid == 0) /* idle */
+		cur = NULL;
+
+	/*
+	 * "imp" is the fault differential for the source task between the
+	 * source and destination node. Calculate the total differential for
+	 * the source task and potential destination task. The more negative
+	 * the value is, the more rmeote accesses that would be expected to
+	 * be incurred if the tasks were swapped.
+	 */
+	if (cur) {
+		/* Skip this swap candidate if cannot move to the source cpu */
+		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
+			goto unlock;
+
+		imp += task_faults(cur, env->src_nid) -
+		       task_faults(cur, env->dst_nid);
+	}
+
+	if (imp < env->best_imp)
+		goto unlock;
+
+	if (!cur) {
+		/* Is there capacity at our destination? */
+		if (env->src_stats.has_capacity &&
+		    !env->dst_stats.has_capacity)
+			goto unlock;
+
+		goto balance;
+	}
+
+	/* Balance doesn't matter much if we're running a task per cpu */
+	if (src_rq->nr_running == 1 && dst_rq->nr_running == 1)
+		goto assign;
+
+	/*
+	 * In the overloaded case, try and keep the load balanced.
+	 */
+balance:
+	dst_load = env->dst_stats.load;
+	src_load = env->src_stats.load;
+
+	/* XXX missing power terms */
+	load = task_h_load(env->p);
+	dst_load += load;
+	src_load -= load;
+
+	if (cur) {
+		load = task_h_load(cur);
+		dst_load -= load;
+		src_load += load;
+	}
+
+	/* make src_load the smaller */
+	if (dst_load < src_load)
+		swap(dst_load, src_load);
+
+	if (src_load * env->imbalance_pct < dst_load * 100)
+		goto unlock;
+
+assign:
+	task_numa_assign(env, cur, imp);
+unlock:
+	rcu_read_unlock();
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
-	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
 	struct task_numa_env env = {
 		.p = p,
+
 		.src_cpu = task_cpu(p),
 		.src_nid = cpu_to_node(task_cpu(p)),
-		.dst_cpu = node_cpu,
-		.dst_nid = p->numa_preferred_nid,
-		.best_load = ULONG_MAX,
-		.best_cpu = task_cpu(p),
+
+		.imbalance_pct = 112,
+
+		.best_task = NULL,
+		.best_imp = 0,
+		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	int cpu;
-	struct task_group *tg = task_group(p);
-	unsigned long weight;
-	bool balanced;
-	int imbalance_pct, idx = -1;
+	unsigned long faults;
+	int nid, cpu, ret;
 
 	/*
-	 * Find the lowest common scheduling domain covering the nodes of both
-	 * the CPU the task is currently running on and the target NUMA node.
+	 * Pick the lowest SD_NUMA domain, as that would have the smallest
+	 * imbalance and would be the first to start moving tasks about.
+	 *
+	 * And we want to avoid any moving of tasks about, as that would create
+	 * random movement of tasks -- counter the numa conditions we're trying
+	 * to satisfy here.
 	 */
 	rcu_read_lock();
-	for_each_domain(env.src_cpu, sd) {
-		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
-			/*
-			 * busy_idx is used for the load decision as it is the
-			 * same index used by the regular load balancer for an
-			 * active cpu.
-			 */
-			idx = sd->busy_idx;
-			imbalance_pct = sd->imbalance_pct;
-			break;
-		}
-	}
+	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
+	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	if (WARN_ON_ONCE(idx == -1))
-		return 0;
+	faults = task_faults(p, env.src_nid);
+	update_numa_stats(&env.src_stats, env.src_nid);
 
-	/*
-	 * XXX the below is mostly nicked from wake_affine(); we should
-	 * see about sharing a bit if at all possible; also it might want
-	 * some per entity weight love.
-	 */
-	weight = p->se.load.weight;
-	env.src_stats.load = source_load(env.src_cpu, idx);
-	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
-	env.src_stats.eff_load *= power_of(env.src_cpu);
-	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
-
-	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
-		env.dst_cpu = cpu;
-		env.dst_stats.load = target_load(cpu, idx);
-
-		/* If the CPU is idle, use it */
-		if (!env.dst_stats.load) {
-			env.best_cpu = cpu;
-			goto migrate;
-		}
+	/* Find an alternative node with relatively better statistics */
+	for_each_online_node(nid) {
+		long imp;
 
-		/* Otherwise check the target CPU load */
-		env.dst_stats.eff_load = 100;
-		env.dst_stats.eff_load *= power_of(cpu);
-		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+		if (nid == env.src_nid)
+			continue;
 
-		/*
-		 * Destination is considered balanced if the destination CPU is
-		 * less loaded than the source CPU. Unfortunately there is a
-		 * risk that a task running on a lightly loaded CPU will not
-		 * migrate to its preferred node due to load imbalances.
-		 */
-		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
-		if (!balanced)
+		/* Only consider nodes that recorded more faults */
+		imp = task_faults(p, nid) - faults;
+		if (imp < 0)
 			continue;
 
-		if (env.dst_stats.eff_load < env.best_load) {
-			env.best_load = env.dst_stats.eff_load;
-			env.best_cpu = cpu;
+		env.dst_nid = nid;
+		update_numa_stats(&env.dst_stats, env.dst_nid);
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			/* Skip this CPU if the source task cannot migrate */
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				continue;
+
+			env.dst_cpu = cpu;
+			task_numa_compare(&env, imp);
 		}
 	}
 
-migrate:
-	return migrate_task_to(p, env.best_cpu);
+	/* No better CPU than the current one was found. */
+	if (env.best_cpu == -1)
+		return -EAGAIN;
+
+	if (env.best_task == NULL) {
+		int ret = migrate_task_to(p, env.best_cpu);
+		return ret;
+	}
+
+	ret = migrate_swap(p, env.best_task);
+	put_task_struct(env.best_task);
+	return ret;
 }
 
 /* Attempt to migrate a task to a CPU on the preferred node. */
@@ -1050,7 +1163,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults;
+		unsigned long faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1060,10 +1173,10 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
+
+			faults += p->numa_faults[i];
 		}
 
-		/* Find maximum private faults */
-		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -4452,8 +4565,6 @@ static int move_one_task(struct lb_env *env)
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e9ab96c..fd3f7b6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -607,9 +607,22 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 	return hsd;
 }
 
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & flag)
+			break;
+	}
+
+	return sd;
+}
+
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 
 struct sched_group_power {
 	atomic_t ref;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 39/63] sched: numa: Use a system-wide search to find swap/migration candidates
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Not there is no guarantee that the node the source task is placed
on by task_numa_migrate() has any relationship to the newly selected
task->numa_preferred_nid due to compute overloading.

[riel@redhat.com: Do not swap with tasks that cannot run on source cpu]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  |   4 +
 kernel/sched/fair.c  | 253 ++++++++++++++++++++++++++++++++++++---------------
 kernel/sched/sched.h |  13 +++
 3 files changed, 199 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0862196..18f9bbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5241,6 +5241,7 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -5257,6 +5258,9 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+
+	sd = lowest_flag_domain(cpu, SD_NUMA);
+	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b19c044..59abe50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -816,6 +816,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static unsigned long task_h_load(struct task_struct *p);
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -906,12 +908,40 @@ static unsigned long target_load(int cpu, int type);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+/* Cached statistics for all CPUs within a node */
 struct numa_stats {
+	unsigned long nr_running;
 	unsigned long load;
-	s64 eff_load;
-	unsigned long faults;
+
+	/* Total compute capacity of CPUs on a node */
+	unsigned long power;
+
+	/* Approximate capacity in terms of runnable tasks on a node */
+	unsigned long capacity;
+	int has_capacity;
 };
 
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->nr_running += rq->nr_running;
+		ns->load += weighted_cpuload(cpu);
+		ns->power += power_of(cpu);
+	}
+
+	ns->load = (ns->load * SCHED_POWER_SCALE) / ns->power;
+	ns->capacity = DIV_ROUND_CLOSEST(ns->power, SCHED_POWER_SCALE);
+	ns->has_capacity = (ns->nr_running < ns->capacity);
+}
+
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -920,95 +950,178 @@ struct task_numa_env {
 
 	struct numa_stats src_stats, dst_stats;
 
-	unsigned long best_load;
+	int imbalance_pct, idx;
+
+	struct task_struct *best_task;
+	long best_imp;
 	int best_cpu;
 };
 
+static void task_numa_assign(struct task_numa_env *env,
+			     struct task_struct *p, long imp)
+{
+	if (env->best_task)
+		put_task_struct(env->best_task);
+	if (p)
+		get_task_struct(p);
+
+	env->best_task = p;
+	env->best_imp = imp;
+	env->best_cpu = env->dst_cpu;
+}
+
+/*
+ * This checks if the overall compute and NUMA accesses of the system would
+ * be improved if the source tasks was migrated to the target dst_cpu taking
+ * into account that it might be best if task running on the dst_cpu should
+ * be exchanged with the source task
+ */
+static void task_numa_compare(struct task_numa_env *env, long imp)
+{
+	struct rq *src_rq = cpu_rq(env->src_cpu);
+	struct rq *dst_rq = cpu_rq(env->dst_cpu);
+	struct task_struct *cur;
+	long dst_load, src_load;
+	long load;
+
+	rcu_read_lock();
+	cur = ACCESS_ONCE(dst_rq->curr);
+	if (cur->pid == 0) /* idle */
+		cur = NULL;
+
+	/*
+	 * "imp" is the fault differential for the source task between the
+	 * source and destination node. Calculate the total differential for
+	 * the source task and potential destination task. The more negative
+	 * the value is, the more rmeote accesses that would be expected to
+	 * be incurred if the tasks were swapped.
+	 */
+	if (cur) {
+		/* Skip this swap candidate if cannot move to the source cpu */
+		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
+			goto unlock;
+
+		imp += task_faults(cur, env->src_nid) -
+		       task_faults(cur, env->dst_nid);
+	}
+
+	if (imp < env->best_imp)
+		goto unlock;
+
+	if (!cur) {
+		/* Is there capacity at our destination? */
+		if (env->src_stats.has_capacity &&
+		    !env->dst_stats.has_capacity)
+			goto unlock;
+
+		goto balance;
+	}
+
+	/* Balance doesn't matter much if we're running a task per cpu */
+	if (src_rq->nr_running == 1 && dst_rq->nr_running == 1)
+		goto assign;
+
+	/*
+	 * In the overloaded case, try and keep the load balanced.
+	 */
+balance:
+	dst_load = env->dst_stats.load;
+	src_load = env->src_stats.load;
+
+	/* XXX missing power terms */
+	load = task_h_load(env->p);
+	dst_load += load;
+	src_load -= load;
+
+	if (cur) {
+		load = task_h_load(cur);
+		dst_load -= load;
+		src_load += load;
+	}
+
+	/* make src_load the smaller */
+	if (dst_load < src_load)
+		swap(dst_load, src_load);
+
+	if (src_load * env->imbalance_pct < dst_load * 100)
+		goto unlock;
+
+assign:
+	task_numa_assign(env, cur, imp);
+unlock:
+	rcu_read_unlock();
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
-	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
 	struct task_numa_env env = {
 		.p = p,
+
 		.src_cpu = task_cpu(p),
 		.src_nid = cpu_to_node(task_cpu(p)),
-		.dst_cpu = node_cpu,
-		.dst_nid = p->numa_preferred_nid,
-		.best_load = ULONG_MAX,
-		.best_cpu = task_cpu(p),
+
+		.imbalance_pct = 112,
+
+		.best_task = NULL,
+		.best_imp = 0,
+		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	int cpu;
-	struct task_group *tg = task_group(p);
-	unsigned long weight;
-	bool balanced;
-	int imbalance_pct, idx = -1;
+	unsigned long faults;
+	int nid, cpu, ret;
 
 	/*
-	 * Find the lowest common scheduling domain covering the nodes of both
-	 * the CPU the task is currently running on and the target NUMA node.
+	 * Pick the lowest SD_NUMA domain, as that would have the smallest
+	 * imbalance and would be the first to start moving tasks about.
+	 *
+	 * And we want to avoid any moving of tasks about, as that would create
+	 * random movement of tasks -- counter the numa conditions we're trying
+	 * to satisfy here.
 	 */
 	rcu_read_lock();
-	for_each_domain(env.src_cpu, sd) {
-		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
-			/*
-			 * busy_idx is used for the load decision as it is the
-			 * same index used by the regular load balancer for an
-			 * active cpu.
-			 */
-			idx = sd->busy_idx;
-			imbalance_pct = sd->imbalance_pct;
-			break;
-		}
-	}
+	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
+	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	if (WARN_ON_ONCE(idx == -1))
-		return 0;
+	faults = task_faults(p, env.src_nid);
+	update_numa_stats(&env.src_stats, env.src_nid);
 
-	/*
-	 * XXX the below is mostly nicked from wake_affine(); we should
-	 * see about sharing a bit if at all possible; also it might want
-	 * some per entity weight love.
-	 */
-	weight = p->se.load.weight;
-	env.src_stats.load = source_load(env.src_cpu, idx);
-	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
-	env.src_stats.eff_load *= power_of(env.src_cpu);
-	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
-
-	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
-		env.dst_cpu = cpu;
-		env.dst_stats.load = target_load(cpu, idx);
-
-		/* If the CPU is idle, use it */
-		if (!env.dst_stats.load) {
-			env.best_cpu = cpu;
-			goto migrate;
-		}
+	/* Find an alternative node with relatively better statistics */
+	for_each_online_node(nid) {
+		long imp;
 
-		/* Otherwise check the target CPU load */
-		env.dst_stats.eff_load = 100;
-		env.dst_stats.eff_load *= power_of(cpu);
-		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+		if (nid == env.src_nid)
+			continue;
 
-		/*
-		 * Destination is considered balanced if the destination CPU is
-		 * less loaded than the source CPU. Unfortunately there is a
-		 * risk that a task running on a lightly loaded CPU will not
-		 * migrate to its preferred node due to load imbalances.
-		 */
-		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
-		if (!balanced)
+		/* Only consider nodes that recorded more faults */
+		imp = task_faults(p, nid) - faults;
+		if (imp < 0)
 			continue;
 
-		if (env.dst_stats.eff_load < env.best_load) {
-			env.best_load = env.dst_stats.eff_load;
-			env.best_cpu = cpu;
+		env.dst_nid = nid;
+		update_numa_stats(&env.dst_stats, env.dst_nid);
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			/* Skip this CPU if the source task cannot migrate */
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				continue;
+
+			env.dst_cpu = cpu;
+			task_numa_compare(&env, imp);
 		}
 	}
 
-migrate:
-	return migrate_task_to(p, env.best_cpu);
+	/* No better CPU than the current one was found. */
+	if (env.best_cpu == -1)
+		return -EAGAIN;
+
+	if (env.best_task == NULL) {
+		int ret = migrate_task_to(p, env.best_cpu);
+		return ret;
+	}
+
+	ret = migrate_swap(p, env.best_task);
+	put_task_struct(env.best_task);
+	return ret;
 }
 
 /* Attempt to migrate a task to a CPU on the preferred node. */
@@ -1050,7 +1163,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults;
+		unsigned long faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1060,10 +1173,10 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
+
+			faults += p->numa_faults[i];
 		}
 
-		/* Find maximum private faults */
-		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -4452,8 +4565,6 @@ static int move_one_task(struct lb_env *env)
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e9ab96c..fd3f7b6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -607,9 +607,22 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 	return hsd;
 }
 
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & flag)
+			break;
+	}
+
+	return sd;
+}
+
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 
 struct sched_group_power {
 	atomic_t ref;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 40/63] sched: numa: Favor placing a task on the preferred node
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A tasks preferred node is selected based on the number of faults
recorded for a node but the actual task_numa_migate() conducts a global
search regardless of the preferred nid. This patch checks if the
preferred nid has capacity and if so, searches for a CPU within that
node. This avoids a global search when the preferred node is not
overloaded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59abe50..722baab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1052,6 +1052,20 @@ unlock:
 	rcu_read_unlock();
 }
 
+static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+		/* Skip this CPU if the source task cannot migrate */
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
+			continue;
+
+		env->dst_cpu = cpu;
+		task_numa_compare(env, imp);
+	}
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1068,7 +1082,8 @@ static int task_numa_migrate(struct task_struct *p)
 	};
 	struct sched_domain *sd;
 	unsigned long faults;
-	int nid, cpu, ret;
+	int nid, ret;
+	long imp;
 
 	/*
 	 * Pick the lowest SD_NUMA domain, as that would have the smallest
@@ -1085,28 +1100,29 @@ static int task_numa_migrate(struct task_struct *p)
 
 	faults = task_faults(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
+	env.dst_nid = p->numa_preferred_nid;
+	imp = task_faults(env.p, env.dst_nid) - faults;
+	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/* Find an alternative node with relatively better statistics */
-	for_each_online_node(nid) {
-		long imp;
-
-		if (nid == env.src_nid)
-			continue;
-
-		/* Only consider nodes that recorded more faults */
-		imp = task_faults(p, nid) - faults;
-		if (imp < 0)
-			continue;
+	/*
+	 * If the preferred nid has capacity then use it. Otherwise find an
+	 * alternative node with relatively better statistics.
+	 */
+	if (env.dst_stats.has_capacity) {
+		task_numa_find_cpu(&env, imp);
+	} else {
+		for_each_online_node(nid) {
+			if (nid == env.src_nid || nid == p->numa_preferred_nid)
+				continue;
 
-		env.dst_nid = nid;
-		update_numa_stats(&env.dst_stats, env.dst_nid);
-		for_each_cpu(cpu, cpumask_of_node(nid)) {
-			/* Skip this CPU if the source task cannot migrate */
-			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			/* Only consider nodes that recorded more faults */
+			imp = task_faults(env.p, nid) - faults;
+			if (imp < 0)
 				continue;
 
-			env.dst_cpu = cpu;
-			task_numa_compare(&env, imp);
+			env.dst_nid = nid;
+			update_numa_stats(&env.dst_stats, env.dst_nid);
+			task_numa_find_cpu(&env, imp);
 		}
 	}
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 40/63] sched: numa: Favor placing a task on the preferred node
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A tasks preferred node is selected based on the number of faults
recorded for a node but the actual task_numa_migate() conducts a global
search regardless of the preferred nid. This patch checks if the
preferred nid has capacity and if so, searches for a CPU within that
node. This avoids a global search when the preferred node is not
overloaded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59abe50..722baab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1052,6 +1052,20 @@ unlock:
 	rcu_read_unlock();
 }
 
+static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+		/* Skip this CPU if the source task cannot migrate */
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
+			continue;
+
+		env->dst_cpu = cpu;
+		task_numa_compare(env, imp);
+	}
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1068,7 +1082,8 @@ static int task_numa_migrate(struct task_struct *p)
 	};
 	struct sched_domain *sd;
 	unsigned long faults;
-	int nid, cpu, ret;
+	int nid, ret;
+	long imp;
 
 	/*
 	 * Pick the lowest SD_NUMA domain, as that would have the smallest
@@ -1085,28 +1100,29 @@ static int task_numa_migrate(struct task_struct *p)
 
 	faults = task_faults(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
+	env.dst_nid = p->numa_preferred_nid;
+	imp = task_faults(env.p, env.dst_nid) - faults;
+	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/* Find an alternative node with relatively better statistics */
-	for_each_online_node(nid) {
-		long imp;
-
-		if (nid == env.src_nid)
-			continue;
-
-		/* Only consider nodes that recorded more faults */
-		imp = task_faults(p, nid) - faults;
-		if (imp < 0)
-			continue;
+	/*
+	 * If the preferred nid has capacity then use it. Otherwise find an
+	 * alternative node with relatively better statistics.
+	 */
+	if (env.dst_stats.has_capacity) {
+		task_numa_find_cpu(&env, imp);
+	} else {
+		for_each_online_node(nid) {
+			if (nid == env.src_nid || nid == p->numa_preferred_nid)
+				continue;
 
-		env.dst_nid = nid;
-		update_numa_stats(&env.dst_stats, env.dst_nid);
-		for_each_cpu(cpu, cpumask_of_node(nid)) {
-			/* Skip this CPU if the source task cannot migrate */
-			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			/* Only consider nodes that recorded more faults */
+			imp = task_faults(env.p, nid) - faults;
+			if (imp < 0)
 				continue;
 
-			env.dst_cpu = cpu;
-			task_numa_compare(&env, imp);
+			env.dst_nid = nid;
+			update_numa_stats(&env.dst_stats, env.dst_nid);
+			task_numa_find_cpu(&env, imp);
 		}
 	}
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 41/63] sched: numa: fix placement of workloads spread across multiple nodes
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The load balancer will spread workloads across multiple NUMA nodes,
in order to balance the load on the system. This means that sometimes
a task's preferred node has available capacity, but moving the task
there will not succeed, because that would create too large an imbalance.

In that case, other NUMA nodes need to be considered.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 722baab..9dd35cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1104,13 +1104,12 @@ static int task_numa_migrate(struct task_struct *p)
 	imp = task_faults(env.p, env.dst_nid) - faults;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/*
-	 * If the preferred nid has capacity then use it. Otherwise find an
-	 * alternative node with relatively better statistics.
-	 */
-	if (env.dst_stats.has_capacity) {
+	/* If the preferred nid has capacity, try to use it. */
+	if (env.dst_stats.has_capacity)
 		task_numa_find_cpu(&env, imp);
-	} else {
+
+	/* No space available on the preferred nid. Look elsewhere. */
+	if (env.best_cpu == -1) {
 		for_each_online_node(nid) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 41/63] sched: numa: fix placement of workloads spread across multiple nodes
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The load balancer will spread workloads across multiple NUMA nodes,
in order to balance the load on the system. This means that sometimes
a task's preferred node has available capacity, but moving the task
there will not succeed, because that would create too large an imbalance.

In that case, other NUMA nodes need to be considered.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 722baab..9dd35cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1104,13 +1104,12 @@ static int task_numa_migrate(struct task_struct *p)
 	imp = task_faults(env.p, env.dst_nid) - faults;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/*
-	 * If the preferred nid has capacity then use it. Otherwise find an
-	 * alternative node with relatively better statistics.
-	 */
-	if (env.dst_stats.has_capacity) {
+	/* If the preferred nid has capacity, try to use it. */
+	if (env.dst_stats.has_capacity)
 		task_numa_find_cpu(&env, imp);
-	} else {
+
+	/* No space available on the preferred nid. Look elsewhere. */
+	if (env.best_cpu == -1) {
 		for_each_online_node(nid) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 42/63] mm: numa: Change page last {nid,pid} into {cpu,pid}
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 90 ++++++++++++++++++++++-----------------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 22 +++++-----
 kernel/bounds.c                   |  4 ++
 kernel/sched/fair.c               |  6 +--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    | 16 ++++---
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 28 ++++++------
 mm/page_alloc.c                   |  4 +-
 13 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb412ce..ce464cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
+#define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT	(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
+#define LAST_CPUPID_MASK	((1UL << LAST_CPUPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,96 +661,106 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
-	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
+	return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
-	return nidpid & LAST__PID_MASK;
+	return cpupid & LAST__PID_MASK;
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_cpu(int cpupid)
 {
-	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+	return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
-	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+	return cpu_to_node(cpupid_to_cpu(cpupid));
 }
 
-static inline bool nidpid_nid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
 {
-	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+	return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
 }
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-static inline int page_nidpid_xchg_last(struct page *page, int nid)
+static inline bool cpupid_cpu_unset(int cpupid)
 {
-	return xchg(&page->_last_nidpid, nid);
+	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
-static inline int page_nidpid_last(struct page *page)
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page->_last_nidpid;
+	return xchg(&page->_last_cpupid, cpupid);
 }
-static inline void page_nidpid_reset_last(struct page *page)
+
+static inline int page_cpupid_last(struct page *page)
+{
+	return page->_last_cpupid;
+}
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	page->_last_nidpid = -1;
+	page->_last_cpupid = -1;
 }
 #else
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
+	return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
 }
 
-extern int page_nidpid_xchg_last(struct page *page, int nidpid);
+extern int page_cpupid_xchg_last(struct page *page, int cpupid);
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
+	int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+	page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+	page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 }
-#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
-#else
-static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
+#else /* !CONFIG_NUMA_BALANCING */
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpupid_to_cpu(int cpupid)
 {
 	return -1;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpu_pid_to_cpupid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool cpupid_pid_unset(int cpupid)
 {
 	return 1;
 }
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
 }
-#endif
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 38a902a..a30f9ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-	int _last_nidpid;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+	int _last_cpupid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 02bc918..da52366 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -39,9 +39,9 @@
  * lookup is necessary.
  *
  * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: |       NODE     | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -65,18 +65,18 @@
 #define LAST__PID_SHIFT 8
 #define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
 
-#define LAST__NID_SHIFT NODES_SHIFT
-#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+#define LAST__CPU_SHIFT NR_CPUS_BITS
+#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)
 
-#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
+#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
 #else
-#define LAST_NIDPID_SHIFT 0
+#define LAST_CPUPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
-#define LAST_NIDPID_WIDTH 0
+#define LAST_CPUPID_WIDTH 0
 #endif
 
 /*
@@ -87,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
-#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
 #include <linux/mmzone.h>
 #include <linux/kbuild.h>
 #include <linux/page_cgroup.h>
+#include <linux/log2.h>
 
 void foo(void)
 {
@@ -17,5 +18,8 @@ void foo(void)
 	DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
 	DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
 	DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
 	/* End of constants */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9dd35cb..af35be1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1210,7 +1210,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1226,8 +1226,8 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 	 * First accesses are treated as private, otherwise consider accesses
 	 * to be private if the accessing pid has not changed
 	 */
-	if (!nidpid_pid_unset(last_nidpid))
-		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	if (!cpupid_pid_unset(last_cpupid))
+		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
 	else
 		priv = 1;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0baf0e4..becf92c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nidpid = -1;
+	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
+		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index cc7f206..5162e6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nidpid;
+	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nidpid;
+	int last_cpupid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nidpid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e895a2..a5867ef 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2324,6 +2324,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	struct zone *zone;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
+	int thiscpu = raw_smp_processor_id();
+	int thisnid = cpu_to_node(thiscpu);
 	int polnid = -1;
 	int ret = -1;
 
@@ -2372,11 +2374,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nidpid;
-		int this_nidpid;
+		int last_cpupid;
+		int this_cpupid;
 
-		polnid = numa_node_id();
-		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
+		polnid = thisnid;
+		this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2399,8 +2401,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
-		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
+		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -2410,7 +2412,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * This way a short and temporary process migration will
 		 * not cause excessive memory migration.
 		 */
-		if (polnid != current->numa_preferred_nid &&
+		if (thisnid != current->numa_preferred_nid &&
 				!current->numa_migrate_seq)
 			goto out;
 #endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 22abf87..c85f3fc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
+		page_cpupid_xchg_last(newpage, page_cpupid_last(page));
 
 	return newpage;
 }
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
+	page_cpupid_xchg_last(new_page, page_cpupid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467de57..68562e9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NIDPID_WIDTH,
+		LAST_CPUPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnidpid %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NIDPID_SHIFT);
+		LAST_CPUPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NIDPID_PGSHIFT);
+		(unsigned long)LAST_CPUPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nidpid not in page flags");
+		"Last cpupid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 25bb477..2c70c3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
-int page_nidpid_xchg_last(struct page *page, int nidpid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)
+int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
 	unsigned long old_flags, flags;
-	int last_nidpid;
+	int last_cpupid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 
-		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+		flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+		flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nidpid;
+	return last_cpupid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 5aae390..9a74855 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_nidpid = true;
-	int last_nid = -1;
+	bool all_same_cpupid = true;
+	int last_cpu = -1;
 	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -64,17 +64,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int nidpid = page_nidpid_last(page);
-					int this_nid = nidpid_to_nid(nidpid);
-					int this_pid = nidpid_to_pid(nidpid);
+					int cpupid = page_cpupid_last(page);
+					int this_cpu = cpupid_to_cpu(cpupid);
+					int this_pid = cpupid_to_pid(cpupid);
 
-					if (last_nid == -1)
-						last_nid = this_nid;
+					if (last_cpu == -1)
+						last_cpu = this_cpu;
 					if (last_pid == -1)
 						last_pid = this_pid;
-					if (last_nid != this_nid ||
+					if (last_cpu != this_cpu ||
 					    last_pid != this_pid) {
-						all_same_nidpid = false;
+						all_same_cpupid = false;
 					}
 
 					if (!pte_numa(oldpte)) {
@@ -115,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_nidpid = all_same_nidpid;
+	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
@@ -142,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_nidpid;
+	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -168,7 +168,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_nidpid);
+				 dirty_accountable, prot_numa, &all_same_cpupid);
 		pages += this_pages;
 
 		/*
@@ -177,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && this_pages && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_cpupid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6301d8..83ec477 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nidpid_reset_last(page);
+	page_cpupid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nidpid_reset_last(page);
+		page_cpupid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 42/63] mm: numa: Change page last {nid,pid} into {cpu,pid}
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 90 ++++++++++++++++++++++-----------------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 22 +++++-----
 kernel/bounds.c                   |  4 ++
 kernel/sched/fair.c               |  6 +--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    | 16 ++++---
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 28 ++++++------
 mm/page_alloc.c                   |  4 +-
 13 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb412ce..ce464cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
+#define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT	(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
+#define LAST_CPUPID_MASK	((1UL << LAST_CPUPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,96 +661,106 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
-	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
+	return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
-	return nidpid & LAST__PID_MASK;
+	return cpupid & LAST__PID_MASK;
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_cpu(int cpupid)
 {
-	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+	return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
-	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+	return cpu_to_node(cpupid_to_cpu(cpupid));
 }
 
-static inline bool nidpid_nid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
 {
-	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+	return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
 }
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-static inline int page_nidpid_xchg_last(struct page *page, int nid)
+static inline bool cpupid_cpu_unset(int cpupid)
 {
-	return xchg(&page->_last_nidpid, nid);
+	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
-static inline int page_nidpid_last(struct page *page)
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page->_last_nidpid;
+	return xchg(&page->_last_cpupid, cpupid);
 }
-static inline void page_nidpid_reset_last(struct page *page)
+
+static inline int page_cpupid_last(struct page *page)
+{
+	return page->_last_cpupid;
+}
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	page->_last_nidpid = -1;
+	page->_last_cpupid = -1;
 }
 #else
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
+	return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
 }
 
-extern int page_nidpid_xchg_last(struct page *page, int nidpid);
+extern int page_cpupid_xchg_last(struct page *page, int cpupid);
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
+	int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+	page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+	page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 }
-#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
-#else
-static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
+#else /* !CONFIG_NUMA_BALANCING */
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpupid_to_cpu(int cpupid)
 {
 	return -1;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpu_pid_to_cpupid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool cpupid_pid_unset(int cpupid)
 {
 	return 1;
 }
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
 }
-#endif
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 38a902a..a30f9ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-	int _last_nidpid;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+	int _last_cpupid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 02bc918..da52366 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -39,9 +39,9 @@
  * lookup is necessary.
  *
  * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: |       NODE     | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -65,18 +65,18 @@
 #define LAST__PID_SHIFT 8
 #define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
 
-#define LAST__NID_SHIFT NODES_SHIFT
-#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+#define LAST__CPU_SHIFT NR_CPUS_BITS
+#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)
 
-#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
+#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
 #else
-#define LAST_NIDPID_SHIFT 0
+#define LAST_CPUPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
-#define LAST_NIDPID_WIDTH 0
+#define LAST_CPUPID_WIDTH 0
 #endif
 
 /*
@@ -87,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
-#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
 #include <linux/mmzone.h>
 #include <linux/kbuild.h>
 #include <linux/page_cgroup.h>
+#include <linux/log2.h>
 
 void foo(void)
 {
@@ -17,5 +18,8 @@ void foo(void)
 	DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
 	DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
 	DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
 	/* End of constants */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9dd35cb..af35be1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1210,7 +1210,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1226,8 +1226,8 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 	 * First accesses are treated as private, otherwise consider accesses
 	 * to be private if the accessing pid has not changed
 	 */
-	if (!nidpid_pid_unset(last_nidpid))
-		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	if (!cpupid_pid_unset(last_cpupid))
+		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
 	else
 		priv = 1;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0baf0e4..becf92c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nidpid = -1;
+	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
+		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index cc7f206..5162e6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nidpid;
+	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nidpid;
+	int last_cpupid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nidpid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e895a2..a5867ef 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2324,6 +2324,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	struct zone *zone;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
+	int thiscpu = raw_smp_processor_id();
+	int thisnid = cpu_to_node(thiscpu);
 	int polnid = -1;
 	int ret = -1;
 
@@ -2372,11 +2374,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nidpid;
-		int this_nidpid;
+		int last_cpupid;
+		int this_cpupid;
 
-		polnid = numa_node_id();
-		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
+		polnid = thisnid;
+		this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2399,8 +2401,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
-		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
+		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -2410,7 +2412,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * This way a short and temporary process migration will
 		 * not cause excessive memory migration.
 		 */
-		if (polnid != current->numa_preferred_nid &&
+		if (thisnid != current->numa_preferred_nid &&
 				!current->numa_migrate_seq)
 			goto out;
 #endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 22abf87..c85f3fc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
+		page_cpupid_xchg_last(newpage, page_cpupid_last(page));
 
 	return newpage;
 }
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
+	page_cpupid_xchg_last(new_page, page_cpupid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467de57..68562e9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NIDPID_WIDTH,
+		LAST_CPUPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnidpid %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NIDPID_SHIFT);
+		LAST_CPUPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NIDPID_PGSHIFT);
+		(unsigned long)LAST_CPUPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nidpid not in page flags");
+		"Last cpupid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 25bb477..2c70c3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
-int page_nidpid_xchg_last(struct page *page, int nidpid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)
+int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
 	unsigned long old_flags, flags;
-	int last_nidpid;
+	int last_cpupid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 
-		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+		flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+		flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nidpid;
+	return last_cpupid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 5aae390..9a74855 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_nidpid = true;
-	int last_nid = -1;
+	bool all_same_cpupid = true;
+	int last_cpu = -1;
 	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -64,17 +64,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int nidpid = page_nidpid_last(page);
-					int this_nid = nidpid_to_nid(nidpid);
-					int this_pid = nidpid_to_pid(nidpid);
+					int cpupid = page_cpupid_last(page);
+					int this_cpu = cpupid_to_cpu(cpupid);
+					int this_pid = cpupid_to_pid(cpupid);
 
-					if (last_nid == -1)
-						last_nid = this_nid;
+					if (last_cpu == -1)
+						last_cpu = this_cpu;
 					if (last_pid == -1)
 						last_pid = this_pid;
-					if (last_nid != this_nid ||
+					if (last_cpu != this_cpu ||
 					    last_pid != this_pid) {
-						all_same_nidpid = false;
+						all_same_cpupid = false;
 					}
 
 					if (!pte_numa(oldpte)) {
@@ -115,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_nidpid = all_same_nidpid;
+	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
@@ -142,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_nidpid;
+	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -168,7 +168,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_nidpid);
+				 dirty_accountable, prot_numa, &all_same_cpupid);
 		pages += this_pages;
 
 		/*
@@ -177,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && this_pages && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_cpupid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6301d8..83ec477 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nidpid_reset_last(page);
+	page_cpupid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nidpid_reset_last(page);
+		page_cpupid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 43/63] sched: numa: Use {cpu, pid} to create task groups for shared faults
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h    |  11 ++++
 include/linux/sched.h |   3 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 165 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |   5 +-
 mm/memory.c           |   8 +++
 6 files changed, 182 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce464cd..81443d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -691,6 +691,12 @@ static inline bool cpupid_cpu_unset(int cpupid)
 	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
+static inline bool __cpupid_match_pid(pid_t task_pid, int cpupid)
+{
+	return (task_pid & LAST__PID_MASK) == cpupid_to_pid(cpupid);
+}
+
+#define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
@@ -760,6 +766,11 @@ static inline bool cpupid_pid_unset(int cpupid)
 static inline void page_cpupid_reset_last(struct page *page)
 {
 }
+
+static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
+{
+	return false;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 703b256..505d4ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1337,6 +1337,9 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	struct list_head numa_entry;
+	struct numa_group *numa_group;
+
 	/*
 	 * Exponential decaying average of faults on a per-node basis.
 	 * Scheduling placement decisions are made based on the these counts.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18f9bbe..0a6899b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1728,6 +1728,9 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+
+	INIT_LIST_HEAD(&p->numa_entry);
+	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	cpu_hotplug_init_task(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af35be1..339b1e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,17 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+struct numa_group {
+	atomic_t refcount;
+
+	spinlock_t lock; /* nr_tasks, tasks */
+	int nr_tasks;
+	struct list_head task_list;
+
+	struct rcu_head rcu;
+	atomic_long_t faults[0];
+};
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1182,7 +1193,10 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
+			long diff;
+
 			i = task_faults_idx(nid, priv);
+			diff = -p->numa_faults[i];
 
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
@@ -1190,6 +1204,11 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
+			diff += p->numa_faults[i];
+			if (p->numa_group) {
+				/* safe because we can only change our own group */
+				atomic_long_add(diff, &p->numa_group->faults[i]);
+			}
 		}
 
 		if (faults > max_faults) {
@@ -1207,6 +1226,131 @@ static void task_numa_placement(struct task_struct *p)
 	}
 }
 
+static inline int get_numa_group(struct numa_group *grp)
+{
+	return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+	if (atomic_dec_and_test(&grp->refcount))
+		kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpupid)
+{
+	struct numa_group *grp, *my_grp;
+	struct task_struct *tsk;
+	bool join = false;
+	int cpu = cpupid_to_cpu(cpupid);
+	int i;
+
+	if (unlikely(!p->numa_group)) {
+		unsigned int size = sizeof(struct numa_group) +
+				    2*nr_node_ids*sizeof(atomic_long_t);
+
+		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!grp)
+			return;
+
+		atomic_set(&grp->refcount, 1);
+		spin_lock_init(&grp->lock);
+		INIT_LIST_HEAD(&grp->task_list);
+
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+		list_add(&p->numa_entry, &grp->task_list);
+		grp->nr_tasks++;
+		rcu_assign_pointer(p->numa_group, grp);
+	}
+
+	rcu_read_lock();
+	tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+	if (!cpupid_match_pid(tsk, cpupid))
+		goto unlock;
+
+	grp = rcu_dereference(tsk->numa_group);
+	if (!grp)
+		goto unlock;
+
+	my_grp = p->numa_group;
+	if (grp == my_grp)
+		goto unlock;
+
+	/*
+	 * Only join the other group if its bigger; if we're the bigger group,
+	 * the other task will join us.
+	 */
+	if (my_grp->nr_tasks > grp->nr_tasks)
+		goto unlock;
+
+	/*
+	 * Tie-break on the grp address.
+	 */
+	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+		goto unlock;
+
+	if (!get_numa_group(grp))
+		goto unlock;
+
+	join = true;
+
+unlock:
+	rcu_read_unlock();
+
+	if (!join)
+		return;
+
+	for (i = 0; i < 2*nr_node_ids; i++) {
+		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+	}
+
+	double_lock(&my_grp->lock, &grp->lock);
+
+	list_move(&p->numa_entry, &grp->task_list);
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+}
+
+void task_numa_free(struct task_struct *p)
+{
+	struct numa_group *grp = p->numa_group;
+	int i;
+
+	if (grp) {
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+		spin_lock(&grp->lock);
+		list_del(&p->numa_entry);
+		grp->nr_tasks--;
+		spin_unlock(&grp->lock);
+		rcu_assign_pointer(p->numa_group, NULL);
+		put_numa_group(grp);
+	}
+
+	kfree(p->numa_faults);
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -1222,15 +1366,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/*
-	 * First accesses are treated as private, otherwise consider accesses
-	 * to be private if the accessing pid has not changed
-	 */
-	if (!cpupid_pid_unset(last_cpupid))
-		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
-	else
-		priv = 1;
-
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
@@ -1245,6 +1380,18 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	}
 
 	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+		priv = 1;
+	} else {
+		priv = cpupid_match_pid(p, last_cpupid);
+		if (!priv)
+			task_numa_group(p, last_cpupid);
+	}
+
+	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fd3f7b6..9aab230 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,10 +556,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
-	kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
 #else /* CONFIG_NUMA_BALANCING */
 static inline void task_numa_free(struct task_struct *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 5162e6d..c57efa2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2719,6 +2719,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		get_page(dirty_page);
 
 reuse:
+		/*
+		 * Clear the pages cpupid information as the existing
+		 * information potentially belongs to a now completely
+		 * unrelated process.
+		 */
+		if (old_page)
+			page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
+
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 43/63] sched: numa: Use {cpu, pid} to create task groups for shared faults
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h    |  11 ++++
 include/linux/sched.h |   3 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 165 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |   5 +-
 mm/memory.c           |   8 +++
 6 files changed, 182 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce464cd..81443d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -691,6 +691,12 @@ static inline bool cpupid_cpu_unset(int cpupid)
 	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
+static inline bool __cpupid_match_pid(pid_t task_pid, int cpupid)
+{
+	return (task_pid & LAST__PID_MASK) == cpupid_to_pid(cpupid);
+}
+
+#define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
@@ -760,6 +766,11 @@ static inline bool cpupid_pid_unset(int cpupid)
 static inline void page_cpupid_reset_last(struct page *page)
 {
 }
+
+static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
+{
+	return false;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 703b256..505d4ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1337,6 +1337,9 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	struct list_head numa_entry;
+	struct numa_group *numa_group;
+
 	/*
 	 * Exponential decaying average of faults on a per-node basis.
 	 * Scheduling placement decisions are made based on the these counts.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18f9bbe..0a6899b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1728,6 +1728,9 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+
+	INIT_LIST_HEAD(&p->numa_entry);
+	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	cpu_hotplug_init_task(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af35be1..339b1e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,17 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+struct numa_group {
+	atomic_t refcount;
+
+	spinlock_t lock; /* nr_tasks, tasks */
+	int nr_tasks;
+	struct list_head task_list;
+
+	struct rcu_head rcu;
+	atomic_long_t faults[0];
+};
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1182,7 +1193,10 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
+			long diff;
+
 			i = task_faults_idx(nid, priv);
+			diff = -p->numa_faults[i];
 
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
@@ -1190,6 +1204,11 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
+			diff += p->numa_faults[i];
+			if (p->numa_group) {
+				/* safe because we can only change our own group */
+				atomic_long_add(diff, &p->numa_group->faults[i]);
+			}
 		}
 
 		if (faults > max_faults) {
@@ -1207,6 +1226,131 @@ static void task_numa_placement(struct task_struct *p)
 	}
 }
 
+static inline int get_numa_group(struct numa_group *grp)
+{
+	return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+	if (atomic_dec_and_test(&grp->refcount))
+		kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpupid)
+{
+	struct numa_group *grp, *my_grp;
+	struct task_struct *tsk;
+	bool join = false;
+	int cpu = cpupid_to_cpu(cpupid);
+	int i;
+
+	if (unlikely(!p->numa_group)) {
+		unsigned int size = sizeof(struct numa_group) +
+				    2*nr_node_ids*sizeof(atomic_long_t);
+
+		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!grp)
+			return;
+
+		atomic_set(&grp->refcount, 1);
+		spin_lock_init(&grp->lock);
+		INIT_LIST_HEAD(&grp->task_list);
+
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+		list_add(&p->numa_entry, &grp->task_list);
+		grp->nr_tasks++;
+		rcu_assign_pointer(p->numa_group, grp);
+	}
+
+	rcu_read_lock();
+	tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+	if (!cpupid_match_pid(tsk, cpupid))
+		goto unlock;
+
+	grp = rcu_dereference(tsk->numa_group);
+	if (!grp)
+		goto unlock;
+
+	my_grp = p->numa_group;
+	if (grp == my_grp)
+		goto unlock;
+
+	/*
+	 * Only join the other group if its bigger; if we're the bigger group,
+	 * the other task will join us.
+	 */
+	if (my_grp->nr_tasks > grp->nr_tasks)
+		goto unlock;
+
+	/*
+	 * Tie-break on the grp address.
+	 */
+	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+		goto unlock;
+
+	if (!get_numa_group(grp))
+		goto unlock;
+
+	join = true;
+
+unlock:
+	rcu_read_unlock();
+
+	if (!join)
+		return;
+
+	for (i = 0; i < 2*nr_node_ids; i++) {
+		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+	}
+
+	double_lock(&my_grp->lock, &grp->lock);
+
+	list_move(&p->numa_entry, &grp->task_list);
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+}
+
+void task_numa_free(struct task_struct *p)
+{
+	struct numa_group *grp = p->numa_group;
+	int i;
+
+	if (grp) {
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+		spin_lock(&grp->lock);
+		list_del(&p->numa_entry);
+		grp->nr_tasks--;
+		spin_unlock(&grp->lock);
+		rcu_assign_pointer(p->numa_group, NULL);
+		put_numa_group(grp);
+	}
+
+	kfree(p->numa_faults);
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -1222,15 +1366,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/*
-	 * First accesses are treated as private, otherwise consider accesses
-	 * to be private if the accessing pid has not changed
-	 */
-	if (!cpupid_pid_unset(last_cpupid))
-		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
-	else
-		priv = 1;
-
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
@@ -1245,6 +1380,18 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	}
 
 	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+		priv = 1;
+	} else {
+		priv = cpupid_match_pid(p, last_cpupid);
+		if (!priv)
+			task_numa_group(p, last_cpupid);
+	}
+
+	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fd3f7b6..9aab230 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,10 +556,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
-	kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
 #else /* CONFIG_NUMA_BALANCING */
 static inline void task_numa_free(struct task_struct *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 5162e6d..c57efa2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2719,6 +2719,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		get_page(dirty_page);
 
 reuse:
+		/*
+		 * Clear the pages cpupid information as the existing
+		 * information potentially belongs to a now completely
+		 * unrelated process.
+		 */
+		if (old_page)
+			page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
+
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 44/63] sched: numa: Report a NUMA task group ID
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/proc/array.c       | 2 ++
 include/linux/sched.h | 5 +++++
 kernel/sched/fair.c   | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd0f1b..1bd2077 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -183,6 +183,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
+		"Ngid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
 		"TracerPid:\t%d\n"
@@ -190,6 +191,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		task_tgid_nr_ns(p, ns),
+		task_numa_group_id(p),
 		pid_nr_ns(pid, ns),
 		ppid, tpid,
 		from_kuid_munged(user_ns, cred->uid),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 505d4ac..1618417 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1442,12 +1442,17 @@ struct task_struct {
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   bool migrated)
 {
 }
+static inline pid_t task_numa_group_id(struct task_struct *p)
+{
+	return 0;
+}
 static inline void set_numabalancing_state(bool enabled)
 {
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 339b1e1..2f60f05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,12 +893,18 @@ struct numa_group {
 
 	spinlock_t lock; /* nr_tasks, tasks */
 	int nr_tasks;
+	pid_t gid;
 	struct list_head task_list;
 
 	struct rcu_head rcu;
 	atomic_long_t faults[0];
 };
 
+pid_t task_numa_group_id(struct task_struct *p)
+{
+	return p->numa_group ? p->numa_group->gid : 0;
+}
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1265,6 +1271,7 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		atomic_set(&grp->refcount, 1);
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
+		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 44/63] sched: numa: Report a NUMA task group ID
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/proc/array.c       | 2 ++
 include/linux/sched.h | 5 +++++
 kernel/sched/fair.c   | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd0f1b..1bd2077 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -183,6 +183,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
+		"Ngid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
 		"TracerPid:\t%d\n"
@@ -190,6 +191,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		task_tgid_nr_ns(p, ns),
+		task_numa_group_id(p),
 		pid_nr_ns(pid, ns),
 		ppid, tpid,
 		from_kuid_munged(user_ns, cred->uid),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 505d4ac..1618417 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1442,12 +1442,17 @@ struct task_struct {
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   bool migrated)
 {
 }
+static inline pid_t task_numa_group_id(struct task_struct *p)
+{
+	return 0;
+}
 static inline void set_numabalancing_state(bool enabled)
 {
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 339b1e1..2f60f05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,12 +893,18 @@ struct numa_group {
 
 	spinlock_t lock; /* nr_tasks, tasks */
 	int nr_tasks;
+	pid_t gid;
 	struct list_head task_list;
 
 	struct rcu_head rcu;
 	atomic_long_t faults[0];
 };
 
+pid_t task_numa_group_id(struct task_struct *p)
+{
+	return p->numa_group ? p->numa_group->gid : 0;
+}
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1265,6 +1271,7 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		atomic_set(&grp->refcount, 1);
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
+		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 45/63] mm: numa: copy cpupid on page migration
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

After page migration, the new page has the nidpid unset. This makes
every fault on a recently migrated page look like a first numa fault,
leading to another page migration.

Copying over the nidpid at page migration time should prevent erroneous
migrations of recently migrated pages.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index c85f3fc..0626af6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -443,6 +443,8 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
+	int cpupid;
+
 	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
@@ -479,6 +481,13 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 			__set_page_dirty_nobuffers(newpage);
  	}
 
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	cpupid = page_cpupid_xchg_last(page, -1);
+	page_cpupid_xchg_last(newpage, cpupid);
+
 	mlock_migrate_page(newpage, page);
 	ksm_migrate_page(newpage, page);
 	/*
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 45/63] mm: numa: copy cpupid on page migration
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

After page migration, the new page has the nidpid unset. This makes
every fault on a recently migrated page look like a first numa fault,
leading to another page migration.

Copying over the nidpid at page migration time should prevent erroneous
migrations of recently migrated pages.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/migrate.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index c85f3fc..0626af6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -443,6 +443,8 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
+	int cpupid;
+
 	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
@@ -479,6 +481,13 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 			__set_page_dirty_nobuffers(newpage);
  	}
 
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	cpupid = page_cpupid_xchg_last(page, -1);
+	page_cpupid_xchg_last(newpage, cpupid);
+
 	mlock_migrate_page(newpage, page);
 	ksm_migrate_page(newpage, page);
 	/*
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 46/63] mm: numa: Do not group on RO pages
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

And here's a little something to make sure not the whole world ends up
in a single group.

As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.

[riel@redhat.com: mapcount 1]
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  7 +++++--
 kernel/sched/fair.c   |  5 +++--
 mm/huge_memory.c      | 15 +++++++++++++--
 mm/memory.c           | 30 ++++++++++++++++++++++++++----
 4 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1618417..56c31c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1440,13 +1440,16 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#define TNF_MIGRATED	0x01
+#define TNF_NO_GROUP	0x02
+
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
-				   bool migrated)
+				   int flags)
 {
 }
 static inline pid_t task_numa_group_id(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f60f05..a9ce454 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1361,9 +1361,10 @@ void task_numa_free(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 {
 	struct task_struct *p = current;
+	bool migrated = flags & TNF_MIGRATED;
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1394,7 +1395,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 		priv = 1;
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
-		if (!priv)
+		if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, last_cpupid);
 	}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index becf92c..7ab4e32 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1285,6 +1285,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
+	int flags = 0;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1299,6 +1300,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pmd_write(pmd))
+		flags |= TNF_NO_GROUP;
+
+	/*
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
@@ -1343,8 +1352,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (migrated)
+	if (migrated) {
+		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
+	}
 
 	goto out;
 clear_pmdnuma:
@@ -1362,7 +1373,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index c57efa2..eba846b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3547,6 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
+	int flags = 0;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3575,6 +3576,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pte_write(pte))
+		flags |= TNF_NO_GROUP;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
@@ -3586,12 +3595,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
-	if (migrated)
+	if (migrated) {
 		page_nid = target_nid;
+		flags |= TNF_MIGRATED;
+	}
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
 }
 
@@ -3632,6 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int page_nid = -1;
 		int target_nid;
 		bool migrated = false;
+		int flags = 0;
 
 		if (!pte_present(pteval))
 			continue;
@@ -3651,20 +3663,30 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
+		/*
+		 * Avoid grouping on DSO/COW pages in specific and RO pages
+		 * in general, RO pages shouldn't hurt as much anyway since
+		 * they can be in shared cache state.
+		 */
+		if (!pte_write(pteval))
+			flags |= TNF_NO_GROUP;
+
 		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
 			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated)
+			if (migrated) {
 				page_nid = target_nid;
+				flags |= TNF_MIGRATED;
+			}
 		} else {
 			put_page(page);
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, flags);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 46/63] mm: numa: Do not group on RO pages
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

And here's a little something to make sure not the whole world ends up
in a single group.

As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.

[riel@redhat.com: mapcount 1]
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  7 +++++--
 kernel/sched/fair.c   |  5 +++--
 mm/huge_memory.c      | 15 +++++++++++++--
 mm/memory.c           | 30 ++++++++++++++++++++++++++----
 4 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1618417..56c31c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1440,13 +1440,16 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#define TNF_MIGRATED	0x01
+#define TNF_NO_GROUP	0x02
+
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
-				   bool migrated)
+				   int flags)
 {
 }
 static inline pid_t task_numa_group_id(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f60f05..a9ce454 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1361,9 +1361,10 @@ void task_numa_free(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 {
 	struct task_struct *p = current;
+	bool migrated = flags & TNF_MIGRATED;
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1394,7 +1395,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 		priv = 1;
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
-		if (!priv)
+		if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, last_cpupid);
 	}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index becf92c..7ab4e32 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1285,6 +1285,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
+	int flags = 0;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1299,6 +1300,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pmd_write(pmd))
+		flags |= TNF_NO_GROUP;
+
+	/*
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
@@ -1343,8 +1352,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (migrated)
+	if (migrated) {
+		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
+	}
 
 	goto out;
 clear_pmdnuma:
@@ -1362,7 +1373,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index c57efa2..eba846b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3547,6 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
+	int flags = 0;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3575,6 +3576,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pte_write(pte))
+		flags |= TNF_NO_GROUP;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
@@ -3586,12 +3595,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
-	if (migrated)
+	if (migrated) {
 		page_nid = target_nid;
+		flags |= TNF_MIGRATED;
+	}
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
 }
 
@@ -3632,6 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int page_nid = -1;
 		int target_nid;
 		bool migrated = false;
+		int flags = 0;
 
 		if (!pte_present(pteval))
 			continue;
@@ -3651,20 +3663,30 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
+		/*
+		 * Avoid grouping on DSO/COW pages in specific and RO pages
+		 * in general, RO pages shouldn't hurt as much anyway since
+		 * they can be in shared cache state.
+		 */
+		if (!pte_write(pteval))
+			flags |= TNF_NO_GROUP;
+
 		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
 			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated)
+			if (migrated) {
 				page_nid = target_nid;
+				flags |= TNF_MIGRATED;
+			}
 		} else {
 			put_page(page);
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, flags);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 47/63] mm: numa: Do not batch handle PMD pages
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

With the THP migration races closed it is still possible to occasionally
see corruption. The problem is related to handling PMD pages in batch.
When a page fault is handled it can be assumed that the page being
faulted will also be flushed from the TLB. The same flushing does not
happen when handling PMD pages in batch. Fixing is straight forward but
there are a number of reasons not to

1. Multiple TLB flushes may have to be sent depending on what pages get
   migrated
2. The handling of PMDs in batch means that faults get accounted to
   the task that is handling the fault. While care is taken to only
   mark PMDs where the last CPU and PID match it can still have problems
   due to PID truncation when matching PIDs.
3. Batching on the PMD level may reduce faults but setting pmd_numa
   requires taking a heavy lock that can contend with THP migration
   and handling the fault requires the release/acquisition of the PTL
   for every page migrated. It's still pretty heavy.

PMD batch handling is not something that people ever have been happy
with. This patch removes it and later patches will deal with the
additional fault overhead using more installigent migrate rate adaption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/memory.c   | 101 ++--------------------------------------------------------
 mm/mprotect.c |  47 ++-------------------------
 2 files changed, 4 insertions(+), 144 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eba846b..9898eeb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3606,103 +3606,6 @@ out:
 	return 0;
 }
 
-/* NUMA hinting page fault entry point for regular pmds */
-#ifdef CONFIG_NUMA_BALANCING
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	pte_t *pte, *orig_pte;
-	unsigned long _addr = addr & PMD_MASK;
-	unsigned long offset;
-	spinlock_t *ptl;
-	bool numa = false;
-	int last_cpupid;
-
-	spin_lock(&mm->page_table_lock);
-	pmd = *pmdp;
-	if (pmd_numa(pmd)) {
-		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
-		numa = true;
-	}
-	spin_unlock(&mm->page_table_lock);
-
-	if (!numa)
-		return 0;
-
-	/* we're in a page fault so some vma must be in the range */
-	BUG_ON(!vma);
-	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
-	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
-	VM_BUG_ON(offset >= PMD_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
-	pte += offset >> PAGE_SHIFT;
-	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
-		pte_t pteval = *pte;
-		struct page *page;
-		int page_nid = -1;
-		int target_nid;
-		bool migrated = false;
-		int flags = 0;
-
-		if (!pte_present(pteval))
-			continue;
-		if (!pte_numa(pteval))
-			continue;
-		if (addr >= vma->vm_end) {
-			vma = find_vma(mm, addr);
-			/* there's a pte present so there must be a vma */
-			BUG_ON(!vma);
-			BUG_ON(addr < vma->vm_start);
-		}
-		if (pte_numa(pteval)) {
-			pteval = pte_mknonnuma(pteval);
-			set_pte_at(mm, addr, pte, pteval);
-		}
-		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page))
-			continue;
-
-		/*
-		 * Avoid grouping on DSO/COW pages in specific and RO pages
-		 * in general, RO pages shouldn't hurt as much anyway since
-		 * they can be in shared cache state.
-		 */
-		if (!pte_write(pteval))
-			flags |= TNF_NO_GROUP;
-
-		last_cpupid = page_cpupid_last(page);
-		page_nid = page_to_nid(page);
-		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-		pte_unmap_unlock(pte, ptl);
-		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated) {
-				page_nid = target_nid;
-				flags |= TNF_MIGRATED;
-			}
-		} else {
-			put_page(page);
-		}
-
-		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, flags);
-
-		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-	}
-	pte_unmap_unlock(orig_pte, ptl);
-
-	return 0;
-}
-#else
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	BUG();
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3841,8 +3744,8 @@ retry:
 		}
 	}
 
-	if (pmd_numa(*pmd))
-		return do_pmd_numa_page(mm, vma, address, pmd);
+	/* THP should already have been handled */
+	BUG_ON(pmd_numa(*pmd));
 
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9a74855..a0302ac 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,15 +37,12 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
+		int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_cpupid = true;
-	int last_cpu = -1;
-	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -64,19 +61,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int cpupid = page_cpupid_last(page);
-					int this_cpu = cpupid_to_cpu(cpupid);
-					int this_pid = cpupid_to_pid(cpupid);
-
-					if (last_cpu == -1)
-						last_cpu = this_cpu;
-					if (last_pid == -1)
-						last_pid = this_pid;
-					if (last_cpu != this_cpu ||
-					    last_pid != this_pid) {
-						all_same_cpupid = false;
-					}
-
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
@@ -115,26 +99,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	spin_lock(&mm->page_table_lock);
-	set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
-	spin_unlock(&mm->page_table_lock);
-}
-#else
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	BUG();
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -142,7 +109,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -168,17 +134,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_cpupid);
+				 dirty_accountable, prot_numa);
 		pages += this_pages;
-
-		/*
-		 * If we are changing protections for NUMA hinting faults then
-		 * set pmd_numa if the examined pages were all on the same
-		 * node. This allows a regular PMD to be handled as one fault
-		 * and effectively batches the taking of the PTL
-		 */
-		if (prot_numa && this_pages && all_same_cpupid)
-			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 47/63] mm: numa: Do not batch handle PMD pages
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

With the THP migration races closed it is still possible to occasionally
see corruption. The problem is related to handling PMD pages in batch.
When a page fault is handled it can be assumed that the page being
faulted will also be flushed from the TLB. The same flushing does not
happen when handling PMD pages in batch. Fixing is straight forward but
there are a number of reasons not to

1. Multiple TLB flushes may have to be sent depending on what pages get
   migrated
2. The handling of PMDs in batch means that faults get accounted to
   the task that is handling the fault. While care is taken to only
   mark PMDs where the last CPU and PID match it can still have problems
   due to PID truncation when matching PIDs.
3. Batching on the PMD level may reduce faults but setting pmd_numa
   requires taking a heavy lock that can contend with THP migration
   and handling the fault requires the release/acquisition of the PTL
   for every page migrated. It's still pretty heavy.

PMD batch handling is not something that people ever have been happy
with. This patch removes it and later patches will deal with the
additional fault overhead using more installigent migrate rate adaption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/memory.c   | 101 ++--------------------------------------------------------
 mm/mprotect.c |  47 ++-------------------------
 2 files changed, 4 insertions(+), 144 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eba846b..9898eeb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3606,103 +3606,6 @@ out:
 	return 0;
 }
 
-/* NUMA hinting page fault entry point for regular pmds */
-#ifdef CONFIG_NUMA_BALANCING
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	pte_t *pte, *orig_pte;
-	unsigned long _addr = addr & PMD_MASK;
-	unsigned long offset;
-	spinlock_t *ptl;
-	bool numa = false;
-	int last_cpupid;
-
-	spin_lock(&mm->page_table_lock);
-	pmd = *pmdp;
-	if (pmd_numa(pmd)) {
-		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
-		numa = true;
-	}
-	spin_unlock(&mm->page_table_lock);
-
-	if (!numa)
-		return 0;
-
-	/* we're in a page fault so some vma must be in the range */
-	BUG_ON(!vma);
-	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
-	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
-	VM_BUG_ON(offset >= PMD_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
-	pte += offset >> PAGE_SHIFT;
-	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
-		pte_t pteval = *pte;
-		struct page *page;
-		int page_nid = -1;
-		int target_nid;
-		bool migrated = false;
-		int flags = 0;
-
-		if (!pte_present(pteval))
-			continue;
-		if (!pte_numa(pteval))
-			continue;
-		if (addr >= vma->vm_end) {
-			vma = find_vma(mm, addr);
-			/* there's a pte present so there must be a vma */
-			BUG_ON(!vma);
-			BUG_ON(addr < vma->vm_start);
-		}
-		if (pte_numa(pteval)) {
-			pteval = pte_mknonnuma(pteval);
-			set_pte_at(mm, addr, pte, pteval);
-		}
-		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page))
-			continue;
-
-		/*
-		 * Avoid grouping on DSO/COW pages in specific and RO pages
-		 * in general, RO pages shouldn't hurt as much anyway since
-		 * they can be in shared cache state.
-		 */
-		if (!pte_write(pteval))
-			flags |= TNF_NO_GROUP;
-
-		last_cpupid = page_cpupid_last(page);
-		page_nid = page_to_nid(page);
-		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-		pte_unmap_unlock(pte, ptl);
-		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated) {
-				page_nid = target_nid;
-				flags |= TNF_MIGRATED;
-			}
-		} else {
-			put_page(page);
-		}
-
-		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, flags);
-
-		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-	}
-	pte_unmap_unlock(orig_pte, ptl);
-
-	return 0;
-}
-#else
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	BUG();
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3841,8 +3744,8 @@ retry:
 		}
 	}
 
-	if (pmd_numa(*pmd))
-		return do_pmd_numa_page(mm, vma, address, pmd);
+	/* THP should already have been handled */
+	BUG_ON(pmd_numa(*pmd));
 
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9a74855..a0302ac 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,15 +37,12 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
+		int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_cpupid = true;
-	int last_cpu = -1;
-	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -64,19 +61,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int cpupid = page_cpupid_last(page);
-					int this_cpu = cpupid_to_cpu(cpupid);
-					int this_pid = cpupid_to_pid(cpupid);
-
-					if (last_cpu == -1)
-						last_cpu = this_cpu;
-					if (last_pid == -1)
-						last_pid = this_pid;
-					if (last_cpu != this_cpu ||
-					    last_pid != this_pid) {
-						all_same_cpupid = false;
-					}
-
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
@@ -115,26 +99,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	spin_lock(&mm->page_table_lock);
-	set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
-	spin_unlock(&mm->page_table_lock);
-}
-#else
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	BUG();
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -142,7 +109,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -168,17 +134,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_cpupid);
+				 dirty_accountable, prot_numa);
 		pages += this_pages;
-
-		/*
-		 * If we are changing protections for NUMA hinting faults then
-		 * set pmd_numa if the examined pages were all on the same
-		 * node. This allows a regular PMD to be handled as one fault
-		 * and effectively batches the taking of the PTL
-		 */
-		if (prot_numa && this_pages && all_same_cpupid)
-			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 48/63] sched: numa: stay on the same node if CLONE_VM
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 +-
 kernel/fork.c         |  2 +-
 kernel/sched/core.c   | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56c31c7..d61b531 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2011,7 +2011,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index 7192d91..c93be06 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,7 +1310,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	sched_fork(clone_flags, p);
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0a6899b..3ce73aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1691,7 +1691,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  *
  * __sched_fork() is basic setup used by init_idle() too:
  */
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
@@ -1720,11 +1720,15 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_scan_seq = 0;
 	}
 
+	if (clone_flags & CLONE_VM)
+		p->numa_preferred_nid = current->numa_preferred_nid;
+	else
+		p->numa_preferred_nid = -1;
+
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
@@ -1758,12 +1762,12 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
 
-	__sched_fork(p);
+	__sched_fork(clone_flags, p);
 	/*
 	 * We mark the process as running here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -4292,7 +4296,7 @@ void init_idle(struct task_struct *idle, int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__sched_fork(idle);
+	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 48/63] sched: numa: stay on the same node if CLONE_VM
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 +-
 kernel/fork.c         |  2 +-
 kernel/sched/core.c   | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56c31c7..d61b531 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2011,7 +2011,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index 7192d91..c93be06 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,7 +1310,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	sched_fork(clone_flags, p);
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0a6899b..3ce73aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1691,7 +1691,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  *
  * __sched_fork() is basic setup used by init_idle() too:
  */
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
@@ -1720,11 +1720,15 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_scan_seq = 0;
 	}
 
+	if (clone_flags & CLONE_VM)
+		p->numa_preferred_nid = current->numa_preferred_nid;
+	else
+		p->numa_preferred_nid = -1;
+
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
@@ -1758,12 +1762,12 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
 
-	__sched_fork(p);
+	__sched_fork(clone_flags, p);
 	/*
 	 * We mark the process as running here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -4292,7 +4296,7 @@ void init_idle(struct task_struct *idle, int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__sched_fork(idle);
+	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 49/63] sched: numa: use group fault statistics in numa placement
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task.  If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |   1 +
 kernel/sched/fair.c   | 124 +++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 108 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d61b531..17eb13f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1346,6 +1346,7 @@ struct task_struct {
 	 * The values remain static for the duration of a PTE scan
 	 */
 	unsigned long *numa_faults;
+	unsigned long total_numa_faults;
 
 	/*
 	 * numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9ce454..f9070f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,6 +897,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	atomic_long_t total_faults;
 	atomic_long_t faults[0];
 };
 
@@ -919,6 +920,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 		p->numa_faults[task_faults_idx(nid, 1)];
 }
 
+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_group)
+		return 0;
+
+	return atomic_long_read(&p->numa_group->faults[2*nid]) +
+	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node.  The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_faults)
+		return 0;
+
+	total_faults = p->total_numa_faults;
+
+	if (!total_faults)
+		return 0;
+
+	return 1000 * task_faults(p, nid) / total_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_group)
+		return 0;
+
+	total_faults = atomic_long_read(&p->numa_group->total_faults);
+
+	if (!total_faults)
+		return 0;
+
+	return 1200 * group_faults(p, nid) / total_faults;
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
@@ -1018,8 +1064,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_faults(cur, env->src_nid) -
-		       task_faults(cur, env->dst_nid);
+		imp += task_weight(cur, env->src_nid) +
+		       group_weight(cur, env->src_nid) -
+		       task_weight(cur, env->dst_nid) -
+		       group_weight(cur, env->dst_nid);
 	}
 
 	if (imp < env->best_imp)
@@ -1098,7 +1146,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	unsigned long faults;
+	unsigned long weight;
 	int nid, ret;
 	long imp;
 
@@ -1115,10 +1163,10 @@ static int task_numa_migrate(struct task_struct *p)
 	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	faults = task_faults(p, env.src_nid);
+	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_faults(env.p, env.dst_nid) - faults;
+	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* If the preferred nid has capacity, try to use it. */
@@ -1131,8 +1179,8 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
-			/* Only consider nodes that recorded more faults */
-			imp = task_faults(env.p, nid) - faults;
+			/* Only consider nodes where both task and groups benefit */
+			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
 			if (imp < 0)
 				continue;
 
@@ -1183,8 +1231,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1;
-	unsigned long max_faults = 0;
+	int seq, nid, max_nid = -1, max_group_nid = -1;
+	unsigned long max_faults = 0, max_group_faults = 0;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1195,7 +1243,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = 0;
+		unsigned long faults = 0, group_faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1211,9 +1259,12 @@ static void task_numa_placement(struct task_struct *p)
 
 			faults += p->numa_faults[i];
 			diff += p->numa_faults[i];
+			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
 				atomic_long_add(diff, &p->numa_group->faults[i]);
+				atomic_long_add(diff, &p->numa_group->total_faults);
+				group_faults += atomic_long_read(&p->numa_group->faults[i]);
 			}
 		}
 
@@ -1221,6 +1272,27 @@ static void task_numa_placement(struct task_struct *p)
 			max_faults = faults;
 			max_nid = nid;
 		}
+
+		if (group_faults > max_group_faults) {
+			max_group_faults = group_faults;
+			max_group_nid = nid;
+		}
+	}
+
+	/*
+	 * If the preferred task and group nids are different,
+	 * iterate over the nodes again to find the best place.
+	 */
+	if (p->numa_group && max_nid != max_group_nid) {
+		unsigned long weight, max_weight = 0;
+
+		for_each_online_node(nid) {
+			weight = task_weight(p, nid) + group_weight(p, nid);
+			if (weight > max_weight) {
+				max_weight = weight;
+				max_nid = nid;
+			}
+		}
 	}
 
 	/* Preferred node as the node with the most faults */
@@ -1276,6 +1348,8 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
 
+		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
 		rcu_assign_pointer(p->numa_group, grp);
@@ -1323,6 +1397,8 @@ unlock:
 		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
 		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
 	}
+	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
+	atomic_long_add(p->total_numa_faults, &grp->total_faults);
 
 	double_lock(&my_grp->lock, &grp->lock);
 
@@ -1347,6 +1423,8 @@ void task_numa_free(struct task_struct *p)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
 
+		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+
 		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
@@ -1385,6 +1463,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+		p->total_numa_faults = 0;
 	}
 
 	/*
@@ -4571,12 +4650,17 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (dst_nid == p->numa_preferred_nid ||
-	    task_faults(p, dst_nid) > task_faults(p, src_nid))
+	/* Always encourage migration to the preferred node. */
+	if (dst_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is better. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4596,11 +4680,17 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is worse. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 49/63] sched: numa: use group fault statistics in numa placement
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task.  If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |   1 +
 kernel/sched/fair.c   | 124 +++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 108 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d61b531..17eb13f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1346,6 +1346,7 @@ struct task_struct {
 	 * The values remain static for the duration of a PTE scan
 	 */
 	unsigned long *numa_faults;
+	unsigned long total_numa_faults;
 
 	/*
 	 * numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9ce454..f9070f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,6 +897,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	atomic_long_t total_faults;
 	atomic_long_t faults[0];
 };
 
@@ -919,6 +920,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 		p->numa_faults[task_faults_idx(nid, 1)];
 }
 
+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_group)
+		return 0;
+
+	return atomic_long_read(&p->numa_group->faults[2*nid]) +
+	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node.  The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_faults)
+		return 0;
+
+	total_faults = p->total_numa_faults;
+
+	if (!total_faults)
+		return 0;
+
+	return 1000 * task_faults(p, nid) / total_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_group)
+		return 0;
+
+	total_faults = atomic_long_read(&p->numa_group->total_faults);
+
+	if (!total_faults)
+		return 0;
+
+	return 1200 * group_faults(p, nid) / total_faults;
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
@@ -1018,8 +1064,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_faults(cur, env->src_nid) -
-		       task_faults(cur, env->dst_nid);
+		imp += task_weight(cur, env->src_nid) +
+		       group_weight(cur, env->src_nid) -
+		       task_weight(cur, env->dst_nid) -
+		       group_weight(cur, env->dst_nid);
 	}
 
 	if (imp < env->best_imp)
@@ -1098,7 +1146,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	unsigned long faults;
+	unsigned long weight;
 	int nid, ret;
 	long imp;
 
@@ -1115,10 +1163,10 @@ static int task_numa_migrate(struct task_struct *p)
 	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	faults = task_faults(p, env.src_nid);
+	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_faults(env.p, env.dst_nid) - faults;
+	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* If the preferred nid has capacity, try to use it. */
@@ -1131,8 +1179,8 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
-			/* Only consider nodes that recorded more faults */
-			imp = task_faults(env.p, nid) - faults;
+			/* Only consider nodes where both task and groups benefit */
+			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
 			if (imp < 0)
 				continue;
 
@@ -1183,8 +1231,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1;
-	unsigned long max_faults = 0;
+	int seq, nid, max_nid = -1, max_group_nid = -1;
+	unsigned long max_faults = 0, max_group_faults = 0;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1195,7 +1243,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = 0;
+		unsigned long faults = 0, group_faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1211,9 +1259,12 @@ static void task_numa_placement(struct task_struct *p)
 
 			faults += p->numa_faults[i];
 			diff += p->numa_faults[i];
+			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
 				atomic_long_add(diff, &p->numa_group->faults[i]);
+				atomic_long_add(diff, &p->numa_group->total_faults);
+				group_faults += atomic_long_read(&p->numa_group->faults[i]);
 			}
 		}
 
@@ -1221,6 +1272,27 @@ static void task_numa_placement(struct task_struct *p)
 			max_faults = faults;
 			max_nid = nid;
 		}
+
+		if (group_faults > max_group_faults) {
+			max_group_faults = group_faults;
+			max_group_nid = nid;
+		}
+	}
+
+	/*
+	 * If the preferred task and group nids are different,
+	 * iterate over the nodes again to find the best place.
+	 */
+	if (p->numa_group && max_nid != max_group_nid) {
+		unsigned long weight, max_weight = 0;
+
+		for_each_online_node(nid) {
+			weight = task_weight(p, nid) + group_weight(p, nid);
+			if (weight > max_weight) {
+				max_weight = weight;
+				max_nid = nid;
+			}
+		}
 	}
 
 	/* Preferred node as the node with the most faults */
@@ -1276,6 +1348,8 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
 
+		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
 		rcu_assign_pointer(p->numa_group, grp);
@@ -1323,6 +1397,8 @@ unlock:
 		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
 		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
 	}
+	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
+	atomic_long_add(p->total_numa_faults, &grp->total_faults);
 
 	double_lock(&my_grp->lock, &grp->lock);
 
@@ -1347,6 +1423,8 @@ void task_numa_free(struct task_struct *p)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
 
+		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+
 		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
@@ -1385,6 +1463,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+		p->total_numa_faults = 0;
 	}
 
 	/*
@@ -4571,12 +4650,17 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (dst_nid == p->numa_preferred_nid ||
-	    task_faults(p, dst_nid) > task_faults(p, src_nid))
+	/* Always encourage migration to the preferred node. */
+	if (dst_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is better. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4596,11 +4680,17 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is worse. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 50/63] sched: numa: call task_numa_free from do_execve
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

It is possible for a task in a numa group to call exec, and
have the new (unrelated) executable inherit the numa group
association from its former self.

This has the potential to break numa grouping, and is trivial
to fix.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/exec.c             | 1 +
 include/linux/sched.h | 4 ++++
 kernel/sched/fair.c   | 9 ++++++++-
 kernel/sched/sched.h  | 5 -----
 4 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 8875dd1..2ea437e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1547,6 +1547,7 @@ static int do_execve_common(const char *filename,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	acct_update_integrals(current);
+	task_numa_free(current);
 	free_bprm(bprm);
 	if (displaced)
 		put_files_struct(displaced);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 17eb13f..5315607 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1448,6 +1448,7 @@ struct task_struct {
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
+extern void task_numa_free(struct task_struct *p);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -1460,6 +1461,9 @@ static inline pid_t task_numa_group_id(struct task_struct *p)
 static inline void set_numabalancing_state(bool enabled)
 {
 }
+static inline void task_numa_free(struct task_struct *p)
+{
+}
 #endif
 
 static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9070f2..d5873e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1418,6 +1418,7 @@ void task_numa_free(struct task_struct *p)
 {
 	struct numa_group *grp = p->numa_group;
 	int i;
+	void *numa_faults = p->numa_faults;
 
 	if (grp) {
 		for (i = 0; i < 2*nr_node_ids; i++)
@@ -1433,7 +1434,9 @@ void task_numa_free(struct task_struct *p)
 		put_numa_group(grp);
 	}
 
-	kfree(p->numa_faults);
+	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
+	kfree(numa_faults);
 }
 
 /*
@@ -1452,6 +1455,10 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (!p->mm)
 		return;
 
+	/* Do not worry about placement if exiting */
+	if (p->state == TASK_DEAD)
+		return;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9aab230..13fe790 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,11 +556,6 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-extern void task_numa_free(struct task_struct *p);
-#else /* CONFIG_NUMA_BALANCING */
-static inline void task_numa_free(struct task_struct *p)
-{
-}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 50/63] sched: numa: call task_numa_free from do_execve
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

It is possible for a task in a numa group to call exec, and
have the new (unrelated) executable inherit the numa group
association from its former self.

This has the potential to break numa grouping, and is trivial
to fix.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/exec.c             | 1 +
 include/linux/sched.h | 4 ++++
 kernel/sched/fair.c   | 9 ++++++++-
 kernel/sched/sched.h  | 5 -----
 4 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 8875dd1..2ea437e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1547,6 +1547,7 @@ static int do_execve_common(const char *filename,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	acct_update_integrals(current);
+	task_numa_free(current);
 	free_bprm(bprm);
 	if (displaced)
 		put_files_struct(displaced);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 17eb13f..5315607 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1448,6 +1448,7 @@ struct task_struct {
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
+extern void task_numa_free(struct task_struct *p);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -1460,6 +1461,9 @@ static inline pid_t task_numa_group_id(struct task_struct *p)
 static inline void set_numabalancing_state(bool enabled)
 {
 }
+static inline void task_numa_free(struct task_struct *p)
+{
+}
 #endif
 
 static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9070f2..d5873e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1418,6 +1418,7 @@ void task_numa_free(struct task_struct *p)
 {
 	struct numa_group *grp = p->numa_group;
 	int i;
+	void *numa_faults = p->numa_faults;
 
 	if (grp) {
 		for (i = 0; i < 2*nr_node_ids; i++)
@@ -1433,7 +1434,9 @@ void task_numa_free(struct task_struct *p)
 		put_numa_group(grp);
 	}
 
-	kfree(p->numa_faults);
+	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
+	kfree(numa_faults);
 }
 
 /*
@@ -1452,6 +1455,10 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (!p->mm)
 		return;
 
+	/* Do not worry about placement if exiting */
+	if (p->state == TASK_DEAD)
+		return;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9aab230..13fe790 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,11 +556,6 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-extern void task_numa_free(struct task_struct *p);
-#else /* CONFIG_NUMA_BALANCING */
-static inline void task_numa_free(struct task_struct *p)
-{
-}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 51/63] sched: numa: Prevent parallel updates to group stats during placement
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Having multiple tasks in a group go through task_numa_placement
simultaneously can lead to a task picking a wrong node to run on, because
the group stats may be in the middle of an update. This patch avoids
parallel updates by holding the numa_group lock during placement
decisions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d5873e5..dc0c376 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1233,6 +1233,7 @@ static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1241,6 +1242,12 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
+	/* If the task is part of a group prevent parallel updates to group stats */
+	if (p->numa_group) {
+		group_lock = &p->numa_group->lock;
+		spin_lock(group_lock);
+	}
+
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults = 0, group_faults = 0;
@@ -1279,20 +1286,24 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * If the preferred task and group nids are different,
-	 * iterate over the nodes again to find the best place.
-	 */
-	if (p->numa_group && max_nid != max_group_nid) {
-		unsigned long weight, max_weight = 0;
-
-		for_each_online_node(nid) {
-			weight = task_weight(p, nid) + group_weight(p, nid);
-			if (weight > max_weight) {
-				max_weight = weight;
-				max_nid = nid;
+	if (p->numa_group) {
+		/*
+		 * If the preferred task and group nids are different,
+		 * iterate over the nodes again to find the best place.
+		 */
+		if (max_nid != max_group_nid) {
+			unsigned long weight, max_weight = 0;
+
+			for_each_online_node(nid) {
+				weight = task_weight(p, nid) + group_weight(p, nid);
+				if (weight > max_weight) {
+					max_weight = weight;
+					max_nid = nid;
+				}
 			}
 		}
+
+		spin_unlock(group_lock);
 	}
 
 	/* Preferred node as the node with the most faults */
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 51/63] sched: numa: Prevent parallel updates to group stats during placement
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Having multiple tasks in a group go through task_numa_placement
simultaneously can lead to a task picking a wrong node to run on, because
the group stats may be in the middle of an update. This patch avoids
parallel updates by holding the numa_group lock during placement
decisions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d5873e5..dc0c376 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1233,6 +1233,7 @@ static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1241,6 +1242,12 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
+	/* If the task is part of a group prevent parallel updates to group stats */
+	if (p->numa_group) {
+		group_lock = &p->numa_group->lock;
+		spin_lock(group_lock);
+	}
+
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults = 0, group_faults = 0;
@@ -1279,20 +1286,24 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * If the preferred task and group nids are different,
-	 * iterate over the nodes again to find the best place.
-	 */
-	if (p->numa_group && max_nid != max_group_nid) {
-		unsigned long weight, max_weight = 0;
-
-		for_each_online_node(nid) {
-			weight = task_weight(p, nid) + group_weight(p, nid);
-			if (weight > max_weight) {
-				max_weight = weight;
-				max_nid = nid;
+	if (p->numa_group) {
+		/*
+		 * If the preferred task and group nids are different,
+		 * iterate over the nodes again to find the best place.
+		 */
+		if (max_nid != max_group_nid) {
+			unsigned long weight, max_weight = 0;
+
+			for_each_online_node(nid) {
+				weight = task_weight(p, nid) + group_weight(p, nid);
+				if (weight > max_weight) {
+					max_weight = weight;
+					max_nid = nid;
+				}
 			}
 		}
+
+		spin_unlock(group_lock);
 	}
 
 	/* Preferred node as the node with the most faults */
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 52/63] sched: numa: add debugging
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-5giqjcqnc93a89q01ymtjxpr@git.kernel.org
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/debug.c  | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |  5 ++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5315607..390004b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,7 @@ struct task_struct {
 	unsigned long *numa_faults_buffer;
 
 	int numa_preferred_nid;
+	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
@@ -2587,6 +2588,11 @@ static inline unsigned int task_cpu(const struct task_struct *p)
 	return task_thread_info(p)->cpu;
 }
 
+static inline int task_node(const struct task_struct *p)
+{
+	return cpu_to_node(task_cpu(p));
+}
+
 extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
 
 #else
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1965599..e6ba5e3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/kallsyms.h>
 #include <linux/utsname.h>
+#include <linux/mempolicy.h>
 
 #include "sched.h"
 
@@ -137,6 +138,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	SEQ_printf(m, " %d", cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
@@ -159,7 +163,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, p) {
-		if (!p->on_rq || task_cpu(p) != rq_cpu)
+		if (task_cpu(p) != rq_cpu)
 			continue;
 
 		print_task(m, rq, p);
@@ -345,7 +349,7 @@ static void sched_debug_header(struct seq_file *m)
 	cpu_clk = local_clock();
 	local_irq_restore(flags);
 
-	SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
+	SEQ_printf(m, "Sched Debug Version: v0.11, %s %.*s\n",
 		init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version);
@@ -488,6 +492,56 @@ static int __init init_sched_debug_procfs(void)
 
 __initcall(init_sched_debug_procfs);
 
+#define __P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F)
+#define P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F)
+#define __PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F))
+#define PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F))
+
+
+static void sched_show_numa(struct task_struct *p, struct seq_file *m)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	struct mempolicy *pol;
+	int node, i;
+
+	if (p->mm)
+		P(mm->numa_scan_seq);
+
+	task_lock(p);
+	pol = p->mempolicy;
+	if (pol && !(pol->flags & MPOL_F_MORON))
+		pol = NULL;
+	mpol_get(pol);
+	task_unlock(p);
+
+	SEQ_printf(m, "numa_migrations, %ld\n", xchg(&p->numa_pages_migrated, 0));
+
+	for_each_online_node(node) {
+		for (i = 0; i < 2; i++) {
+			unsigned long nr_faults = -1;
+			int cpu_current, home_node;
+
+			if (p->numa_faults)
+				nr_faults = p->numa_faults[2*node + i];
+
+			cpu_current = !i ? (task_node(p) == node) :
+				(pol && node_isset(node, pol->v.nodes));
+
+			home_node = (p->numa_preferred_nid == node);
+
+			SEQ_printf(m, "numa_faults, %d, %d, %d, %d, %ld\n",
+				i, node, cpu_current, home_node, nr_faults);
+		}
+	}
+
+	mpol_put(pol);
+#endif
+}
+
 void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 {
 	unsigned long nr_switches;
@@ -591,6 +645,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		SEQ_printf(m, "%-45s:%21Ld\n",
 			   "clock-delta", (long long)(t1-t0));
 	}
+
+	sched_show_numa(p, m);
 }
 
 void proc_sched_set_task(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc0c376..58d1070 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1137,7 +1137,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.p = p,
 
 		.src_cpu = task_cpu(p),
-		.src_nid = cpu_to_node(task_cpu(p)),
+		.src_nid = task_node(p),
 
 		.imbalance_pct = 112,
 
@@ -1515,6 +1515,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
+	if (migrated)
+		p->numa_pages_migrated += pages;
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 52/63] sched: numa: add debugging
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-5giqjcqnc93a89q01ymtjxpr@git.kernel.org
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/debug.c  | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |  5 ++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5315607..390004b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,7 @@ struct task_struct {
 	unsigned long *numa_faults_buffer;
 
 	int numa_preferred_nid;
+	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
@@ -2587,6 +2588,11 @@ static inline unsigned int task_cpu(const struct task_struct *p)
 	return task_thread_info(p)->cpu;
 }
 
+static inline int task_node(const struct task_struct *p)
+{
+	return cpu_to_node(task_cpu(p));
+}
+
 extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
 
 #else
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1965599..e6ba5e3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/kallsyms.h>
 #include <linux/utsname.h>
+#include <linux/mempolicy.h>
 
 #include "sched.h"
 
@@ -137,6 +138,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	SEQ_printf(m, " %d", cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
@@ -159,7 +163,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, p) {
-		if (!p->on_rq || task_cpu(p) != rq_cpu)
+		if (task_cpu(p) != rq_cpu)
 			continue;
 
 		print_task(m, rq, p);
@@ -345,7 +349,7 @@ static void sched_debug_header(struct seq_file *m)
 	cpu_clk = local_clock();
 	local_irq_restore(flags);
 
-	SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
+	SEQ_printf(m, "Sched Debug Version: v0.11, %s %.*s\n",
 		init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version);
@@ -488,6 +492,56 @@ static int __init init_sched_debug_procfs(void)
 
 __initcall(init_sched_debug_procfs);
 
+#define __P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F)
+#define P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F)
+#define __PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F))
+#define PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F))
+
+
+static void sched_show_numa(struct task_struct *p, struct seq_file *m)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	struct mempolicy *pol;
+	int node, i;
+
+	if (p->mm)
+		P(mm->numa_scan_seq);
+
+	task_lock(p);
+	pol = p->mempolicy;
+	if (pol && !(pol->flags & MPOL_F_MORON))
+		pol = NULL;
+	mpol_get(pol);
+	task_unlock(p);
+
+	SEQ_printf(m, "numa_migrations, %ld\n", xchg(&p->numa_pages_migrated, 0));
+
+	for_each_online_node(node) {
+		for (i = 0; i < 2; i++) {
+			unsigned long nr_faults = -1;
+			int cpu_current, home_node;
+
+			if (p->numa_faults)
+				nr_faults = p->numa_faults[2*node + i];
+
+			cpu_current = !i ? (task_node(p) == node) :
+				(pol && node_isset(node, pol->v.nodes));
+
+			home_node = (p->numa_preferred_nid == node);
+
+			SEQ_printf(m, "numa_faults, %d, %d, %d, %d, %ld\n",
+				i, node, cpu_current, home_node, nr_faults);
+		}
+	}
+
+	mpol_put(pol);
+#endif
+}
+
 void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 {
 	unsigned long nr_switches;
@@ -591,6 +645,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		SEQ_printf(m, "%-45s:%21Ld\n",
 			   "clock-delta", (long long)(t1-t0));
 	}
+
+	sched_show_numa(p, m);
 }
 
 void proc_sched_set_task(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc0c376..58d1070 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1137,7 +1137,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.p = p,
 
 		.src_cpu = task_cpu(p),
-		.src_nid = cpu_to_node(task_cpu(p)),
+		.src_nid = task_node(p),
 
 		.imbalance_pct = 112,
 
@@ -1515,6 +1515,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
+	if (migrated)
+		p->numa_pages_migrated += pages;
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 53/63] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch separately considers task and group affinities when searching
for swap candidates during task NUMA placement. If tasks are not part of
a group or the same group then the task weights are considered.
Otherwise the group weights are compared.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58d1070..e7da6f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,13 +1039,15 @@ static void task_numa_assign(struct task_numa_env *env,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env, long imp)
+static void task_numa_compare(struct task_numa_env *env,
+			      long taskimp, long groupimp)
 {
 	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long dst_load, src_load;
 	long load;
+	long imp = (groupimp > 0) ? groupimp : taskimp;
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
@@ -1064,10 +1066,19 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_weight(cur, env->src_nid) +
-		       group_weight(cur, env->src_nid) -
-		       task_weight(cur, env->dst_nid) -
-		       group_weight(cur, env->dst_nid);
+		/*
+		 * If dst and source tasks are in the same NUMA group, or not
+		 * in any group then look only at task weights otherwise give
+		 * priority to the group weights.
+		 */
+		if (!cur->numa_group || !env->p->numa_group ||
+		    cur->numa_group == env->p->numa_group) {
+			imp = taskimp + task_weight(cur, env->src_nid) -
+			      task_weight(cur, env->dst_nid);
+		} else {
+			imp = groupimp + group_weight(cur, env->src_nid) -
+			       group_weight(cur, env->dst_nid);
+		}
 	}
 
 	if (imp < env->best_imp)
@@ -1117,7 +1128,8 @@ unlock:
 	rcu_read_unlock();
 }
 
-static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+static void task_numa_find_cpu(struct task_numa_env *env,
+				long taskimp, long groupimp)
 {
 	int cpu;
 
@@ -1127,7 +1139,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, long imp)
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, imp);
+		task_numa_compare(env, taskimp, groupimp);
 	}
 }
 
@@ -1146,9 +1158,9 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	unsigned long weight;
+	unsigned long taskweight, groupweight;
 	int nid, ret;
-	long imp;
+	long taskimp, groupimp;
 
 	/*
 	 * Pick the lowest SD_NUMA domain, as that would have the smallest
@@ -1163,15 +1175,17 @@ static int task_numa_migrate(struct task_struct *p)
 	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
+	taskweight = task_weight(p, env.src_nid);
+	groupweight = group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
+	taskimp = task_weight(p, env.dst_nid) - taskweight;
+	groupimp = group_weight(p, env.dst_nid) - groupweight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* If the preferred nid has capacity, try to use it. */
 	if (env.dst_stats.has_capacity)
-		task_numa_find_cpu(&env, imp);
+		task_numa_find_cpu(&env, taskimp, groupimp);
 
 	/* No space available on the preferred nid. Look elsewhere. */
 	if (env.best_cpu == -1) {
@@ -1180,13 +1194,14 @@ static int task_numa_migrate(struct task_struct *p)
 				continue;
 
 			/* Only consider nodes where both task and groups benefit */
-			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
-			if (imp < 0)
+			taskimp = task_weight(p, nid) - taskweight;
+			groupimp = group_weight(p, nid) - groupweight;
+			if (taskimp < 0 && groupimp < 0)
 				continue;
 
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, imp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
@@ -4678,10 +4693,9 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (dst_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is better. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If both task and group weight improve, this move is a winner. */
+	if (task_weight(p, dst_nid) > task_weight(p, src_nid) &&
+	    group_weight(p, dst_nid) > group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4708,10 +4722,9 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	if (src_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is worse. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If either task or group weight get worse, don't do it. */
+	if (task_weight(p, dst_nid) < task_weight(p, src_nid) ||
+	    group_weight(p, dst_nid) < group_weight(p, src_nid))
 		return true;
 
 	return false;
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 53/63] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch separately considers task and group affinities when searching
for swap candidates during task NUMA placement. If tasks are not part of
a group or the same group then the task weights are considered.
Otherwise the group weights are compared.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58d1070..e7da6f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,13 +1039,15 @@ static void task_numa_assign(struct task_numa_env *env,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env, long imp)
+static void task_numa_compare(struct task_numa_env *env,
+			      long taskimp, long groupimp)
 {
 	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long dst_load, src_load;
 	long load;
+	long imp = (groupimp > 0) ? groupimp : taskimp;
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
@@ -1064,10 +1066,19 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_weight(cur, env->src_nid) +
-		       group_weight(cur, env->src_nid) -
-		       task_weight(cur, env->dst_nid) -
-		       group_weight(cur, env->dst_nid);
+		/*
+		 * If dst and source tasks are in the same NUMA group, or not
+		 * in any group then look only at task weights otherwise give
+		 * priority to the group weights.
+		 */
+		if (!cur->numa_group || !env->p->numa_group ||
+		    cur->numa_group == env->p->numa_group) {
+			imp = taskimp + task_weight(cur, env->src_nid) -
+			      task_weight(cur, env->dst_nid);
+		} else {
+			imp = groupimp + group_weight(cur, env->src_nid) -
+			       group_weight(cur, env->dst_nid);
+		}
 	}
 
 	if (imp < env->best_imp)
@@ -1117,7 +1128,8 @@ unlock:
 	rcu_read_unlock();
 }
 
-static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+static void task_numa_find_cpu(struct task_numa_env *env,
+				long taskimp, long groupimp)
 {
 	int cpu;
 
@@ -1127,7 +1139,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, long imp)
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, imp);
+		task_numa_compare(env, taskimp, groupimp);
 	}
 }
 
@@ -1146,9 +1158,9 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	unsigned long weight;
+	unsigned long taskweight, groupweight;
 	int nid, ret;
-	long imp;
+	long taskimp, groupimp;
 
 	/*
 	 * Pick the lowest SD_NUMA domain, as that would have the smallest
@@ -1163,15 +1175,17 @@ static int task_numa_migrate(struct task_struct *p)
 	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
+	taskweight = task_weight(p, env.src_nid);
+	groupweight = group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
+	taskimp = task_weight(p, env.dst_nid) - taskweight;
+	groupimp = group_weight(p, env.dst_nid) - groupweight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* If the preferred nid has capacity, try to use it. */
 	if (env.dst_stats.has_capacity)
-		task_numa_find_cpu(&env, imp);
+		task_numa_find_cpu(&env, taskimp, groupimp);
 
 	/* No space available on the preferred nid. Look elsewhere. */
 	if (env.best_cpu == -1) {
@@ -1180,13 +1194,14 @@ static int task_numa_migrate(struct task_struct *p)
 				continue;
 
 			/* Only consider nodes where both task and groups benefit */
-			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
-			if (imp < 0)
+			taskimp = task_weight(p, nid) - taskweight;
+			groupimp = group_weight(p, nid) - groupweight;
+			if (taskimp < 0 && groupimp < 0)
 				continue;
 
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, imp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
@@ -4678,10 +4693,9 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (dst_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is better. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If both task and group weight improve, this move is a winner. */
+	if (task_weight(p, dst_nid) > task_weight(p, src_nid) &&
+	    group_weight(p, dst_nid) > group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4708,10 +4722,9 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	if (src_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is worse. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If either task or group weight get worse, don't do it. */
+	if (task_weight(p, dst_nid) < task_weight(p, src_nid) ||
+	    group_weight(p, dst_nid) < group_weight(p, src_nid))
 		return true;
 
 	return false;
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 54/63] sched: numa: fix task or group comparison
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch separately considers task and group affinities when
searching for swap candidates during NUMA placement. If tasks
are part of the same group, or no group at all, the task weights
are considered.

Some hysteresis is added to prevent tasks within one group from
getting bounced between NUMA nodes due to tiny differences.

If tasks are part of different groups, the code compares group
weights, in order to favor grouping task groups together.

The patch also changes the group weight multiplier to be the
same as the task weight multiplier, since the two are no longer
added up like before.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7da6f2..6daf82e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -962,7 +962,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 	if (!total_faults)
 		return 0;
 
-	return 1200 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1068,16 +1068,34 @@ static void task_numa_compare(struct task_numa_env *env,
 
 		/*
 		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights otherwise give
-		 * priority to the group weights.
+		 * in any group then look only at task weights.
 		 */
-		if (!cur->numa_group || !env->p->numa_group ||
-		    cur->numa_group == env->p->numa_group) {
+		if (cur->numa_group == env->p->numa_group) {
 			imp = taskimp + task_weight(cur, env->src_nid) -
 			      task_weight(cur, env->dst_nid);
+			/*
+			 * Add some hysteresis to prevent swapping the
+			 * tasks within a group over tiny differences.
+			 */
+			if (cur->numa_group)
+				imp -= imp/16;
 		} else {
-			imp = groupimp + group_weight(cur, env->src_nid) -
-			       group_weight(cur, env->dst_nid);
+			/*
+			 * Compare the group weights. If a task is all by
+			 * itself (not part of a group), use the task weight
+			 * instead.
+			 */
+			if (env->p->numa_group)
+				imp = groupimp;
+			else
+				imp = taskimp;
+
+			if (cur->numa_group)
+				imp += group_weight(cur, env->src_nid) -
+				       group_weight(cur, env->dst_nid);
+			else
+				imp += task_weight(cur, env->src_nid) -
+				       task_weight(cur, env->dst_nid);
 		}
 	}
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 54/63] sched: numa: fix task or group comparison
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch separately considers task and group affinities when
searching for swap candidates during NUMA placement. If tasks
are part of the same group, or no group at all, the task weights
are considered.

Some hysteresis is added to prevent tasks within one group from
getting bounced between NUMA nodes due to tiny differences.

If tasks are part of different groups, the code compares group
weights, in order to favor grouping task groups together.

The patch also changes the group weight multiplier to be the
same as the task weight multiplier, since the two are no longer
added up like before.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7da6f2..6daf82e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -962,7 +962,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 	if (!total_faults)
 		return 0;
 
-	return 1200 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1068,16 +1068,34 @@ static void task_numa_compare(struct task_numa_env *env,
 
 		/*
 		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights otherwise give
-		 * priority to the group weights.
+		 * in any group then look only at task weights.
 		 */
-		if (!cur->numa_group || !env->p->numa_group ||
-		    cur->numa_group == env->p->numa_group) {
+		if (cur->numa_group == env->p->numa_group) {
 			imp = taskimp + task_weight(cur, env->src_nid) -
 			      task_weight(cur, env->dst_nid);
+			/*
+			 * Add some hysteresis to prevent swapping the
+			 * tasks within a group over tiny differences.
+			 */
+			if (cur->numa_group)
+				imp -= imp/16;
 		} else {
-			imp = groupimp + group_weight(cur, env->src_nid) -
-			       group_weight(cur, env->dst_nid);
+			/*
+			 * Compare the group weights. If a task is all by
+			 * itself (not part of a group), use the task weight
+			 * instead.
+			 */
+			if (env->p->numa_group)
+				imp = groupimp;
+			else
+				imp = taskimp;
+
+			if (cur->numa_group)
+				imp += group_weight(cur, env->src_nid) -
+				       group_weight(cur, env->dst_nid);
+			else
+				imp += task_weight(cur, env->src_nid) -
+				       task_weight(cur, env->dst_nid);
 		}
 	}
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 55/63] sched: numa: Avoid migrating tasks that are placed on their preferred node
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

This patch classifies scheduler domains and runqueues into types depending
the number of tasks that are about their NUMA placement and the number
that are currently running on their preferred node. The types are

regular: There are tasks running that do not care about their NUMA
	placement.

remote: There are tasks running that care about their placement but are
	currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running
that care about their placement (nr_numa_running). The load balancer
uses this information to avoid migrating idea placed NUMA tasks as long
as better options for load balancing exists. For example, it will not
consider balancing between a group whose tasks are all perfectly placed
and a group with remote tasks.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  |  29 +++++++++++++
 kernel/sched/fair.c  | 120 +++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   5 +++
 3 files changed, 142 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ce73aa..86497b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4473,6 +4473,35 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
+
+/*
+ * Requeue a task on a given node and accurately track the number of NUMA
+ * tasks on the runqueues
+ */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	struct rq *rq;
+	unsigned long flags;
+	bool on_rq, running;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 1;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
 #endif
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6daf82e..88225b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,18 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running += (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running -= (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
+}
+
 struct numa_group {
 	atomic_t refcount;
 
@@ -1227,6 +1239,8 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
+	sched_setnuma(p, env.dst_nid);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1342,8 +1356,7 @@ static void task_numa_placement(struct task_struct *p)
 	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
-		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 1;
+		sched_setnuma(p, max_nid);
 		numa_migrate_preferred(p);
 	}
 }
@@ -1741,6 +1754,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1750,8 +1771,12 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_enqueue(rq, task_of(se));
+		list_add(&se->group_node, &rq->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1762,8 +1787,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -4605,6 +4632,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
+enum fbq_type { regular, remote, all };
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_SOME_PINNED 0x04
@@ -4630,6 +4659,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	enum fbq_type		fbq_type;
 };
 
 /*
@@ -5089,6 +5120,10 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 };
 
 /*
@@ -5417,6 +5452,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+		sgs->nr_numa_running += rq->nr_numa_running;
+		sgs->nr_preferred_running += rq->nr_preferred_running;
+#endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
@@ -5491,14 +5530,43 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running > sgs->nr_numa_running)
+		return regular;
+	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+		return remote;
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	if (rq->nr_running > rq->nr_numa_running)
+		return regular;
+	if (rq->nr_running > rq->nr_preferred_running)
+		return remote;
+	return all;
+}
+#else
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	return regular;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
  * @balance: Should we balance.
  * @sds: variable to hold the statistics for this sched_domain.
  */
-static inline void update_sd_lb_stats(struct lb_env *env,
-					struct sd_lb_stats *sds)
+static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
@@ -5548,6 +5616,9 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
+
+	if (env->sd->flags & SD_NUMA)
+		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 }
 
 /**
@@ -5851,15 +5922,39 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power,
-							   SCHED_POWER_SCALE);
-		unsigned long wl;
+		unsigned long power, capacity, wl;
+		enum fbq_type rt;
+
+		rq = cpu_rq(i);
+		rt = fbq_classify_rq(rq);
 
+		/*
+		 * We classify groups/runqueues into three groups:
+		 *  - regular: there are !numa tasks
+		 *  - remote:  there are numa tasks that run on the 'wrong' node
+		 *  - all:     there is no distinction
+		 *
+		 * In order to avoid migrating ideally placed numa tasks,
+		 * ignore those when there's better options.
+		 *
+		 * If we ignore the actual busiest queue to migrate another
+		 * task, the next balance pass can still reduce the busiest
+		 * queue by moving tasks around inside the node.
+		 *
+		 * If we cannot move enough load due to this classification
+		 * the next pass will adjust the group classification and
+		 * allow migration of more tasks.
+		 *
+		 * Both cases only affect the total convergence complexity.
+		 */
+		if (rt > env->fbq_type)
+			continue;
+
+		power = power_of(i);
+		capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 
 		/*
@@ -5975,6 +6070,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.fbq_type	= all,
 	};
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 13fe790..f9e1b5e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -409,6 +409,10 @@ struct rq {
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
 	unsigned int nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned long last_load_update_tick;
@@ -554,6 +558,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
 #endif /* CONFIG_NUMA_BALANCING */
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 55/63] sched: numa: Avoid migrating tasks that are placed on their preferred node
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

This patch classifies scheduler domains and runqueues into types depending
the number of tasks that are about their NUMA placement and the number
that are currently running on their preferred node. The types are

regular: There are tasks running that do not care about their NUMA
	placement.

remote: There are tasks running that care about their placement but are
	currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running
that care about their placement (nr_numa_running). The load balancer
uses this information to avoid migrating idea placed NUMA tasks as long
as better options for load balancing exists. For example, it will not
consider balancing between a group whose tasks are all perfectly placed
and a group with remote tasks.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  |  29 +++++++++++++
 kernel/sched/fair.c  | 120 +++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   5 +++
 3 files changed, 142 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ce73aa..86497b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4473,6 +4473,35 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
+
+/*
+ * Requeue a task on a given node and accurately track the number of NUMA
+ * tasks on the runqueues
+ */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	struct rq *rq;
+	unsigned long flags;
+	bool on_rq, running;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 1;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
 #endif
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6daf82e..88225b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,18 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running += (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running -= (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
+}
+
 struct numa_group {
 	atomic_t refcount;
 
@@ -1227,6 +1239,8 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
+	sched_setnuma(p, env.dst_nid);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1342,8 +1356,7 @@ static void task_numa_placement(struct task_struct *p)
 	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
-		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 1;
+		sched_setnuma(p, max_nid);
 		numa_migrate_preferred(p);
 	}
 }
@@ -1741,6 +1754,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1750,8 +1771,12 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_enqueue(rq, task_of(se));
+		list_add(&se->group_node, &rq->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1762,8 +1787,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -4605,6 +4632,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
+enum fbq_type { regular, remote, all };
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_SOME_PINNED 0x04
@@ -4630,6 +4659,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	enum fbq_type		fbq_type;
 };
 
 /*
@@ -5089,6 +5120,10 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 };
 
 /*
@@ -5417,6 +5452,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+		sgs->nr_numa_running += rq->nr_numa_running;
+		sgs->nr_preferred_running += rq->nr_preferred_running;
+#endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
@@ -5491,14 +5530,43 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running > sgs->nr_numa_running)
+		return regular;
+	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+		return remote;
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	if (rq->nr_running > rq->nr_numa_running)
+		return regular;
+	if (rq->nr_running > rq->nr_preferred_running)
+		return remote;
+	return all;
+}
+#else
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	return regular;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
  * @balance: Should we balance.
  * @sds: variable to hold the statistics for this sched_domain.
  */
-static inline void update_sd_lb_stats(struct lb_env *env,
-					struct sd_lb_stats *sds)
+static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
@@ -5548,6 +5616,9 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
+
+	if (env->sd->flags & SD_NUMA)
+		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 }
 
 /**
@@ -5851,15 +5922,39 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power,
-							   SCHED_POWER_SCALE);
-		unsigned long wl;
+		unsigned long power, capacity, wl;
+		enum fbq_type rt;
+
+		rq = cpu_rq(i);
+		rt = fbq_classify_rq(rq);
 
+		/*
+		 * We classify groups/runqueues into three groups:
+		 *  - regular: there are !numa tasks
+		 *  - remote:  there are numa tasks that run on the 'wrong' node
+		 *  - all:     there is no distinction
+		 *
+		 * In order to avoid migrating ideally placed numa tasks,
+		 * ignore those when there's better options.
+		 *
+		 * If we ignore the actual busiest queue to migrate another
+		 * task, the next balance pass can still reduce the busiest
+		 * queue by moving tasks around inside the node.
+		 *
+		 * If we cannot move enough load due to this classification
+		 * the next pass will adjust the group classification and
+		 * allow migration of more tasks.
+		 *
+		 * Both cases only affect the total convergence complexity.
+		 */
+		if (rt > env->fbq_type)
+			continue;
+
+		power = power_of(i);
+		capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 
 		/*
@@ -5975,6 +6070,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.fbq_type	= all,
 	};
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 13fe790..f9e1b5e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -409,6 +409,10 @@ struct rq {
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
 	unsigned int nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned long last_load_update_tick;
@@ -554,6 +558,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
 #endif /* CONFIG_NUMA_BALANCING */
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 56/63] sched: numa: be more careful about joining numa groups
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Due to the way the pid is truncated, and tasks are moved between
CPUs by the scheduler, it is possible for the current task_numa_fault
to group together tasks that do not actually share memory together.

This patch adds a few easy sanity checks to task_numa_fault, joining
tasks together if they share the same tsk->mm, or if the fault was on
a page with an elevated mapcount, in a shared VMA.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 16 +++++++++++-----
 mm/memory.c           |  7 +++++++
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 390004b..b859621 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1444,6 +1444,7 @@ struct task_struct {
 
 #define TNF_MIGRATED	0x01
 #define TNF_NO_GROUP	0x02
+#define TNF_SHARED	0x04
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 88225b7..baa2276 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,7 +1381,7 @@ static void double_lock(spinlock_t *l1, spinlock_t *l2)
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
-static void task_numa_group(struct task_struct *p, int cpupid)
+static void task_numa_group(struct task_struct *p, int cpupid, int flags)
 {
 	struct numa_group *grp, *my_grp;
 	struct task_struct *tsk;
@@ -1439,10 +1439,16 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
 		goto unlock;
 
-	if (!get_numa_group(grp))
-		goto unlock;
+	/* Always join threads in the same process. */
+	if (tsk->mm == current->mm)
+		join = true;
+
+	/* Simple filter to avoid false positives due to PID collisions */
+	if (flags & TNF_SHARED)
+		join = true;
 
-	join = true;
+	if (join && !get_numa_group(grp))
+		join = false;
 
 unlock:
 	rcu_read_unlock();
@@ -1539,7 +1545,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
 		if (!priv && !(flags & TNF_NO_GROUP))
-			task_numa_group(p, last_cpupid);
+			task_numa_group(p, last_cpupid, flags);
 	}
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 9898eeb..823720c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3584,6 +3584,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
+	/*
+	 * Flag if the page is shared between multiple address spaces. This
+	 * is later used when determining whether to group tasks together
+	 */
+	if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
+		flags |= TNF_SHARED;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 56/63] sched: numa: be more careful about joining numa groups
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Due to the way the pid is truncated, and tasks are moved between
CPUs by the scheduler, it is possible for the current task_numa_fault
to group together tasks that do not actually share memory together.

This patch adds a few easy sanity checks to task_numa_fault, joining
tasks together if they share the same tsk->mm, or if the fault was on
a page with an elevated mapcount, in a shared VMA.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 16 +++++++++++-----
 mm/memory.c           |  7 +++++++
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 390004b..b859621 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1444,6 +1444,7 @@ struct task_struct {
 
 #define TNF_MIGRATED	0x01
 #define TNF_NO_GROUP	0x02
+#define TNF_SHARED	0x04
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 88225b7..baa2276 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,7 +1381,7 @@ static void double_lock(spinlock_t *l1, spinlock_t *l2)
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
-static void task_numa_group(struct task_struct *p, int cpupid)
+static void task_numa_group(struct task_struct *p, int cpupid, int flags)
 {
 	struct numa_group *grp, *my_grp;
 	struct task_struct *tsk;
@@ -1439,10 +1439,16 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
 		goto unlock;
 
-	if (!get_numa_group(grp))
-		goto unlock;
+	/* Always join threads in the same process. */
+	if (tsk->mm == current->mm)
+		join = true;
+
+	/* Simple filter to avoid false positives due to PID collisions */
+	if (flags & TNF_SHARED)
+		join = true;
 
-	join = true;
+	if (join && !get_numa_group(grp))
+		join = false;
 
 unlock:
 	rcu_read_unlock();
@@ -1539,7 +1545,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
 		if (!priv && !(flags & TNF_NO_GROUP))
-			task_numa_group(p, last_cpupid);
+			task_numa_group(p, last_cpupid, flags);
 	}
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 9898eeb..823720c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3584,6 +3584,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
+	/*
+	 * Flag if the page is shared between multiple address spaces. This
+	 * is later used when determining whether to group tasks together
+	 */
+	if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
+		flags |= TNF_SHARED;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 57/63] sched: numa: Take false sharing into account when adapting scan rate
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Scan rate is altered based on whether shared/private faults dominated.
task_numa_group() may detect false sharing but that information is not
taken into account when adapting the scan rate. Take it into account.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index baa2276..03698f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,7 +1381,8 @@ static void double_lock(spinlock_t *l1, spinlock_t *l2)
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
-static void task_numa_group(struct task_struct *p, int cpupid, int flags)
+static void task_numa_group(struct task_struct *p, int cpupid, int flags,
+			int *priv)
 {
 	struct numa_group *grp, *my_grp;
 	struct task_struct *tsk;
@@ -1447,6 +1448,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags)
 	if (flags & TNF_SHARED)
 		join = true;
 
+	/* Update priv based on whether false sharing was detected */
+	*priv = !join;
+
 	if (join && !get_numa_group(grp))
 		join = false;
 
@@ -1545,7 +1549,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
 		if (!priv && !(flags & TNF_NO_GROUP))
-			task_numa_group(p, last_cpupid, flags);
+			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
 	/*
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 57/63] sched: numa: Take false sharing into account when adapting scan rate
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Scan rate is altered based on whether shared/private faults dominated.
task_numa_group() may detect false sharing but that information is not
taken into account when adapting the scan rate. Take it into account.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index baa2276..03698f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,7 +1381,8 @@ static void double_lock(spinlock_t *l1, spinlock_t *l2)
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
-static void task_numa_group(struct task_struct *p, int cpupid, int flags)
+static void task_numa_group(struct task_struct *p, int cpupid, int flags,
+			int *priv)
 {
 	struct numa_group *grp, *my_grp;
 	struct task_struct *tsk;
@@ -1447,6 +1448,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags)
 	if (flags & TNF_SHARED)
 		join = true;
 
+	/* Update priv based on whether false sharing was detected */
+	*priv = !join;
+
 	if (join && !get_numa_group(grp))
 		join = false;
 
@@ -1545,7 +1549,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
 		if (!priv && !(flags & TNF_NO_GROUP))
-			task_numa_group(p, last_cpupid, flags);
+			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
 	/*
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 58/63] sched: numa: adjust scan rate in task_numa_placement
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Adjust numa_scan_period in task_numa_placement, depending on how much
useful work the numa code can do. The more local faults there are in a
given scan window the longer the period (and hence the slower the scan rate)
during the next window. If there are excessive shared faults then the scan
period will decrease with the amount of scaling depending on whether the
ratio of shared/private faults. If the preferred node changes then the
scan rate is reset to recheck if the task is properly placed.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |   9 ++++
 kernel/sched/fair.c   | 112 +++++++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |   4 +-
 mm/memory.c           |   9 ++--
 4 files changed, 105 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b859621..c1bd367 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1355,6 +1355,14 @@ struct task_struct {
 	 */
 	unsigned long *numa_faults_buffer;
 
+	/*
+	 * numa_faults_locality tracks if faults recorded during the last
+	 * scan window were remote/local. The task scan period is adapted
+	 * based on the locality of the faults with different weights
+	 * depending on whether they were shared or private faults
+	 */
+	unsigned long numa_faults_locality[2];
+
 	int numa_preferred_nid;
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -1445,6 +1453,7 @@ struct task_struct {
 #define TNF_MIGRATED	0x01
 #define TNF_NO_GROUP	0x02
 #define TNF_SHARED	0x04
+#define TNF_FAULT_LOCAL	0x08
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03698f5..d8514c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1241,6 +1241,12 @@ static int task_numa_migrate(struct task_struct *p)
 
 	sched_setnuma(p, env.dst_nid);
 
+	/*
+	 * Reset the scan period if the task is being rescheduled on an
+	 * alternative node to recheck if the tasks is now properly placed.
+	 */
+	p->numa_scan_period = task_scan_min(p);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1276,10 +1282,86 @@ static void numa_migrate_preferred(struct task_struct *p)
 		p->numa_migrate_retry = jiffies + HZ*5;
 }
 
+/*
+ * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
+ * increments. The more local the fault statistics are, the higher the scan
+ * period will be for the next scan window. If local/remote ratio is below
+ * NUMA_PERIOD_THRESHOLD (where range of ratio is 1..NUMA_PERIOD_SLOTS) the
+ * scan period will decrease
+ */
+#define NUMA_PERIOD_SLOTS 10
+#define NUMA_PERIOD_THRESHOLD 3
+
+/*
+ * Increase the scan period (slow down scanning) if the majority of
+ * our memory is already on our local node, or if the majority of
+ * the page accesses are shared with other processes.
+ * Otherwise, decrease the scan period.
+ */
+static void update_task_scan_period(struct task_struct *p,
+			unsigned long shared, unsigned long private)
+{
+	unsigned int period_slot;
+	int ratio;
+	int diff;
+
+	unsigned long remote = p->numa_faults_locality[0];
+	unsigned long local = p->numa_faults_locality[1];
+
+	/*
+	 * If there were no record hinting faults then either the task is
+	 * completely idle or all activity is areas that are not of interest
+	 * to automatic numa balancing. Scan slower
+	 */
+	if (local + shared == 0) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		p->mm->numa_next_scan = jiffies +
+			msecs_to_jiffies(p->numa_scan_period);
+
+		return;
+	}
+
+	/*
+	 * Prepare to scale scan period relative to the current period.
+	 *	 == NUMA_PERIOD_THRESHOLD scan period stays the same
+	 *       <  NUMA_PERIOD_THRESHOLD scan period decreases (scan faster)
+	 *	 >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower)
+	 */
+	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
+	ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
+	if (ratio >= NUMA_PERIOD_THRESHOLD) {
+		int slot = ratio - NUMA_PERIOD_THRESHOLD;
+		if (!slot)
+			slot = 1;
+		diff = slot * period_slot;
+	} else {
+		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
+
+		/*
+		 * Scale scan rate increases based on sharing. There is an
+		 * inverse relationship between the degree of sharing and
+		 * the adjustment made to the scanning period. Broadly
+		 * speaking the intent is that there is little point
+		 * scanning faster if shared accesses dominate as it may
+		 * simply bounce migrations uselessly
+		 */
+		period_slot = DIV_ROUND_UP(diff, NUMA_PERIOD_SLOTS);
+		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
+		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+	}
+
+	p->numa_scan_period = clamp(p->numa_scan_period + diff,
+			task_scan_min(p), task_scan_max(p));
+	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	unsigned long fault_types[2] = { 0, 0 };
 	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1309,6 +1391,7 @@ static void task_numa_placement(struct task_struct *p)
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
+			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
@@ -1333,6 +1416,8 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	update_task_scan_period(p, fault_types[0], fault_types[1]);
+
 	if (p->numa_group) {
 		/*
 		 * If the preferred task and group nids are different,
@@ -1538,6 +1623,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 		p->total_numa_faults = 0;
+		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 	}
 
 	/*
@@ -1552,19 +1638,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
-	/*
-	 * If pages are properly placed (did not migrate) then scan slower.
-	 * This is reset periodically in case of phase changes
-	 */
-	if (!migrated) {
-		/* Initialise if necessary */
-		if (!p->numa_scan_period_max)
-			p->numa_scan_period_max = task_scan_max(p);
-
-		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period + 10);
-	}
-
 	task_numa_placement(p);
 
 	/* Retry task to preferred node migration if it previously failed */
@@ -1575,6 +1648,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		p->numa_pages_migrated += pages;
 
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -1702,18 +1776,6 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * If the whole process was scanned without updates then no NUMA
-	 * hinting faults are being recorded and scan rate should be lower.
-	 */
-	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
-		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period << 1);
-
-		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
-		mm->numa_next_scan = next_scan;
-	}
-
-	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ab4e32..1be2a1f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,8 +1296,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_nid = page_to_nid(page);
 	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == this_nid)
+	if (page_nid == this_nid) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		flags |= TNF_FAULT_LOCAL;
+	}
 
 	/*
 	 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/memory.c b/mm/memory.c
index 823720c..1c7501f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3527,13 +3527,16 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int page_nid)
+				unsigned long addr, int page_nid,
+				int *flags)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == numa_node_id())
+	if (page_nid == numa_node_id()) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		*flags |= TNF_FAULT_LOCAL;
+	}
 
 	return mpol_misplaced(page, vma, addr);
 }
@@ -3593,7 +3596,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
 		put_page(page);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 58/63] sched: numa: adjust scan rate in task_numa_placement
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Adjust numa_scan_period in task_numa_placement, depending on how much
useful work the numa code can do. The more local faults there are in a
given scan window the longer the period (and hence the slower the scan rate)
during the next window. If there are excessive shared faults then the scan
period will decrease with the amount of scaling depending on whether the
ratio of shared/private faults. If the preferred node changes then the
scan rate is reset to recheck if the task is properly placed.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |   9 ++++
 kernel/sched/fair.c   | 112 +++++++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |   4 +-
 mm/memory.c           |   9 ++--
 4 files changed, 105 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b859621..c1bd367 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1355,6 +1355,14 @@ struct task_struct {
 	 */
 	unsigned long *numa_faults_buffer;
 
+	/*
+	 * numa_faults_locality tracks if faults recorded during the last
+	 * scan window were remote/local. The task scan period is adapted
+	 * based on the locality of the faults with different weights
+	 * depending on whether they were shared or private faults
+	 */
+	unsigned long numa_faults_locality[2];
+
 	int numa_preferred_nid;
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -1445,6 +1453,7 @@ struct task_struct {
 #define TNF_MIGRATED	0x01
 #define TNF_NO_GROUP	0x02
 #define TNF_SHARED	0x04
+#define TNF_FAULT_LOCAL	0x08
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03698f5..d8514c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1241,6 +1241,12 @@ static int task_numa_migrate(struct task_struct *p)
 
 	sched_setnuma(p, env.dst_nid);
 
+	/*
+	 * Reset the scan period if the task is being rescheduled on an
+	 * alternative node to recheck if the tasks is now properly placed.
+	 */
+	p->numa_scan_period = task_scan_min(p);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1276,10 +1282,86 @@ static void numa_migrate_preferred(struct task_struct *p)
 		p->numa_migrate_retry = jiffies + HZ*5;
 }
 
+/*
+ * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
+ * increments. The more local the fault statistics are, the higher the scan
+ * period will be for the next scan window. If local/remote ratio is below
+ * NUMA_PERIOD_THRESHOLD (where range of ratio is 1..NUMA_PERIOD_SLOTS) the
+ * scan period will decrease
+ */
+#define NUMA_PERIOD_SLOTS 10
+#define NUMA_PERIOD_THRESHOLD 3
+
+/*
+ * Increase the scan period (slow down scanning) if the majority of
+ * our memory is already on our local node, or if the majority of
+ * the page accesses are shared with other processes.
+ * Otherwise, decrease the scan period.
+ */
+static void update_task_scan_period(struct task_struct *p,
+			unsigned long shared, unsigned long private)
+{
+	unsigned int period_slot;
+	int ratio;
+	int diff;
+
+	unsigned long remote = p->numa_faults_locality[0];
+	unsigned long local = p->numa_faults_locality[1];
+
+	/*
+	 * If there were no record hinting faults then either the task is
+	 * completely idle or all activity is areas that are not of interest
+	 * to automatic numa balancing. Scan slower
+	 */
+	if (local + shared == 0) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		p->mm->numa_next_scan = jiffies +
+			msecs_to_jiffies(p->numa_scan_period);
+
+		return;
+	}
+
+	/*
+	 * Prepare to scale scan period relative to the current period.
+	 *	 == NUMA_PERIOD_THRESHOLD scan period stays the same
+	 *       <  NUMA_PERIOD_THRESHOLD scan period decreases (scan faster)
+	 *	 >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower)
+	 */
+	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
+	ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
+	if (ratio >= NUMA_PERIOD_THRESHOLD) {
+		int slot = ratio - NUMA_PERIOD_THRESHOLD;
+		if (!slot)
+			slot = 1;
+		diff = slot * period_slot;
+	} else {
+		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
+
+		/*
+		 * Scale scan rate increases based on sharing. There is an
+		 * inverse relationship between the degree of sharing and
+		 * the adjustment made to the scanning period. Broadly
+		 * speaking the intent is that there is little point
+		 * scanning faster if shared accesses dominate as it may
+		 * simply bounce migrations uselessly
+		 */
+		period_slot = DIV_ROUND_UP(diff, NUMA_PERIOD_SLOTS);
+		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
+		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+	}
+
+	p->numa_scan_period = clamp(p->numa_scan_period + diff,
+			task_scan_min(p), task_scan_max(p));
+	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	unsigned long fault_types[2] = { 0, 0 };
 	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1309,6 +1391,7 @@ static void task_numa_placement(struct task_struct *p)
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
+			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
@@ -1333,6 +1416,8 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	update_task_scan_period(p, fault_types[0], fault_types[1]);
+
 	if (p->numa_group) {
 		/*
 		 * If the preferred task and group nids are different,
@@ -1538,6 +1623,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 		p->total_numa_faults = 0;
+		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 	}
 
 	/*
@@ -1552,19 +1638,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
-	/*
-	 * If pages are properly placed (did not migrate) then scan slower.
-	 * This is reset periodically in case of phase changes
-	 */
-	if (!migrated) {
-		/* Initialise if necessary */
-		if (!p->numa_scan_period_max)
-			p->numa_scan_period_max = task_scan_max(p);
-
-		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period + 10);
-	}
-
 	task_numa_placement(p);
 
 	/* Retry task to preferred node migration if it previously failed */
@@ -1575,6 +1648,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		p->numa_pages_migrated += pages;
 
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -1702,18 +1776,6 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * If the whole process was scanned without updates then no NUMA
-	 * hinting faults are being recorded and scan rate should be lower.
-	 */
-	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
-		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period << 1);
-
-		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
-		mm->numa_next_scan = next_scan;
-	}
-
-	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ab4e32..1be2a1f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,8 +1296,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_nid = page_to_nid(page);
 	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == this_nid)
+	if (page_nid == this_nid) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		flags |= TNF_FAULT_LOCAL;
+	}
 
 	/*
 	 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/memory.c b/mm/memory.c
index 823720c..1c7501f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3527,13 +3527,16 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int page_nid)
+				unsigned long addr, int page_nid,
+				int *flags)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == numa_node_id())
+	if (page_nid == numa_node_id()) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		*flags |= TNF_FAULT_LOCAL;
+	}
 
 	return mpol_misplaced(page, vma, addr);
 }
@@ -3593,7 +3596,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
 		put_page(page);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 59/63] sched: numa: Remove the numa_balancing_scan_period_reset sysctl
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

With scan rate adaptions based on whether the workload has properly
converged or not there should be no need for the scan period reset
hammer. Get rid of it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 18 +-----------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 4 insertions(+), 37 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index d48bca4..84f1780 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,15 +374,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -418,9 +416,6 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a30f9ca..a3198e5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,9 +420,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 86497b8..07d7c11 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1716,7 +1716,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8514c8..38ec714 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -826,7 +826,6 @@ static unsigned long task_h_load(struct task_struct *p);
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 1000;
 unsigned int sysctl_numa_balancing_scan_period_max = 60000;
-unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -1685,24 +1684,9 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
-	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
 			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-		mm->numa_next_reset = now +
-			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-	}
-
-	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = task_scan_min(p);
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
 	}
 
 	/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 42f616a..e509b90 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -371,13 +371,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 59/63] sched: numa: Remove the numa_balancing_scan_period_reset sysctl
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

With scan rate adaptions based on whether the workload has properly
converged or not there should be no need for the scan period reset
hammer. Get rid of it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 18 +-----------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 4 insertions(+), 37 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index d48bca4..84f1780 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,15 +374,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -418,9 +416,6 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a30f9ca..a3198e5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,9 +420,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 86497b8..07d7c11 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1716,7 +1716,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8514c8..38ec714 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -826,7 +826,6 @@ static unsigned long task_h_load(struct task_struct *p);
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 1000;
 unsigned int sysctl_numa_balancing_scan_period_max = 60000;
-unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -1685,24 +1684,9 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
-	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
 			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-		mm->numa_next_reset = now +
-			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-	}
-
-	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = task_scan_min(p);
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
 	}
 
 	/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 42f616a..e509b90 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -371,13 +371,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 60/63] mm: numa: revert temporarily disabling of NUMA migration
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

With the scan rate code working (at least for multi-instance specjbb),
the large hammer that is "sched: Do not migrate memory immediately after
switching node" can be replaced with something smarter. Revert temporarily
migration disabling and all traces of numa_migrate_seq.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 -
 kernel/sched/core.c   |  2 --
 kernel/sched/fair.c   | 25 +------------------------
 mm/mempolicy.c        | 12 ------------
 4 files changed, 1 insertion(+), 39 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1bd367..0f6b1b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1330,7 +1330,6 @@ struct task_struct {
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	int numa_scan_seq;
-	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
 	unsigned long numa_migrate_retry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07d7c11..6ecf72b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1726,7 +1726,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
@@ -4493,7 +4492,6 @@ void sched_setnuma(struct task_struct *p, int nid)
 		p->sched_class->put_prev_task(rq, p);
 
 	p->numa_preferred_nid = nid;
-	p->numa_migrate_seq = 1;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38ec714..ceffce9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1261,16 +1261,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
-		/*
-		 * If migration is temporarily disabled due to a task migration
-		 * then re-enable it now as the task is running on its
-		 * preferred node and memory should migrate locally
-		 */
-		if (!p->numa_migrate_seq)
-			p->numa_migrate_seq++;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
 		return;
-	}
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1))
@@ -1367,7 +1359,6 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
-	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* If the task is part of a group prevent parallel updates to group stats */
@@ -4729,20 +4720,6 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
-#ifdef CONFIG_NUMA_BALANCING
-	if (p->numa_preferred_nid != -1) {
-		int src_nid = cpu_to_node(env->src_cpu);
-		int dst_nid = cpu_to_node(env->dst_cpu);
-
-		/*
-		 * If the load balancer has moved the task then limit
-		 * migrations from taking place in the short term in
-		 * case this is a short-lived migration.
-		 */
-		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
-			p->numa_migrate_seq = 0;
-	}
-#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a5867ef..2929c24 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2404,18 +2404,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
-
-#ifdef CONFIG_NUMA_BALANCING
-		/*
-		 * If the scheduler has just moved us away from our
-		 * preferred node, do not bother migrating pages yet.
-		 * This way a short and temporary process migration will
-		 * not cause excessive memory migration.
-		 */
-		if (thisnid != current->numa_preferred_nid &&
-				!current->numa_migrate_seq)
-			goto out;
-#endif
 	}
 
 	if (curnid != polnid)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 60/63] mm: numa: revert temporarily disabling of NUMA migration
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

With the scan rate code working (at least for multi-instance specjbb),
the large hammer that is "sched: Do not migrate memory immediately after
switching node" can be replaced with something smarter. Revert temporarily
migration disabling and all traces of numa_migrate_seq.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 -
 kernel/sched/core.c   |  2 --
 kernel/sched/fair.c   | 25 +------------------------
 mm/mempolicy.c        | 12 ------------
 4 files changed, 1 insertion(+), 39 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1bd367..0f6b1b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1330,7 +1330,6 @@ struct task_struct {
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	int numa_scan_seq;
-	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
 	unsigned long numa_migrate_retry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07d7c11..6ecf72b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1726,7 +1726,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
@@ -4493,7 +4492,6 @@ void sched_setnuma(struct task_struct *p, int nid)
 		p->sched_class->put_prev_task(rq, p);
 
 	p->numa_preferred_nid = nid;
-	p->numa_migrate_seq = 1;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38ec714..ceffce9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1261,16 +1261,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
-		/*
-		 * If migration is temporarily disabled due to a task migration
-		 * then re-enable it now as the task is running on its
-		 * preferred node and memory should migrate locally
-		 */
-		if (!p->numa_migrate_seq)
-			p->numa_migrate_seq++;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
 		return;
-	}
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1))
@@ -1367,7 +1359,6 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
-	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* If the task is part of a group prevent parallel updates to group stats */
@@ -4729,20 +4720,6 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
-#ifdef CONFIG_NUMA_BALANCING
-	if (p->numa_preferred_nid != -1) {
-		int src_nid = cpu_to_node(env->src_cpu);
-		int dst_nid = cpu_to_node(env->dst_cpu);
-
-		/*
-		 * If the load balancer has moved the task then limit
-		 * migrations from taking place in the short term in
-		 * case this is a short-lived migration.
-		 */
-		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
-			p->numa_migrate_seq = 0;
-	}
-#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a5867ef..2929c24 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2404,18 +2404,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
-
-#ifdef CONFIG_NUMA_BALANCING
-		/*
-		 * If the scheduler has just moved us away from our
-		 * preferred node, do not bother migrating pages yet.
-		 * This way a short and temporary process migration will
-		 * not cause excessive memory migration.
-		 */
-		if (thisnid != current->numa_preferred_nid &&
-				!current->numa_migrate_seq)
-			goto out;
-#endif
 	}
 
 	if (curnid != polnid)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 61/63] sched: numa: skip some page migrations after a shared fault
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Shared faults can lead to lots of unnecessary page migrations,
slowing down the system, and causing private faults to hit the
per-pgdat migration ratelimit.

This patch adds sysctl numa_balancing_migrate_deferred, which specifies
how many shared page migrations to skip unconditionally, after each page
migration that is skipped because it is a shared fault.

This reduces the number of page migrations back and forth in
shared fault situations. It also gives a strong preference to
the tasks that are already running where most of the memory is,
and to moving the other tasks to near the memory.

Testing this with a much higher scan rate than the default
still seems to result in fewer page migrations than before.

Memory seems to be somewhat better consolidated than previously,
with multi-instance specjbb runs on a 4 node system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 10 ++++++++-
 include/linux/sched.h           |  5 ++++-
 kernel/sched/fair.c             |  8 +++++++
 kernel/sysctl.c                 |  7 ++++++
 mm/mempolicy.c                  | 48 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 84f1780..4273b2d 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
+numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
+numa_balancing_migrate_deferred.
 
 ==============================================================
 
@@ -421,6 +422,13 @@ the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
 preferred node is overloaded.
 
+numa_balancing_migrate_deferred is how many page migrations get skipped
+unconditionally, after a page migration is skipped because a page is shared
+with other tasks. This reduces page migration overhead, and determines
+how much stronger the "move task near its memory" policy scheduler becomes,
+versus the "move memory near its task" memory management policy, for workloads
+with shared memory.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f6b1b3..b737b72 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,6 +1332,8 @@ struct task_struct {
 	int numa_scan_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	int numa_preferred_nid;
+	int numa_migrate_deferred;
 	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
@@ -1362,7 +1364,6 @@ struct task_struct {
 	 */
 	unsigned long numa_faults_locality[2];
 
-	int numa_preferred_nid;
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -1459,6 +1460,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
+
+extern unsigned int sysctl_numa_balancing_migrate_deferred;
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ceffce9..9e2271b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -833,6 +833,14 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * After skipping a page migration on a shared page, skip N more numa page
+ * migrations unconditionally. This reduces the number of NUMA migrations
+ * in shared memory workloads, and has the effect of pulling tasks towards
+ * where their memory lives, over pulling the memory towards the task.
+ */
+unsigned int sysctl_numa_balancing_migrate_deferred = 16;
+
 static unsigned int task_nr_scan_windows(struct task_struct *p)
 {
 	unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e509b90..a159e1f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode           = 0644,
 		.proc_handler   = proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_migrate_deferred",
+		.data           = &sysctl_numa_balancing_migrate_deferred,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2929c24..71cb253 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,6 +2301,35 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
+{
+	/* Never defer a private fault */
+	if (cpupid_match_pid(p, last_cpupid))
+		return false;
+
+	if (p->numa_migrate_deferred) {
+		p->numa_migrate_deferred--;
+		return true;
+	}
+	return false;
+}
+
+static inline void defer_numa_migrate(struct task_struct *p)
+{
+	p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
+}
+#else
+static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
+{
+	return false;
+}
+
+static inline void defer_numa_migrate(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * mpol_misplaced - check whether current page node is valid in policy
  *
@@ -2402,7 +2431,24 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * relation.
 		 */
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
-		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
+
+			/* See sysctl_numa_balancing_migrate_deferred comment */
+			if (!cpupid_match_pid(current, last_cpupid))
+				defer_numa_migrate(current);
+
+			goto out;
+		}
+
+		/*
+		 * The quadratic filter above reduces extraneous migration
+		 * of shared pages somewhat. This code reduces it even more,
+		 * reducing the overhead of page migrations of shared pages.
+		 * This makes workloads with shared pages rely more on
+		 * "move task near its memory", and less on "move memory
+		 * towards its task", which is exactly what we want.
+		 */
+		if (numa_migrate_deferred(current, last_cpupid))
 			goto out;
 	}
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 61/63] sched: numa: skip some page migrations after a shared fault
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Shared faults can lead to lots of unnecessary page migrations,
slowing down the system, and causing private faults to hit the
per-pgdat migration ratelimit.

This patch adds sysctl numa_balancing_migrate_deferred, which specifies
how many shared page migrations to skip unconditionally, after each page
migration that is skipped because it is a shared fault.

This reduces the number of page migrations back and forth in
shared fault situations. It also gives a strong preference to
the tasks that are already running where most of the memory is,
and to moving the other tasks to near the memory.

Testing this with a much higher scan rate than the default
still seems to result in fewer page migrations than before.

Memory seems to be somewhat better consolidated than previously,
with multi-instance specjbb runs on a 4 node system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 10 ++++++++-
 include/linux/sched.h           |  5 ++++-
 kernel/sched/fair.c             |  8 +++++++
 kernel/sysctl.c                 |  7 ++++++
 mm/mempolicy.c                  | 48 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 84f1780..4273b2d 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
+numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
+numa_balancing_migrate_deferred.
 
 ==============================================================
 
@@ -421,6 +422,13 @@ the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
 preferred node is overloaded.
 
+numa_balancing_migrate_deferred is how many page migrations get skipped
+unconditionally, after a page migration is skipped because a page is shared
+with other tasks. This reduces page migration overhead, and determines
+how much stronger the "move task near its memory" policy scheduler becomes,
+versus the "move memory near its task" memory management policy, for workloads
+with shared memory.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f6b1b3..b737b72 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,6 +1332,8 @@ struct task_struct {
 	int numa_scan_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	int numa_preferred_nid;
+	int numa_migrate_deferred;
 	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
@@ -1362,7 +1364,6 @@ struct task_struct {
 	 */
 	unsigned long numa_faults_locality[2];
 
-	int numa_preferred_nid;
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -1459,6 +1460,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
+
+extern unsigned int sysctl_numa_balancing_migrate_deferred;
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ceffce9..9e2271b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -833,6 +833,14 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * After skipping a page migration on a shared page, skip N more numa page
+ * migrations unconditionally. This reduces the number of NUMA migrations
+ * in shared memory workloads, and has the effect of pulling tasks towards
+ * where their memory lives, over pulling the memory towards the task.
+ */
+unsigned int sysctl_numa_balancing_migrate_deferred = 16;
+
 static unsigned int task_nr_scan_windows(struct task_struct *p)
 {
 	unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e509b90..a159e1f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode           = 0644,
 		.proc_handler   = proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_migrate_deferred",
+		.data           = &sysctl_numa_balancing_migrate_deferred,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2929c24..71cb253 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,6 +2301,35 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
+{
+	/* Never defer a private fault */
+	if (cpupid_match_pid(p, last_cpupid))
+		return false;
+
+	if (p->numa_migrate_deferred) {
+		p->numa_migrate_deferred--;
+		return true;
+	}
+	return false;
+}
+
+static inline void defer_numa_migrate(struct task_struct *p)
+{
+	p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
+}
+#else
+static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
+{
+	return false;
+}
+
+static inline void defer_numa_migrate(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * mpol_misplaced - check whether current page node is valid in policy
  *
@@ -2402,7 +2431,24 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * relation.
 		 */
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
-		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
+
+			/* See sysctl_numa_balancing_migrate_deferred comment */
+			if (!cpupid_match_pid(current, last_cpupid))
+				defer_numa_migrate(current);
+
+			goto out;
+		}
+
+		/*
+		 * The quadratic filter above reduces extraneous migration
+		 * of shared pages somewhat. This code reduces it even more,
+		 * reducing the overhead of page migrations of shared pages.
+		 * This makes workloads with shared pages rely more on
+		 * "move task near its memory", and less on "move memory
+		 * towards its task", which is exactly what we want.
+		 */
+		if (numa_migrate_deferred(current, last_cpupid))
 			goto out;
 	}
 
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 62/63] sched: numa: use unsigned longs for numa group fault stats
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

As Peter says "If you're going to hold locks you can also do away with all
that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
the updates.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 49 ++++++++++++++++++++-----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e2271b..f45dd4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -916,8 +916,8 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
-	atomic_long_t total_faults;
-	atomic_long_t faults[0];
+	unsigned long total_faults;
+	unsigned long faults[0];
 };
 
 pid_t task_numa_group_id(struct task_struct *p)
@@ -944,8 +944,7 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
 	if (!p->numa_group)
 		return 0;
 
-	return atomic_long_read(&p->numa_group->faults[2*nid]) +
-	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+	return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1];
 }
 
 /*
@@ -971,17 +970,10 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 
 static inline unsigned long group_weight(struct task_struct *p, int nid)
 {
-	unsigned long total_faults;
-
-	if (!p->numa_group)
-		return 0;
-
-	total_faults = atomic_long_read(&p->numa_group->total_faults);
-
-	if (!total_faults)
+	if (!p->numa_group || !p->numa_group->total_faults)
 		return 0;
 
-	return 1000 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1397,9 +1389,9 @@ static void task_numa_placement(struct task_struct *p)
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
-				atomic_long_add(diff, &p->numa_group->faults[i]);
-				atomic_long_add(diff, &p->numa_group->total_faults);
-				group_faults += atomic_long_read(&p->numa_group->faults[i]);
+				p->numa_group->faults[i] += diff;
+				p->numa_group->total_faults += diff;
+				group_faults += p->numa_group->faults[i];
 			}
 		}
 
@@ -1475,7 +1467,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
-				    2*nr_node_ids*sizeof(atomic_long_t);
+				    2*nr_node_ids*sizeof(unsigned long);
 
 		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!grp)
@@ -1487,9 +1479,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+			grp->faults[i] = p->numa_faults[i];
 
-		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+		grp->total_faults = p->total_numa_faults;
 
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
@@ -1543,14 +1535,14 @@ unlock:
 	if (!join)
 		return;
 
+	double_lock(&my_grp->lock, &grp->lock);
+
 	for (i = 0; i < 2*nr_node_ids; i++) {
-		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
-		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] += p->numa_faults[i];
 	}
-	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
-	atomic_long_add(p->total_numa_faults, &grp->total_faults);
-
-	double_lock(&my_grp->lock, &grp->lock);
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults += p->total_numa_faults;
 
 	list_move(&p->numa_entry, &grp->task_list);
 	my_grp->nr_tasks--;
@@ -1571,12 +1563,11 @@ void task_numa_free(struct task_struct *p)
 	void *numa_faults = p->numa_faults;
 
 	if (grp) {
+		spin_lock(&grp->lock);
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
-
-		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+			grp->faults[i] -= p->numa_faults[i];
+		grp->total_faults -= p->total_numa_faults;
 
-		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
 		spin_unlock(&grp->lock);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 62/63] sched: numa: use unsigned longs for numa group fault stats
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

As Peter says "If you're going to hold locks you can also do away with all
that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
the updates.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 49 ++++++++++++++++++++-----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e2271b..f45dd4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -916,8 +916,8 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
-	atomic_long_t total_faults;
-	atomic_long_t faults[0];
+	unsigned long total_faults;
+	unsigned long faults[0];
 };
 
 pid_t task_numa_group_id(struct task_struct *p)
@@ -944,8 +944,7 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
 	if (!p->numa_group)
 		return 0;
 
-	return atomic_long_read(&p->numa_group->faults[2*nid]) +
-	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+	return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1];
 }
 
 /*
@@ -971,17 +970,10 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 
 static inline unsigned long group_weight(struct task_struct *p, int nid)
 {
-	unsigned long total_faults;
-
-	if (!p->numa_group)
-		return 0;
-
-	total_faults = atomic_long_read(&p->numa_group->total_faults);
-
-	if (!total_faults)
+	if (!p->numa_group || !p->numa_group->total_faults)
 		return 0;
 
-	return 1000 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1397,9 +1389,9 @@ static void task_numa_placement(struct task_struct *p)
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
-				atomic_long_add(diff, &p->numa_group->faults[i]);
-				atomic_long_add(diff, &p->numa_group->total_faults);
-				group_faults += atomic_long_read(&p->numa_group->faults[i]);
+				p->numa_group->faults[i] += diff;
+				p->numa_group->total_faults += diff;
+				group_faults += p->numa_group->faults[i];
 			}
 		}
 
@@ -1475,7 +1467,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
-				    2*nr_node_ids*sizeof(atomic_long_t);
+				    2*nr_node_ids*sizeof(unsigned long);
 
 		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!grp)
@@ -1487,9 +1479,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+			grp->faults[i] = p->numa_faults[i];
 
-		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+		grp->total_faults = p->total_numa_faults;
 
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
@@ -1543,14 +1535,14 @@ unlock:
 	if (!join)
 		return;
 
+	double_lock(&my_grp->lock, &grp->lock);
+
 	for (i = 0; i < 2*nr_node_ids; i++) {
-		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
-		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] += p->numa_faults[i];
 	}
-	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
-	atomic_long_add(p->total_numa_faults, &grp->total_faults);
-
-	double_lock(&my_grp->lock, &grp->lock);
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults += p->total_numa_faults;
 
 	list_move(&p->numa_entry, &grp->task_list);
 	my_grp->nr_tasks--;
@@ -1571,12 +1563,11 @@ void task_numa_free(struct task_struct *p)
 	void *numa_faults = p->numa_faults;
 
 	if (grp) {
+		spin_lock(&grp->lock);
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
-
-		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+			grp->faults[i] -= p->numa_faults[i];
+		grp->total_faults -= p->total_numa_faults;
 
-		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
 		spin_unlock(&grp->lock);
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 63/63] sched: numa: periodically retry task_numa_migrate
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Short spikes of CPU load can lead to a task being migrated
away from its preferred node for temporary reasons.

It is important that the task is migrated back to where it
belongs, in order to avoid migrating too much memory to its
new location, and generally disturbing a task's NUMA location.

This patch fixes NUMA placement for 4 specjbb instances on
a 4 node system. Without this patch, things take longer to
converge, and processes are not always completely on their
own node.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f45dd4c..1d5ea2d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,18 +1259,19 @@ static int task_numa_migrate(struct task_struct *p)
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
-	/* Success if task is already running on preferred CPU */
-	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	/* This task has no NUMA fault statistics yet */
+	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
 		return;
 
-	/* This task has no NUMA fault statistics yet */
-	if (unlikely(p->numa_preferred_nid == -1))
+	/* Periodically retry migrating the task to the preferred node */
+	p->numa_migrate_retry = jiffies + HZ;
+
+	/* Success if task is already running on preferred CPU */
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
 		return;
 
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	if (task_numa_migrate(p) != 0)
-		p->numa_migrate_retry = jiffies + HZ*5;
+	task_numa_migrate(p);
 }
 
 /*
@@ -1629,8 +1630,11 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 	task_numa_placement(p);
 
-	/* Retry task to preferred node migration if it previously failed */
-	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+	/*
+	 * Retry task to preferred node migration periodically, in case it
+	 * case it previously failed, or the scheduler moved us.
+	 */
+	if (time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
 	if (migrated)
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 63/63] sched: numa: periodically retry task_numa_migrate
@ 2013-10-07 10:29   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

Short spikes of CPU load can lead to a task being migrated
away from its preferred node for temporary reasons.

It is important that the task is migrated back to where it
belongs, in order to avoid migrating too much memory to its
new location, and generally disturbing a task's NUMA location.

This patch fixes NUMA placement for 4 specjbb instances on
a 4 node system. Without this patch, things take longer to
converge, and processes are not always completely on their
own node.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f45dd4c..1d5ea2d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,18 +1259,19 @@ static int task_numa_migrate(struct task_struct *p)
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
-	/* Success if task is already running on preferred CPU */
-	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	/* This task has no NUMA fault statistics yet */
+	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
 		return;
 
-	/* This task has no NUMA fault statistics yet */
-	if (unlikely(p->numa_preferred_nid == -1))
+	/* Periodically retry migrating the task to the preferred node */
+	p->numa_migrate_retry = jiffies + HZ;
+
+	/* Success if task is already running on preferred CPU */
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
 		return;
 
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	if (task_numa_migrate(p) != 0)
-		p->numa_migrate_retry = jiffies + HZ*5;
+	task_numa_migrate(p);
 }
 
 /*
@@ -1629,8 +1630,11 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 	task_numa_placement(p);
 
-	/* Retry task to preferred node migration if it previously failed */
-	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+	/*
+	 * Retry task to preferred node migration periodically, in case it
+	 * case it previously failed, or the scheduler moved us.
+	 */
+	if (time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
 	if (migrated)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 12:46     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
@ 2013-10-07 12:46     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 03/63] sched, numa: Comment fixlets
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 12:46     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Fix a 80 column violation and a PTE vs PMD reference.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 03/63] sched, numa: Comment fixlets
@ 2013-10-07 12:46     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Fix a 80 column violation and a PTE vs PMD reference.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 12:47     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> If another task handled a hinting fault in parallel then do not double
> account for it.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
@ 2013-10-07 12:47     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> If another task handled a hinting fault in parallel then do not double
> account for it.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 13:55     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 13:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The locking for migrating THP is unusual. While normal page migration
> prevents parallel accesses using a migration PTE, THP migration relies on
> a combination of the page_table_lock, the page lock and the existance of
> the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
> 
> If a THP page is currently being migrated and another thread traps a
> fault on the same page it checks if the page is misplaced. If it is not,
> then pmd_numa is cleared. The problem is that it checks if the page is
> misplaced without holding the page lock meaning that the racing thread
> can be migrating the THP when the second thread clears the NUMA bit
> and faults a stale page.
> 
> This patch checks if the page is potentially being migrated and stalls
> using the lock_page if it is potentially being migrated before checking
> if the page is misplaced or not.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
@ 2013-10-07 13:55     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 13:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The locking for migrating THP is unusual. While normal page migration
> prevents parallel accesses using a migration PTE, THP migration relies on
> a combination of the page_table_lock, the page lock and the existance of
> the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
> 
> If a THP page is currently being migrated and another thread traps a
> fault on the same page it checks if the page is misplaced. If it is not,
> then pmd_numa is cleared. The problem is that it checks if the page is
> misplaced without holding the page lock meaning that the racing thread
> can be migrating the THP when the second thread clears the NUMA bit
> and faults a stale page.
> 
> This patch checks if the page is potentially being migrated and stalls
> using the lock_page if it is potentially being migrated before checking
> if the page is misplaced or not.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 06/63] mm: Prevent parallel splits during THP migration
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 14:01     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migrations are serialised by the page lock but on its own that does
> not prevent THP splits. If the page is split during THP migration then
> the pmd_same checks will prevent page table corruption but the unlock page
> and other fix-ups potentially will cause corruption. This patch takes the
> anon_vma lock to prevent parallel splits during migration.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 06/63] mm: Prevent parallel splits during THP migration
@ 2013-10-07 14:01     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migrations are serialised by the page lock but on its own that does
> not prevent THP splits. If the page is split during THP migration then
> the pmd_same checks will prevent page table corruption but the unlock page
> and other fix-ups potentially will cause corruption. This patch takes the
> anon_vma lock to prevent parallel splits during migration.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 14:02     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> There are three callers of task_numa_fault():
> 
>  - do_huge_pmd_numa_page():
>      Accounts against the current node, not the node where the
>      page resides, unless we migrated, in which case it accounts
>      against the node we migrated to.
> 
>  - do_numa_page():
>      Accounts against the current node, not the node where the
>      page resides, unless we migrated, in which case it accounts
>      against the node we migrated to.
> 
>  - do_pmd_numa_page():
>      Accounts not at all when the page isn't migrated, otherwise
>      accounts against the node we migrated towards.
> 
> This seems wrong to me; all three sites should have the same
> sementaics, furthermore we should accounts against where the page
> really is, we already know where the task is.
> 
> So modify all three sites to always account; we did after all receive
> the fault; and always account to where the page is after migration,
> regardless of success.
> 
> They all still differ on when they clear the PTE/PMD; ideally that
> would get sorted too.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
@ 2013-10-07 14:02     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> There are three callers of task_numa_fault():
> 
>  - do_huge_pmd_numa_page():
>      Accounts against the current node, not the node where the
>      page resides, unless we migrated, in which case it accounts
>      against the node we migrated to.
> 
>  - do_numa_page():
>      Accounts against the current node, not the node where the
>      page resides, unless we migrated, in which case it accounts
>      against the node we migrated to.
> 
>  - do_pmd_numa_page():
>      Accounts not at all when the page isn't migrated, otherwise
>      accounts against the node we migrated towards.
> 
> This seems wrong to me; all three sites should have the same
> sementaics, furthermore we should accounts against where the page
> really is, we already know where the task is.
> 
> So modify all three sites to always account; we did after all receive
> the fault; and always account to where the page is after migration,
> regardless of success.
> 
> They all still differ on when they clear the PTE/PMD; ideally that
> would get sorted too.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 14:02     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migration uses the page lock to guard against parallel allocations
> but there are cases like this still open
> 
> Task A						Task B
> do_huge_pmd_numa_page				do_huge_pmd_numa_page
> lock_page
> mpol_misplaced == -1
> unlock_page
> goto clear_pmdnuma
> 						lock_page
> 						mpol_misplaced == 2
> 						migrate_misplaced_transhuge
> pmd = pmd_mknonnuma
> set_pmd_at
> 
> During hours of testing, one crashed with weird errors and while I have
> no direct evidence, I suspect something like the race above happened.
> This patch extends the page lock to being held until the pmd_numa is
> cleared to prevent migration starting in parallel while the pmd_numa is
> being cleared. It also flushes the old pmd entry and orders pagetable
> insertion before rmap insertion.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
@ 2013-10-07 14:02     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migration uses the page lock to guard against parallel allocations
> but there are cases like this still open
> 
> Task A						Task B
> do_huge_pmd_numa_page				do_huge_pmd_numa_page
> lock_page
> mpol_misplaced == -1
> unlock_page
> goto clear_pmdnuma
> 						lock_page
> 						mpol_misplaced == 2
> 						migrate_misplaced_transhuge
> pmd = pmd_mknonnuma
> set_pmd_at
> 
> During hours of testing, one crashed with weird errors and while I have
> no direct evidence, I suspect something like the race above happened.
> This patch extends the page lock to being held until the pmd_numa is
> cleared to prevent migration starting in parallel while the pmd_numa is
> being cleared. It also flushes the old pmd entry and orders pagetable
> insertion before rmap insertion.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 14:02     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> large difference when estimating the cost of automatic NUMA balancing and
> can be misleading when comparing results that had collapsed versus split
> THP. This patch addresses the accounting issue.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-10-07 14:02     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> large difference when estimating the cost of automatic NUMA balancing and
> can be misleading when comparing results that had collapsed versus split
> THP. This patch addresses the accounting issue.
> 
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 15:12     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 15:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning is expensive both in terms of the scanning itself and
> the TLB flush if there are any updates. Currently non-present PTEs are
> accounted for as an update and incurring a TLB flush where it is only
> necessary for anonymous migration entries. This patch addresses the
> problem and should reduce TLB flushes.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
@ 2013-10-07 15:12     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 15:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning is expensive both in terms of the scanning itself and
> the TLB flush if there are any updates. Currently non-present PTEs are
> accounted for as an update and incurring a TLB flush where it is only
> necessary for anonymous migration entries. This patch addresses the
> problem and should reduce TLB flushes.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 17:10     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The zero page is not replicated between nodes and is often shared between
> processes. The data is read-only and likely to be cached in local CPUs
> if heavily accessed meaning that the remote memory access cost is less
> of a concern. This patch prevents trapping faults on the zero pages. For
> tasks using the zero page this will reduce the number of PTE updates,
> TLB flushes and hinting faults.
> 
> [peterz@infradead.org: Correct use of is_huge_zero_page]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-10-07 17:10     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The zero page is not replicated between nodes and is often shared between
> processes. The data is read-only and likely to be cached in local CPUs
> if heavily accessed meaning that the remote memory access cost is less
> of a concern. This patch prevents trapping faults on the zero pages. For
> tasks using the zero page this will reduce the number of PTE updates,
> TLB flushes and hinting faults.
> 
> [peterz@infradead.org: Correct use of is_huge_zero_page]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 17:24     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> With a trace_printk("working\n"); right after the cmpxchg in
> task_numa_work() we can see that of a 4 thread process, its always the
> same task winning the race and doing the protection change.
> 
> This is a problem since the task doing the protection change has a
> penalty for taking faults -- it is busy when marking the PTEs. If its
> always the same task the ->numa_faults[] get severely skewed.
> 
> Avoid this by delaying the task doing the protection change such that
> it is unlikely to win the privilege again.

> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
@ 2013-10-07 17:24     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> With a trace_printk("working\n"); right after the cmpxchg in
> task_numa_work() we can see that of a 4 thread process, its always the
> same task winning the race and doing the protection change.
> 
> This is a problem since the task doing the protection change has a
> penalty for taking faults -- it is busy when marking the PTEs. If its
> always the same task the ->numa_faults[] get severely skewed.
> 
> Avoid this by delaying the task doing the protection change such that
> it is unlikely to win the privilege again.

> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 17:24     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
> limited sees like a bad idea. Even if this node can't migrate anymore other
> nodes might and we want up-to-date information to do balance decisions.
> We already rate limit the actual migrations, this should leave enough
> bandwidth to allow the non-migrating scanning. I think its important we
> keep up-to-date information if we're going to do placement based on it.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
@ 2013-10-07 17:24     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
> limited sees like a bad idea. Even if this node can't migrate anymore other
> nodes might and we want up-to-date information to do balance decisions.
> We already rate limit the actual migrations, this should leave enough
> bandwidth to allow the non-migrating scanning. I think its important we
> keep up-to-date information if we're going to do placement based on it.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 17:42     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> PTE scanning and NUMA hinting fault handling is expensive so commit
> 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
> on a new node") deferred the PTE scan until a task had been scheduled on
> another node. The problem is that in the purely shared memory case that
> this may never happen and no NUMA hinting fault information will be
> captured. We are not ruling out the possibility that something better
> can be done here but for now, this patch needs to be reverted and depend
> entirely on the scan_delay to avoid punishing short-lived processes.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
@ 2013-10-07 17:42     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> PTE scanning and NUMA hinting fault handling is expensive so commit
> 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
> on a new node") deferred the PTE scan until a task had been scheduled on
> another node. The problem is that in the purely shared memory case that
> this may never happen and no NUMA hinting fault information will be
> captured. We are not ruling out the possibility that something better
> can be done here but for now, this patch needs to be reverted and depend
> entirely on the scan_delay to avoid punishing short-lived processes.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 17:44     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Scan delay logic and resets are currently initialised to start scanning
> immediately instead of delaying properly. Initialise them properly at
> fork time and catch when a new mm has been allocated.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
@ 2013-10-07 17:44     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Scan delay logic and resets are currently initialised to start scanning
> immediately instead of delaying properly. Initialise them properly at
> fork time and catch when a new mm has been allocated.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 17:44     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The NUMA PTE scan rate is controlled with a combination of the
> numa_balancing_scan_period_min, numa_balancing_scan_period_max and
> numa_balancing_scan_size. This scan rate is independent of the size
> of the task and as an aside it is further complicated by the fact that
> numa_balancing_scan_size controls how many pages are marked pte_numa and
> not how much virtual memory is scanned.
> 
> In combination, it is almost impossible to meaningfully tune the min and
> max scan periods and reasoning about performance is complex when the time
> to complete a full scan is is partially a function of the tasks memory
> size. This patch alters the semantic of the min and max tunables to be
> about tuning the length time it takes to complete a scan of a tasks occupied
> virtual address space. Conceptually this is a lot easier to understand. There
> is a "sanity" check to ensure the scan rate is never extremely fast based on
> the amount of virtual memory that should be scanned in a second. The default
> of 2.5G seems arbitrary but it is to have the maximum scan rate after the
> patch roughly match the maximum scan rate before the patch was applied.
> 
> On a similar note, numa_scan_period is in milliseconds and not
> jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
> to numa_scan_period means that the rate scanning slows depends on HZ which
> is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
@ 2013-10-07 17:44     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The NUMA PTE scan rate is controlled with a combination of the
> numa_balancing_scan_period_min, numa_balancing_scan_period_max and
> numa_balancing_scan_size. This scan rate is independent of the size
> of the task and as an aside it is further complicated by the fact that
> numa_balancing_scan_size controls how many pages are marked pte_numa and
> not how much virtual memory is scanned.
> 
> In combination, it is almost impossible to meaningfully tune the min and
> max scan periods and reasoning about performance is complex when the time
> to complete a full scan is is partially a function of the tasks memory
> size. This patch alters the semantic of the min and max tunables to be
> about tuning the length time it takes to complete a scan of a tasks occupied
> virtual address space. Conceptually this is a lot easier to understand. There
> is a "sanity" check to ensure the scan rate is never extremely fast based on
> the amount of virtual memory that should be scanned in a second. The default
> of 2.5G seems arbitrary but it is to have the maximum scan rate after the
> patch roughly match the maximum scan rate before the patch was applied.
> 
> On a similar note, numa_scan_period is in milliseconds and not
> jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
> to numa_scan_period means that the rate scanning slows depends on HZ which
> is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 18:02     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
> was migrated. For long-lived but idle processes there may be no faults
> but the scan rate will be high and just waste CPU. This patch will slow
> the scan rate for processes that are not trapping faults.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
@ 2013-10-07 18:02     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
> was migrated. For long-lived but idle processes there may be no faults
> but the scan rate will be high and just waste CPU. This patch will slow
> the scan rate for processes that are not trapping faults.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 18:02     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch tracks what nodes numa hinting faults were incurred on.
> This information is later used to schedule a task on the node storing
> the pages most frequently faulted by the task.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
@ 2013-10-07 18:02     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch tracks what nodes numa hinting faults were incurred on.
> This information is later used to schedule a task on the node storing
> the pages most frequently faulted by the task.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 18:04     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch selects a preferred node for a task to run on based on the
> NUMA hinting faults. This information is later used to migrate tasks
> towards the node during balancing.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
@ 2013-10-07 18:04     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch selects a preferred node for a task to run on based on the
> NUMA hinting faults. This information is later used to migrate tasks
> towards the node during balancing.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 21/63] sched: Update NUMA hinting faults once per scan
  2013-10-07 10:28   ` Mel Gorman
@ 2013-10-07 18:39     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA hinting fault counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
> 
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 21/63] sched: Update NUMA hinting faults once per scan
@ 2013-10-07 18:39     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA hinting fault counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
> 
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 22/63] sched: Favour moving tasks towards the preferred node
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:39     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch favours moving tasks towards NUMA node that recorded a higher
> number of NUMA faults during active load balancing.  Ideally this is
> self-reinforcing as the longer the task runs on that node, the more faults
> it should incur causing task_numa_placement to keep the task running on that
> node. In reality a big weakness is that the nodes CPUs can be overloaded
> and it would be more efficient to queue tasks on an idle node and migrate
> to the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the preferred
> node for a PTE scans which is controlled by the numa_balancing_settle_count
> sysctl. Once the settle_count number of scans has complete the schedule
> is free to place the task on an alternative node if the load is imbalanced.
> 
> [srikar@linux.vnet.ibm.com: Fixed statistics]
> [peterz@infradead.org: Tunable and use higher faults instead of preferred]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 22/63] sched: Favour moving tasks towards the preferred node
@ 2013-10-07 18:39     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch favours moving tasks towards NUMA node that recorded a higher
> number of NUMA faults during active load balancing.  Ideally this is
> self-reinforcing as the longer the task runs on that node, the more faults
> it should incur causing task_numa_placement to keep the task running on that
> node. In reality a big weakness is that the nodes CPUs can be overloaded
> and it would be more efficient to queue tasks on an idle node and migrate
> to the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the preferred
> node for a PTE scans which is controlled by the numa_balancing_settle_count
> sysctl. Once the settle_count number of scans has complete the schedule
> is free to place the task on an alternative node if the load is imbalanced.
> 
> [srikar@linux.vnet.ibm.com: Fixed statistics]
> [peterz@infradead.org: Tunable and use higher faults instead of preferred]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:40     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Just as "sched: Favour moving tasks towards the preferred node" favours
> moving tasks towards nodes with a higher number of recorded NUMA hinting
> faults, this patch resists moving tasks towards nodes with lower faults.
> 
> [mgorman@suse.de: changelog]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
@ 2013-10-07 18:40     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Just as "sched: Favour moving tasks towards the preferred node" favours
> moving tasks towards nodes with a higher number of recorded NUMA hinting
> faults, this patch resists moving tasks towards nodes with lower faults.
> 
> [mgorman@suse.de: changelog]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:40     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
@ 2013-10-07 18:40     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:41     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults
> that are private to a task and those that are shared.  This patch prepares
> infrastructure for separately accounting shared and private faults by
> allocating the necessary buffers and passing in relevant information. For
> now, all faults are treated as private and detection will be introduced
> later.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
@ 2013-10-07 18:41     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults
> that are private to a task and those that are shared.  This patch prepares
> infrastructure for separately accounting shared and private faults by
> allocating the necessary buffers and passing in relevant information. For
> now, all faults are treated as private and detection will be introduced
> later.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:41     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_placement checks current->mm but after buffers for faults
> have already been uselessly allocated. Move the check earlier.
> 
> [peterz@infradead.org: Identified the problem]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
@ 2013-10-07 18:41     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_placement checks current->mm but after buffers for faults
> have already been uselessly allocated. Move the check earlier.
> 
> [peterz@infradead.org: Identified the problem]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:43     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
> 
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
@ 2013-10-07 18:43     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
> 
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 28/63] sched: Remove check that skips small VMAs
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:44     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_work skips small VMAs. At the time the logic was to reduce the
> scanning overhead which was considerable. It is a dubious hack at best.
> It would make much more sense to cache where faults have been observed
> and only rescan those regions during subsequent PTE scans. Remove this
> hack as motivation to do it properly in the future.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 28/63] sched: Remove check that skips small VMAs
@ 2013-10-07 18:44     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_work skips small VMAs. At the time the logic was to reduce the
> scanning overhead which was considerable. It is a dubious hack at best.
> It would make much more sense to cache where faults have been observed
> and only rescan those regions during subsequent PTE scans. Remove this
> hack as motivation to do it properly in the future.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:45     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults that
> are private to a task and those that are shared. If treated identically
> there is a risk that shared pages bounce between nodes depending on
> the order they are referenced by tasks. Ultimately what is desirable is
> that task private pages remain local to the task while shared pages are
> interleaved between sharing tasks running on different nodes to give good
> average performance. This is further complicated by THP as even
> applications that partition their data may not be partitioning on a huge
> page boundary.
> 
> To start with, this patch assumes that multi-threaded or multi-process
> applications partition their data and that in general the private accesses
> are more important for cpu->memory locality in the general case. Also,
> no new infrastructure is required to treat private pages properly but
> interleaving for shared pages requires additional infrastructure.
> 
> To detect private accesses the pid of the last accessing task is required
> but the storage requirements are a high. This patch borrows heavily from
> Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
> to encode some bits from the last accessing task in the page flags as
> well as the node information. Collisions will occur but it is better than
> just depending on the node information. Node information is then used to
> determine if a page needs to migrate. The PID information is used to detect
> private/shared accesses. The preferred NUMA node is selected based on where
> the maximum number of approximately private faults were measured. Shared
> faults are not taken into consideration for a few reasons.
> 
> First, if there are many tasks sharing the page then they'll all move
> towards the same node. The node will be compute overloaded and then
> scheduled away later only to bounce back again. Alternatively the shared
> tasks would just bounce around nodes because the fault information is
> effectively noise. Either way accounting for shared faults the same as
> private faults can result in lower performance overall.
> 
> The second reason is based on a hypothetical workload that has a small
> number of very important, heavily accessed private pages but a large shared
> array. The shared array would dominate the number of faults and be selected
> as a preferred node even though it's the wrong decision.
> 
> The third reason is that multiple threads in a process will race each
> other to fault the shared page making the fault information unreliable.
> 
> [riel@redhat.com: Fix complication error when !NUMA_BALANCING]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
@ 2013-10-07 18:45     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults that
> are private to a task and those that are shared. If treated identically
> there is a risk that shared pages bounce between nodes depending on
> the order they are referenced by tasks. Ultimately what is desirable is
> that task private pages remain local to the task while shared pages are
> interleaved between sharing tasks running on different nodes to give good
> average performance. This is further complicated by THP as even
> applications that partition their data may not be partitioning on a huge
> page boundary.
> 
> To start with, this patch assumes that multi-threaded or multi-process
> applications partition their data and that in general the private accesses
> are more important for cpu->memory locality in the general case. Also,
> no new infrastructure is required to treat private pages properly but
> interleaving for shared pages requires additional infrastructure.
> 
> To detect private accesses the pid of the last accessing task is required
> but the storage requirements are a high. This patch borrows heavily from
> Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
> to encode some bits from the last accessing task in the page flags as
> well as the node information. Collisions will occur but it is better than
> just depending on the node information. Node information is then used to
> determine if a page needs to migrate. The PID information is used to detect
> private/shared accesses. The preferred NUMA node is selected based on where
> the maximum number of approximately private faults were measured. Shared
> faults are not taken into consideration for a few reasons.
> 
> First, if there are many tasks sharing the page then they'll all move
> towards the same node. The node will be compute overloaded and then
> scheduled away later only to bounce back again. Alternatively the shared
> tasks would just bounce around nodes because the fault information is
> effectively noise. Either way accounting for shared faults the same as
> private faults can result in lower performance overall.
> 
> The second reason is based on a hypothetical workload that has a small
> number of very important, heavily accessed private pages but a large shared
> array. The shared array would dominate the number of faults and be selected
> as a preferred node even though it's the wrong decision.
> 
> The third reason is that multiple threads in a process will race each
> other to fault the shared page making the fault information unreliable.
> 
> [riel@redhat.com: Fix complication error when !NUMA_BALANCING]
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:58     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
> find_idlest_cpu_node has two critical limitations. It does not take the
> scheduling class into account when calculating the load and it is unsuitable
> for using when comparing loads between NUMA nodes.
> 
> task_numa_find_cpu uses similar load calculations to wake_affine() when
> selecting the least loaded CPU within a scheduling domain common to the
> source and destimation nodes. It avoids causing CPU load imbalances in
> the machine by refusing to migrate if the relative load on the target
> CPU is higher than the source CPU.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
@ 2013-10-07 18:58     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
> find_idlest_cpu_node has two critical limitations. It does not take the
> scheduling class into account when calculating the load and it is unsuitable
> for using when comparing loads between NUMA nodes.
> 
> task_numa_find_cpu uses similar load calculations to wake_affine() when
> selecting the least loaded CPU within a scheduling domain common to the
> source and destimation nodes. It avoids causing CPU load imbalances in
> the machine by refusing to migrate if the relative load on the target
> CPU is higher than the source CPU.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 18:58     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> When a preferred node is selected for a tasks there is an attempt to migrate
> the task to a CPU there. This may fail in which case the task will only
> migrate if the active load balancer takes action. This may never happen if
> the conditions are not right. This patch will check at NUMA hinting fault
> time if another attempt should be made to migrate the task. It will only
> make an attempt once every five seconds.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
@ 2013-10-07 18:58     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> When a preferred node is selected for a tasks there is an attempt to migrate
> the task to a CPU there. This may fail in which case the task will only
> migrate if the active load balancer takes action. This may never happen if
> the conditions are not right. This patch will check at NUMA hinting fault
> time if another attempt should be made to migrate the task. It will only
> make an attempt once every five seconds.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:04     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> NUMA hinting faults will not migrate a shared executable page mapped by
> multiple processes on the grounds that the data is probably in the CPU
> cache already and the page may just bounce between tasks running on multipl
> nodes. Even if the migration is avoided, there is still the overhead of
> trapping the fault, updating the statistics, making scheduler placement
> decisions based on the information etc. If we are never going to migrate
> the page, it is overhead for no gain and worse a process may be placed on
> a sub-optimal node for shared executable pages. This patch avoids trapping
> faults for shared libraries entirely.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>
-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
@ 2013-10-07 19:04     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> NUMA hinting faults will not migrate a shared executable page mapped by
> multiple processes on the grounds that the data is probably in the CPU
> cache already and the page may just bounce between tasks running on multipl
> nodes. Even if the migration is avoided, there is still the overhead of
> trapping the fault, updating the statistics, making scheduler placement
> decisions based on the information etc. If we are never going to migrate
> the page, it is overhead for no gain and worse a process may be placed on
> a sub-optimal node for shared executable pages. This patch avoids trapping
> faults for shared libraries entirely.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>
-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:06     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Base page PMD faulting is meant to batch handle NUMA hinting faults from
> PTEs. However, even is no PTE faults would ever be handled within a
> range the kernel still traps PMD hinting faults. This patch avoids the
> overhead.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
@ 2013-10-07 19:06     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Base page PMD faulting is meant to batch handle NUMA hinting faults from
> PTEs. However, even is no PTE faults would ever be handled within a
> range the kernel still traps PMD hinting faults. This patch avoids the
> overhead.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 38/63] sched: Introduce migrate_swap()
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:06     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Use the new stop_two_cpus() to implement migrate_swap(), a function that
> flips two tasks between their respective cpus.
> 
> I'm fairly sure there's a less crude way than employing the stop_two_cpus()
> method, but everything I tried either got horribly fragile and/or complex. So
> keep it simple for now.
> 
> The notable detail is how we 'migrate' tasks that aren't runnable
> anymore. We'll make it appear like we migrated them before they went to
> sleep. The sole difference is the previous cpu in the wakeup path, so we
> override this.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 38/63] sched: Introduce migrate_swap()
@ 2013-10-07 19:06     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Use the new stop_two_cpus() to implement migrate_swap(), a function that
> flips two tasks between their respective cpus.
> 
> I'm fairly sure there's a less crude way than employing the stop_two_cpus()
> method, but everything I tried either got horribly fragile and/or complex. So
> keep it simple for now.
> 
> The notable detail is how we 'migrate' tasks that aren't runnable
> anymore. We'll make it appear like we migrated them before they went to
> sleep. The sole difference is the previous cpu in the wakeup path, so we
> override this.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 39/63] sched: numa: Use a system-wide search to find swap/migration candidates
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:07     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch implements a system-wide search for swap/migration candidates
> based on total NUMA hinting faults. It has a balance limit, however it
> doesn't properly consider total node balance.
> 
> In the old scheme a task selected a preferred node based on the highest
> number of private faults recorded on the node. In this scheme, the preferred
> node is based on the total number of faults. If the preferred node for a
> task changes then task_numa_migrate will search the whole system looking
> for tasks to swap with that would improve both the overall compute
> balance and minimise the expected number of remote NUMA hinting faults.
> 
> Not there is no guarantee that the node the source task is placed
> on by task_numa_migrate() has any relationship to the newly selected
> task->numa_preferred_nid due to compute overloading.
> 
> [riel@redhat.com: Do not swap with tasks that cannot run on source cpu]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 39/63] sched: numa: Use a system-wide search to find swap/migration candidates
@ 2013-10-07 19:07     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch implements a system-wide search for swap/migration candidates
> based on total NUMA hinting faults. It has a balance limit, however it
> doesn't properly consider total node balance.
> 
> In the old scheme a task selected a preferred node based on the highest
> number of private faults recorded on the node. In this scheme, the preferred
> node is based on the total number of faults. If the preferred node for a
> task changes then task_numa_migrate will search the whole system looking
> for tasks to swap with that would improve both the overall compute
> balance and minimise the expected number of remote NUMA hinting faults.
> 
> Not there is no guarantee that the node the source task is placed
> on by task_numa_migrate() has any relationship to the newly selected
> task->numa_preferred_nid due to compute overloading.
> 
> [riel@redhat.com: Do not swap with tasks that cannot run on source cpu]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 40/63] sched: numa: Favor placing a task on the preferred node
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:07     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> A tasks preferred node is selected based on the number of faults
> recorded for a node but the actual task_numa_migate() conducts a global
> search regardless of the preferred nid. This patch checks if the
> preferred nid has capacity and if so, searches for a CPU within that
> node. This avoids a global search when the preferred node is not
> overloaded.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 40/63] sched: numa: Favor placing a task on the preferred node
@ 2013-10-07 19:07     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> A tasks preferred node is selected based on the number of faults
> recorded for a node but the actual task_numa_migate() conducts a global
> search regardless of the preferred nid. This patch checks if the
> preferred nid has capacity and if so, searches for a CPU within that
> node. This avoids a global search when the preferred node is not
> overloaded.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 42/63] mm: numa: Change page last {nid,pid} into {cpu,pid}
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:08     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Change the per page last fault tracking to use cpu,pid instead of
> nid,pid. This will allow us to try and lookup the alternate task more
> easily. Note that even though it is the cpu that is store in the page
> flags that the mpol_misplaced decision is still based on the node.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 42/63] mm: numa: Change page last {nid,pid} into {cpu,pid}
@ 2013-10-07 19:08     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Change the per page last fault tracking to use cpu,pid instead of
> nid,pid. This will allow us to try and lookup the alternate task more
> easily. Note that even though it is the cpu that is store in the page
> flags that the mpol_misplaced decision is still based on the node.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 43/63] sched: numa: Use {cpu, pid} to create task groups for shared faults
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:09     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> While parallel applications tend to align their data on the cache
> boundary, they tend not to align on the page or THP boundary.
> Consequently tasks that partition their data can still "false-share"
> pages presenting a problem for optimal NUMA placement.
> 
> This patch uses NUMA hinting faults to chain tasks together into
> numa_groups. As well as storing the NID a task was running on when
> accessing a page a truncated representation of the faulting PID is
> stored. If subsequent faults are from different PIDs it is reasonable
> to assume that those two tasks share a page and are candidates for
> being grouped together. Note that this patch makes no scheduling
> decisions based on the grouping information.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 43/63] sched: numa: Use {cpu, pid} to create task groups for shared faults
@ 2013-10-07 19:09     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> While parallel applications tend to align their data on the cache
> boundary, they tend not to align on the page or THP boundary.
> Consequently tasks that partition their data can still "false-share"
> pages presenting a problem for optimal NUMA placement.
> 
> This patch uses NUMA hinting faults to chain tasks together into
> numa_groups. As well as storing the NID a task was running on when
> accessing a page a truncated representation of the faulting PID is
> stored. If subsequent faults are from different PIDs it is reasonable
> to assume that those two tasks share a page and are candidates for
> being grouped together. Note that this patch makes no scheduling
> decisions based on the grouping information.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 44/63] sched: numa: Report a NUMA task group ID
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:09     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> It is desirable to model from userspace how the scheduler groups tasks
> over time. This patch adds an ID to the numa_group and reports it via
> /proc/PID/status.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 44/63] sched: numa: Report a NUMA task group ID
@ 2013-10-07 19:09     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> It is desirable to model from userspace how the scheduler groups tasks
> over time. This patch adds an ID to the numa_group and reports it via
> /proc/PID/status.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 46/63] mm: numa: Do not group on RO pages
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:10     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> And here's a little something to make sure not the whole world ends up
> in a single group.
> 
> As while we don't migrate shared executable pages, we do scan/fault on
> them. And since everybody links to libc, everybody ends up in the same
> group.
> 
> [riel@redhat.com: mapcount 1]
> Suggested-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 46/63] mm: numa: Do not group on RO pages
@ 2013-10-07 19:10     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> And here's a little something to make sure not the whole world ends up
> in a single group.
> 
> As while we don't migrate shared executable pages, we do scan/fault on
> them. And since everybody links to libc, everybody ends up in the same
> group.
> 
> [riel@redhat.com: mapcount 1]
> Suggested-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 47/63] mm: numa: Do not batch handle PMD pages
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:11     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> With the THP migration races closed it is still possible to occasionally
> see corruption. The problem is related to handling PMD pages in batch.
> When a page fault is handled it can be assumed that the page being
> faulted will also be flushed from the TLB. The same flushing does not
> happen when handling PMD pages in batch. Fixing is straight forward but
> there are a number of reasons not to
> 
> 1. Multiple TLB flushes may have to be sent depending on what pages get
>    migrated
> 2. The handling of PMDs in batch means that faults get accounted to
>    the task that is handling the fault. While care is taken to only
>    mark PMDs where the last CPU and PID match it can still have problems
>    due to PID truncation when matching PIDs.
> 3. Batching on the PMD level may reduce faults but setting pmd_numa
>    requires taking a heavy lock that can contend with THP migration
>    and handling the fault requires the release/acquisition of the PTL
>    for every page migrated. It's still pretty heavy.
> 
> PMD batch handling is not something that people ever have been happy
> with. This patch removes it and later patches will deal with the
> additional fault overhead using more installigent migrate rate adaption.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 47/63] mm: numa: Do not batch handle PMD pages
@ 2013-10-07 19:11     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> With the THP migration races closed it is still possible to occasionally
> see corruption. The problem is related to handling PMD pages in batch.
> When a page fault is handled it can be assumed that the page being
> faulted will also be flushed from the TLB. The same flushing does not
> happen when handling PMD pages in batch. Fixing is straight forward but
> there are a number of reasons not to
> 
> 1. Multiple TLB flushes may have to be sent depending on what pages get
>    migrated
> 2. The handling of PMDs in batch means that faults get accounted to
>    the task that is handling the fault. While care is taken to only
>    mark PMDs where the last CPU and PID match it can still have problems
>    due to PID truncation when matching PIDs.
> 3. Batching on the PMD level may reduce faults but setting pmd_numa
>    requires taking a heavy lock that can contend with THP migration
>    and handling the fault requires the release/acquisition of the PTL
>    for every page migrated. It's still pretty heavy.
> 
> PMD batch handling is not something that people ever have been happy
> with. This patch removes it and later patches will deal with the
> additional fault overhead using more installigent migrate rate adaption.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 51/63] sched: numa: Prevent parallel updates to group stats during placement
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:13     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Having multiple tasks in a group go through task_numa_placement
> simultaneously can lead to a task picking a wrong node to run on, because
> the group stats may be in the middle of an update. This patch avoids
> parallel updates by holding the numa_group lock during placement
> decisions.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 51/63] sched: numa: Prevent parallel updates to group stats during placement
@ 2013-10-07 19:13     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Having multiple tasks in a group go through task_numa_placement
> simultaneously can lead to a task picking a wrong node to run on, because
> the group stats may be in the middle of an update. This patch avoids
> parallel updates by holding the numa_group lock during placement
> decisions.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 52/63] sched: numa: add debugging
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:13     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Ingo Molnar <mingo@kernel.org>
> 
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/n/tip-5giqjcqnc93a89q01ymtjxpr@git.kernel.org

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 52/63] sched: numa: add debugging
@ 2013-10-07 19:13     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Ingo Molnar <mingo@kernel.org>
> 
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/n/tip-5giqjcqnc93a89q01ymtjxpr@git.kernel.org

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 55/63] sched: numa: Avoid migrating tasks that are placed on their preferred node
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:14     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> This patch classifies scheduler domains and runqueues into types depending
> the number of tasks that are about their NUMA placement and the number
> that are currently running on their preferred node. The types are
> 
> regular: There are tasks running that do not care about their NUMA
> 	placement.
> 
> remote: There are tasks running that care about their placement but are
> 	currently running on a node remote to their ideal placement
> 
> all: No distinction
> 
> To implement this the patch tracks the number of tasks that are optimally
> NUMA placed (rq->nr_preferred_running) and the number of tasks running
> that care about their placement (nr_numa_running). The load balancer
> uses this information to avoid migrating idea placed NUMA tasks as long
> as better options for load balancing exists. For example, it will not
> consider balancing between a group whose tasks are all perfectly placed
> and a group with remote tasks.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 55/63] sched: numa: Avoid migrating tasks that are placed on their preferred node
@ 2013-10-07 19:14     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> This patch classifies scheduler domains and runqueues into types depending
> the number of tasks that are about their NUMA placement and the number
> that are currently running on their preferred node. The types are
> 
> regular: There are tasks running that do not care about their NUMA
> 	placement.
> 
> remote: There are tasks running that care about their placement but are
> 	currently running on a node remote to their ideal placement
> 
> all: No distinction
> 
> To implement this the patch tracks the number of tasks that are optimally
> NUMA placed (rq->nr_preferred_running) and the number of tasks running
> that care about their placement (nr_numa_running). The load balancer
> uses this information to avoid migrating idea placed NUMA tasks as long
> as better options for load balancing exists. For example, it will not
> consider balancing between a group whose tasks are all perfectly placed
> and a group with remote tasks.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 57/63] sched: numa: Take false sharing into account when adapting scan rate
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:14     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Scan rate is altered based on whether shared/private faults dominated.
> task_numa_group() may detect false sharing but that information is not
> taken into account when adapting the scan rate. Take it into account.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 57/63] sched: numa: Take false sharing into account when adapting scan rate
@ 2013-10-07 19:14     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Scan rate is altered based on whether shared/private faults dominated.
> task_numa_group() may detect false sharing but that information is not
> taken into account when adapting the scan rate. Take it into account.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 59/63] sched: numa: Remove the numa_balancing_scan_period_reset sysctl
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:14     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> With scan rate adaptions based on whether the workload has properly
> converged or not there should be no need for the scan period reset
> hammer. Get rid of it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 59/63] sched: numa: Remove the numa_balancing_scan_period_reset sysctl
@ 2013-10-07 19:14     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> With scan rate adaptions based on whether the workload has properly
> converged or not there should be no need for the scan period reset
> hammer. Get rid of it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 62/63] sched: numa: use unsigned longs for numa group fault stats
  2013-10-07 10:29   ` Mel Gorman
@ 2013-10-07 19:15     ` Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> As Peter says "If you're going to hold locks you can also do away with all
> that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
> the updates.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 62/63] sched: numa: use unsigned longs for numa group fault stats
@ 2013-10-07 19:15     ` Rik van Riel
  0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 10/07/2013 06:29 AM, Mel Gorman wrote:
> As Peter says "If you're going to hold locks you can also do away with all
> that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
> the updates.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-09 11:03   ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 11:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> This series has roughly the same goals as previous versions despite the
> size. It reduces overhead of automatic balancing through scan rate reduction
> and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> towards their memory as well as moving memory toward their task. It handles
> shared pages and groups related tasks together. Some problems such as shared
> page interleaving and properly dealing with processes that are larger than
> a node are being deferred. This version should be ready for wider testing
> in -tip.

Thanks Mel - the series looks really nice. I've applied the patches to 
tip:sched/core and will push them out later today if they pass testing 
here.

> Note that with kernel 3.12-rc3 that numa balancing will fail to boot if 
> CONFIG_JUMP_LABEL is configured. This is a separate bug that is 
> currently being dealt with.

Okay, this is about:

  https://lkml.org/lkml/2013/9/30/308

Note that Peter and me saw no crashes so far, and we boot with 
CONFIG_JUMP_LABEL=y and CONFIG_NUMA_BALANCING=y. It seems like an 
unrelated bug in any case, perhaps related to specific details in your 
kernel image?

2)

I also noticed a small Kconfig annoyance:

config NUMA_BALANCING_DEFAULT_ENABLED
        bool "Automatically enable NUMA aware memory/task placement"
        default y
        depends on NUMA_BALANCING
        help
          If set, autonumic NUMA balancing will be enabled if running on a NUMA
          machine.

config NUMA_BALANCING
        bool "Memory placement aware NUMA scheduler"
        depends on ARCH_SUPPORTS_NUMA_BALANCING
        depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
        depends on SMP && NUMA && MIGRATION
        help
          This option adds support for automatic NUM

the NUMA_BALANCING_DEFAULT_ENABLED option should come after the 
NUMA_BALANCING entries - things like 'make oldconfig' produce weird output 
otherwise.

3)

Plus in addition to PeterZ's build fix I noticed this new build warning on 
i386 UP kernels:

 kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]

Introduced here I think:

    sched/numa: Use a system-wide search to find swap/migration candidates

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 11:03   ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 11:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> This series has roughly the same goals as previous versions despite the
> size. It reduces overhead of automatic balancing through scan rate reduction
> and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> towards their memory as well as moving memory toward their task. It handles
> shared pages and groups related tasks together. Some problems such as shared
> page interleaving and properly dealing with processes that are larger than
> a node are being deferred. This version should be ready for wider testing
> in -tip.

Thanks Mel - the series looks really nice. I've applied the patches to 
tip:sched/core and will push them out later today if they pass testing 
here.

> Note that with kernel 3.12-rc3 that numa balancing will fail to boot if 
> CONFIG_JUMP_LABEL is configured. This is a separate bug that is 
> currently being dealt with.

Okay, this is about:

  https://lkml.org/lkml/2013/9/30/308

Note that Peter and me saw no crashes so far, and we boot with 
CONFIG_JUMP_LABEL=y and CONFIG_NUMA_BALANCING=y. It seems like an 
unrelated bug in any case, perhaps related to specific details in your 
kernel image?

2)

I also noticed a small Kconfig annoyance:

config NUMA_BALANCING_DEFAULT_ENABLED
        bool "Automatically enable NUMA aware memory/task placement"
        default y
        depends on NUMA_BALANCING
        help
          If set, autonumic NUMA balancing will be enabled if running on a NUMA
          machine.

config NUMA_BALANCING
        bool "Memory placement aware NUMA scheduler"
        depends on ARCH_SUPPORTS_NUMA_BALANCING
        depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
        depends on SMP && NUMA && MIGRATION
        help
          This option adds support for automatic NUM

the NUMA_BALANCING_DEFAULT_ENABLED option should come after the 
NUMA_BALANCING entries - things like 'make oldconfig' produce weird output 
otherwise.

3)

Plus in addition to PeterZ's build fix I noticed this new build warning on 
i386 UP kernels:

 kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]

Introduced here I think:

    sched/numa: Use a system-wide search to find swap/migration candidates

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 11:03   ` Ingo Molnar
@ 2013-10-09 11:11     ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 11:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> 3)
> 
> Plus in addition to PeterZ's build fix I noticed this new build warning on 
> i386 UP kernels:
> 
>  kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]
> 
> Introduced here I think:
> 
>     sched/numa: Use a system-wide search to find swap/migration candidates

4)

allyes builds fail on x86 32-bit:

   mm/mmzone.c:101:5: error: redefinition of ‘page_cpupid_xchg_last’

The reason is the mismatch in definitions:

 mm.h:

  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS

 mmzone.c:

  #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)

Note the missing 'NOT_' in the latter line. I've changed it to:

  #if defined(CONFIG_NUMA_BALANCING) && defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 11:11     ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 11:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> 3)
> 
> Plus in addition to PeterZ's build fix I noticed this new build warning on 
> i386 UP kernels:
> 
>  kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]
> 
> Introduced here I think:
> 
>     sched/numa: Use a system-wide search to find swap/migration candidates

4)

allyes builds fail on x86 32-bit:

   mm/mmzone.c:101:5: error: redefinition of a??page_cpupid_xchg_lasta??

The reason is the mismatch in definitions:

 mm.h:

  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS

 mmzone.c:

  #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)

Note the missing 'NOT_' in the latter line. I've changed it to:

  #if defined(CONFIG_NUMA_BALANCING) && defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 11:11     ` Ingo Molnar
@ 2013-10-09 11:13       ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 11:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

>  mmzone.c:
> 
>   #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)
> 
> Note the missing 'NOT_' in the latter line. I've changed it to:
> 
>   #if defined(CONFIG_NUMA_BALANCING) && defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)

Actually, I think it should be:

   #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)

I'll fold back this fix to keep it bisectable on 32-bit platforms.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 11:13       ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 11:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

>  mmzone.c:
> 
>   #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)
> 
> Note the missing 'NOT_' in the latter line. I've changed it to:
> 
>   #if defined(CONFIG_NUMA_BALANCING) && defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)

Actually, I think it should be:

   #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)

I'll fold back this fix to keep it bisectable on 32-bit platforms.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 11:03   ` Ingo Molnar
@ 2013-10-09 12:05     ` Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 12:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 01:03:54PM +0200, Ingo Molnar wrote:
>  kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]

Not too pretty, but it avoids the warning:

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -681,6 +681,8 @@ static u64 sched_vslice(struct cfs_rq *c
 }
 
 #ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+
 static inline void __update_task_entity_contrib(struct sched_entity *se);
 
 /* Give new task start runnable values to heavy its load in infant time */
@@ -816,8 +818,6 @@ update_stats_curr_start(struct cfs_rq *c
  * Scheduling class queueing methods:
  */
 
-static unsigned long task_h_load(struct task_struct *p);
-
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 12:05     ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 12:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 01:03:54PM +0200, Ingo Molnar wrote:
>  kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]

Not too pretty, but it avoids the warning:

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -681,6 +681,8 @@ static u64 sched_vslice(struct cfs_rq *c
 }
 
 #ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+
 static inline void __update_task_entity_contrib(struct sched_entity *se);
 
 /* Give new task start runnable values to heavy its load in infant time */
@@ -816,8 +818,6 @@ update_stats_curr_start(struct cfs_rq *c
  * Scheduling class queueing methods:
  */
 
-static unsigned long task_h_load(struct task_struct *p);
-
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 12:05     ` Peter Zijlstra
@ 2013-10-09 12:48       ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 12:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 09, 2013 at 01:03:54PM +0200, Ingo Molnar wrote:
> >  kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]
> 
> Not too pretty, but it avoids the warning:
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -681,6 +681,8 @@ static u64 sched_vslice(struct cfs_rq *c
>  }
>  
>  #ifdef CONFIG_SMP
> +static unsigned long task_h_load(struct task_struct *p);
> +
>  static inline void __update_task_entity_contrib(struct sched_entity *se);
>  
>  /* Give new task start runnable values to heavy its load in infant time */
> @@ -816,8 +818,6 @@ update_stats_curr_start(struct cfs_rq *c
>   * Scheduling class queueing methods:
>   */
>  
> -static unsigned long task_h_load(struct task_struct *p);
> -
>  #ifdef CONFIG_NUMA_BALANCING

Hm, so we really want to do a split-up of this file once things have 
calmed down - that will address such dependency issues.

Until then this fix will do, I've backmerged it to the originating patch.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 12:48       ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 12:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 09, 2013 at 01:03:54PM +0200, Ingo Molnar wrote:
> >  kernel/sched/fair.c:819:22: warning: 'task_h_load' declared 'static' but never defined [-Wunused-function]
> 
> Not too pretty, but it avoids the warning:
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -681,6 +681,8 @@ static u64 sched_vslice(struct cfs_rq *c
>  }
>  
>  #ifdef CONFIG_SMP
> +static unsigned long task_h_load(struct task_struct *p);
> +
>  static inline void __update_task_entity_contrib(struct sched_entity *se);
>  
>  /* Give new task start runnable values to heavy its load in infant time */
> @@ -816,8 +818,6 @@ update_stats_curr_start(struct cfs_rq *c
>   * Scheduling class queueing methods:
>   */
>  
> -static unsigned long task_h_load(struct task_struct *p);
> -
>  #ifdef CONFIG_NUMA_BALANCING

Hm, so we really want to do a split-up of this file once things have 
calmed down - that will address such dependency issues.

Until then this fix will do, I've backmerged it to the originating patch.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-07 10:28 ` Mel Gorman
                   ` (64 preceding siblings ...)
  (?)
@ 2013-10-09 16:28 ` Ingo Molnar
  2013-10-09 16:29   ` Ingo Molnar
  2013-10-09 17:08     ` Peter Zijlstra
  -1 siblings, 2 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 16:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]


Hm, so I'm seeing boot crashes with the config attached:

 INIT: version 2.86 booting 
 BUG: unable to handle kernel BUG: unable to handle kernel paging 
 requestpaging request at eaf10f40 
  at eaf10f40 
 IP:IP: [<b103e0ef>] task_work_run+0x52/0x87 
  [<b103e0ef>] task_work_run+0x52/0x87 
 *pde = 3fbf9067 *pde = 3fbf9067 *pte = 3af10060 *pte = 3af10060  
 
 Oops: 0000 [#1] Oops: 0000 [#1] DEBUG_PAGEALLOCDEBUG_PAGEALLOC 
 
 CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
 3.12.0-rc4-01668-gfd71a04-dirty #229484 
 CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
 3.12.0-rc4-01668-gfd71a04-dirty #229484 
 task: eaf157a0 ti: eacf2000 task.ti: eacf2000 

Note that the config does not have NUMA_BALANCING enabled. With another 
config I also had a failed bootup due to the OOM killer kicking in. That 
didn't have NUMA_BALANCING enabled either.

Yet this all started today, after merging the NUMA patches.

Any ideas?

Thanks,

	Ingo

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 112297 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/i386 3.12.0-rc4 Kernel Configuration
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf32-i386"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
# CONFIG_ZONE_DMA32 is not set
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-ecx -fcall-saved-edx"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SWAP is not set
# CONFIG_SYSVIPC is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_FHANDLE is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
# CONFIG_AUDIT_LOGINUID_IMMUTABLE is not set

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_CHIP=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_DEBUG=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_KTIME_SCALAR=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ=y
# CONFIG_HIGH_RES_TIMERS is not set

#
# CPU/Task time and stats accounting
#
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_IRQ_TIME_ACCOUNTING=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set

#
# RCU Subsystem
#
CONFIG_TINY_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_STALL_COMMON is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=20
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE=y
# CONFIG_CGROUPS is not set
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_NAMESPACES is not set
# CONFIG_UIDGID_STRICT_TYPE_CHECKS is not set
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
# CONFIG_RD_LZO is not set
# CONFIG_RD_LZ4 is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_EXPERT=y
# CONFIG_UID16 is not set
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
# CONFIG_SIGNALFD is not set
# CONFIG_TIMERFD is not set
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_PCI_QUIRKS=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_USE_VMALLOC=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
CONFIG_DEBUG_PERF_USE_VMALLOC=y
# CONFIG_VM_EVENT_COUNTERS is not set
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_JUMP_LABEL=y
CONFIG_UPROBES=y
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_IPC_PARSE_VERSION=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MODULES_USE_ELF_REL=y
CONFIG_CLONE_BACKWARDS=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_OLD_SIGACTION=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
# CONFIG_MODULES is not set
CONFIG_BLOCK=y
# CONFIG_LBDAF is not set
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_CMDLINE_PARSER=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
CONFIG_ACORN_PARTITION_CUMANA=y
# CONFIG_ACORN_PARTITION_EESOX is not set
# CONFIG_ACORN_PARTITION_ICS is not set
# CONFIG_ACORN_PARTITION_ADFS is not set
CONFIG_ACORN_PARTITION_POWERTEC=y
# CONFIG_ACORN_PARTITION_RISCIX is not set
# CONFIG_AIX_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_CMDLINE_PARTITION=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
CONFIG_DEFAULT_NOOP=y
CONFIG_DEFAULT_IOSCHED="noop"
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_FREEZER=y

#
# Processor type and features
#
# CONFIG_ZONE_DMA is not set
# CONFIG_SMP is not set
# CONFIG_X86_MPPARSE is not set
CONFIG_GOLDFISH=y
CONFIG_X86_EXTENDED_PLATFORM=y
CONFIG_X86_GOLDFISH=y
# CONFIG_X86_INTEL_CE is not set
CONFIG_X86_WANT_INTEL_MID=y
# CONFIG_X86_RDC321X is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_X86_32_IRIS=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_KVMTOOL_TEST_ENABLE=y
CONFIG_HYPERVISOR_GUEST=y
# CONFIG_PARAVIRT is not set
# CONFIG_XEN_PRIVILEGED_GUEST is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MELAN is not set
# CONFIG_MGEODEGX1 is not set
CONFIG_MGEODE_LX=y
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_INTERNODE_CACHE_SHIFT=5
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_USE_3DNOW=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PROCESSOR_SELECT=y
# CONFIG_CPU_SUP_INTEL is not set
# CONFIG_CPU_SUP_CYRIX_32 is not set
# CONFIG_CPU_SUP_AMD is not set
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_TRANSMETA_32=y
# CONFIG_CPU_SUP_UMC_32 is not set
# CONFIG_HPET_TIMER is not set
# CONFIG_DMI is not set
CONFIG_NR_CPUS=1
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
CONFIG_X86_ANCIENT_MCE=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
CONFIG_I8K=y
CONFIG_X86_REBOOTFIXUPS=y
# CONFIG_MICROCODE is not set
# CONFIG_MICROCODE_INTEL_EARLY is not set
# CONFIG_MICROCODE_AMD_EARLY is not set
# CONFIG_X86_MSR is not set
CONFIG_X86_CPUID=y
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
# CONFIG_VMSPLIT_3G is not set
CONFIG_VMSPLIT_3G_OPT=y
# CONFIG_VMSPLIT_2G is not set
# CONFIG_VMSPLIT_2G_OPT is not set
# CONFIG_VMSPLIT_1G is not set
CONFIG_PAGE_OFFSET=0xB0000000
CONFIG_HIGHMEM=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
# CONFIG_HAVE_BOOTMEM_INFO_NODE is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=999999
CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=0
# CONFIG_BOUNCE is not set
CONFIG_NEED_BOUNCE_POOL=y
CONFIG_VIRT_TO_BUS=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_HWPOISON_INJECT=y
CONFIG_TRANSPARENT_HUGEPAGE=y
# CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_NEED_PER_CPU_KM=y
# CONFIG_CLEANCACHE is not set
# CONFIG_CMA is not set
# CONFIG_ZBUD is not set
CONFIG_HIGHPTE=y
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW=64
CONFIG_MATH_EMULATION=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
# CONFIG_X86_SMAP is not set
# CONFIG_SECCOMP is not set
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_SCHED_HRTICK is not set
# CONFIG_KEXEC is not set
CONFIG_CRASH_DUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_X86_NEED_RELOCS=y
CONFIG_PHYSICAL_ALIGN=0x1000000
# CONFIG_COMPAT_VDSO is not set
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE=""
# CONFIG_CMDLINE_OVERRIDE is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_PM_SLEEP=y
CONFIG_PM_AUTOSLEEP=y
CONFIG_PM_WAKELOCKS=y
CONFIG_PM_WAKELOCKS_LIMIT=100
CONFIG_PM_WAKELOCKS_GC=y
CONFIG_PM_RUNTIME=y
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_WQ_POWER_EFFICIENT_DEFAULT=y
# CONFIG_ACPI is not set
CONFIG_SFI=y
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
# CONFIG_CPU_FREQ is not set

#
# CPU Idle
#
# CONFIG_CPU_IDLE is not set
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
CONFIG_PCI_GODIRECT=y
# CONFIG_PCI_GOOLPC is not set
# CONFIG_PCI_GOANY is not set
CONFIG_PCI_DIRECT=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
# CONFIG_HT_IRQ is not set
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
# CONFIG_PCI_PRI is not set
CONFIG_PCI_PASID=y

#
# PCI host controller drivers
#
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
# CONFIG_EISA is not set
CONFIG_SCx200=y
# CONFIG_SCx200HR_TIMER is not set
CONFIG_OLPC=y
# CONFIG_OLPC_XO1_PM is not set
# CONFIG_ALIX is not set
CONFIG_NET5501=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
# CONFIG_CARDBUS is not set

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
# CONFIG_YENTA_TI is not set
CONFIG_YENTA_TOSHIBA=y
CONFIG_PD6729=y
CONFIG_I82092=y
CONFIG_I82365=y
CONFIG_TCIC=y
CONFIG_PCMCIA_PROBE=y
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_CPCI=y
# CONFIG_HOTPLUG_PCI_CPCI_ZT5550 is not set
CONFIG_HOTPLUG_PCI_CPCI_GENERIC=y
CONFIG_HOTPLUG_PCI_SHPC=y
# CONFIG_RAPIDIO is not set
CONFIG_X86_SYSFB=y

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_BINFMT_SCRIPT is not set
CONFIG_HAVE_AOUT=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_DIAG=y
CONFIG_UNIX=y
CONFIG_UNIX_DIAG=y
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_SUB_POLICY is not set
CONFIG_XFRM_MIGRATE=y
# CONFIG_XFRM_STATISTICS is not set
CONFIG_NET_KEY=y
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
# CONFIG_IP_ROUTE_MULTIPATH is not set
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_ROUTE_CLASSID=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_IP_PNP_BOOTP=y
# CONFIG_IP_PNP_RARP is not set
CONFIG_NET_IPIP=y
CONFIG_NET_IPGRE_DEMUX=y
CONFIG_NET_IP_TUNNEL=y
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
# CONFIG_IP_PIMSM_V2 is not set
CONFIG_SYN_COOKIES=y
CONFIG_NET_IPVTI=y
CONFIG_INET_AH=y
CONFIG_INET_ESP=y
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=y
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_INET_UDP_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=y
CONFIG_INET6_ESP=y
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
CONFIG_INET6_TUNNEL=y
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=y
CONFIG_IPV6_SIT=y
# CONFIG_IPV6_SIT_6RD is not set
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=y
CONFIG_IPV6_GRE=y
# CONFIG_IPV6_MULTIPLE_TABLES is not set
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
# CONFIG_IPV6_PIMSM_V2 is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NETWORK_PHY_TIMESTAMPING=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_DEBUG=y
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_NETLINK_ACCT=y
CONFIG_NETFILTER_NETLINK_QUEUE=y
CONFIG_NETFILTER_NETLINK_LOG=y
CONFIG_NF_CONNTRACK=y
CONFIG_NF_CONNTRACK_MARK=y
# CONFIG_NF_CONNTRACK_SECMARK is not set
# CONFIG_NF_CONNTRACK_ZONES is not set
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CONNTRACK_TIMEOUT=y
# CONFIG_NF_CONNTRACK_TIMESTAMP is not set
CONFIG_NF_CONNTRACK_LABELS=y
# CONFIG_NF_CT_PROTO_DCCP is not set
CONFIG_NF_CT_PROTO_SCTP=y
# CONFIG_NF_CT_PROTO_UDPLITE is not set
CONFIG_NF_CONNTRACK_AMANDA=y
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CONNTRACK_H323=y
CONFIG_NF_CONNTRACK_IRC=y
CONFIG_NF_CONNTRACK_BROADCAST=y
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
CONFIG_NF_CONNTRACK_SNMP=y
# CONFIG_NF_CONNTRACK_PPTP is not set
CONFIG_NF_CONNTRACK_SANE=y
CONFIG_NF_CONNTRACK_SIP=y
# CONFIG_NF_CONNTRACK_TFTP is not set
CONFIG_NF_CT_NETLINK=y
# CONFIG_NF_CT_NETLINK_TIMEOUT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE_CT is not set
CONFIG_NETFILTER_SYNPROXY=y
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=y
CONFIG_NETFILTER_XT_CONNMARK=y

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_AUDIT=y
CONFIG_NETFILTER_XT_TARGET_CHECKSUM=y
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=y
CONFIG_NETFILTER_XT_TARGET_CONNMARK=y
CONFIG_NETFILTER_XT_TARGET_CT=y
CONFIG_NETFILTER_XT_TARGET_DSCP=y
CONFIG_NETFILTER_XT_TARGET_HL=y
CONFIG_NETFILTER_XT_TARGET_HMARK=y
CONFIG_NETFILTER_XT_TARGET_IDLETIMER=y
CONFIG_NETFILTER_XT_TARGET_LED=y
CONFIG_NETFILTER_XT_TARGET_LOG=y
CONFIG_NETFILTER_XT_TARGET_MARK=y
CONFIG_NETFILTER_XT_TARGET_NFLOG=y
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=y
# CONFIG_NETFILTER_XT_TARGET_NOTRACK is not set
CONFIG_NETFILTER_XT_TARGET_RATEEST=y
CONFIG_NETFILTER_XT_TARGET_TEE=y
CONFIG_NETFILTER_XT_TARGET_TPROXY=y
CONFIG_NETFILTER_XT_TARGET_TRACE=y
CONFIG_NETFILTER_XT_TARGET_SECMARK=y
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=y

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
CONFIG_NETFILTER_XT_MATCH_CLUSTER=y
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=y
CONFIG_NETFILTER_XT_MATCH_CONNLABEL=y
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=y
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
CONFIG_NETFILTER_XT_MATCH_ECN=y
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=y
CONFIG_NETFILTER_XT_MATCH_HELPER=y
CONFIG_NETFILTER_XT_MATCH_HL=y
CONFIG_NETFILTER_XT_MATCH_IPRANGE=y
CONFIG_NETFILTER_XT_MATCH_IPVS=y
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
CONFIG_NETFILTER_XT_MATCH_MARK=y
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y
CONFIG_NETFILTER_XT_MATCH_NFACCT=y
# CONFIG_NETFILTER_XT_MATCH_OSF is not set
CONFIG_NETFILTER_XT_MATCH_OWNER=y
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y
CONFIG_NETFILTER_XT_MATCH_QUOTA=y
CONFIG_NETFILTER_XT_MATCH_RATEEST=y
CONFIG_NETFILTER_XT_MATCH_REALM=y
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_SOCKET is not set
# CONFIG_NETFILTER_XT_MATCH_STATE is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
CONFIG_NETFILTER_XT_MATCH_STRING=y
CONFIG_NETFILTER_XT_MATCH_TCPMSS=y
CONFIG_NETFILTER_XT_MATCH_TIME=y
CONFIG_NETFILTER_XT_MATCH_U32=y
# CONFIG_IP_SET is not set
CONFIG_IP_VS=y
CONFIG_IP_VS_IPV6=y
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
# CONFIG_IP_VS_PROTO_TCP is not set
# CONFIG_IP_VS_PROTO_UDP is not set
# CONFIG_IP_VS_PROTO_AH_ESP is not set
# CONFIG_IP_VS_PROTO_ESP is not set
# CONFIG_IP_VS_PROTO_AH is not set
# CONFIG_IP_VS_PROTO_SCTP is not set

#
# IPVS scheduler
#
# CONFIG_IP_VS_RR is not set
CONFIG_IP_VS_WRR=y
CONFIG_IP_VS_LC=y
# CONFIG_IP_VS_WLC is not set
CONFIG_IP_VS_LBLC=y
CONFIG_IP_VS_LBLCR=y
CONFIG_IP_VS_DH=y
CONFIG_IP_VS_SH=y
CONFIG_IP_VS_SED=y
CONFIG_IP_VS_NQ=y

#
# IPVS SH scheduler
#
CONFIG_IP_VS_SH_TAB_BITS=8

#
# IPVS application helper
#
# CONFIG_IP_VS_NFCT is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
# CONFIG_NF_CONNTRACK_IPV4 is not set
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_AH is not set
CONFIG_IP_NF_MATCH_ECN=y
# CONFIG_IP_NF_MATCH_RPFILTER is not set
CONFIG_IP_NF_MATCH_TTL=y
CONFIG_IP_NF_FILTER=y
# CONFIG_IP_NF_TARGET_REJECT is not set
CONFIG_IP_NF_TARGET_SYNPROXY=y
CONFIG_IP_NF_TARGET_ULOG=y
CONFIG_IP_NF_MANGLE=y
CONFIG_IP_NF_TARGET_ECN=y
# CONFIG_IP_NF_TARGET_TTL is not set
CONFIG_IP_NF_RAW=y
CONFIG_IP_NF_SECURITY=y
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV6=y
# CONFIG_NF_CONNTRACK_IPV6 is not set
CONFIG_IP6_NF_IPTABLES=y
CONFIG_IP6_NF_MATCH_AH=y
CONFIG_IP6_NF_MATCH_EUI64=y
CONFIG_IP6_NF_MATCH_FRAG=y
# CONFIG_IP6_NF_MATCH_OPTS is not set
CONFIG_IP6_NF_MATCH_HL=y
CONFIG_IP6_NF_MATCH_IPV6HEADER=y
CONFIG_IP6_NF_MATCH_MH=y
CONFIG_IP6_NF_MATCH_RPFILTER=y
# CONFIG_IP6_NF_MATCH_RT is not set
# CONFIG_IP6_NF_TARGET_HL is not set
CONFIG_IP6_NF_FILTER=y
# CONFIG_IP6_NF_TARGET_REJECT is not set
CONFIG_IP6_NF_TARGET_SYNPROXY=y
CONFIG_IP6_NF_MANGLE=y
CONFIG_IP6_NF_RAW=y
# CONFIG_IP6_NF_SECURITY is not set

#
# DECnet: Netfilter Configuration
#
CONFIG_DECNET_NF_GRABULATOR=y
CONFIG_IP_DCCP=y
CONFIG_INET_DCCP_DIAG=y

#
# DCCP CCIDs Configuration
#
# CONFIG_IP_DCCP_CCID2_DEBUG is not set
CONFIG_IP_DCCP_CCID3=y
CONFIG_IP_DCCP_CCID3_DEBUG=y
CONFIG_IP_DCCP_TFRC_LIB=y
CONFIG_IP_DCCP_TFRC_DEBUG=y

#
# DCCP Kernel Hacking
#
CONFIG_IP_DCCP_DEBUG=y
CONFIG_IP_SCTP=y
# CONFIG_SCTP_DBG_OBJCNT is not set
CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5=y
# CONFIG_SCTP_DEFAULT_COOKIE_HMAC_SHA1 is not set
# CONFIG_SCTP_DEFAULT_COOKIE_HMAC_NONE is not set
CONFIG_SCTP_COOKIE_HMAC_MD5=y
# CONFIG_SCTP_COOKIE_HMAC_SHA1 is not set
# CONFIG_RDS is not set
CONFIG_TIPC=y
CONFIG_TIPC_PORTS=8191
# CONFIG_TIPC_MEDIA_IB is not set
CONFIG_ATM=y
CONFIG_ATM_CLIP=y
# CONFIG_ATM_CLIP_NO_ICMP is not set
CONFIG_ATM_LANE=y
CONFIG_ATM_MPOA=y
CONFIG_ATM_BR2684=y
# CONFIG_ATM_BR2684_IPFILTER is not set
# CONFIG_L2TP is not set
CONFIG_STP=y
CONFIG_GARP=y
# CONFIG_BRIDGE is not set
CONFIG_HAVE_NET_DSA=y
CONFIG_VLAN_8021Q=y
CONFIG_VLAN_8021Q_GVRP=y
# CONFIG_VLAN_8021Q_MVRP is not set
CONFIG_DECNET=y
CONFIG_DECNET_ROUTER=y
CONFIG_LLC=y
CONFIG_LLC2=y
CONFIG_IPX=y
CONFIG_IPX_INTERN=y
CONFIG_ATALK=y
CONFIG_DEV_APPLETALK=y
CONFIG_LTPC=y
# CONFIG_COPS is not set
CONFIG_IPDDP=y
CONFIG_IPDDP_ENCAP=y
CONFIG_X25=y
CONFIG_LAPB=y
CONFIG_PHONET=y
CONFIG_IEEE802154=y
CONFIG_IEEE802154_6LOWPAN=y
CONFIG_MAC802154=y
# CONFIG_NET_SCHED is not set
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
CONFIG_VSOCKETS=y
CONFIG_VMWARE_VMCI_VSOCKETS=y
CONFIG_NETLINK_MMAP=y
CONFIG_NETLINK_DIAG=y
CONFIG_NET_MPLS_GSO=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
CONFIG_AX25=y
# CONFIG_AX25_DAMA_SLAVE is not set
CONFIG_NETROM=y
CONFIG_ROSE=y

#
# AX.25 network device drivers
#
CONFIG_MKISS=y
CONFIG_6PACK=y
CONFIG_BPQETHER=y
CONFIG_DMASCC=y
CONFIG_SCC=y
CONFIG_SCC_DELAY=y
CONFIG_SCC_TRXECHO=y
CONFIG_BAYCOM_SER_FDX=y
CONFIG_BAYCOM_SER_HDX=y
CONFIG_BAYCOM_PAR=y
CONFIG_BAYCOM_EPP=y
CONFIG_YAM=y
CONFIG_CAN=y
# CONFIG_CAN_RAW is not set
CONFIG_CAN_BCM=y
CONFIG_CAN_GW=y

#
# CAN Device Drivers
#
# CONFIG_CAN_VCAN is not set
# CONFIG_CAN_SLCAN is not set
CONFIG_CAN_DEV=y
# CONFIG_CAN_CALC_BITTIMING is not set
CONFIG_CAN_LEDS=y
# CONFIG_CAN_JANZ_ICAN3 is not set
CONFIG_PCH_CAN=y
CONFIG_CAN_GRCAN=y
CONFIG_CAN_SJA1000=y
CONFIG_CAN_SJA1000_ISA=y
CONFIG_CAN_SJA1000_PLATFORM=y
# CONFIG_CAN_SJA1000_OF_PLATFORM is not set
CONFIG_CAN_EMS_PCMCIA=y
CONFIG_CAN_EMS_PCI=y
CONFIG_CAN_PEAK_PCMCIA=y
CONFIG_CAN_PEAK_PCI=y
# CONFIG_CAN_PEAK_PCIEC is not set
# CONFIG_CAN_KVASER_PCI is not set
CONFIG_CAN_PLX_PCI=y
CONFIG_CAN_TSCAN1=y
# CONFIG_CAN_C_CAN is not set
CONFIG_CAN_CC770=y
CONFIG_CAN_CC770_ISA=y
CONFIG_CAN_CC770_PLATFORM=y

#
# CAN USB interfaces
#
# CONFIG_CAN_EMS_USB is not set
CONFIG_CAN_ESD_USB2=y
CONFIG_CAN_KVASER_USB=y
# CONFIG_CAN_PEAK_USB is not set
CONFIG_CAN_8DEV_USB=y
CONFIG_CAN_SOFTING=y
# CONFIG_CAN_SOFTING_CS is not set
CONFIG_CAN_DEBUG_DEVICES=y
CONFIG_IRDA=y

#
# IrDA protocols
#
CONFIG_IRLAN=y
# CONFIG_IRCOMM is not set
# CONFIG_IRDA_ULTRA is not set

#
# IrDA options
#
# CONFIG_IRDA_CACHE_LAST_LSAP is not set
CONFIG_IRDA_FAST_RR=y
# CONFIG_IRDA_DEBUG is not set

#
# Infrared-port device drivers
#

#
# SIR device drivers
#
CONFIG_IRTTY_SIR=y

#
# Dongle support
#
# CONFIG_DONGLE is not set
CONFIG_KINGSUN_DONGLE=y
# CONFIG_KSDAZZLE_DONGLE is not set
# CONFIG_KS959_DONGLE is not set

#
# FIR device drivers
#
# CONFIG_USB_IRDA is not set
# CONFIG_SIGMATEL_FIR is not set
CONFIG_NSC_FIR=y
# CONFIG_WINBOND_FIR is not set
CONFIG_TOSHIBA_FIR=y
CONFIG_SMC_IRCC_FIR=y
# CONFIG_ALI_FIR is not set
CONFIG_VLSI_FIR=y
CONFIG_VIA_FIR=y
CONFIG_MCS_FIR=y
CONFIG_BT=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
# CONFIG_BT_BNEP is not set
CONFIG_BT_HIDP=y

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIBTUSB is not set
CONFIG_BT_HCIBTSDIO=y
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
CONFIG_BT_HCIBFUSB=y
CONFIG_BT_HCIDTL1=y
CONFIG_BT_HCIBT3C=y
# CONFIG_BT_HCIBLUECARD is not set
CONFIG_BT_HCIBTUART=y
CONFIG_BT_HCIVHCI=y
# CONFIG_BT_MRVL is not set
CONFIG_BT_WILINK=y
CONFIG_AF_RXRPC=y
# CONFIG_AF_RXRPC_DEBUG is not set
CONFIG_RXKAD=y
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
CONFIG_WIRELESS_EXT=y
CONFIG_WEXT_CORE=y
CONFIG_WEXT_PROC=y
CONFIG_WEXT_SPY=y
CONFIG_WEXT_PRIV=y
CONFIG_CFG80211=y
# CONFIG_NL80211_TESTMODE is not set
CONFIG_CFG80211_DEVELOPER_WARNINGS=y
# CONFIG_CFG80211_REG_DEBUG is not set
# CONFIG_CFG80211_CERTIFICATION_ONUS is not set
# CONFIG_CFG80211_DEFAULT_PS is not set
CONFIG_CFG80211_DEBUGFS=y
# CONFIG_CFG80211_INTERNAL_REGDB is not set
CONFIG_CFG80211_WEXT=y
CONFIG_LIB80211=y
CONFIG_LIB80211_CRYPT_WEP=y
CONFIG_LIB80211_CRYPT_CCMP=y
CONFIG_LIB80211_CRYPT_TKIP=y
CONFIG_LIB80211_DEBUG=y
CONFIG_MAC80211=y
# CONFIG_MAC80211_RC_PID is not set
# CONFIG_MAC80211_RC_MINSTREL is not set
CONFIG_MAC80211_RC_DEFAULT=""

#
# Some wireless drivers require a rate control algorithm
#
# CONFIG_MAC80211_MESH is not set
CONFIG_MAC80211_LEDS=y
CONFIG_MAC80211_DEBUGFS=y
# CONFIG_MAC80211_MESSAGE_TRACING is not set
CONFIG_MAC80211_DEBUG_MENU=y
CONFIG_MAC80211_NOINLINE=y
# CONFIG_MAC80211_VERBOSE_DEBUG is not set
CONFIG_MAC80211_MLME_DEBUG=y
# CONFIG_MAC80211_STA_DEBUG is not set
CONFIG_MAC80211_HT_DEBUG=y
CONFIG_MAC80211_IBSS_DEBUG=y
# CONFIG_MAC80211_PS_DEBUG is not set
CONFIG_MAC80211_TDLS_DEBUG=y
CONFIG_MAC80211_DEBUG_COUNTERS=y
CONFIG_WIMAX=y
CONFIG_WIMAX_DEBUG_LEVEL=8
# CONFIG_RFKILL is not set
# CONFIG_RFKILL_REGULATOR is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
CONFIG_NET_9P_RDMA=y
CONFIG_NET_9P_DEBUG=y
# CONFIG_CAIF is not set
CONFIG_CEPH_LIB=y
CONFIG_CEPH_LIB_PRETTYDEBUG=y
CONFIG_CEPH_LIB_USE_DNS_RESOLVER=y
CONFIG_NFC=y
CONFIG_NFC_NCI=y
CONFIG_NFC_HCI=y
# CONFIG_NFC_SHDLC is not set

#
# Near Field Communication (NFC) devices
#
# CONFIG_NFC_PN533 is not set
CONFIG_NFC_WILINK=y
# CONFIG_NFC_SIM is not set
CONFIG_NFC_PN544=y
CONFIG_NFC_MICROREAD=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
# CONFIG_DEVTMPFS_MOUNT is not set
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=y
CONFIG_REGMAP_MMIO=y
CONFIG_REGMAP_IRQ=y
# CONFIG_DMA_SHARED_BUFFER is not set

#
# Bus devices
#
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
CONFIG_OF=y

#
# Device Tree and Open Firmware support
#
# CONFIG_PROC_DEVICETREE is not set
# CONFIG_OF_SELFTEST is not set
CONFIG_OF_PROMTREE=y
CONFIG_OF_ADDRESS=y
CONFIG_OF_IRQ=y
CONFIG_OF_NET=y
CONFIG_OF_MDIO=y
CONFIG_OF_PCI=y
CONFIG_OF_PCI_IRQ=y
CONFIG_PARPORT=y
CONFIG_PARPORT_PC=y
CONFIG_PARPORT_SERIAL=y
# CONFIG_PARPORT_PC_FIFO is not set
CONFIG_PARPORT_PC_SUPERIO=y
CONFIG_PARPORT_PC_PCMCIA=y
# CONFIG_PARPORT_GSC is not set
CONFIG_PARPORT_AX88796=y
CONFIG_PARPORT_1284=y
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_ISAPNP=y
CONFIG_PNPBIOS=y
# CONFIG_PNPBIOS_PROC_FS is not set
# CONFIG_PNPACPI is not set
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=y
# CONFIG_PARIDE is not set
CONFIG_BLK_DEV_PCIESSD_MTIP32XX=y
CONFIG_BLK_CPQ_DA=y
CONFIG_BLK_CPQ_CISS_DA=y
# CONFIG_CISS_SCSI_TAPE is not set
# CONFIG_BLK_DEV_DAC960 is not set
CONFIG_BLK_DEV_UMEM=y
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
CONFIG_BLK_DEV_NBD=y
# CONFIG_BLK_DEV_NVME is not set
CONFIG_BLK_DEV_OSD=y
CONFIG_BLK_DEV_SX8=y
# CONFIG_BLK_DEV_RAM is not set
CONFIG_CDROM_PKTCDVD=y
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_VIRTIO_BLK=y
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
CONFIG_DUMMY_IRQ=y
CONFIG_IBM_ASM=y
CONFIG_PHANTOM=y
CONFIG_SGI_IOC4=y
CONFIG_TIFM_CORE=y
CONFIG_TIFM_7XX1=y
CONFIG_ICS932S401=y
# CONFIG_ATMEL_SSC is not set
CONFIG_ENCLOSURE_SERVICES=y
CONFIG_CS5535_MFGPT=y
CONFIG_CS5535_MFGPT_DEFAULT_IRQ=7
CONFIG_CS5535_CLOCK_EVENT_SRC=y
CONFIG_HP_ILO=y
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
CONFIG_ISL29020=y
CONFIG_SENSORS_TSL2550=y
CONFIG_SENSORS_BH1780=y
# CONFIG_SENSORS_BH1770 is not set
CONFIG_SENSORS_APDS990X=y
# CONFIG_HMC6352 is not set
CONFIG_DS1682=y
# CONFIG_VMWARE_BALLOON is not set
# CONFIG_BMP085_I2C is not set
# CONFIG_PCH_PHUB is not set
CONFIG_USB_SWITCH_FSA9480=y
# CONFIG_SRAM is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
CONFIG_EEPROM_AT24=y
CONFIG_EEPROM_LEGACY=y
# CONFIG_EEPROM_MAX6875 is not set
CONFIG_EEPROM_93CX6=y
CONFIG_CB710_CORE=y
CONFIG_CB710_DEBUG=y
CONFIG_CB710_DEBUG_ASSUMPTIONS=y

#
# Texas Instruments shared transport line discipline
#
CONFIG_TI_ST=y
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
CONFIG_VMWARE_VMCI=y
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=y
CONFIG_SCSI_NETLINK=y
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
CONFIG_CHR_DEV_SG=y
CONFIG_CHR_DEV_SCH=y
# CONFIG_SCSI_ENCLOSURE is not set
# CONFIG_SCSI_MULTI_LUN is not set
# CONFIG_SCSI_CONSTANTS is not set
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=y
CONFIG_SCSI_FC_TGT_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=y
CONFIG_SCSI_SAS_ATTRS=y
CONFIG_SCSI_SAS_LIBSAS=y
# CONFIG_SCSI_SAS_ATA is not set
# CONFIG_SCSI_SAS_HOST_SMP is not set
CONFIG_SCSI_SRP_ATTRS=y
# CONFIG_SCSI_SRP_TGT_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=y
CONFIG_ISCSI_BOOT_SYSFS=y
# CONFIG_SCSI_CXGB3_ISCSI is not set
CONFIG_SCSI_CXGB4_ISCSI=y
# CONFIG_SCSI_BNX2_ISCSI is not set
CONFIG_SCSI_BNX2X_FCOE=y
CONFIG_BE2ISCSI=y
CONFIG_BLK_DEV_3W_XXXX_RAID=y
CONFIG_SCSI_HPSA=y
CONFIG_SCSI_3W_9XXX=y
# CONFIG_SCSI_3W_SAS is not set
CONFIG_SCSI_7000FASST=y
# CONFIG_SCSI_ACARD is not set
CONFIG_SCSI_AHA152X=y
# CONFIG_SCSI_AHA1542 is not set
CONFIG_SCSI_AACRAID=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=5000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC7XXX_OLD is not set
CONFIG_SCSI_AIC79XX=y
CONFIG_AIC79XX_CMDS_PER_DEVICE=32
CONFIG_AIC79XX_RESET_DELAY_MS=5000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC94XX=y
# CONFIG_AIC94XX_DEBUG is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
CONFIG_SCSI_IN2000=y
CONFIG_SCSI_ARCMSR=y
CONFIG_SCSI_ESAS2R=y
# CONFIG_MEGARAID_NEWGEN is not set
CONFIG_MEGARAID_LEGACY=y
CONFIG_MEGARAID_SAS=y
CONFIG_SCSI_MPT2SAS=y
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
# CONFIG_SCSI_MPT2SAS_LOGGING is not set
CONFIG_SCSI_MPT3SAS=y
CONFIG_SCSI_MPT3SAS_MAX_SGE=128
# CONFIG_SCSI_MPT3SAS_LOGGING is not set
CONFIG_SCSI_UFSHCD=y
CONFIG_SCSI_UFSHCD_PCI=y
CONFIG_SCSI_UFSHCD_PLATFORM=y
CONFIG_SCSI_HPTIOP=y
# CONFIG_SCSI_BUSLOGIC is not set
CONFIG_VMWARE_PVSCSI=y
CONFIG_LIBFC=y
CONFIG_LIBFCOE=y
# CONFIG_FCOE is not set
CONFIG_FCOE_FNIC=y
CONFIG_SCSI_DMX3191D=y
# CONFIG_SCSI_DTC3280 is not set
CONFIG_SCSI_EATA=y
# CONFIG_SCSI_EATA_TAGGED_QUEUE is not set
CONFIG_SCSI_EATA_LINKED_COMMANDS=y
CONFIG_SCSI_EATA_MAX_TAGS=16
CONFIG_SCSI_FUTURE_DOMAIN=y
CONFIG_SCSI_GDTH=y
# CONFIG_SCSI_ISCI is not set
CONFIG_SCSI_GENERIC_NCR5380=y
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_GENERIC_NCR53C400 is not set
CONFIG_SCSI_IPS=y
CONFIG_SCSI_INITIO=y
CONFIG_SCSI_INIA100=y
# CONFIG_SCSI_PPA is not set
CONFIG_SCSI_IMM=y
# CONFIG_SCSI_IZIP_EPP16 is not set
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
CONFIG_SCSI_NCR53C406A=y
CONFIG_SCSI_STEX=y
CONFIG_SCSI_SYM53C8XX_2=y
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
CONFIG_SCSI_IPR=y
# CONFIG_SCSI_IPR_TRACE is not set
CONFIG_SCSI_IPR_DUMP=y
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
CONFIG_SCSI_QLOGIC_1280=y
CONFIG_SCSI_QLA_FC=y
CONFIG_TCM_QLA2XXX=y
CONFIG_SCSI_QLA_ISCSI=y
CONFIG_SCSI_LPFC=y
CONFIG_SCSI_LPFC_DEBUG_FS=y
CONFIG_SCSI_SYM53C416=y
CONFIG_SCSI_DC395x=y
CONFIG_SCSI_DC390T=y
CONFIG_SCSI_T128=y
# CONFIG_SCSI_U14_34F is not set
CONFIG_SCSI_ULTRASTOR=y
CONFIG_SCSI_NSP32=y
# CONFIG_SCSI_DEBUG is not set
CONFIG_SCSI_PMCRAID=y
CONFIG_SCSI_PM8001=y
# CONFIG_SCSI_SRP is not set
CONFIG_SCSI_BFA_FC=y
CONFIG_SCSI_VIRTIO=y
# CONFIG_SCSI_CHELSIO_FCOE is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=y
CONFIG_SCSI_DH_HP_SW=y
CONFIG_SCSI_DH_EMC=y
CONFIG_SCSI_DH_ALUA=y
CONFIG_SCSI_OSD_INITIATOR=y
CONFIG_SCSI_OSD_ULD=y
CONFIG_SCSI_OSD_DPRINT_SENSE=1
# CONFIG_SCSI_OSD_DEBUG is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
# CONFIG_ATA_VERBOSE_ERROR is not set
# CONFIG_SATA_PMP is not set

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_AHCI_PLATFORM=y
CONFIG_AHCI_IMX=y
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
CONFIG_SATA_SIL24=y
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
CONFIG_SATA_QSTOR=y
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
CONFIG_SATA_HIGHBANK=y
CONFIG_SATA_MV=y
CONFIG_SATA_NV=y
CONFIG_SATA_PROMISE=y
# CONFIG_SATA_RCAR is not set
# CONFIG_SATA_SIL is not set
CONFIG_SATA_SIS=y
CONFIG_SATA_SVW=y
CONFIG_SATA_ULI=y
CONFIG_SATA_VIA=y
CONFIG_SATA_VITESSE=y

#
# PATA SFF controllers with BMDMA
#
CONFIG_PATA_ALI=y
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
CONFIG_PATA_ATP867X=y
CONFIG_PATA_CMD64X=y
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
CONFIG_PATA_CS5535=y
CONFIG_PATA_CS5536=y
# CONFIG_PATA_CYPRESS is not set
CONFIG_PATA_EFAR=y
CONFIG_PATA_HPT366=y
CONFIG_PATA_HPT37X=y
CONFIG_PATA_HPT3X2N=y
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
CONFIG_PATA_IT821X=y
CONFIG_PATA_JMICRON=y
CONFIG_PATA_MARVELL=y
# CONFIG_PATA_NETCELL is not set
CONFIG_PATA_NINJA32=y
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OLDPIIX=y
CONFIG_PATA_OPTIDMA=y
CONFIG_PATA_PDC2027X=y
CONFIG_PATA_PDC_OLD=y
CONFIG_PATA_RADISYS=y
CONFIG_PATA_RDC=y
CONFIG_PATA_SC1200=y
CONFIG_PATA_SCH=y
CONFIG_PATA_SERVERWORKS=y
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=y
CONFIG_PATA_TOSHIBA=y
CONFIG_PATA_TRIFLEX=y
CONFIG_PATA_VIA=y
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
CONFIG_PATA_CMD640_PCI=y
CONFIG_PATA_ISAPNP=y
CONFIG_PATA_MPIIX=y
CONFIG_PATA_NS87410=y
CONFIG_PATA_OPTI=y
# CONFIG_PATA_PCMCIA is not set
CONFIG_PATA_PLATFORM=y
CONFIG_PATA_OF_PLATFORM=y
# CONFIG_PATA_QDI is not set
CONFIG_PATA_RZ1000=y
CONFIG_PATA_WINBOND_VLB=y

#
# Generic fallback / legacy drivers
#
CONFIG_ATA_GENERIC=y
CONFIG_PATA_LEGACY=y
# CONFIG_MD is not set
CONFIG_TARGET_CORE=y
CONFIG_TCM_IBLOCK=y
CONFIG_TCM_FILEIO=y
# CONFIG_TCM_PSCSI is not set
CONFIG_LOOPBACK_TARGET=y
# CONFIG_TCM_FC is not set
CONFIG_ISCSI_TARGET=y
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
CONFIG_FIREWIRE_NOSY=y
CONFIG_I2O=y
CONFIG_I2O_LCT_NOTIFY_ON_CHANGES=y
# CONFIG_I2O_EXT_ADAPTEC is not set
CONFIG_I2O_CONFIG=y
# CONFIG_I2O_CONFIG_OLD_IOCTL is not set
CONFIG_I2O_BUS=y
# CONFIG_I2O_BLOCK is not set
CONFIG_I2O_SCSI=y
CONFIG_I2O_PROC=y
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
CONFIG_BONDING=y
# CONFIG_DUMMY is not set
CONFIG_EQUALIZER=y
CONFIG_NET_FC=y
CONFIG_NET_TEAM=y
CONFIG_NET_TEAM_MODE_BROADCAST=y
CONFIG_NET_TEAM_MODE_ROUNDROBIN=y
CONFIG_NET_TEAM_MODE_RANDOM=y
CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=y
CONFIG_NET_TEAM_MODE_LOADBALANCE=y
CONFIG_MACVLAN=y
CONFIG_MACVTAP=y
CONFIG_VXLAN=y
CONFIG_NETCONSOLE=y
# CONFIG_NETCONSOLE_DYNAMIC is not set
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_TUN is not set
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=y
CONFIG_NLMON=y
CONFIG_SUNGEM_PHY=y
# CONFIG_ARCNET is not set
CONFIG_ATM_DRIVERS=y
CONFIG_ATM_DUMMY=y
# CONFIG_ATM_TCP is not set
# CONFIG_ATM_LANAI is not set
CONFIG_ATM_ENI=y
# CONFIG_ATM_ENI_DEBUG is not set
# CONFIG_ATM_ENI_TUNE_BURST is not set
# CONFIG_ATM_FIRESTREAM is not set
CONFIG_ATM_ZATM=y
CONFIG_ATM_ZATM_DEBUG=y
CONFIG_ATM_NICSTAR=y
CONFIG_ATM_NICSTAR_USE_SUNI=y
CONFIG_ATM_NICSTAR_USE_IDT77105=y
CONFIG_ATM_IDT77252=y
# CONFIG_ATM_IDT77252_DEBUG is not set
# CONFIG_ATM_IDT77252_RCV_ALL is not set
CONFIG_ATM_IDT77252_USE_SUNI=y
# CONFIG_ATM_AMBASSADOR is not set
CONFIG_ATM_HORIZON=y
# CONFIG_ATM_HORIZON_DEBUG is not set
# CONFIG_ATM_IA is not set
CONFIG_ATM_FORE200E=y
# CONFIG_ATM_FORE200E_USE_TASKLET is not set
CONFIG_ATM_FORE200E_TX_RETRY=16
CONFIG_ATM_FORE200E_DEBUG=0
CONFIG_ATM_HE=y
# CONFIG_ATM_HE_USE_SUNI is not set
CONFIG_ATM_SOLOS=y

#
# CAIF transport drivers
#
CONFIG_VHOST_NET=y
CONFIG_VHOST_RING=y
CONFIG_VHOST=y

#
# Distributed Switch Architecture drivers
#
# CONFIG_NET_DSA_MV88E6XXX is not set
# CONFIG_NET_DSA_MV88E6060 is not set
# CONFIG_NET_DSA_MV88E6XXX_NEED_PPU is not set
# CONFIG_NET_DSA_MV88E6131 is not set
# CONFIG_NET_DSA_MV88E6123_61_65 is not set
CONFIG_ETHERNET=y
CONFIG_MDIO=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL3 is not set
# CONFIG_3C515 is not set
# CONFIG_PCMCIA_3C574 is not set
CONFIG_PCMCIA_3C589=y
CONFIG_VORTEX=y
CONFIG_TYPHOON=y
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_NET_VENDOR_ALTEON is not set
CONFIG_NET_VENDOR_AMD=y
CONFIG_AMD8111_ETH=y
# CONFIG_LANCE is not set
CONFIG_PCNET32=y
CONFIG_PCMCIA_NMCLAN=y
# CONFIG_NI65 is not set
# CONFIG_NET_VENDOR_ARC is not set
# CONFIG_NET_VENDOR_ATHEROS is not set
# CONFIG_NET_CADENCE is not set
CONFIG_NET_VENDOR_BROADCOM=y
CONFIG_B44=y
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_BNX2=y
CONFIG_CNIC=y
CONFIG_TIGON3=y
# CONFIG_BNX2X is not set
# CONFIG_NET_VENDOR_BROCADE is not set
# CONFIG_NET_CALXEDA_XGMAC is not set
CONFIG_NET_VENDOR_CHELSIO=y
CONFIG_CHELSIO_T1=y
# CONFIG_CHELSIO_T1_1G is not set
# CONFIG_CHELSIO_T3 is not set
CONFIG_CHELSIO_T4=y
# CONFIG_CHELSIO_T4VF is not set
# CONFIG_NET_VENDOR_CIRRUS is not set
# CONFIG_NET_VENDOR_CISCO is not set
CONFIG_DNET=y
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
CONFIG_DE2104X=y
CONFIG_DE2104X_DSL=0
CONFIG_TULIP=y
CONFIG_TULIP_MWI=y
CONFIG_TULIP_MMIO=y
CONFIG_TULIP_NAPI=y
# CONFIG_TULIP_NAPI_HW_MITIGATION is not set
CONFIG_DE4X5=y
CONFIG_WINBOND_840=y
CONFIG_DM9102=y
CONFIG_ULI526X=y
# CONFIG_NET_VENDOR_DLINK is not set
# CONFIG_NET_VENDOR_EMULEX is not set
CONFIG_NET_VENDOR_EXAR=y
CONFIG_S2IO=y
# CONFIG_VXGE is not set
# CONFIG_NET_VENDOR_FUJITSU is not set
CONFIG_NET_VENDOR_HP=y
# CONFIG_HP100 is not set
# CONFIG_NET_VENDOR_INTEL is not set
# CONFIG_IP1000 is not set
CONFIG_JME=y
# CONFIG_NET_VENDOR_MARVELL is not set
CONFIG_NET_VENDOR_MELLANOX=y
CONFIG_MLX4_EN=y
# CONFIG_MLX4_EN_DCB is not set
CONFIG_MLX4_CORE=y
CONFIG_MLX4_DEBUG=y
CONFIG_MLX5_CORE=y
CONFIG_NET_VENDOR_MICREL=y
CONFIG_KS8851_MLL=y
CONFIG_KSZ884X_PCI=y
# CONFIG_NET_VENDOR_MYRI is not set
CONFIG_FEALNX=y
CONFIG_NET_VENDOR_NATSEMI=y
CONFIG_NATSEMI=y
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_8390=y
CONFIG_PCMCIA_AXNET=y
CONFIG_NE2000=y
CONFIG_NE2K_PCI=y
CONFIG_PCMCIA_PCNET=y
# CONFIG_ULTRA is not set
CONFIG_WD80x3=y
CONFIG_NET_VENDOR_NVIDIA=y
CONFIG_FORCEDETH=y
# CONFIG_NET_VENDOR_OKI is not set
CONFIG_ETHOC=y
CONFIG_NET_PACKET_ENGINE=y
# CONFIG_HAMACHI is not set
CONFIG_YELLOWFIN=y
# CONFIG_NET_VENDOR_QLOGIC is not set
# CONFIG_NET_VENDOR_REALTEK is not set
CONFIG_SH_ETH=y
# CONFIG_NET_VENDOR_RDC is not set
CONFIG_NET_VENDOR_SEEQ=y
# CONFIG_NET_VENDOR_SILAN is not set
# CONFIG_NET_VENDOR_SIS is not set
# CONFIG_SFC is not set
CONFIG_NET_VENDOR_SMSC=y
CONFIG_SMC9194=y
# CONFIG_PCMCIA_SMC91C92 is not set
CONFIG_EPIC100=y
# CONFIG_SMSC911X is not set
CONFIG_SMSC9420=y
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
CONFIG_SUNGEM=y
CONFIG_CASSINI=y
CONFIG_NIU=y
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
# CONFIG_NET_VENDOR_TI is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
CONFIG_VIA_VELOCITY=y
# CONFIG_NET_VENDOR_WIZNET is not set
CONFIG_NET_VENDOR_XIRCOM=y
CONFIG_PCMCIA_XIRC2PS=y
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
CONFIG_SKFP=y
CONFIG_HIPPI=y
# CONFIG_ROADRUNNER is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
CONFIG_AT803X_PHY=y
# CONFIG_AMD_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
CONFIG_QSEMI_PHY=y
# CONFIG_LXT_PHY is not set
CONFIG_CICADA_PHY=y
# CONFIG_VITESSE_PHY is not set
CONFIG_SMSC_PHY=y
CONFIG_BROADCOM_PHY=y
CONFIG_BCM87XX_PHY=y
CONFIG_ICPLUS_PHY=y
# CONFIG_REALTEK_PHY is not set
CONFIG_NATIONAL_PHY=y
CONFIG_STE10XP=y
CONFIG_LSI_ET1011C_PHY=y
CONFIG_MICREL_PHY=y
# CONFIG_FIXED_PHY is not set
CONFIG_MDIO_BITBANG=y
# CONFIG_MDIO_GPIO is not set
CONFIG_MDIO_BUS_MUX=y
CONFIG_MDIO_BUS_MUX_GPIO=y
CONFIG_MDIO_BUS_MUX_MMIOREG=y
# CONFIG_PLIP is not set
# CONFIG_PPP is not set
CONFIG_SLIP=y
CONFIG_SLHC=y
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLIP_SMART=y
CONFIG_SLIP_MODE_SLIP6=y

#
# USB Network Adapters
#
CONFIG_USB_CATC=y
# CONFIG_USB_KAWETH is not set
CONFIG_USB_PEGASUS=y
CONFIG_USB_RTL8150=y
CONFIG_USB_RTL8152=y
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=y
CONFIG_USB_NET_AX88179_178A=y
CONFIG_USB_NET_CDCETHER=y
CONFIG_USB_NET_CDC_EEM=y
CONFIG_USB_NET_CDC_NCM=y
CONFIG_USB_NET_CDC_MBIM=y
CONFIG_USB_NET_DM9601=y
# CONFIG_USB_NET_SR9700 is not set
CONFIG_USB_NET_SMSC75XX=y
CONFIG_USB_NET_SMSC95XX=y
CONFIG_USB_NET_GL620A=y
CONFIG_USB_NET_NET1080=y
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
CONFIG_USB_NET_RNDIS_HOST=y
CONFIG_USB_NET_CDC_SUBSET=y
CONFIG_USB_ALI_M5632=y
# CONFIG_USB_AN2720 is not set
# CONFIG_USB_BELKIN is not set
# CONFIG_USB_ARMLINUX is not set
# CONFIG_USB_EPSON2888 is not set
# CONFIG_USB_KC2190 is not set
CONFIG_USB_NET_ZAURUS=y
CONFIG_USB_NET_CX82310_ETH=y
CONFIG_USB_NET_KALMIA=y
CONFIG_USB_NET_QMI_WWAN=y
CONFIG_USB_NET_INT51X1=y
# CONFIG_USB_CDC_PHONET is not set
CONFIG_USB_IPHETH=y
# CONFIG_USB_SIERRA_NET is not set
# CONFIG_USB_VL600 is not set
CONFIG_WLAN=y
# CONFIG_PCMCIA_RAYCS is not set
CONFIG_LIBERTAS_THINFIRM=y
CONFIG_LIBERTAS_THINFIRM_DEBUG=y
# CONFIG_LIBERTAS_THINFIRM_USB is not set
# CONFIG_AIRO is not set
CONFIG_ATMEL=y
# CONFIG_PCI_ATMEL is not set
# CONFIG_PCMCIA_ATMEL is not set
# CONFIG_AT76C50X_USB is not set
CONFIG_AIRO_CS=y
CONFIG_PCMCIA_WL3501=y
CONFIG_PRISM54=y
# CONFIG_USB_ZD1201 is not set
# CONFIG_USB_NET_RNDIS_WLAN is not set
CONFIG_RTL8180=y
# CONFIG_RTL8187 is not set
CONFIG_ADM8211=y
# CONFIG_MAC80211_HWSIM is not set
CONFIG_MWL8K=y
CONFIG_ATH_COMMON=y
CONFIG_ATH_CARDS=y
# CONFIG_ATH_DEBUG is not set
CONFIG_ATH5K=y
# CONFIG_ATH5K_DEBUG is not set
# CONFIG_ATH5K_TRACER is not set
CONFIG_ATH5K_PCI=y
CONFIG_ATH9K_HW=y
CONFIG_ATH9K_COMMON=y
# CONFIG_ATH9K_BTCOEX_SUPPORT is not set
CONFIG_ATH9K=y
CONFIG_ATH9K_PCI=y
CONFIG_ATH9K_AHB=y
CONFIG_ATH9K_DEBUGFS=y
# CONFIG_ATH9K_LEGACY_RATE_CONTROL is not set
CONFIG_ATH9K_HTC=y
CONFIG_ATH9K_HTC_DEBUGFS=y
CONFIG_CARL9170=y
# CONFIG_CARL9170_LEDS is not set
CONFIG_CARL9170_DEBUGFS=y
CONFIG_CARL9170_WPC=y
# CONFIG_CARL9170_HWRNG is not set
CONFIG_ATH6KL=y
CONFIG_ATH6KL_SDIO=y
CONFIG_ATH6KL_USB=y
CONFIG_ATH6KL_DEBUG=y
# CONFIG_ATH6KL_TRACING is not set
# CONFIG_AR5523 is not set
CONFIG_WIL6210=y
CONFIG_WIL6210_ISR_COR=y
CONFIG_WIL6210_TRACING=y
CONFIG_ATH10K=y
# CONFIG_ATH10K_PCI is not set
# CONFIG_ATH10K_DEBUG is not set
# CONFIG_ATH10K_DEBUGFS is not set
# CONFIG_ATH10K_TRACING is not set
# CONFIG_B43 is not set
# CONFIG_B43LEGACY is not set
CONFIG_BRCMUTIL=y
CONFIG_BRCMFMAC=y
CONFIG_BRCMFMAC_SDIO=y
# CONFIG_BRCMFMAC_USB is not set
CONFIG_BRCM_TRACING=y
# CONFIG_BRCMDBG is not set
CONFIG_HOSTAP=y
# CONFIG_HOSTAP_FIRMWARE is not set
# CONFIG_HOSTAP_PLX is not set
CONFIG_HOSTAP_PCI=y
CONFIG_HOSTAP_CS=y
CONFIG_IPW2100=y
# CONFIG_IPW2100_MONITOR is not set
CONFIG_IPW2100_DEBUG=y
CONFIG_IPW2200=y
CONFIG_IPW2200_MONITOR=y
# CONFIG_IPW2200_RADIOTAP is not set
# CONFIG_IPW2200_PROMISCUOUS is not set
CONFIG_IPW2200_QOS=y
CONFIG_IPW2200_DEBUG=y
CONFIG_LIBIPW=y
# CONFIG_LIBIPW_DEBUG is not set
CONFIG_IWLWIFI=y
CONFIG_IWLDVM=y
CONFIG_IWLMVM=y

#
# Debugging Options
#
# CONFIG_IWLWIFI_DEBUG is not set
CONFIG_IWLWIFI_DEBUGFS=y
# CONFIG_IWLWIFI_DEVICE_TRACING is not set
CONFIG_IWLEGACY=y
CONFIG_IWL4965=y
CONFIG_IWL3945=y

#
# iwl3945 / iwl4965 Debugging Options
#
# CONFIG_IWLEGACY_DEBUG is not set
# CONFIG_IWLEGACY_DEBUGFS is not set
# CONFIG_LIBERTAS is not set
CONFIG_HERMES=y
CONFIG_HERMES_PRISM=y
# CONFIG_HERMES_CACHE_FW_ON_INIT is not set
CONFIG_PLX_HERMES=y
# CONFIG_TMD_HERMES is not set
# CONFIG_NORTEL_HERMES is not set
# CONFIG_PCI_HERMES is not set
CONFIG_PCMCIA_HERMES=y
CONFIG_PCMCIA_SPECTRUM=y
# CONFIG_ORINOCO_USB is not set
CONFIG_P54_COMMON=y
# CONFIG_P54_USB is not set
# CONFIG_P54_PCI is not set
CONFIG_P54_LEDS=y
CONFIG_RT2X00=y
# CONFIG_RT2400PCI is not set
CONFIG_RT2500PCI=y
CONFIG_RT61PCI=y
CONFIG_RT2800PCI=y
CONFIG_RT2800PCI_RT33XX=y
CONFIG_RT2800PCI_RT35XX=y
CONFIG_RT2800PCI_RT53XX=y
# CONFIG_RT2800PCI_RT3290 is not set
# CONFIG_RT2500USB is not set
# CONFIG_RT73USB is not set
CONFIG_RT2800USB=y
# CONFIG_RT2800USB_RT33XX is not set
# CONFIG_RT2800USB_RT35XX is not set
# CONFIG_RT2800USB_RT3573 is not set
CONFIG_RT2800USB_RT53XX=y
CONFIG_RT2800USB_RT55XX=y
# CONFIG_RT2800USB_UNKNOWN is not set
CONFIG_RT2800_LIB=y
CONFIG_RT2X00_LIB_MMIO=y
CONFIG_RT2X00_LIB_PCI=y
CONFIG_RT2X00_LIB_USB=y
CONFIG_RT2X00_LIB=y
CONFIG_RT2X00_LIB_FIRMWARE=y
CONFIG_RT2X00_LIB_CRYPTO=y
CONFIG_RT2X00_LIB_LEDS=y
# CONFIG_RT2X00_LIB_DEBUGFS is not set
CONFIG_RT2X00_DEBUG=y
CONFIG_RTL_CARDS=y
# CONFIG_RTL8192CE is not set
CONFIG_RTL8192SE=y
# CONFIG_RTL8192DE is not set
CONFIG_RTL8723AE=y
CONFIG_RTL8188EE=y
CONFIG_RTL8192CU=y
CONFIG_RTLWIFI=y
CONFIG_RTLWIFI_PCI=y
CONFIG_RTLWIFI_USB=y
CONFIG_RTLWIFI_DEBUG=y
CONFIG_RTL8192C_COMMON=y
CONFIG_WL_TI=y
CONFIG_WL1251=y
# CONFIG_WL1251_SDIO is not set
CONFIG_WL12XX=y
# CONFIG_WL18XX is not set
CONFIG_WLCORE=y
# CONFIG_WLCORE_SDIO is not set
CONFIG_ZD1211RW=y
CONFIG_ZD1211RW_DEBUG=y
# CONFIG_MWIFIEX is not set
CONFIG_CW1200=y
CONFIG_CW1200_WLAN_SDIO=y

#
# WiMAX Wireless Broadband devices
#
# CONFIG_WIMAX_I2400M_USB is not set
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
CONFIG_LAPBETHER=y
CONFIG_X25_ASY=y
# CONFIG_SBNI is not set
# CONFIG_IEEE802154_DRIVERS is not set
CONFIG_VMXNET3=y
# CONFIG_ISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y
CONFIG_INPUT_SPARSEKMAP=y
CONFIG_INPUT_MATRIXKMAP=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=y
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ADP5588=y
CONFIG_KEYBOARD_ADP5589=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
CONFIG_KEYBOARD_QT2160=y
# CONFIG_KEYBOARD_LKKBD is not set
CONFIG_KEYBOARD_GPIO=y
CONFIG_KEYBOARD_GPIO_POLLED=y
CONFIG_KEYBOARD_TCA6416=y
CONFIG_KEYBOARD_TCA8418=y
CONFIG_KEYBOARD_MATRIX=y
CONFIG_KEYBOARD_LM8323=y
CONFIG_KEYBOARD_LM8333=y
CONFIG_KEYBOARD_MAX7359=y
CONFIG_KEYBOARD_MCS=y
CONFIG_KEYBOARD_MPR121=y
CONFIG_KEYBOARD_NEWTON=y
CONFIG_KEYBOARD_OPENCORES=y
CONFIG_KEYBOARD_GOLDFISH_EVENTS=y
CONFIG_KEYBOARD_STOWAWAY=y
CONFIG_KEYBOARD_SUNKBD=y
CONFIG_KEYBOARD_TC3589X=y
CONFIG_KEYBOARD_TWL4030=y
CONFIG_KEYBOARD_XTKBD=y
CONFIG_KEYBOARD_CROS_EC=y
# CONFIG_INPUT_MOUSE is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_88PM860X is not set
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
CONFIG_TOUCHSCREEN_AUO_PIXCIR=y
# CONFIG_TOUCHSCREEN_BU21013 is not set
CONFIG_TOUCHSCREEN_CY8CTMG110=y
CONFIG_TOUCHSCREEN_CYTTSP_CORE=y
CONFIG_TOUCHSCREEN_CYTTSP_I2C=y
# CONFIG_TOUCHSCREEN_CYTTSP4_CORE is not set
CONFIG_TOUCHSCREEN_DA9034=y
CONFIG_TOUCHSCREEN_DYNAPRO=y
CONFIG_TOUCHSCREEN_HAMPSHIRE=y
CONFIG_TOUCHSCREEN_EETI=y
CONFIG_TOUCHSCREEN_EGALAX=y
CONFIG_TOUCHSCREEN_FUJITSU=y
# CONFIG_TOUCHSCREEN_ILI210X is not set
CONFIG_TOUCHSCREEN_GUNZE=y
CONFIG_TOUCHSCREEN_ELO=y
CONFIG_TOUCHSCREEN_WACOM_W8001=y
CONFIG_TOUCHSCREEN_WACOM_I2C=y
CONFIG_TOUCHSCREEN_MAX11801=y
CONFIG_TOUCHSCREEN_MCS5000=y
CONFIG_TOUCHSCREEN_MMS114=y
CONFIG_TOUCHSCREEN_MTOUCH=y
# CONFIG_TOUCHSCREEN_INEXIO is not set
CONFIG_TOUCHSCREEN_MK712=y
CONFIG_TOUCHSCREEN_HTCPEN=y
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
CONFIG_TOUCHSCREEN_TOUCHRIGHT=y
CONFIG_TOUCHSCREEN_TOUCHWIN=y
CONFIG_TOUCHSCREEN_TI_AM335X_TSC=y
# CONFIG_TOUCHSCREEN_PIXCIR is not set
CONFIG_TOUCHSCREEN_WM831X=y
CONFIG_TOUCHSCREEN_USB_COMPOSITE=y
# CONFIG_TOUCHSCREEN_USB_EGALAX is not set
CONFIG_TOUCHSCREEN_USB_PANJIT=y
# CONFIG_TOUCHSCREEN_USB_3M is not set
CONFIG_TOUCHSCREEN_USB_ITM=y
CONFIG_TOUCHSCREEN_USB_ETURBO=y
# CONFIG_TOUCHSCREEN_USB_GUNZE is not set
# CONFIG_TOUCHSCREEN_USB_DMC_TSC10 is not set
CONFIG_TOUCHSCREEN_USB_IRTOUCH=y
CONFIG_TOUCHSCREEN_USB_IDEALTEK=y
CONFIG_TOUCHSCREEN_USB_GENERAL_TOUCH=y
# CONFIG_TOUCHSCREEN_USB_GOTOP is not set
# CONFIG_TOUCHSCREEN_USB_JASTEC is not set
# CONFIG_TOUCHSCREEN_USB_ELO is not set
# CONFIG_TOUCHSCREEN_USB_E2I is not set
# CONFIG_TOUCHSCREEN_USB_ZYTRONIC is not set
# CONFIG_TOUCHSCREEN_USB_ETT_TC45USB is not set
# CONFIG_TOUCHSCREEN_USB_NEXIO is not set
CONFIG_TOUCHSCREEN_USB_EASYTOUCH=y
CONFIG_TOUCHSCREEN_TOUCHIT213=y
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
CONFIG_TOUCHSCREEN_TSC2007=y
# CONFIG_TOUCHSCREEN_ST1232 is not set
CONFIG_TOUCHSCREEN_TPS6507X=y
CONFIG_INPUT_MISC=y
CONFIG_INPUT_88PM860X_ONKEY=y
# CONFIG_INPUT_88PM80X_ONKEY is not set
CONFIG_INPUT_AD714X=y
# CONFIG_INPUT_AD714X_I2C is not set
CONFIG_INPUT_BMA150=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
CONFIG_INPUT_MPU3050=y
# CONFIG_INPUT_APANEL is not set
CONFIG_INPUT_GP2A=y
CONFIG_INPUT_GPIO_TILT_POLLED=y
# CONFIG_INPUT_WISTRON_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
CONFIG_INPUT_KEYSPAN_REMOTE=y
# CONFIG_INPUT_KXTJ9 is not set
CONFIG_INPUT_POWERMATE=y
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_TWL4030_PWRBUTTON=y
CONFIG_INPUT_TWL4030_VIBRA=y
CONFIG_INPUT_TWL6040_VIBRA=y
CONFIG_INPUT_UINPUT=y
CONFIG_INPUT_PCF50633_PMU=y
CONFIG_INPUT_PCF8574=y
# CONFIG_INPUT_PWM_BEEPER is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set
CONFIG_INPUT_WM831X_ON=y
CONFIG_INPUT_ADXL34X=y
# CONFIG_INPUT_ADXL34X_I2C is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_CMA3000 is not set
CONFIG_INPUT_IDEAPAD_SLIDEBAR=y

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
CONFIG_SERIO_CT82C710=y
CONFIG_SERIO_PARKBD=y
CONFIG_SERIO_PCIPS2=y
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
CONFIG_SERIO_ALTERA_PS2=y
CONFIG_SERIO_PS2MULT=y
CONFIG_SERIO_ARC_PS2=y
CONFIG_SERIO_APBPS2=y
CONFIG_SERIO_OLPC_APSP=y
CONFIG_GAMEPORT=y
CONFIG_GAMEPORT_NS558=y
CONFIG_GAMEPORT_L4=y
# CONFIG_GAMEPORT_EMU10K1 is not set
CONFIG_GAMEPORT_FM801=y

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
# CONFIG_SERIAL_NONSTANDARD is not set
CONFIG_NOZOMI=y
CONFIG_N_GSM=y
# CONFIG_TRACE_SINK is not set
# CONFIG_GOLDFISH_TTY is not set
CONFIG_DEVKMEM=y

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
# CONFIG_SERIAL_8250_PNP is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
# CONFIG_SERIAL_8250_MANY_PORTS is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_DW=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_KGDB_NMI=y
CONFIG_SERIAL_MFD_HSU=y
CONFIG_SERIAL_MFD_HSU_CONSOLE=y
CONFIG_SERIAL_UARTLITE=y
# CONFIG_SERIAL_UARTLITE_CONSOLE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_OF_PLATFORM is not set
# CONFIG_SERIAL_SCCNXP is not set
CONFIG_SERIAL_TIMBERDALE=y
CONFIG_SERIAL_ALTERA_JTAGUART=y
CONFIG_SERIAL_ALTERA_JTAGUART_CONSOLE=y
CONFIG_SERIAL_ALTERA_JTAGUART_CONSOLE_BYPASS=y
CONFIG_SERIAL_ALTERA_UART=y
CONFIG_SERIAL_ALTERA_UART_MAXPORTS=4
CONFIG_SERIAL_ALTERA_UART_BAUDRATE=115200
# CONFIG_SERIAL_ALTERA_UART_CONSOLE is not set
CONFIG_SERIAL_PCH_UART=y
CONFIG_SERIAL_PCH_UART_CONSOLE=y
CONFIG_SERIAL_XILINX_PS_UART=y
CONFIG_SERIAL_XILINX_PS_UART_CONSOLE=y
# CONFIG_SERIAL_ARC is not set
CONFIG_SERIAL_RP2=y
CONFIG_SERIAL_RP2_NR_UARTS=32
CONFIG_SERIAL_FSL_LPUART=y
# CONFIG_SERIAL_FSL_LPUART_CONSOLE is not set
CONFIG_SERIAL_ST_ASC=y
# CONFIG_SERIAL_ST_ASC_CONSOLE is not set
# CONFIG_TTY_PRINTK is not set
CONFIG_PRINTER=y
# CONFIG_LP_CONSOLE is not set
CONFIG_PPDEV=y
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
CONFIG_IPMI_HANDLER=y
CONFIG_IPMI_PANIC_EVENT=y
CONFIG_IPMI_PANIC_STRING=y
CONFIG_IPMI_DEVICE_INTERFACE=y
CONFIG_IPMI_SI=y
CONFIG_IPMI_WATCHDOG=y
CONFIG_IPMI_POWEROFF=y
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_TIMERIOMEM=y
CONFIG_HW_RANDOM_INTEL=y
CONFIG_HW_RANDOM_AMD=y
CONFIG_HW_RANDOM_GEODE=y
CONFIG_HW_RANDOM_VIA=y
CONFIG_HW_RANDOM_VIRTIO=y
CONFIG_HW_RANDOM_TPM=y
# CONFIG_NVRAM is not set
CONFIG_DTLK=y
# CONFIG_R3964 is not set
CONFIG_APPLICOM=y
# CONFIG_SONYPI is not set

#
# PCMCIA character devices
#
CONFIG_SYNCLINK_CS=y
CONFIG_CARDMAN_4000=y
CONFIG_CARDMAN_4040=y
# CONFIG_IPWIRELESS is not set
CONFIG_MWAVE=y
# CONFIG_SCx200_GPIO is not set
CONFIG_PC8736x_GPIO=y
CONFIG_NSC_GPIO=y
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=256
CONFIG_HANGCHECK_TIMER=y
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y
CONFIG_TCG_TIS_I2C_INFINEON=y
CONFIG_TCG_NSC=y
# CONFIG_TCG_ATMEL is not set
CONFIG_TCG_INFINEON=y
# CONFIG_TCG_ST33_I2C is not set
CONFIG_TELCLOCK=y
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
# CONFIG_I2C_COMPAT is not set
CONFIG_I2C_CHARDEV=y
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=y
CONFIG_I2C_ALGOBIT=y
CONFIG_I2C_ALGOPCA=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
CONFIG_I2C_ALI1535=y
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=y
CONFIG_I2C_AMD756_S4882=y
# CONFIG_I2C_AMD8111 is not set
CONFIG_I2C_I801=y
CONFIG_I2C_ISCH=y
CONFIG_I2C_ISMT=y
CONFIG_I2C_PIIX4=y
CONFIG_I2C_NFORCE2=y
CONFIG_I2C_NFORCE2_S4985=y
CONFIG_I2C_SIS5595=y
CONFIG_I2C_SIS630=y
# CONFIG_I2C_SIS96X is not set
CONFIG_I2C_VIA=y
CONFIG_I2C_VIAPRO=y

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_CBUS_GPIO is not set
CONFIG_I2C_DESIGNWARE_CORE=y
CONFIG_I2C_DESIGNWARE_PCI=y
CONFIG_I2C_EG20T=y
CONFIG_I2C_GPIO=y
CONFIG_I2C_KEMPLD=y
# CONFIG_I2C_OCORES is not set
CONFIG_I2C_PCA_PLATFORM=y
# CONFIG_I2C_PXA is not set
# CONFIG_I2C_PXA_PCI is not set
CONFIG_I2C_SIMTEC=y
CONFIG_I2C_XILINX=y

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_DIOLAN_U2C=y
CONFIG_I2C_PARPORT=y
CONFIG_I2C_PARPORT_LIGHT=y
CONFIG_I2C_TAOS_EVM=y
CONFIG_I2C_TINY_USB=y
# CONFIG_I2C_VIPERBOARD is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_ELEKTOR is not set
# CONFIG_I2C_PCA_ISA is not set
CONFIG_SCx200_ACB=y
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
CONFIG_I2C_DEBUG_BUS=y
# CONFIG_SPI is not set
CONFIG_HSI=y
CONFIG_HSI_BOARDINFO=y

#
# HSI clients
#
CONFIG_HSI_CHAR=y

#
# PPS support
#
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
CONFIG_PPS_CLIENT_KTIMER=y
CONFIG_PPS_CLIENT_LDISC=y
# CONFIG_PPS_CLIENT_PARPORT is not set
CONFIG_PPS_CLIENT_GPIO=y

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y
# CONFIG_DP83640_PHY is not set
# CONFIG_PTP_1588_CLOCK_PCH is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIO_DEVRES=y
CONFIG_GPIOLIB=y
CONFIG_OF_GPIO=y
CONFIG_DEBUG_GPIO=y
CONFIG_GPIO_SYSFS=y
CONFIG_GPIO_GENERIC=y
CONFIG_GPIO_MAX730X=y

#
# Memory mapped GPIO drivers:
#
CONFIG_GPIO_GENERIC_PLATFORM=y
CONFIG_GPIO_IT8761E=y
# CONFIG_GPIO_F7188X is not set
CONFIG_GPIO_TS5500=y
CONFIG_GPIO_SCH=y
CONFIG_GPIO_ICH=y
CONFIG_GPIO_VX855=y
CONFIG_GPIO_GRGPIO=y

#
# I2C GPIO expanders:
#
CONFIG_GPIO_MAX7300=y
CONFIG_GPIO_MAX732X=y
# CONFIG_GPIO_MAX732X_IRQ is not set
CONFIG_GPIO_PCA953X=y
CONFIG_GPIO_PCA953X_IRQ=y
CONFIG_GPIO_PCF857X=y
CONFIG_GPIO_SX150X=y
CONFIG_GPIO_TC3589X=y
CONFIG_GPIO_TWL4030=y
CONFIG_GPIO_TWL6040=y
# CONFIG_GPIO_WM831X is not set
CONFIG_GPIO_WM8350=y
# CONFIG_GPIO_WM8994 is not set
CONFIG_GPIO_ADP5588=y
# CONFIG_GPIO_ADP5588_IRQ is not set
CONFIG_GPIO_ADNP=y

#
# PCI GPIO expanders:
#
CONFIG_GPIO_CS5535=y
CONFIG_GPIO_BT8XX=y
CONFIG_GPIO_AMD8111=y
CONFIG_GPIO_LANGWELL=y
CONFIG_GPIO_PCH=y
CONFIG_GPIO_ML_IOH=y
CONFIG_GPIO_SODAVILLE=y
CONFIG_GPIO_TIMBERDALE=y
# CONFIG_GPIO_RDC321X is not set

#
# SPI GPIO expanders:
#
CONFIG_GPIO_MCP23S08=y

#
# AC97 GPIO expanders:
#

#
# LPC GPIO expanders:
#
CONFIG_GPIO_KEMPLD=y

#
# MODULbus GPIO expanders:
#
CONFIG_GPIO_JANZ_TTL=y
# CONFIG_GPIO_TPS65910 is not set

#
# USB GPIO expanders:
#
# CONFIG_GPIO_VIPERBOARD is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
CONFIG_POWER_SUPPLY_DEBUG=y
CONFIG_PDA_POWER=y
# CONFIG_GENERIC_ADC_BATTERY is not set
# CONFIG_WM831X_BACKUP is not set
# CONFIG_WM831X_POWER is not set
CONFIG_WM8350_POWER=y
# CONFIG_TEST_POWER is not set
CONFIG_BATTERY_88PM860X=y
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_OLPC is not set
CONFIG_BATTERY_SBS=y
CONFIG_BATTERY_BQ27x00=y
# CONFIG_BATTERY_BQ27X00_I2C is not set
# CONFIG_BATTERY_BQ27X00_PLATFORM is not set
CONFIG_BATTERY_DA9030=y
CONFIG_BATTERY_MAX17040=y
# CONFIG_BATTERY_MAX17042 is not set
CONFIG_BATTERY_TWL4030_MADC=y
CONFIG_CHARGER_88PM860X=y
CONFIG_CHARGER_PCF50633=y
CONFIG_BATTERY_RX51=y
# CONFIG_CHARGER_ISP1704 is not set
# CONFIG_CHARGER_MAX8903 is not set
CONFIG_CHARGER_TWL4030=y
CONFIG_CHARGER_LP8727=y
CONFIG_CHARGER_LP8788=y
CONFIG_CHARGER_GPIO=y
# CONFIG_CHARGER_MANAGER is not set
# CONFIG_CHARGER_MAX8998 is not set
CONFIG_CHARGER_BQ2415X=y
CONFIG_CHARGER_BQ24190=y
CONFIG_CHARGER_SMB347=y
CONFIG_CHARGER_TPS65090=y
CONFIG_BATTERY_GOLDFISH=y
# CONFIG_POWER_RESET is not set
# CONFIG_POWER_AVS is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=y
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_AD7414 is not set
CONFIG_SENSORS_AD7418=y
CONFIG_SENSORS_ADM1021=y
CONFIG_SENSORS_ADM1025=y
CONFIG_SENSORS_ADM1026=y
CONFIG_SENSORS_ADM1029=y
# CONFIG_SENSORS_ADM1031 is not set
CONFIG_SENSORS_ADM9240=y
CONFIG_SENSORS_ADT7X10=y
CONFIG_SENSORS_ADT7410=y
CONFIG_SENSORS_ADT7411=y
CONFIG_SENSORS_ADT7462=y
CONFIG_SENSORS_ADT7470=y
CONFIG_SENSORS_ADT7475=y
CONFIG_SENSORS_ASC7621=y
# CONFIG_SENSORS_K8TEMP is not set
CONFIG_SENSORS_K10TEMP=y
CONFIG_SENSORS_FAM15H_POWER=y
CONFIG_SENSORS_ASB100=y
CONFIG_SENSORS_ATXP1=y
CONFIG_SENSORS_DS620=y
CONFIG_SENSORS_DS1621=y
CONFIG_SENSORS_I5K_AMB=y
# CONFIG_SENSORS_F71805F is not set
CONFIG_SENSORS_F71882FG=y
CONFIG_SENSORS_F75375S=y
CONFIG_SENSORS_FSCHMD=y
CONFIG_SENSORS_G760A=y
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_GL518SM is not set
CONFIG_SENSORS_GL520SM=y
# CONFIG_SENSORS_GPIO_FAN is not set
CONFIG_SENSORS_HIH6130=y
CONFIG_SENSORS_HTU21=y
CONFIG_SENSORS_CORETEMP=y
CONFIG_SENSORS_IBMAEM=y
# CONFIG_SENSORS_IBMPEX is not set
CONFIG_SENSORS_IIO_HWMON=y
CONFIG_SENSORS_IT87=y
# CONFIG_SENSORS_JC42 is not set
CONFIG_SENSORS_LINEAGE=y
CONFIG_SENSORS_LM63=y
CONFIG_SENSORS_LM73=y
# CONFIG_SENSORS_LM75 is not set
CONFIG_SENSORS_LM77=y
CONFIG_SENSORS_LM78=y
CONFIG_SENSORS_LM80=y
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
CONFIG_SENSORS_LTC4245=y
# CONFIG_SENSORS_LTC4261 is not set
CONFIG_SENSORS_LM95234=y
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
CONFIG_SENSORS_MAX16065=y
CONFIG_SENSORS_MAX1619=y
CONFIG_SENSORS_MAX1668=y
CONFIG_SENSORS_MAX197=y
CONFIG_SENSORS_MAX6639=y
CONFIG_SENSORS_MAX6642=y
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MCP3021 is not set
CONFIG_SENSORS_NCT6775=y
CONFIG_SENSORS_NTC_THERMISTOR=y
CONFIG_SENSORS_PC87360=y
# CONFIG_SENSORS_PC87427 is not set
CONFIG_SENSORS_PCF8591=y
CONFIG_PMBUS=y
CONFIG_SENSORS_PMBUS=y
CONFIG_SENSORS_ADM1275=y
# CONFIG_SENSORS_LM25066 is not set
CONFIG_SENSORS_LTC2978=y
CONFIG_SENSORS_MAX16064=y
CONFIG_SENSORS_MAX34440=y
CONFIG_SENSORS_MAX8688=y
# CONFIG_SENSORS_UCD9000 is not set
CONFIG_SENSORS_UCD9200=y
CONFIG_SENSORS_ZL6100=y
CONFIG_SENSORS_SHT15=y
# CONFIG_SENSORS_SHT21 is not set
CONFIG_SENSORS_SIS5595=y
CONFIG_SENSORS_SMM665=y
CONFIG_SENSORS_DME1737=y
# CONFIG_SENSORS_EMC1403 is not set
CONFIG_SENSORS_EMC2103=y
# CONFIG_SENSORS_EMC6W201 is not set
CONFIG_SENSORS_SMSC47M1=y
# CONFIG_SENSORS_SMSC47M192 is not set
CONFIG_SENSORS_SMSC47B397=y
# CONFIG_SENSORS_SCH56XX_COMMON is not set
CONFIG_SENSORS_ADS1015=y
CONFIG_SENSORS_ADS7828=y
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
CONFIG_SENSORS_INA2XX=y
CONFIG_SENSORS_THMC50=y
CONFIG_SENSORS_TMP102=y
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_TWL4030_MADC is not set
CONFIG_SENSORS_VIA_CPUTEMP=y
# CONFIG_SENSORS_VIA686A is not set
CONFIG_SENSORS_VT1211=y
CONFIG_SENSORS_VT8231=y
# CONFIG_SENSORS_W83781D is not set
CONFIG_SENSORS_W83791D=y
# CONFIG_SENSORS_W83792D is not set
CONFIG_SENSORS_W83793=y
CONFIG_SENSORS_W83795=y
CONFIG_SENSORS_W83795_FANCTRL=y
CONFIG_SENSORS_W83L785TS=y
CONFIG_SENSORS_W83L786NG=y
CONFIG_SENSORS_W83627HF=y
CONFIG_SENSORS_W83627EHF=y
CONFIG_SENSORS_WM831X=y
CONFIG_SENSORS_WM8350=y
CONFIG_SENSORS_APPLESMC=y
CONFIG_THERMAL=y
# CONFIG_THERMAL_HWMON is not set
# CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE is not set
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE=y
CONFIG_THERMAL_GOV_FAIR_SHARE=y
# CONFIG_THERMAL_GOV_STEP_WISE is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
CONFIG_THERMAL_EMULATION=y
CONFIG_X86_PKG_TEMP_THERMAL=y

#
# Texas Instruments thermal drivers
#
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=y
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
CONFIG_SSB_PCMCIAHOST_POSSIBLE=y
# CONFIG_SSB_PCMCIAHOST is not set
CONFIG_SSB_SDIOHOST_POSSIBLE=y
# CONFIG_SSB_SDIOHOST is not set
CONFIG_SSB_SILENT=y
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y
CONFIG_SSB_DRIVER_GPIO=y
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
CONFIG_MFD_CORE=y
CONFIG_MFD_CS5535=y
CONFIG_MFD_AS3711=y
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_AAT2870_CORE is not set
CONFIG_MFD_CROS_EC=y
CONFIG_MFD_CROS_EC_I2C=y
CONFIG_PMIC_DA903X=y
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
CONFIG_HTC_PASIC3=y
CONFIG_HTC_I2CPLD=y
CONFIG_LPC_ICH=y
CONFIG_LPC_SCH=y
CONFIG_MFD_JANZ_CMODIO=y
CONFIG_MFD_KEMPLD=y
CONFIG_MFD_88PM800=y
# CONFIG_MFD_88PM805 is not set
CONFIG_MFD_88PM860X=y
# CONFIG_MFD_MAX77686 is not set
CONFIG_MFD_MAX77693=y
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
CONFIG_MFD_MAX8998=y
CONFIG_MFD_VIPERBOARD=y
# CONFIG_MFD_RETU is not set
CONFIG_MFD_PCF50633=y
# CONFIG_PCF50633_ADC is not set
CONFIG_PCF50633_GPIO=y
CONFIG_MFD_RDC321X=y
CONFIG_MFD_RTSX_PCI=y
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SEC_CORE is not set
CONFIG_MFD_SI476X_CORE=y
CONFIG_MFD_SM501=y
CONFIG_MFD_SM501_GPIO=y
# CONFIG_MFD_SMSC is not set
CONFIG_ABX500_CORE=y
# CONFIG_AB3100_CORE is not set
# CONFIG_MFD_STMPE is not set
CONFIG_MFD_SYSCON=y
CONFIG_MFD_TI_AM335X_TSCADC=y
CONFIG_MFD_LP8788=y
# CONFIG_MFD_PALMAS is not set
CONFIG_TPS6105X=y
# CONFIG_TPS65010 is not set
CONFIG_TPS6507X=y
CONFIG_MFD_TPS65090=y
CONFIG_MFD_TPS65217=y
# CONFIG_MFD_TPS6586X is not set
CONFIG_MFD_TPS65910=y
CONFIG_MFD_TPS65912=y
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS80031 is not set
CONFIG_TWL4030_CORE=y
CONFIG_TWL4030_MADC=y
CONFIG_MFD_TWL4030_AUDIO=y
CONFIG_TWL6040_CORE=y
CONFIG_MFD_WL1273_CORE=y
CONFIG_MFD_LM3533=y
CONFIG_MFD_TIMBERDALE=y
CONFIG_MFD_TC3589X=y
# CONFIG_MFD_TMIO is not set
CONFIG_MFD_VX855=y
# CONFIG_MFD_ARIZONA_I2C is not set
CONFIG_MFD_WM8400=y
CONFIG_MFD_WM831X=y
CONFIG_MFD_WM831X_I2C=y
CONFIG_MFD_WM8350=y
CONFIG_MFD_WM8350_I2C=y
CONFIG_MFD_WM8994=y
CONFIG_REGULATOR=y
CONFIG_REGULATOR_DEBUG=y
# CONFIG_REGULATOR_DUMMY is not set
CONFIG_REGULATOR_FIXED_VOLTAGE=y
CONFIG_REGULATOR_VIRTUAL_CONSUMER=y
CONFIG_REGULATOR_USERSPACE_CONSUMER=y
CONFIG_REGULATOR_88PM800=y
CONFIG_REGULATOR_88PM8607=y
CONFIG_REGULATOR_AD5398=y
CONFIG_REGULATOR_ANATOP=y
CONFIG_REGULATOR_AS3711=y
CONFIG_REGULATOR_DA903X=y
CONFIG_REGULATOR_DA9210=y
# CONFIG_REGULATOR_FAN53555 is not set
CONFIG_REGULATOR_GPIO=y
CONFIG_REGULATOR_ISL6271A=y
CONFIG_REGULATOR_LP3971=y
CONFIG_REGULATOR_LP3972=y
# CONFIG_REGULATOR_LP872X is not set
CONFIG_REGULATOR_LP8755=y
# CONFIG_REGULATOR_LP8788 is not set
CONFIG_REGULATOR_MAX1586=y
CONFIG_REGULATOR_MAX8649=y
CONFIG_REGULATOR_MAX8660=y
# CONFIG_REGULATOR_MAX8952 is not set
# CONFIG_REGULATOR_MAX8973 is not set
CONFIG_REGULATOR_MAX8998=y
# CONFIG_REGULATOR_MAX77693 is not set
CONFIG_REGULATOR_PCF50633=y
CONFIG_REGULATOR_PFUZE100=y
CONFIG_REGULATOR_TPS51632=y
CONFIG_REGULATOR_TPS6105X=y
CONFIG_REGULATOR_TPS62360=y
CONFIG_REGULATOR_TPS65023=y
# CONFIG_REGULATOR_TPS6507X is not set
CONFIG_REGULATOR_TPS65090=y
CONFIG_REGULATOR_TPS65217=y
CONFIG_REGULATOR_TPS65910=y
CONFIG_REGULATOR_TWL4030=y
CONFIG_REGULATOR_WM831X=y
CONFIG_REGULATOR_WM8350=y
# CONFIG_REGULATOR_WM8400 is not set
CONFIG_REGULATOR_WM8994=y
CONFIG_MEDIA_SUPPORT=y

#
# Multimedia core support
#
# CONFIG_MEDIA_CAMERA_SUPPORT is not set
# CONFIG_MEDIA_ANALOG_TV_SUPPORT is not set
CONFIG_MEDIA_DIGITAL_TV_SUPPORT=y
CONFIG_MEDIA_RADIO_SUPPORT=y
# CONFIG_MEDIA_RC_SUPPORT is not set
CONFIG_VIDEO_DEV=y
CONFIG_VIDEO_V4L2=y
CONFIG_VIDEO_ADV_DEBUG=y
# CONFIG_VIDEO_FIXED_MINOR_RANGES is not set
CONFIG_VIDEOBUF_GEN=y
CONFIG_VIDEOBUF_DMA_SG=y
CONFIG_VIDEOBUF_VMALLOC=y
# CONFIG_VIDEO_V4L2_INT_DEVICE is not set
CONFIG_DVB_CORE=y
# CONFIG_DVB_NET is not set
CONFIG_TTPCI_EEPROM=y
CONFIG_DVB_MAX_ADAPTERS=8
CONFIG_DVB_DYNAMIC_MINORS=y

#
# Media drivers
#
CONFIG_MEDIA_USB_SUPPORT=y

#
# Analog/digital TV USB devices
#
CONFIG_VIDEO_AU0828=y
CONFIG_VIDEO_AU0828_V4L2=y

#
# Digital TV USB devices
#
# CONFIG_DVB_USB_V2 is not set
CONFIG_DVB_TTUSB_BUDGET=y
CONFIG_DVB_TTUSB_DEC=y
# CONFIG_SMS_USB_DRV is not set
# CONFIG_DVB_B2C2_FLEXCOP_USB is not set

#
# Webcam, TV (analog/digital) USB devices
#
# CONFIG_VIDEO_EM28XX is not set
CONFIG_MEDIA_PCI_SUPPORT=y

#
# Media capture/analog/hybrid TV support
#
CONFIG_VIDEO_CX25821=y
# CONFIG_VIDEO_SAA7134 is not set
# CONFIG_VIDEO_SAA7164 is not set

#
# Media digital TV PCI Adapters
#
CONFIG_DVB_AV7110=y
CONFIG_DVB_AV7110_OSD=y
CONFIG_DVB_BUDGET_CORE=y
CONFIG_DVB_BUDGET=y
CONFIG_DVB_BUDGET_AV=y
# CONFIG_DVB_BUDGET_PATCH is not set
CONFIG_DVB_B2C2_FLEXCOP_PCI=y
# CONFIG_DVB_B2C2_FLEXCOP_PCI_DEBUG is not set
CONFIG_DVB_PLUTO2=y
CONFIG_DVB_PT1=y
CONFIG_DVB_NGENE=y
CONFIG_DVB_DDBRIDGE=y

#
# Supported MMC/SDIO adapters
#
CONFIG_SMS_SDIO_DRV=y
# CONFIG_RADIO_ADAPTERS is not set
CONFIG_MEDIA_COMMON_OPTIONS=y

#
# common driver options
#
CONFIG_VIDEO_BTCX=y
CONFIG_VIDEO_TVEEPROM=y
CONFIG_CYPRESS_FIRMWARE=y
CONFIG_DVB_B2C2_FLEXCOP=y
CONFIG_VIDEO_SAA7146=y
CONFIG_VIDEO_SAA7146_VV=y
CONFIG_SMS_SIANO_MDTV=y

#
# Media ancillary drivers (tuners, sensors, i2c, frontends)
#
# CONFIG_MEDIA_SUBDRV_AUTOSELECT is not set

#
# Encoders, decoders, sensors and other helper chips
#

#
# Audio decoders, processors and mixers
#
# CONFIG_VIDEO_TVAUDIO is not set
CONFIG_VIDEO_TDA7432=y
CONFIG_VIDEO_TDA9840=y
CONFIG_VIDEO_TEA6415C=y
# CONFIG_VIDEO_TEA6420 is not set
CONFIG_VIDEO_MSP3400=y
# CONFIG_VIDEO_CS5345 is not set
CONFIG_VIDEO_CS53L32A=y
CONFIG_VIDEO_TLV320AIC23B=y
CONFIG_VIDEO_UDA1342=y
CONFIG_VIDEO_WM8775=y
CONFIG_VIDEO_WM8739=y
CONFIG_VIDEO_VP27SMPX=y
# CONFIG_VIDEO_SONY_BTF_MPX is not set

#
# RDS decoders
#
# CONFIG_VIDEO_SAA6588 is not set

#
# Video decoders
#
# CONFIG_VIDEO_ADV7180 is not set
# CONFIG_VIDEO_ADV7183 is not set
CONFIG_VIDEO_BT819=y
# CONFIG_VIDEO_BT856 is not set
# CONFIG_VIDEO_BT866 is not set
CONFIG_VIDEO_KS0127=y
CONFIG_VIDEO_ML86V7667=y
# CONFIG_VIDEO_SAA7110 is not set
CONFIG_VIDEO_SAA711X=y
CONFIG_VIDEO_SAA7191=y
# CONFIG_VIDEO_TVP514X is not set
CONFIG_VIDEO_TVP5150=y
# CONFIG_VIDEO_TVP7002 is not set
CONFIG_VIDEO_TW2804=y
CONFIG_VIDEO_TW9903=y
CONFIG_VIDEO_TW9906=y
CONFIG_VIDEO_VPX3220=y

#
# Video and audio decoders
#
CONFIG_VIDEO_SAA717X=y
# CONFIG_VIDEO_CX25840 is not set

#
# Video encoders
#
CONFIG_VIDEO_SAA7127=y
CONFIG_VIDEO_SAA7185=y
# CONFIG_VIDEO_ADV7170 is not set
# CONFIG_VIDEO_ADV7175 is not set
CONFIG_VIDEO_ADV7343=y
CONFIG_VIDEO_ADV7393=y
# CONFIG_VIDEO_AK881X is not set
CONFIG_VIDEO_THS8200=y

#
# Camera sensor devices
#

#
# Flash devices
#

#
# Video improvement chips
#
CONFIG_VIDEO_UPD64031A=y
CONFIG_VIDEO_UPD64083=y

#
# Miscelaneous helper chips
#
CONFIG_VIDEO_THS7303=y
CONFIG_VIDEO_M52790=y

#
# Sensors used on soc_camera driver
#
CONFIG_MEDIA_TUNER=y

#
# Customize TV tuners
#
# CONFIG_MEDIA_TUNER_SIMPLE is not set
CONFIG_MEDIA_TUNER_TDA8290=y
CONFIG_MEDIA_TUNER_TDA827X=y
CONFIG_MEDIA_TUNER_TDA18271=y
CONFIG_MEDIA_TUNER_TDA9887=y
# CONFIG_MEDIA_TUNER_TEA5761 is not set
# CONFIG_MEDIA_TUNER_TEA5767 is not set
# CONFIG_MEDIA_TUNER_MT20XX is not set
CONFIG_MEDIA_TUNER_MT2060=y
CONFIG_MEDIA_TUNER_MT2063=y
CONFIG_MEDIA_TUNER_MT2266=y
CONFIG_MEDIA_TUNER_MT2131=y
CONFIG_MEDIA_TUNER_QT1010=y
CONFIG_MEDIA_TUNER_XC2028=y
CONFIG_MEDIA_TUNER_XC5000=y
CONFIG_MEDIA_TUNER_XC4000=y
CONFIG_MEDIA_TUNER_MXL5005S=y
CONFIG_MEDIA_TUNER_MXL5007T=y
# CONFIG_MEDIA_TUNER_MC44S803 is not set
CONFIG_MEDIA_TUNER_MAX2165=y
CONFIG_MEDIA_TUNER_TDA18218=y
CONFIG_MEDIA_TUNER_FC0011=y
CONFIG_MEDIA_TUNER_FC0012=y
CONFIG_MEDIA_TUNER_FC0013=y
CONFIG_MEDIA_TUNER_TDA18212=y
CONFIG_MEDIA_TUNER_E4000=y
CONFIG_MEDIA_TUNER_FC2580=y
CONFIG_MEDIA_TUNER_TUA9001=y
# CONFIG_MEDIA_TUNER_IT913X is not set
# CONFIG_MEDIA_TUNER_R820T is not set

#
# Customise DVB Frontends
#

#
# Multistandard (satellite) frontends
#
CONFIG_DVB_STB0899=y
CONFIG_DVB_STB6100=y
# CONFIG_DVB_STV090x is not set
CONFIG_DVB_STV6110x=y

#
# Multistandard (cable + terrestrial) frontends
#
CONFIG_DVB_DRXK=y
CONFIG_DVB_TDA18271C2DD=y

#
# DVB-S (satellite) frontends
#
CONFIG_DVB_CX24110=y
CONFIG_DVB_CX24123=y
# CONFIG_DVB_MT312 is not set
CONFIG_DVB_ZL10036=y
CONFIG_DVB_ZL10039=y
CONFIG_DVB_S5H1420=y
CONFIG_DVB_STV0288=y
CONFIG_DVB_STB6000=y
CONFIG_DVB_STV0299=y
# CONFIG_DVB_STV6110 is not set
# CONFIG_DVB_STV0900 is not set
CONFIG_DVB_TDA8083=y
# CONFIG_DVB_TDA10086 is not set
# CONFIG_DVB_TDA8261 is not set
CONFIG_DVB_VES1X93=y
CONFIG_DVB_TUNER_ITD1000=y
CONFIG_DVB_TUNER_CX24113=y
CONFIG_DVB_TDA826X=y
CONFIG_DVB_TUA6100=y
CONFIG_DVB_CX24116=y
CONFIG_DVB_SI21XX=y
CONFIG_DVB_TS2020=y
CONFIG_DVB_DS3000=y
# CONFIG_DVB_MB86A16 is not set
# CONFIG_DVB_TDA10071 is not set

#
# DVB-T (terrestrial) frontends
#
CONFIG_DVB_SP8870=y
CONFIG_DVB_SP887X=y
CONFIG_DVB_CX22700=y
# CONFIG_DVB_CX22702 is not set
CONFIG_DVB_S5H1432=y
CONFIG_DVB_DRXD=y
CONFIG_DVB_L64781=y
CONFIG_DVB_TDA1004X=y
CONFIG_DVB_NXT6000=y
# CONFIG_DVB_MT352 is not set
# CONFIG_DVB_ZL10353 is not set
# CONFIG_DVB_DIB3000MB is not set
CONFIG_DVB_DIB3000MC=y
CONFIG_DVB_DIB7000M=y
CONFIG_DVB_DIB7000P=y
# CONFIG_DVB_DIB9000 is not set
CONFIG_DVB_TDA10048=y
CONFIG_DVB_AF9013=y
CONFIG_DVB_EC100=y
CONFIG_DVB_HD29L2=y
CONFIG_DVB_STV0367=y
# CONFIG_DVB_CXD2820R is not set
CONFIG_DVB_RTL2830=y
# CONFIG_DVB_RTL2832 is not set

#
# DVB-C (cable) frontends
#
CONFIG_DVB_VES1820=y
# CONFIG_DVB_TDA10021 is not set
CONFIG_DVB_TDA10023=y
CONFIG_DVB_STV0297=y

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#
CONFIG_DVB_NXT200X=y
# CONFIG_DVB_OR51211 is not set
CONFIG_DVB_OR51132=y
CONFIG_DVB_BCM3510=y
# CONFIG_DVB_LGDT330X is not set
CONFIG_DVB_LGDT3305=y
CONFIG_DVB_LG2160=y
CONFIG_DVB_S5H1409=y
CONFIG_DVB_AU8522=y
CONFIG_DVB_AU8522_DTV=y
CONFIG_DVB_AU8522_V4L=y
CONFIG_DVB_S5H1411=y

#
# ISDB-T (terrestrial) frontends
#
CONFIG_DVB_S921=y
# CONFIG_DVB_DIB8000 is not set
CONFIG_DVB_MB86A20S=y

#
# Digital terrestrial only tuners/PLL
#
# CONFIG_DVB_PLL is not set
CONFIG_DVB_TUNER_DIB0070=y
CONFIG_DVB_TUNER_DIB0090=y

#
# SEC control devices for DVB-S
#
# CONFIG_DVB_LNBP21 is not set
CONFIG_DVB_LNBP22=y
# CONFIG_DVB_ISL6405 is not set
CONFIG_DVB_ISL6421=y
# CONFIG_DVB_ISL6423 is not set
CONFIG_DVB_A8293=y
CONFIG_DVB_LGS8GL5=y
# CONFIG_DVB_LGS8GXX is not set
CONFIG_DVB_ATBM8830=y
# CONFIG_DVB_TDA665x is not set
CONFIG_DVB_IX2505V=y
# CONFIG_DVB_IT913X_FE is not set
CONFIG_DVB_M88RS2000=y
CONFIG_DVB_AF9033=y

#
# Tools to develop new frontends
#
# CONFIG_DVB_DUMMY_FE is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_ALI=y
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
CONFIG_AGP_INTEL=y
CONFIG_AGP_NVIDIA=y
CONFIG_AGP_SIS=y
CONFIG_AGP_SWORKS=y
CONFIG_AGP_VIA=y
# CONFIG_AGP_EFFICEON is not set
# CONFIG_VGA_ARB is not set
# CONFIG_DRM is not set
CONFIG_VGASTATE=y
CONFIG_VIDEO_OUTPUT_CONTROL=y
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
CONFIG_FB_DDC=y
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_HECUBA=y
CONFIG_FB_SVGALIB=y
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
CONFIG_FB_PM2=y
CONFIG_FB_PM2_FIFO_DISCONNECT=y
# CONFIG_FB_CYBER2000 is not set
CONFIG_FB_ARC=y
# CONFIG_FB_ASILIANT is not set
CONFIG_FB_IMSTT=y
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_N411=y
CONFIG_FB_HGA=y
CONFIG_FB_S1D13XXX=y
CONFIG_FB_NVIDIA=y
CONFIG_FB_NVIDIA_I2C=y
CONFIG_FB_NVIDIA_DEBUG=y
# CONFIG_FB_NVIDIA_BACKLIGHT is not set
CONFIG_FB_RIVA=y
# CONFIG_FB_RIVA_I2C is not set
CONFIG_FB_RIVA_DEBUG=y
# CONFIG_FB_RIVA_BACKLIGHT is not set
CONFIG_FB_I740=y
CONFIG_FB_I810=y
CONFIG_FB_I810_GTF=y
CONFIG_FB_I810_I2C=y
CONFIG_FB_LE80578=y
CONFIG_FB_CARILLO_RANCH=y
# CONFIG_FB_INTEL is not set
CONFIG_FB_MATROX=y
# CONFIG_FB_MATROX_MILLENIUM is not set
CONFIG_FB_MATROX_MYSTIQUE=y
# CONFIG_FB_MATROX_G is not set
CONFIG_FB_MATROX_I2C=y
# CONFIG_FB_RADEON is not set
CONFIG_FB_ATY128=y
CONFIG_FB_ATY128_BACKLIGHT=y
# CONFIG_FB_ATY is not set
CONFIG_FB_S3=y
CONFIG_FB_S3_DDC=y
CONFIG_FB_SAVAGE=y
# CONFIG_FB_SAVAGE_I2C is not set
CONFIG_FB_SAVAGE_ACCEL=y
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
CONFIG_FB_NEOMAGIC=y
# CONFIG_FB_KYRO is not set
CONFIG_FB_3DFX=y
CONFIG_FB_3DFX_ACCEL=y
# CONFIG_FB_3DFX_I2C is not set
CONFIG_FB_VOODOO1=y
CONFIG_FB_VT8623=y
CONFIG_FB_TRIDENT=y
CONFIG_FB_ARK=y
# CONFIG_FB_PM3 is not set
CONFIG_FB_CARMINE=y
# CONFIG_FB_CARMINE_DRAM_EVAL is not set
CONFIG_CARMINE_DRAM_CUSTOM=y
# CONFIG_FB_GEODE is not set
# CONFIG_FB_TMIO is not set
CONFIG_FB_SM501=y
CONFIG_FB_SMSCUFX=y
# CONFIG_FB_UDL is not set
# CONFIG_FB_GOLDFISH is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_FB_SIMPLE is not set
CONFIG_EXYNOS_VIDEO=y
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
CONFIG_BACKLIGHT_LM3533=y
# CONFIG_BACKLIGHT_PWM is not set
CONFIG_BACKLIGHT_DA903X=y
CONFIG_BACKLIGHT_SAHARA=y
# CONFIG_BACKLIGHT_WM831X is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
CONFIG_BACKLIGHT_88PM860X=y
# CONFIG_BACKLIGHT_PCF50633 is not set
CONFIG_BACKLIGHT_LM3630=y
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LP855X is not set
CONFIG_BACKLIGHT_LP8788=y
CONFIG_BACKLIGHT_OT200=y
CONFIG_BACKLIGHT_PANDORA=y
CONFIG_BACKLIGHT_TPS65217=y
# CONFIG_BACKLIGHT_AS3711 is not set
CONFIG_BACKLIGHT_GPIO=y
# CONFIG_BACKLIGHT_LV5207LP is not set
CONFIG_BACKLIGHT_BD6107=y

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
CONFIG_LOGO=y
CONFIG_LOGO_LINUX_MONO=y
# CONFIG_LOGO_LINUX_VGA16 is not set
# CONFIG_LOGO_LINUX_CLUT224 is not set
CONFIG_FB_SSD1307=y
# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
CONFIG_HID_BATTERY_STRENGTH=y
# CONFIG_HIDRAW is not set
# CONFIG_UHID is not set
# CONFIG_HID_GENERIC is not set

#
# Special HID drivers
#
# CONFIG_HID_A4TECH is not set
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
CONFIG_HID_APPLEIR=y
CONFIG_HID_AUREAL=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
CONFIG_HID_DRAGONRISE=y
CONFIG_DRAGONRISE_FF=y
CONFIG_HID_EMS_FF=y
CONFIG_HID_ELECOM=y
CONFIG_HID_ELO=y
CONFIG_HID_EZKEY=y
CONFIG_HID_HOLTEK=y
# CONFIG_HOLTEK_FF is not set
CONFIG_HID_HUION=y
CONFIG_HID_KEYTOUCH=y
CONFIG_HID_KYE=y
CONFIG_HID_UCLOGIC=y
CONFIG_HID_WALTOP=y
CONFIG_HID_GYRATION=y
# CONFIG_HID_ICADE is not set
CONFIG_HID_TWINHAN=y
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO_TPKBD is not set
# CONFIG_HID_LOGITECH is not set
CONFIG_HID_MAGICMOUSE=y
CONFIG_HID_MICROSOFT=y
# CONFIG_HID_MONTEREY is not set
# CONFIG_HID_MULTITOUCH is not set
CONFIG_HID_NTRIG=y
CONFIG_HID_ORTEK=y
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
CONFIG_HID_PICOLCD=y
CONFIG_HID_PICOLCD_FB=y
CONFIG_HID_PICOLCD_BACKLIGHT=y
# CONFIG_HID_PICOLCD_LEDS is not set
CONFIG_HID_PRIMAX=y
CONFIG_HID_ROCCAT=y
CONFIG_HID_SAITEK=y
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
CONFIG_HID_SPEEDLINK=y
CONFIG_HID_STEELSERIES=y
CONFIG_HID_SUNPLUS=y
CONFIG_HID_GREENASIA=y
# CONFIG_GREENASIA_FF is not set
# CONFIG_HID_SMARTJOYPLUS is not set
CONFIG_HID_TIVO=y
# CONFIG_HID_TOPSEED is not set
CONFIG_HID_THINGM=y
CONFIG_HID_THRUSTMASTER=y
CONFIG_THRUSTMASTER_FF=y
# CONFIG_HID_WACOM is not set
CONFIG_HID_WIIMOTE=y
CONFIG_HID_XINMO=y
CONFIG_HID_ZEROPLUS=y
CONFIG_ZEROPLUS_FF=y
CONFIG_HID_ZYDACRON=y
CONFIG_HID_SENSOR_HUB=y

#
# USB HID support
#
CONFIG_USB_HID=y
# CONFIG_HID_PID is not set
# CONFIG_USB_HIDDEV is not set

#
# I2C HID support
#
# CONFIG_I2C_HID is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
# CONFIG_USB_DEFAULT_PERSIST is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_OTG_WHITELIST=y
# CONFIG_USB_OTG_BLACKLIST_HUB is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_C67X00_HCD=y
# CONFIG_USB_XHCI_HCD is not set
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
CONFIG_USB_EHCI_PCI=y
CONFIG_USB_EHCI_HCD_PLATFORM=y
CONFIG_USB_OXU210HP_HCD=y
# CONFIG_USB_ISP116X_HCD is not set
CONFIG_USB_ISP1760_HCD=y
# CONFIG_USB_ISP1362_HCD is not set
CONFIG_USB_FUSBH200_HCD=y
CONFIG_USB_FOTG210_HCD=y
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_HCD_PCI is not set
CONFIG_USB_OHCI_HCD_SSB=y
CONFIG_USB_OHCI_HCD_PLATFORM=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=y
CONFIG_USB_SL811_HCD=y
CONFIG_USB_SL811_HCD_ISO=y
# CONFIG_USB_SL811_CS is not set
CONFIG_USB_R8A66597_HCD=y
# CONFIG_USB_RENESAS_USBHS_HCD is not set
CONFIG_USB_HCD_SSB=y
# CONFIG_USB_HCD_TEST_MODE is not set
CONFIG_USB_MUSB_HDRC=y
# CONFIG_USB_MUSB_HOST is not set
# CONFIG_USB_MUSB_GADGET is not set
CONFIG_USB_MUSB_DUAL_ROLE=y
CONFIG_USB_MUSB_TUSB6010=y
# CONFIG_USB_MUSB_DSPS is not set
# CONFIG_USB_MUSB_UX500 is not set
CONFIG_MUSB_PIO_ONLY=y
CONFIG_USB_RENESAS_USBHS=y

#
# USB Device Class drivers
#
CONFIG_USB_ACM=y
CONFIG_USB_PRINTER=y
CONFIG_USB_WDM=y
CONFIG_USB_TMC=y

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
CONFIG_USB_STORAGE_DEBUG=y
CONFIG_USB_STORAGE_REALTEK=y
CONFIG_REALTEK_AUTOPM=y
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
CONFIG_USB_STORAGE_ENE_UB6250=y

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USB_DWC3 is not set
CONFIG_USB_CHIPIDEA=y
CONFIG_USB_CHIPIDEA_UDC=y
# CONFIG_USB_CHIPIDEA_HOST is not set
CONFIG_USB_CHIPIDEA_DEBUG=y

#
# USB port drivers
#
CONFIG_USB_USS720=y
CONFIG_USB_SERIAL=y
# CONFIG_USB_SERIAL_CONSOLE is not set
# CONFIG_USB_SERIAL_GENERIC is not set
CONFIG_USB_SERIAL_SIMPLE=y
# CONFIG_USB_SERIAL_AIRCABLE is not set
CONFIG_USB_SERIAL_ARK3116=y
CONFIG_USB_SERIAL_BELKIN=y
CONFIG_USB_SERIAL_CH341=y
CONFIG_USB_SERIAL_WHITEHEAT=y
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=y
CONFIG_USB_SERIAL_CP210X=y
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
CONFIG_USB_SERIAL_EMPEG=y
CONFIG_USB_SERIAL_FTDI_SIO=y
CONFIG_USB_SERIAL_VISOR=y
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
CONFIG_USB_SERIAL_F81232=y
CONFIG_USB_SERIAL_GARMIN=y
# CONFIG_USB_SERIAL_IPW is not set
CONFIG_USB_SERIAL_IUU=y
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
CONFIG_USB_SERIAL_KLSI=y
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
CONFIG_USB_SERIAL_MCT_U232=y
CONFIG_USB_SERIAL_METRO=y
CONFIG_USB_SERIAL_MOS7720=y
# CONFIG_USB_SERIAL_MOS7715_PARPORT is not set
CONFIG_USB_SERIAL_MOS7840=y
CONFIG_USB_SERIAL_NAVMAN=y
CONFIG_USB_SERIAL_PL2303=y
CONFIG_USB_SERIAL_OTI6858=y
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
CONFIG_USB_SERIAL_SIERRAWIRELESS=y
# CONFIG_USB_SERIAL_SYMBOL is not set
CONFIG_USB_SERIAL_TI=y
CONFIG_USB_SERIAL_CYBERJACK=y
CONFIG_USB_SERIAL_XIRCOM=y
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
CONFIG_USB_SERIAL_OPTICON=y
CONFIG_USB_SERIAL_XSENS_MT=y
CONFIG_USB_SERIAL_WISHBONE=y
# CONFIG_USB_SERIAL_ZTE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
CONFIG_USB_SERIAL_QT2=y
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=y
CONFIG_USB_EMI26=y
CONFIG_USB_ADUTUX=y
CONFIG_USB_SEVSEG=y
CONFIG_USB_RIO500=y
CONFIG_USB_LEGOTOWER=y
CONFIG_USB_LCD=y
CONFIG_USB_LED=y
CONFIG_USB_CYPRESS_CY7C63=y
CONFIG_USB_CYTHERM=y
CONFIG_USB_IDMOUSE=y
CONFIG_USB_FTDI_ELAN=y
# CONFIG_USB_APPLEDISPLAY is not set
CONFIG_USB_SISUSBVGA=y
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=y
CONFIG_USB_TRANCEVIBRATOR=y
CONFIG_USB_IOWARRIOR=y
CONFIG_USB_TEST=y
CONFIG_USB_EHSET_TEST_FIXTURE=y
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
CONFIG_USB_EZUSB_FX2=y
CONFIG_USB_HSIC_USB3503=y
# CONFIG_USB_ATM is not set

#
# USB Physical Layer drivers
#
CONFIG_USB_PHY=y
CONFIG_NOP_USB_XCEIV=y
# CONFIG_AM335X_PHY_USB is not set
CONFIG_SAMSUNG_USBPHY=y
CONFIG_SAMSUNG_USB2PHY=y
CONFIG_SAMSUNG_USB3PHY=y
CONFIG_USB_GPIO_VBUS=y
# CONFIG_USB_ISP1301 is not set
CONFIG_USB_RCAR_PHY=y
CONFIG_USB_GADGET=y
CONFIG_USB_GADGET_DEBUG=y
# CONFIG_USB_GADGET_DEBUG_FILES is not set
CONFIG_USB_GADGET_DEBUG_FS=y
CONFIG_USB_GADGET_VBUS_DRAW=2
CONFIG_USB_GADGET_STORAGE_NUM_BUFFERS=2

#
# USB Peripheral Controller
#
# CONFIG_USB_FUSB300 is not set
# CONFIG_USB_FOTG210_UDC is not set
CONFIG_USB_R8A66597=y
# CONFIG_USB_RENESAS_USBHS_UDC is not set
CONFIG_USB_PXA27X=y
# CONFIG_USB_MV_UDC is not set
# CONFIG_USB_MV_U3D is not set
# CONFIG_USB_M66592 is not set
CONFIG_USB_AMD5536UDC=y
CONFIG_USB_NET2272=y
CONFIG_USB_NET2272_DMA=y
# CONFIG_USB_NET2280 is not set
# CONFIG_USB_GOKU is not set
# CONFIG_USB_EG20T is not set
CONFIG_USB_DUMMY_HCD=y
CONFIG_USB_LIBCOMPOSITE=y
CONFIG_USB_U_ETHER=y
CONFIG_USB_F_NCM=y
# CONFIG_USB_CONFIGFS is not set
# CONFIG_USB_ZERO is not set
# CONFIG_USB_ETH is not set
CONFIG_USB_G_NCM=y
# CONFIG_USB_GADGETFS is not set
# CONFIG_USB_FUNCTIONFS is not set
# CONFIG_USB_MASS_STORAGE is not set
# CONFIG_USB_GADGET_TARGET is not set
# CONFIG_USB_G_SERIAL is not set
# CONFIG_USB_G_PRINTER is not set
# CONFIG_USB_CDC_COMPOSITE is not set
# CONFIG_USB_G_NOKIA is not set
# CONFIG_USB_G_ACM_MS is not set
# CONFIG_USB_G_MULTI is not set
# CONFIG_USB_G_HID is not set
# CONFIG_USB_G_DBGP is not set
# CONFIG_USB_G_WEBCAM is not set
# CONFIG_UWB is not set
CONFIG_MMC=y
CONFIG_MMC_DEBUG=y
CONFIG_MMC_UNSAFE_RESUME=y
CONFIG_MMC_CLKGATE=y

#
# MMC/SD/SDIO Card Drivers
#
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
CONFIG_SDIO_UART=y
CONFIG_MMC_TEST=y

#
# MMC/SD/SDIO Host Controller Drivers
#
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PCI=y
CONFIG_MMC_RICOH_MMC=y
CONFIG_MMC_SDHCI_PLTFM=y
# CONFIG_MMC_WBSD is not set
# CONFIG_MMC_TIFM_SD is not set
CONFIG_MMC_GOLDFISH=y
CONFIG_MMC_SDRICOH_CS=y
# CONFIG_MMC_CB710 is not set
# CONFIG_MMC_VIA_SDMMC is not set
# CONFIG_MMC_VUB300 is not set
CONFIG_MMC_USHC=y
CONFIG_MMC_REALTEK_PCI=y
CONFIG_MEMSTICK=y
# CONFIG_MEMSTICK_DEBUG is not set

#
# MemoryStick drivers
#
# CONFIG_MEMSTICK_UNSAFE_RESUME is not set
# CONFIG_MSPRO_BLOCK is not set
CONFIG_MS_BLOCK=y

#
# MemoryStick Host Controller Drivers
#
CONFIG_MEMSTICK_TIFM_MS=y
CONFIG_MEMSTICK_JMICRON_38X=y
CONFIG_MEMSTICK_R592=y
CONFIG_MEMSTICK_REALTEK_PCI=y
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_88PM860X is not set
# CONFIG_LEDS_LM3530 is not set
CONFIG_LEDS_LM3533=y
CONFIG_LEDS_LM3642=y
CONFIG_LEDS_PCA9532=y
# CONFIG_LEDS_PCA9532_GPIO is not set
CONFIG_LEDS_GPIO=y
# CONFIG_LEDS_LP3944 is not set
CONFIG_LEDS_LP55XX_COMMON=y
# CONFIG_LEDS_LP5521 is not set
CONFIG_LEDS_LP5523=y
# CONFIG_LEDS_LP5562 is not set
CONFIG_LEDS_LP8501=y
# CONFIG_LEDS_LP8788 is not set
# CONFIG_LEDS_PCA955X is not set
CONFIG_LEDS_PCA963X=y
# CONFIG_LEDS_WM831X_STATUS is not set
CONFIG_LEDS_WM8350=y
CONFIG_LEDS_DA903X=y
CONFIG_LEDS_PWM=y
CONFIG_LEDS_REGULATOR=y
# CONFIG_LEDS_BD2802 is not set
CONFIG_LEDS_LT3593=y
CONFIG_LEDS_TCA6507=y
CONFIG_LEDS_LM355x=y
CONFIG_LEDS_OT200=y
CONFIG_LEDS_BLINKM=y

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=y
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
CONFIG_LEDS_TRIGGER_HEARTBEAT=y
CONFIG_LEDS_TRIGGER_BACKLIGHT=y
CONFIG_LEDS_TRIGGER_CPU=y
CONFIG_LEDS_TRIGGER_GPIO=y
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
CONFIG_LEDS_TRIGGER_TRANSIENT=y
CONFIG_LEDS_TRIGGER_CAMERA=y
# CONFIG_ACCESSIBILITY is not set
CONFIG_INFINIBAND=y
# CONFIG_INFINIBAND_USER_MAD is not set
# CONFIG_INFINIBAND_USER_ACCESS is not set
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=y
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_AMSO1100=y
CONFIG_INFINIBAND_AMSO1100_DEBUG=y
# CONFIG_INFINIBAND_CXGB4 is not set
CONFIG_MLX4_INFINIBAND=y
CONFIG_MLX5_INFINIBAND=y
CONFIG_INFINIBAND_NES=y
# CONFIG_INFINIBAND_NES_DEBUG is not set
# CONFIG_INFINIBAND_OCRDMA is not set
CONFIG_INFINIBAND_IPOIB=y
CONFIG_INFINIBAND_IPOIB_CM=y
# CONFIG_INFINIBAND_IPOIB_DEBUG is not set
CONFIG_INFINIBAND_SRP=y
# CONFIG_INFINIBAND_SRPT is not set
# CONFIG_INFINIBAND_ISER is not set
CONFIG_INFINIBAND_ISERT=y
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
# CONFIG_RTC_HCTOSYS is not set
# CONFIG_RTC_SYSTOHC is not set
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
# CONFIG_RTC_INTF_DEV is not set
CONFIG_RTC_DRV_TEST=y

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_88PM860X is not set
CONFIG_RTC_DRV_88PM80X=y
CONFIG_RTC_DRV_DS1307=y
CONFIG_RTC_DRV_DS1374=y
CONFIG_RTC_DRV_DS1672=y
CONFIG_RTC_DRV_DS3232=y
CONFIG_RTC_DRV_LP8788=y
CONFIG_RTC_DRV_MAX6900=y
# CONFIG_RTC_DRV_MAX8998 is not set
CONFIG_RTC_DRV_RS5C372=y
CONFIG_RTC_DRV_ISL1208=y
CONFIG_RTC_DRV_ISL12022=y
CONFIG_RTC_DRV_X1205=y
CONFIG_RTC_DRV_PCF2127=y
CONFIG_RTC_DRV_PCF8523=y
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
CONFIG_RTC_DRV_BQ32K=y
CONFIG_RTC_DRV_TWL4030=y
CONFIG_RTC_DRV_TPS65910=y
# CONFIG_RTC_DRV_S35390A is not set
CONFIG_RTC_DRV_FM3130=y
CONFIG_RTC_DRV_RX8581=y
# CONFIG_RTC_DRV_RX8025 is not set
CONFIG_RTC_DRV_EM3027=y
CONFIG_RTC_DRV_RV3029C2=y

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
CONFIG_RTC_DRV_DS1286=y
CONFIG_RTC_DRV_DS1511=y
# CONFIG_RTC_DRV_DS1553 is not set
CONFIG_RTC_DRV_DS1742=y
# CONFIG_RTC_DRV_STK17TA8 is not set
CONFIG_RTC_DRV_M48T86=y
CONFIG_RTC_DRV_M48T35=y
# CONFIG_RTC_DRV_M48T59 is not set
CONFIG_RTC_DRV_MSM6242=y
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
CONFIG_RTC_DRV_V3020=y
# CONFIG_RTC_DRV_DS2404 is not set
CONFIG_RTC_DRV_WM831X=y
# CONFIG_RTC_DRV_WM8350 is not set
CONFIG_RTC_DRV_PCF50633=y

#
# on-CPU RTC drivers
#
CONFIG_RTC_DRV_SNVS=y
CONFIG_RTC_DRV_MOXART=y

#
# HID Sensor RTC drivers
#
CONFIG_RTC_DRV_HID_SENSOR_TIME=y
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=y
# CONFIG_UIO_CIF is not set
CONFIG_UIO_PDRV_GENIRQ=y
CONFIG_UIO_DMEM_GENIRQ=y
# CONFIG_UIO_AEC is not set
CONFIG_UIO_SERCOS3=y
CONFIG_UIO_PCI_GENERIC=y
# CONFIG_UIO_NETX is not set
CONFIG_UIO_MF624=y
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO=y

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_VIRTIO_MMIO=y
# CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_DELL_LAPTOP is not set
CONFIG_SENSORS_HDAPS=y
# CONFIG_IBM_RTL is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_GOLDFISH_PIPE is not set

#
# Hardware Spinlock drivers
#
CONFIG_CLKSRC_I8253=y
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# CONFIG_ARM_ARCH_TIMER_EVTSTREAM is not set
CONFIG_MAILBOX=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_OF_IOMMU=y

#
# Remoteproc drivers
#
# CONFIG_STE_MODEM_RPROC is not set

#
# Rpmsg drivers
#
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
CONFIG_MEMORY=y
CONFIG_IIO=y
CONFIG_IIO_BUFFER=y
CONFIG_IIO_BUFFER_CB=y
CONFIG_IIO_KFIFO_BUF=y
CONFIG_IIO_TRIGGERED_BUFFER=y
CONFIG_IIO_TRIGGER=y
CONFIG_IIO_CONSUMERS_PER_TRIGGER=2

#
# Accelerometers
#
# CONFIG_BMA180 is not set
CONFIG_HID_SENSOR_ACCEL_3D=y
CONFIG_IIO_ST_ACCEL_3AXIS=y
CONFIG_IIO_ST_ACCEL_I2C_3AXIS=y

#
# Analog to digital converters
#
CONFIG_EXYNOS_ADC=y
CONFIG_LP8788_ADC=y
CONFIG_MAX1363=y
# CONFIG_NAU7802 is not set
# CONFIG_TI_ADC081C is not set
CONFIG_TI_AM335X_ADC=y
CONFIG_TWL6030_GPADC=y
CONFIG_VIPERBOARD_ADC=y

#
# Amplifiers
#

#
# Hid Sensor IIO Common
#
CONFIG_HID_SENSOR_IIO_COMMON=y
CONFIG_HID_SENSOR_IIO_TRIGGER=y
# CONFIG_HID_SENSOR_ENUM_BASE_QUIRKS is not set
CONFIG_IIO_ST_SENSORS_I2C=y
CONFIG_IIO_ST_SENSORS_CORE=y

#
# Digital to analog converters
#
# CONFIG_AD5064 is not set
CONFIG_AD5380=y
CONFIG_AD5446=y
# CONFIG_MAX517 is not set
CONFIG_MCP4725=y

#
# Frequency Synthesizers DDS/PLL
#

#
# Clock Generator/Distribution
#

#
# Phase-Locked Loop (PLL) frequency synthesizers
#

#
# Digital gyroscope sensors
#
CONFIG_HID_SENSOR_GYRO_3D=y
CONFIG_IIO_ST_GYRO_3AXIS=y
CONFIG_IIO_ST_GYRO_I2C_3AXIS=y
CONFIG_ITG3200=y

#
# Inertial measurement units
#
CONFIG_INV_MPU6050_IIO=y

#
# Light sensors
#
CONFIG_ADJD_S311=y
CONFIG_APDS9300=y
CONFIG_HID_SENSOR_ALS=y
# CONFIG_SENSORS_LM3533 is not set
# CONFIG_SENSORS_TSL2563 is not set
CONFIG_VCNL4000=y

#
# Magnetometer sensors
#
# CONFIG_AK8975 is not set
CONFIG_HID_SENSOR_MAGNETOMETER_3D=y
CONFIG_IIO_ST_MAGN_3AXIS=y
CONFIG_IIO_ST_MAGN_I2C_3AXIS=y

#
# Triggers - standalone
#
# CONFIG_IIO_INTERRUPT_TRIGGER is not set
CONFIG_IIO_SYSFS_TRIGGER=y

#
# Pressure sensors
#
CONFIG_IIO_ST_PRESS=y
CONFIG_IIO_ST_PRESS_I2C=y

#
# Temperature sensors
#
# CONFIG_TMP006 is not set
# CONFIG_NTB is not set
CONFIG_VME_BUS=y

#
# VME Bridge Drivers
#
CONFIG_VME_CA91CX42=y
CONFIG_VME_TSI148=y

#
# VME Board Drivers
#
# CONFIG_VMIVME_7805 is not set

#
# VME Device Drivers
#
CONFIG_PWM=y
CONFIG_PWM_SYSFS=y
CONFIG_PWM_PCA9685=y
# CONFIG_PWM_TWL is not set
CONFIG_PWM_TWL_LED=y
CONFIG_IRQCHIP=y
CONFIG_IPACK_BUS=y
CONFIG_BOARD_TPCI200=y
# CONFIG_SERIAL_IPOCTAL is not set
CONFIG_RESET_CONTROLLER=y
CONFIG_FMC=y
# CONFIG_FMC_FAKEDEV is not set
CONFIG_FMC_TRIVIAL=y
CONFIG_FMC_WRITE_EEPROM=y
CONFIG_FMC_CHARDEV=y

#
# Firmware Drivers
#
CONFIG_EDD=y
CONFIG_EDD_OFF=y
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
CONFIG_DCDBAS=y
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set
CONFIG_GOOGLE_FIRMWARE=y

#
# Google Firmware Drivers
#

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
# CONFIG_EXT4_FS_POSIX_ACL is not set
CONFIG_EXT4_FS_SECURITY=y
CONFIG_EXT4_DEBUG=y
CONFIG_JBD=y
CONFIG_JBD_DEBUG=y
CONFIG_JBD2=y
CONFIG_JBD2_DEBUG=y
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
# CONFIG_REISERFS_FS_POSIX_ACL is not set
CONFIG_REISERFS_FS_SECURITY=y
# CONFIG_JFS_FS is not set
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
# CONFIG_XFS_POSIX_ACL is not set
CONFIG_XFS_RT=y
CONFIG_XFS_DEBUG=y
# CONFIG_OCFS2_FS is not set
CONFIG_BTRFS_FS=y
# CONFIG_BTRFS_FS_POSIX_ACL is not set
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y
# CONFIG_BTRFS_DEBUG is not set
CONFIG_BTRFS_ASSERT=y
# CONFIG_NILFS2_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
# CONFIG_DNOTIFY is not set
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
CONFIG_QUOTA_DEBUG=y
CONFIG_QUOTA_TREE=y
CONFIG_QFMT_V1=y
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS4_FS=y
CONFIG_FUSE_FS=y
CONFIG_CUSE=y
CONFIG_GENERIC_ACL=y

#
# Caches
#
CONFIG_FSCACHE=y
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_HISTOGRAM=y
# CONFIG_FSCACHE_DEBUG is not set
# CONFIG_FSCACHE_OBJECT_LIST is not set
CONFIG_CACHEFILES=y
CONFIG_CACHEFILES_DEBUG=y
CONFIG_CACHEFILES_HISTOGRAM=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
# CONFIG_ZISOFS is not set
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
CONFIG_NTFS_DEBUG=y
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
# CONFIG_PROC_VMCORE is not set
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_CONFIGFS_FS=y
# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V2=y
# CONFIG_NFS_V3 is not set
CONFIG_NFS_V4=y
CONFIG_NFS_SWAP=y
# CONFIG_NFS_V4_1 is not set
# CONFIG_ROOT_NFS is not set
# CONFIG_NFS_FSCACHE is not set
CONFIG_NFS_USE_LEGACY_DNS=y
CONFIG_NFS_DEBUG=y
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
CONFIG_NFSD_V4=y
CONFIG_NFSD_V4_SECURITY_LABEL=y
CONFIG_NFSD_FAULT_INJECTION=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_XPRT_RDMA=y
CONFIG_SUNRPC_SWAP=y
CONFIG_SUNRPC_DEBUG=y
CONFIG_CEPH_FS=y
CONFIG_CEPH_FSCACHE=y
CONFIG_CIFS=y
CONFIG_CIFS_STATS=y
CONFIG_CIFS_STATS2=y
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_UPCALL=y
# CONFIG_CIFS_XATTR is not set
CONFIG_CIFS_DEBUG=y
# CONFIG_CIFS_DEBUG2 is not set
CONFIG_CIFS_DFS_UPCALL=y
CONFIG_CIFS_SMB2=y
CONFIG_CIFS_FSCACHE=y
CONFIG_NCP_FS=y
# CONFIG_NCPFS_PACKET_SIGNING is not set
# CONFIG_NCPFS_IOCTL_LOCKING is not set
# CONFIG_NCPFS_STRONG is not set
# CONFIG_NCPFS_NFS_NS is not set
# CONFIG_NCPFS_OS2_NS is not set
# CONFIG_NCPFS_SMALLDOS is not set
CONFIG_NCPFS_NLS=y
# CONFIG_NCPFS_EXTRAS is not set
# CONFIG_CODA_FS is not set
CONFIG_AFS_FS=y
# CONFIG_AFS_DEBUG is not set
# CONFIG_AFS_FSCACHE is not set
CONFIG_9P_FS=y
CONFIG_9P_FSCACHE=y
CONFIG_9P_FS_POSIX_ACL=y
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_CODEPAGE_737=y
CONFIG_NLS_CODEPAGE_775=y
CONFIG_NLS_CODEPAGE_850=y
CONFIG_NLS_CODEPAGE_852=y
CONFIG_NLS_CODEPAGE_855=y
CONFIG_NLS_CODEPAGE_857=y
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
CONFIG_NLS_CODEPAGE_863=y
# CONFIG_NLS_CODEPAGE_864 is not set
CONFIG_NLS_CODEPAGE_865=y
CONFIG_NLS_CODEPAGE_866=y
CONFIG_NLS_CODEPAGE_869=y
CONFIG_NLS_CODEPAGE_936=y
CONFIG_NLS_CODEPAGE_950=y
# CONFIG_NLS_CODEPAGE_932 is not set
CONFIG_NLS_CODEPAGE_949=y
# CONFIG_NLS_CODEPAGE_874 is not set
CONFIG_NLS_ISO8859_8=y
CONFIG_NLS_CODEPAGE_1250=y
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
CONFIG_NLS_ISO8859_2=y
# CONFIG_NLS_ISO8859_3 is not set
CONFIG_NLS_ISO8859_4=y
# CONFIG_NLS_ISO8859_5 is not set
CONFIG_NLS_ISO8859_6=y
CONFIG_NLS_ISO8859_7=y
CONFIG_NLS_ISO8859_9=y
CONFIG_NLS_ISO8859_13=y
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=y
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
CONFIG_NLS_MAC_CELTIC=y
# CONFIG_NLS_MAC_CENTEURO is not set
CONFIG_NLS_MAC_CROATIAN=y
CONFIG_NLS_MAC_CYRILLIC=y
# CONFIG_NLS_MAC_GAELIC is not set
CONFIG_NLS_MAC_GREEK=y
CONFIG_NLS_MAC_ICELAND=y
CONFIG_NLS_MAC_INUIT=y
CONFIG_NLS_MAC_ROMANIAN=y
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y

#
# printk and dmesg options
#
# CONFIG_PRINTK_TIME is not set
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=4
CONFIG_BOOT_PRINTK_DELAY=y
CONFIG_DYNAMIC_DEBUG=y

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_REDUCED=y
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=1024
# CONFIG_STRIP_ASM_SYMS is not set
CONFIG_READABLE_ASM=y
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_KERNEL=y

#
# Memory Debugging
#
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_WANT_PAGE_DEBUG_FLAGS=y
CONFIG_PAGE_GUARD=y
# CONFIG_DEBUG_OBJECTS is not set
CONFIG_DEBUG_SLAB=y
# CONFIG_DEBUG_SLAB_LEAK is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_RB=y
CONFIG_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_MEMORY_INIT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Lockups and Hangs
#
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
# CONFIG_DEBUG_RT_MUTEXES is not set
CONFIG_RT_MUTEX_TESTER=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_LOCK_ALLOC=y
# CONFIG_PROVE_LOCKING is not set
CONFIG_LOCKDEP=y
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_KOBJECT_RELEASE is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_WRITECOUNT is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
CONFIG_DEBUG_NOTIFIERS=y
# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
CONFIG_NOTIFIER_ERROR_INJECTION=y
# CONFIG_PM_NOTIFIER_ERROR_INJECT is not set
CONFIG_FAULT_INJECTION=y
# CONFIG_FAILSLAB is not set
# CONFIG_FAIL_PAGE_ALLOC is not set
# CONFIG_FAIL_MAKE_REQUEST is not set
# CONFIG_FAIL_IO_TIMEOUT is not set
# CONFIG_FAIL_MMC_REQUEST is not set
CONFIG_FAULT_INJECTION_DEBUG_FS=y
CONFIG_FAULT_INJECTION_STACKTRACE_FILTER=y
# CONFIG_LATENCYTOP is not set
CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_RING_BUFFER_ALLOW_SWAP=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP=y
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_STACK_TRACER is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_UPROBE_EVENT=y
CONFIG_PROBE_EVENTS=y
# CONFIG_FTRACE_STARTUP_TEST is not set
CONFIG_MMIOTRACE=y
# CONFIG_RING_BUFFER_BENCHMARK is not set
CONFIG_RING_BUFFER_STARTUP_TEST=y

#
# Runtime Testing
#
CONFIG_LKDTM=y
# CONFIG_TEST_LIST_SORT is not set
CONFIG_BACKTRACE_SELF_TEST=y
CONFIG_RBTREE_TEST=y
CONFIG_ATOMIC64_SELFTEST=y
# CONFIG_TEST_STRING_HELPERS is not set
CONFIG_TEST_KSTRTOX=y
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_KGDB_TESTS=y
# CONFIG_KGDB_TESTS_ON_BOOT is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
# CONFIG_STRICT_DEVMEM is not set
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
CONFIG_X86_PTDUMP=y
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_RODATA_TEST=y
CONFIG_DOUBLEFAULT=y
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
CONFIG_CPA_DEBUG=y
CONFIG_OPTIMIZE_INLINING=y
CONFIG_DEBUG_NMI_SELFTEST=y
# CONFIG_X86_DEBUG_STATIC_CPU_HAS is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_TRUSTED_KEYS=y
CONFIG_ENCRYPTED_KEYS=y
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_PATH=y
# CONFIG_SECURITY_SELINUX is not set
# CONFIG_SECURITY_SMACK is not set
CONFIG_SECURITY_TOMOYO=y
CONFIG_SECURITY_TOMOYO_MAX_ACCEPT_ENTRY=2048
CONFIG_SECURITY_TOMOYO_MAX_AUDIT_LOG=1024
# CONFIG_SECURITY_TOMOYO_OMIT_USERSPACE_LOADER is not set
CONFIG_SECURITY_TOMOYO_POLICY_LOADER="/sbin/tomoyo-init"
CONFIG_SECURITY_TOMOYO_ACTIVATION_TRIGGER="/sbin/init"
CONFIG_SECURITY_APPARMOR=y
CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1
# CONFIG_SECURITY_APPARMOR_HASH is not set
# CONFIG_SECURITY_YAMA is not set
CONFIG_INTEGRITY=y
CONFIG_INTEGRITY_SIGNATURE=y
CONFIG_INTEGRITY_AUDIT=y
# CONFIG_INTEGRITY_ASYMMETRIC_KEYS is not set
CONFIG_IMA=y
CONFIG_IMA_MEASURE_PCR_IDX=10
CONFIG_IMA_APPRAISE=y
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_TOMOYO=y
# CONFIG_DEFAULT_SECURITY_APPARMOR is not set
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="tomoyo"
CONFIG_XOR_BLOCKS=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_FIPS=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_USER=y
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CRYPTO_CRYPTD=y
CONFIG_CRYPTO_AUTHENC=y
CONFIG_CRYPTO_ABLK_HELPER_X86=y

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=y
CONFIG_CRYPTO_GCM=y
CONFIG_CRYPTO_SEQIV=y

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_LRW=y
CONFIG_CRYPTO_PCBC=y
CONFIG_CRYPTO_XTS=y

#
# Hash modes
#
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=y
CONFIG_CRYPTO_VMAC=y

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y
CONFIG_CRYPTO_CRC32=y
CONFIG_CRYPTO_CRC32_PCLMUL=y
CONFIG_CRYPTO_CRCT10DIF=y
CONFIG_CRYPTO_GHASH=y
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=y
CONFIG_CRYPTO_RMD128=y
CONFIG_CRYPTO_RMD160=y
CONFIG_CRYPTO_RMD256=y
CONFIG_CRYPTO_RMD320=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
# CONFIG_CRYPTO_TGR192 is not set
CONFIG_CRYPTO_WP512=y

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_586=y
CONFIG_CRYPTO_AES_NI_INTEL=y
CONFIG_CRYPTO_ANUBIS=y
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_BLOWFISH is not set
CONFIG_CRYPTO_CAMELLIA=y
CONFIG_CRYPTO_CAST_COMMON=y
CONFIG_CRYPTO_CAST5=y
CONFIG_CRYPTO_CAST6=y
CONFIG_CRYPTO_DES=y
CONFIG_CRYPTO_FCRYPT=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
CONFIG_CRYPTO_SALSA20_586=y
# CONFIG_CRYPTO_SEED is not set
CONFIG_CRYPTO_SERPENT=y
# CONFIG_CRYPTO_SERPENT_SSE2_586 is not set
CONFIG_CRYPTO_TEA=y
# CONFIG_CRYPTO_TWOFISH is not set
CONFIG_CRYPTO_TWOFISH_COMMON=y
CONFIG_CRYPTO_TWOFISH_586=y

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_ZLIB=y
CONFIG_CRYPTO_LZO=y
CONFIG_CRYPTO_LZ4=y
CONFIG_CRYPTO_LZ4HC=y

#
# Random Number Generation
#
CONFIG_CRYPTO_ANSI_CPRNG=y
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_HW is not set
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_PUBLIC_KEY_ALGO_RSA=y
# CONFIG_X509_CERTIFICATE_PARSER is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
# CONFIG_LGUEST is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=y
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_PERCPU_RWSEM=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
CONFIG_CRC32_SELFTEST=y
# CONFIG_CRC32_SLICEBY8 is not set
# CONFIG_CRC32_SLICEBY4 is not set
CONFIG_CRC32_SARWATE=y
# CONFIG_CRC32_BIT is not set
CONFIG_CRC7=y
CONFIG_LIBCRC32C=y
CONFIG_CRC8=y
CONFIG_AUDIT_GENERIC=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_COMPRESS=y
CONFIG_LZ4HC_COMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
# CONFIG_XZ_DEC_IA64 is not set
# CONFIG_XZ_DEC_ARM is not set
# CONFIG_XZ_DEC_ARMTHUMB is not set
# CONFIG_XZ_DEC_SPARC is not set
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=y
CONFIG_TEXTSEARCH_BM=y
CONFIG_TEXTSEARCH_FSM=y
CONFIG_BTREE=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_AVERAGE=y
CONFIG_CLZ_TAB=y
CONFIG_CORDIC=y
# CONFIG_DDR is not set
CONFIG_MPILIB=y
CONFIG_SIGNATURE=y
CONFIG_OID_REGISTRY=y
CONFIG_FONT_SUPPORT=y
CONFIG_FONTS=y
# CONFIG_FONT_8x8 is not set
CONFIG_FONT_8x16=y
CONFIG_FONT_6x11=y
# CONFIG_FONT_7x14 is not set
# CONFIG_FONT_PEARL_8x8 is not set
CONFIG_FONT_ACORN_8x8=y
CONFIG_FONT_MINI_4x6=y
# CONFIG_FONT_SUN8x16 is not set
# CONFIG_FONT_SUN12x22 is not set
CONFIG_FONT_10x18=y

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 16:28 ` Ingo Molnar
@ 2013-10-09 16:29   ` Ingo Molnar
  2013-10-09 16:57       ` Ingo Molnar
  2013-10-09 17:08     ` Peter Zijlstra
  1 sibling, 1 reply; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 16:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

[-- Attachment #1: Type: text/plain, Size: 32 bytes --]


full crashlog attached.

	Ingo

[-- Attachment #2: crash.log --]
[-- Type: text/plain, Size: 256706 bytes --]

Linux version 3.12.0-rc4-01668-gfd71a04-dirty (mingo@earth5) (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) (GCC) ) #229484 Wed Oct 9 16:59:58 CEST 2013
KERNEL supported cpus:
  Centaur CentaurHauls
  Transmeta GenuineTMx86
  Transmeta TransmetaCPU
CPU: vendor_id 'AuthenticAMD' unknown, using generic init.
CPU: Your system may be unstable.
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009f7ff] usable
BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x000000003ffeffff] usable
BIOS-e820: [mem 0x000000003fff0000-0x000000003fff2fff] ACPI NVS
BIOS-e820: [mem 0x000000003fff3000-0x000000003fffffff] ACPI data
BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
BIOS-e820: [mem 0x00000000fec00000-0x00000000ffffffff] reserved
console [earlyser0] enabled
debug: ignoring loglevel setting.
Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
e820: remove [mem 0x000a0000-0x000fffff] usable
e820: last_pfn = 0x3fff0 max_arch_pfn = 0x100000
MTRR default type: uncachable
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-C7FFF write-protect
  C8000-FFFFF uncachable
MTRR variable ranges enabled:
  0 base 0000000000 mask FFC0000000 write-back
  1 disabled
  2 disabled
  3 disabled
  4 disabled
  5 disabled
  6 disabled
  7 disabled
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Scanning 1 areas for low memory corruption
initial memory mapped: [mem 0x00000000-0x037fffff]
Base memory trampoline at [b009b000] 9b000 size 16384
init_memory_mapping: [mem 0x00000000-0x000fffff]
 [mem 0x00000000-0x000fffff] page 4k
init_memory_mapping: [mem 0x3f800000-0x3fbfffff]
 [mem 0x3f800000-0x3fbfffff] page 4k
BRK [0x0335b000, 0x0335bfff] PGTABLE
init_memory_mapping: [mem 0x38000000-0x3f7fffff]
 [mem 0x38000000-0x3f7fffff] page 4k
BRK [0x0335c000, 0x0335cfff] PGTABLE
BRK [0x0335d000, 0x0335dfff] PGTABLE
BRK [0x0335e000, 0x0335efff] PGTABLE
BRK [0x0335f000, 0x0335ffff] PGTABLE
BRK [0x03360000, 0x03360fff] PGTABLE
init_memory_mapping: [mem 0x00100000-0x37ffffff]
 [mem 0x00100000-0x37ffffff] page 4k
init_memory_mapping: [mem 0x3fc00000-0x3ffeffff]
 [mem 0x3fc00000-0x3ffeffff] page 4k
0MB HIGHMEM available.
1023MB LOWMEM available.
  mapped low ram: 0 - 3fff0000
  low ram: 0 - 3fff0000
Zone ranges:
  Normal   [mem 0x00001000-0x3ffeffff]
  HighMem  empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x00001000-0x0009efff]
  node   0: [mem 0x00100000-0x3ffeffff]
On node 0 totalpages: 262030
free_area_init_node: node 0, pgdat b2cac760, node_mem_map ef214024
  Normal zone: 2304 pages used for memmap
  Normal zone: 0 pages reserved
  Normal zone: 262030 pages, LIFO batch:31
Using APIC driver default
SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
No local APIC present or hardware disabled
APIC: disable apic facility
APIC: switched to apic NOOP
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic_noop.c:113 noop_apic_read+0x29/0x3f()
CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.0-rc4-01668-gfd71a04-dirty #229484
 b286bb7c b2b4def8 b235a604 b2b4df28 b1028784 b286ec98 00000000 00000000
 b286bb7c 00000071 b1015bba b1015bba 00000071 00000000 b2dc0000 b2b4df38
 b102882f 00000009 00000000 b2b4df40 b1015bba b2b4df48 b10142e8 b2b4df58
Call Trace:
 [<b235a604>] dump_stack+0x16/0x18
 [<b1028784>] warn_slowpath_common+0x73/0x89
 [<b1015bba>] ? noop_apic_read+0x29/0x3f
 [<b102882f>] warn_slowpath_null+0x1d/0x1f
 [<b1015bba>] noop_apic_read+0x29/0x3f
 [<b10142e8>] read_apic_id+0x14/0x1f
 [<b2d0ed8d>] init_apic_mappings+0xea/0x140
 [<b2d080f3>] setup_arch+0xa45/0xab3
 [<b23557b4>] ? printk+0x38/0x3a
 [<b2d057e7>] start_kernel+0xb9/0x354
 [<b2d05384>] i386_start_kernel+0x12e/0x131
---[ end trace a7919e7f17c0a725 ]---
nr_irqs_gsi: 16
e820: [mem 0x40000000-0xdfffffff] available for PCI devices
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0 
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 259726
Kernel command line: root=/dev/sda1 earlyprintk=ttyS0,115200,keep console=ttyS0,115200 debug initcall_debug enforcing=0 apic=verbose ignore_loglevel sysrq_always_enabled selinux=0 nmi_watchdog=0 3 panic=1 3
sysrq: sysrq always enabled.
PID hash table entries: 4096 (order: 2, 16384 bytes)
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Initializing CPU#0
Initializing HighMem for node 0 (00000000:00000000)
Memory: 1000788K/1048120K available (19921K kernel code, 1764K rwdata, 8020K rodata, 732K init, 5700K bss, 47332K reserved, 0K highmem)
virtual kernel memory layout:
    fixmap  : 0xfffa3000 - 0xfffff000   ( 368 kB)
    pkmap   : 0xff800000 - 0xffc00000   (4096 kB)
    vmalloc : 0xf07f0000 - 0xff7fe000   ( 240 MB)
    lowmem  : 0xb0000000 - 0xefff0000   (1023 MB)
      .init : 0xb2d05000 - 0xb2dbc000   ( 732 kB)
      .data : 0xb237475f - 0xb2d040e0   (9790 kB)
      .text : 0xb1000000 - 0xb237475f   (19921 kB)
Checking if this processor honours the WP bit even in supervisor mode...Ok.
NR_IRQS:2304 nr_irqs:256 16
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic_noop.c:119 noop_apic_write+0x26/0x3b()
CPU: 0 PID: 0 Comm: swapper Tainted: G        W    3.12.0-rc4-01668-gfd71a04-dirty #229484
 b286bb7c b2b4df4c b235a604 b2b4df7c b1028784 b286ec98 00000000 00000000
 b286bb7c 00000077 b1015b7c b1015b7c 00000077 b2d91900 b2b54460 b2b4df8c
 b102882f 00000009 00000000 b2b4df94 b1015b7c b2b4df9c b2d0eb47 b2b4dfb4
Call Trace:
 [<b235a604>] dump_stack+0x16/0x18
 [<b1028784>] warn_slowpath_common+0x73/0x89
 [<b1015b7c>] ? noop_apic_write+0x26/0x3b
 [<b102882f>] warn_slowpath_null+0x1d/0x1f
 [<b1015b7c>] noop_apic_write+0x26/0x3b
 [<b2d0eb47>] init_bsp_APIC+0x64/0xb4
 [<b2d081bc>] init_ISA_irqs+0x16/0x46
 [<b2d0821c>] native_init_IRQ+0xa/0x1ae
 [<b2d08210>] init_IRQ+0x24/0x26
 [<b2d0591f>] start_kernel+0x1f1/0x354
 [<b2d054d3>] ? repair_env_string+0x5e/0x5e
 [<b2d05384>] i386_start_kernel+0x12e/0x131
---[ end trace a7919e7f17c0a726 ]---
CPU 0 irqstacks, hard=b008c000 soft=b008e000
spurious 8259A interrupt: IRQ7.
Console: colour VGA+ 80x25
console [ttyS0] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:  8
... MAX_LOCK_DEPTH:          48
... MAX_LOCKDEP_KEYS:        8191
... CLASSHASH_SIZE:          4096
... MAX_LOCKDEP_ENTRIES:     16384
... MAX_LOCKDEP_CHAINS:      32768
... CHAINHASH_SIZE:          16384
 memory used by lock dependency info: 3551 kB
 per task-struct memory footprint: 1152 bytes
------------------------
| Locking API testsuite:
----------------------------------------------------------------------------
                                 | spin |wlock |rlock |mutex | wsem | rsem |
  --------------------------------------------------------------------------
                     A-A deadlock:                     A-A deadlock:failed|failed|failed|failed|  ok  |  ok  |failed|failed|failed|failed|failed|failed|

                 A-B-B-A deadlock:                 A-B-B-A deadlock:failed|failed|failed|failed|  ok  |  ok  |failed|failed|failed|failed|failed|failed|

             A-B-B-C-C-A deadlock:             A-B-B-C-C-A deadlock:failed|failed|failed|failed|  ok  |  ok  |failed|failed|failed|failed|failed|failed|

             A-B-C-A-B-C deadlock:             A-B-C-A-B-C deadlock:failed|failed|failed|failed|  ok  |  ok  |failed|failed|failed|failed|failed|failed|

         A-B-B-C-C-D-D-A deadlock:         A-B-B-C-C-D-D-A deadlock:failed|failed|failed|failed|  ok  |  ok  |failed|failed|failed|failed|failed|failed|

         A-B-C-D-B-D-D-A deadlock:         A-B-C-D-B-D-D-A deadlock:failed|failed|failed|failed|  ok  |  ok  |failed|failed|failed|failed|failed|failed|

         A-B-C-D-B-C-D-A deadlock:         A-B-C-D-B-C-D-A deadlock:failed|failed|failed|failed|  ok  |  ok  |failed|failed|failed|failed|failed|failed|

                    double unlock:                    double unlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |

                  initialize held:                  initialize held:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |

                 bad unlock order:                 bad unlock order:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |

  --------------------------------------------------------------------------
              recursive read-lock:              recursive read-lock:             |             |  ok  |  ok  |             |             |failed|failed|

           recursive read-lock #2:           recursive read-lock #2:             |             |  ok  |  ok  |             |             |failed|failed|

            mixed read-write-lock:            mixed read-write-lock:             |             |failed|failed|             |             |failed|failed|

            mixed write-read-lock:            mixed write-read-lock:             |             |failed|failed|             |             |failed|failed|

  --------------------------------------------------------------------------
     hard-irqs-on + irq-safe-A/12:     hard-irqs-on + irq-safe-A/12:failed|failed|failed|failed|  ok  |  ok  |

     soft-irqs-on + irq-safe-A/12:     soft-irqs-on + irq-safe-A/12:failed|failed|failed|failed|  ok  |  ok  |

     hard-irqs-on + irq-safe-A/21:     hard-irqs-on + irq-safe-A/21:failed|failed|failed|failed|  ok  |  ok  |

     soft-irqs-on + irq-safe-A/21:     soft-irqs-on + irq-safe-A/21:failed|failed|failed|failed|  ok  |  ok  |

       sirq-safe-A => hirqs-on/12:       sirq-safe-A => hirqs-on/12:failed|failed|failed|failed|  ok  |  ok  |

       sirq-safe-A => hirqs-on/21:       sirq-safe-A => hirqs-on/21:failed|failed|failed|failed|  ok  |  ok  |

         hard-safe-A + irqs-on/12:         hard-safe-A + irqs-on/12:failed|failed|failed|failed|  ok  |  ok  |

         soft-safe-A + irqs-on/12:         soft-safe-A + irqs-on/12:failed|failed|failed|failed|  ok  |  ok  |

         hard-safe-A + irqs-on/21:         hard-safe-A + irqs-on/21:failed|failed|failed|failed|  ok  |  ok  |

         soft-safe-A + irqs-on/21:         soft-safe-A + irqs-on/21:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #1/123:    hard-safe-A + unsafe-B #1/123:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #1/123:    soft-safe-A + unsafe-B #1/123:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #1/132:    hard-safe-A + unsafe-B #1/132:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #1/132:    soft-safe-A + unsafe-B #1/132:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #1/213:    hard-safe-A + unsafe-B #1/213:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #1/213:    soft-safe-A + unsafe-B #1/213:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #1/231:    hard-safe-A + unsafe-B #1/231:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #1/231:    soft-safe-A + unsafe-B #1/231:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #1/312:    hard-safe-A + unsafe-B #1/312:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #1/312:    soft-safe-A + unsafe-B #1/312:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #1/321:    hard-safe-A + unsafe-B #1/321:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #1/321:    soft-safe-A + unsafe-B #1/321:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #2/123:    hard-safe-A + unsafe-B #2/123:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #2/123:    soft-safe-A + unsafe-B #2/123:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #2/132:    hard-safe-A + unsafe-B #2/132:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #2/132:    soft-safe-A + unsafe-B #2/132:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #2/213:    hard-safe-A + unsafe-B #2/213:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #2/213:    soft-safe-A + unsafe-B #2/213:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #2/231:    hard-safe-A + unsafe-B #2/231:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #2/231:    soft-safe-A + unsafe-B #2/231:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #2/312:    hard-safe-A + unsafe-B #2/312:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #2/312:    soft-safe-A + unsafe-B #2/312:failed|failed|failed|failed|  ok  |  ok  |

    hard-safe-A + unsafe-B #2/321:    hard-safe-A + unsafe-B #2/321:failed|failed|failed|failed|  ok  |  ok  |

    soft-safe-A + unsafe-B #2/321:    soft-safe-A + unsafe-B #2/321:failed|failed|failed|failed|  ok  |  ok  |

      hard-irq lock-inversion/123:      hard-irq lock-inversion/123:failed|failed|failed|failed|  ok  |  ok  |

      soft-irq lock-inversion/123:      soft-irq lock-inversion/123:failed|failed|failed|failed|  ok  |  ok  |

      hard-irq lock-inversion/132:      hard-irq lock-inversion/132:failed|failed|failed|failed|  ok  |  ok  |

      soft-irq lock-inversion/132:      soft-irq lock-inversion/132:failed|failed|failed|failed|  ok  |  ok  |

      hard-irq lock-inversion/213:      hard-irq lock-inversion/213:failed|failed|failed|failed|  ok  |  ok  |

      soft-irq lock-inversion/213:      soft-irq lock-inversion/213:failed|failed|failed|failed|  ok  |  ok  |

      hard-irq lock-inversion/231:      hard-irq lock-inversion/231:failed|failed|failed|failed|  ok  |  ok  |

      soft-irq lock-inversion/231:      soft-irq lock-inversion/231:failed|failed|failed|failed|  ok  |  ok  |

      hard-irq lock-inversion/312:      hard-irq lock-inversion/312:failed|failed|failed|failed|  ok  |  ok  |

      soft-irq lock-inversion/312:      soft-irq lock-inversion/312:failed|failed|failed|failed|  ok  |  ok  |

      hard-irq lock-inversion/321:      hard-irq lock-inversion/321:failed|failed|failed|failed|  ok  |  ok  |

      soft-irq lock-inversion/321:      soft-irq lock-inversion/321:failed|failed|failed|failed|  ok  |  ok  |

      hard-irq read-recursion/123:      hard-irq read-recursion/123:  ok  |  ok  |

      soft-irq read-recursion/123:      soft-irq read-recursion/123:  ok  |  ok  |

      hard-irq read-recursion/132:      hard-irq read-recursion/132:  ok  |  ok  |

      soft-irq read-recursion/132:      soft-irq read-recursion/132:  ok  |  ok  |

      hard-irq read-recursion/213:      hard-irq read-recursion/213:  ok  |  ok  |

      soft-irq read-recursion/213:      soft-irq read-recursion/213:  ok  |  ok  |

      hard-irq read-recursion/231:      hard-irq read-recursion/231:  ok  |  ok  |

      soft-irq read-recursion/231:      soft-irq read-recursion/231:  ok  |  ok  |

      hard-irq read-recursion/312:      hard-irq read-recursion/312:  ok  |  ok  |

      soft-irq read-recursion/312:      soft-irq read-recursion/312:  ok  |  ok  |

      hard-irq read-recursion/321:      hard-irq read-recursion/321:  ok  |  ok  |

      soft-irq read-recursion/321:      soft-irq read-recursion/321:  ok  |  ok  |

  --------------------------------------------------------------------------
  | Wound/wait tests |
  ---------------------
                  ww api failures:                  ww api failures:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |

               ww contexts mixing:               ww contexts mixing:failed|failed|  ok  |  ok  |

             finishing ww context:             finishing ww context:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |

               locking mismatches:               locking mismatches:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |

                 EDEADLK handling:                 EDEADLK handling:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |

           spinlock nest unlocked:           spinlock nest unlocked:  ok  |  ok  |

  -----------------------------------------------------
                                 |block | try  |context|
  -----------------------------------------------------
                          context:                          context:failed|failed|  ok  |  ok  |  ok  |  ok  |

                              try:                              try:failed|failed|  ok  |  ok  |failed|failed|

                            block:                            block:failed|failed|  ok  |  ok  |failed|failed|

                         spinlock:                         spinlock:failed|failed|  ok  |  ok  |failed|failed|

--------------------------------------------------------
141 out of 253 testcases failed, as expected. |
----------------------------------------------------
tsc: Fast TSC calibration using PIT
tsc: Detected 2010.210 MHz processor
Calibrating delay loop (skipped), value calculated using timer frequency.. Calibrating delay loop (skipped), value calculated using timer frequency.. 4020.42 BogoMIPS (lpj=2010210)
4020.42 BogoMIPS (lpj=2010210)
pid_max: default: 32768 minimum: 301
Security Framework initialized
TOMOYO Linux initialized
AppArmor: AppArmor disabled by boot time parameter
Mount-cache hash table entries: 512
mce: CPU supports 5 MCE banks
mce: unknown CPU type - not enabling MCE support
Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
tlb_flushall_shift: -1
Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
tlb_flushall_shift: -1
CPU: CPU: AuthenticAMD AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 3800+AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ (fam: 0f, model: 23 (fam: 0f, model: 23, stepping: 02)
, stepping: 02)
calling  set_real_mode_permissions+0x0/0x7c @ 1
initcall set_real_mode_permissions+0x0/0x7c returned 0 after 0 usecs
calling  trace_init_flags_sys_exit+0x0/0x11 @ 1
initcall trace_init_flags_sys_exit+0x0/0x11 returned 0 after 0 usecs
calling  trace_init_flags_sys_enter+0x0/0x11 @ 1
initcall trace_init_flags_sys_enter+0x0/0x11 returned 0 after 0 usecs
calling  init_hw_perf_events+0x0/0x4bb @ 1
Performance Events: Performance Events: no PMU driver, software events only.
no PMU driver, software events only.
initcall init_hw_perf_events+0x0/0x4bb returned 0 after 1953 usecs
calling  register_trigger_all_cpu_backtrace+0x0/0x13 @ 1
initcall register_trigger_all_cpu_backtrace+0x0/0x13 returned 0 after 0 usecs
calling  spawn_ksoftirqd+0x0/0x17 @ 1
initcall spawn_ksoftirqd+0x0/0x17 returned 0 after 0 usecs
calling  init_workqueues+0x0/0x2cf @ 1
initcall init_workqueues+0x0/0x2cf returned 0 after 0 usecs
calling  relay_init+0x0/0x7 @ 1
initcall relay_init+0x0/0x7 returned 0 after 0 usecs
calling  tracer_alloc_buffers+0x0/0x18c @ 1
initcall tracer_alloc_buffers+0x0/0x18c returned 0 after 976 usecs
calling  init_events+0x0/0x5d @ 1
initcall init_events+0x0/0x5d returned 0 after 0 usecs
calling  init_trace_printk+0x0/0x7 @ 1
initcall init_trace_printk+0x0/0x7 returned 0 after 0 usecs
calling  event_trace_memsetup+0x0/0x5a @ 1
initcall event_trace_memsetup+0x0/0x5a returned 0 after 0 usecs
calling  init_ftrace_syscalls+0x0/0x5e @ 1
initcall init_ftrace_syscalls+0x0/0x5e returned 0 after 976 usecs
calling  dynamic_debug_init+0x0/0x22c @ 1
initcall dynamic_debug_init+0x0/0x22c returned 0 after 2929 usecs
enabled ExtINT on CPU#0
No ESR for 82489DX.
Using local APIC timer interrupts.
calibrating APIC timer ...
Using local APIC timer interrupts.
calibrating APIC timer ...
... lapic delta = 0
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at kernel/time/clockevents.c:49 clockevent_delta2ns+0x53/0xb7()
CPU: 0 PID: 1 Comm: swapper Tainted: G        W    3.12.0-rc4-01668-gfd71a04-dirty #229484
 b2879283 b2879283 b010dedc b010dedc b235a604 b235a604 b010df0c b010df0c b1028784 b1028784 b286ec98 b286ec98 00000000 00000000 00000001 00000001

 b2879283 b2879283 00000031 00000031 b10613a9 b10613a9 b10613a9 b10613a9 00000031 00000031 00000000 00000000 b2b5a120 b2b5a120 b010df1c b010df1c

 b102882f b102882f 00000009 00000009 00000000 00000000 b010df34 b010df34 b10613a9 b10613a9 00000000 00000000 00000000 00000000 7fffffff 7fffffff

Call Trace:
 [<b235a604>] dump_stack+0x16/0x18
 [<b1028784>] warn_slowpath_common+0x73/0x89
 [<b10613a9>] ? clockevent_delta2ns+0x53/0xb7
 [<b102882f>] warn_slowpath_null+0x1d/0x1f
 [<b10613a9>] clockevent_delta2ns+0x53/0xb7
 [<b2d0e6fe>] setup_boot_APIC_clock+0x1e4/0x3d1
 [<b2d0ef11>] APIC_init_uniprocessor+0xef/0xfb
 [<b2d05c2a>] kernel_init_freeable+0x5a/0x178
 [<b1047d39>] ? finish_task_switch.constprop.64+0x28/0x9d
 [<b2350e9e>] kernel_init+0xb/0xc3
 [<b23735d7>] ret_from_kernel_thread+0x1b/0x28
 [<b2350e93>] ? rest_init+0xb7/0xb7
---[ end trace a7919e7f17c0a727 ]---
..... delta 0
..... mult: 1
..... calibration result: 0
..... CPU clock speed is 2009.0995 MHz.
..... host bus clock speed is 0.0000 MHz.
APIC frequency too slow, disabling apic timer
devtmpfs: initialized
calling  init_mmap_min_addr+0x0/0x11 @ 1
initcall init_mmap_min_addr+0x0/0x11 returned 0 after 0 usecs
calling  net_ns_init+0x0/0x142 @ 1
initcall net_ns_init+0x0/0x142 returned 0 after 0 usecs
calling  reboot_init+0x0/0x7 @ 1
initcall reboot_init+0x0/0x7 returned 0 after 0 usecs
calling  init_lapic_sysfs+0x0/0x1e @ 1
initcall init_lapic_sysfs+0x0/0x1e returned 0 after 0 usecs
calling  wq_sysfs_init+0x0/0x11 @ 1
initcall wq_sysfs_init+0x0/0x11 returned 0 after 0 usecs
calling  ksysfs_init+0x0/0x7a @ 1
initcall ksysfs_init+0x0/0x7a returned 0 after 0 usecs
calling  pm_init+0x0/0x7a @ 1
initcall pm_init+0x0/0x7a returned 0 after 0 usecs
calling  init_jiffies_clocksource+0x0/0xf @ 1
initcall init_jiffies_clocksource+0x0/0xf returned 0 after 0 usecs
calling  init_wakeup_tracer+0x0/0x1d @ 1
initcall init_wakeup_tracer+0x0/0x1d returned 0 after 0 usecs
calling  event_trace_enable+0x0/0xf1 @ 1
initcall event_trace_enable+0x0/0xf1 returned 0 after 2929 usecs
calling  init_zero_pfn+0x0/0x25 @ 1
initcall init_zero_pfn+0x0/0x25 returned 0 after 0 usecs
calling  memory_failure_init+0x0/0x91 @ 1
initcall memory_failure_init+0x0/0x91 returned 0 after 0 usecs
calling  fsnotify_init+0x0/0x2c @ 1
initcall fsnotify_init+0x0/0x2c returned 0 after 0 usecs
calling  filelock_init+0x0/0x48 @ 1
initcall filelock_init+0x0/0x48 returned 0 after 0 usecs
calling  init_aout_binfmt+0x0/0x13 @ 1
initcall init_aout_binfmt+0x0/0x13 returned 0 after 0 usecs
calling  init_misc_binfmt+0x0/0x28 @ 1
initcall init_misc_binfmt+0x0/0x28 returned 0 after 0 usecs
calling  init_elf_binfmt+0x0/0x13 @ 1
initcall init_elf_binfmt+0x0/0x13 returned 0 after 0 usecs
calling  debugfs_init+0x0/0x4c @ 1
initcall debugfs_init+0x0/0x4c returned 0 after 0 usecs
calling  securityfs_init+0x0/0x43 @ 1
initcall securityfs_init+0x0/0x43 returned 0 after 0 usecs
calling  calibrate_xor_blocks+0x0/0x189 @ 1
xor: measuring software checksum speed
   pIII_sse  :  3716.000 MB/sec
   prefetch64-sse:  4840.000 MB/sec
xor: using function: prefetch64-sse (4840.000 MB/sec)
initcall calibrate_xor_blocks+0x0/0x189 returned 0 after 23437 usecs
calling  prandom_init+0x0/0x85 @ 1
initcall prandom_init+0x0/0x85 returned 0 after 0 usecs
calling  test_atomic64+0x0/0x629 @ 1
atomic64 test passed for i386+ platform with CX8 and with SSE
initcall test_atomic64+0x0/0x629 returned 0 after 976 usecs
calling  sfi_sysfs_init+0x0/0xc3 @ 1
initcall sfi_sysfs_init+0x0/0xc3 returned 0 after 0 usecs
calling  virtio_init+0x0/0x22 @ 1
initcall virtio_init+0x0/0x22 returned 0 after 0 usecs
calling  regulator_init+0x0/0x69 @ 1
regulator-dummy: no parameters
initcall regulator_init+0x0/0x69 returned 0 after 976 usecs
calling  early_resume_init+0x0/0x1c3 @ 1
RTC time: 20:27:31, date: 10/05/13
initcall early_resume_init+0x0/0x1c3 returned 0 after 976 usecs
calling  bsp_pm_check_init+0x0/0x11 @ 1
initcall bsp_pm_check_init+0x0/0x11 returned 0 after 0 usecs
calling  sock_init+0x0/0x89 @ 1
initcall sock_init+0x0/0x89 returned 0 after 0 usecs
calling  netpoll_init+0x0/0x39 @ 1
initcall netpoll_init+0x0/0x39 returned 0 after 0 usecs
calling  netlink_proto_init+0x0/0x197 @ 1
NET: Registered protocol family 16
initcall netlink_proto_init+0x0/0x197 returned 0 after 976 usecs
calling  olpc_init+0x0/0xfc @ 1
initcall olpc_init+0x0/0xfc returned 0 after 0 usecs
calling  bdi_class_init+0x0/0x3c @ 1
initcall bdi_class_init+0x0/0x3c returned 0 after 0 usecs
calling  kobject_uevent_init+0x0/0xf @ 1
initcall kobject_uevent_init+0x0/0xf returned 0 after 976 usecs
calling  gpiolib_sysfs_init+0x0/0x78 @ 1
initcall gpiolib_sysfs_init+0x0/0x78 returned 0 after 0 usecs
calling  pcibus_class_init+0x0/0x14 @ 1
initcall pcibus_class_init+0x0/0x14 returned 0 after 0 usecs
calling  pci_driver_init+0x0/0xf @ 1
initcall pci_driver_init+0x0/0xf returned 0 after 0 usecs
calling  backlight_class_init+0x0/0x4c @ 1
initcall backlight_class_init+0x0/0x4c returned 0 after 0 usecs
calling  video_output_class_init+0x0/0x14 @ 1
initcall video_output_class_init+0x0/0x14 returned 0 after 0 usecs
calling  anatop_regulator_init+0x0/0x11 @ 1
initcall anatop_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  tty_class_init+0x0/0x2b @ 1
initcall tty_class_init+0x0/0x2b returned 0 after 0 usecs
calling  vtconsole_class_init+0x0/0xc9 @ 1
initcall vtconsole_class_init+0x0/0xc9 returned 0 after 976 usecs
calling  wakeup_sources_debugfs_init+0x0/0x2f @ 1
initcall wakeup_sources_debugfs_init+0x0/0x2f returned 0 after 0 usecs
calling  regmap_initcall+0x0/0xc @ 1
initcall regmap_initcall+0x0/0xc returned 0 after 0 usecs
calling  syscon_init+0x0/0x11 @ 1
initcall syscon_init+0x0/0x11 returned 0 after 0 usecs
calling  hsi_init+0x0/0xf @ 1
initcall hsi_init+0x0/0xf returned 0 after 976 usecs
calling  i2c_init+0x0/0x35 @ 1
initcall i2c_init+0x0/0x35 returned 0 after 0 usecs
calling  arch_kdebugfs_init+0x0/0x2b4 @ 1
initcall arch_kdebugfs_init+0x0/0x2b4 returned 0 after 0 usecs
calling  init_pit_clocksource+0x0/0x15 @ 1
initcall init_pit_clocksource+0x0/0x15 returned 0 after 0 usecs
calling  mtrr_if_init+0x0/0x56 @ 1
initcall mtrr_if_init+0x0/0x56 returned 0 after 0 usecs
calling  kdump_buf_page_init+0x0/0x38 @ 1
initcall kdump_buf_page_init+0x0/0x38 returned 0 after 0 usecs
calling  olpc_ec_init_module+0x0/0x11 @ 1
initcall olpc_ec_init_module+0x0/0x11 returned 0 after 0 usecs
calling  pci_arch_init+0x0/0x55 @ 1
PCI: Using configuration type 1 for base access
initcall pci_arch_init+0x0/0x55 returned 0 after 1953 usecs
calling  topology_init+0x0/0x13 @ 1
Missing cpus node, bailing out
initcall topology_init+0x0/0x13 returned 0 after 976 usecs
calling  mtrr_init_finialize+0x0/0x30 @ 1
initcall mtrr_init_finialize+0x0/0x30 returned 0 after 0 usecs
calling  param_sysfs_init+0x0/0x2be @ 1
initcall param_sysfs_init+0x0/0x2be returned 0 after 37109 usecs
calling  pm_sysrq_init+0x0/0x16 @ 1
initcall pm_sysrq_init+0x0/0x16 returned 0 after 0 usecs
calling  default_bdi_init+0x0/0x78 @ 1
initcall default_bdi_init+0x0/0x78 returned 0 after 0 usecs
calling  init_bio+0x0/0xee @ 1
bio: create slab <bio-0> at 0
initcall init_bio+0x0/0xee returned 0 after 976 usecs
calling  fsnotify_notification_init+0x0/0x9c @ 1
initcall fsnotify_notification_init+0x0/0x9c returned 0 after 0 usecs
calling  cryptomgr_init+0x0/0xf @ 1
initcall cryptomgr_init+0x0/0xf returned 0 after 0 usecs
calling  cryptd_init+0x0/0x80 @ 1
initcall cryptd_init+0x0/0x80 returned 0 after 0 usecs
calling  blk_settings_init+0x0/0x1d @ 1
initcall blk_settings_init+0x0/0x1d returned 0 after 0 usecs
calling  blk_ioc_init+0x0/0x2f @ 1
initcall blk_ioc_init+0x0/0x2f returned 0 after 0 usecs
calling  blk_softirq_init+0x0/0x2a @ 1
initcall blk_softirq_init+0x0/0x2a returned 0 after 0 usecs
calling  blk_iopoll_setup+0x0/0x2a @ 1
initcall blk_iopoll_setup+0x0/0x2a returned 0 after 0 usecs
calling  genhd_device_init+0x0/0x6a @ 1
initcall genhd_device_init+0x0/0x6a returned 0 after 0 usecs
calling  blk_dev_integrity_init+0x0/0x2f @ 1
initcall blk_dev_integrity_init+0x0/0x2f returned 0 after 0 usecs
calling  raid6_select_algo+0x0/0x1e9 @ 1
raid6: mmxx1     1648 MB/s
raid6: mmxx2     2992 MB/s
raid6: sse1x1     464 MB/s
raid6: sse1x2     757 MB/s
raid6: sse2x1     777 MB/s
raid6: sse2x2    1312 MB/s
raid6: int32x1    347 MB/s
raid6: int32x2    546 MB/s
raid6: int32x4    718 MB/s
raid6: int32x8    714 MB/s
raid6: using algorithm mmxx2 (2992 MB/s)
raid6: using intx1 recovery algorithm
initcall raid6_select_algo+0x0/0x1e9 returned 0 after 177734 usecs
calling  gpiolib_debugfs_init+0x0/0x2a @ 1
initcall gpiolib_debugfs_init+0x0/0x2a returned 0 after 0 usecs
calling  max7300_init+0x0/0x11 @ 1
initcall max7300_init+0x0/0x11 returned 0 after 0 usecs
calling  max732x_init+0x0/0x11 @ 1
initcall max732x_init+0x0/0x11 returned 0 after 0 usecs
calling  mcp23s08_init+0x0/0x11 @ 1
initcall mcp23s08_init+0x0/0x11 returned 0 after 0 usecs
calling  pca953x_init+0x0/0x11 @ 1
initcall pca953x_init+0x0/0x11 returned 0 after 0 usecs
calling  pcf857x_init+0x0/0x11 @ 1
initcall pcf857x_init+0x0/0x11 returned 0 after 0 usecs
calling  sx150x_init+0x0/0x11 @ 1
initcall sx150x_init+0x0/0x11 returned 0 after 0 usecs
calling  tc3589x_gpio_init+0x0/0x11 @ 1
initcall tc3589x_gpio_init+0x0/0x11 returned 0 after 0 usecs
calling  gpio_twl4030_init+0x0/0x11 @ 1
initcall gpio_twl4030_init+0x0/0x11 returned 0 after 0 usecs
calling  wm8350_gpio_init+0x0/0x11 @ 1
initcall wm8350_gpio_init+0x0/0x11 returned 0 after 0 usecs
calling  pwm_debugfs_init+0x0/0x2a @ 1
initcall pwm_debugfs_init+0x0/0x2a returned 0 after 0 usecs
calling  pwm_sysfs_init+0x0/0x14 @ 1
initcall pwm_sysfs_init+0x0/0x14 returned 0 after 0 usecs
calling  pci_slot_init+0x0/0x3d @ 1
initcall pci_slot_init+0x0/0x3d returned 0 after 0 usecs
calling  fbmem_init+0x0/0x96 @ 1
initcall fbmem_init+0x0/0x96 returned 0 after 0 usecs
calling  pnp_init+0x0/0xf @ 1
initcall pnp_init+0x0/0xf returned 0 after 0 usecs
calling  regulator_fixed_voltage_init+0x0/0x11 @ 1
initcall regulator_fixed_voltage_init+0x0/0x11 returned 0 after 0 usecs
calling  pm8607_regulator_init+0x0/0x11 @ 1
initcall pm8607_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  ad5398_init+0x0/0x11 @ 1
initcall ad5398_init+0x0/0x11 returned 0 after 0 usecs
calling  as3711_regulator_init+0x0/0x11 @ 1
initcall as3711_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  da903x_regulator_init+0x0/0x11 @ 1
initcall da903x_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  gpio_regulator_init+0x0/0x11 @ 1
initcall gpio_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  isl6271a_init+0x0/0x11 @ 1
initcall isl6271a_init+0x0/0x11 returned 0 after 0 usecs
calling  lp3972_module_init+0x0/0x11 @ 1
initcall lp3972_module_init+0x0/0x11 returned 0 after 0 usecs
calling  lp8755_init+0x0/0x11 @ 1
initcall lp8755_init+0x0/0x11 returned 0 after 0 usecs
calling  max1586_pmic_init+0x0/0x11 @ 1
initcall max1586_pmic_init+0x0/0x11 returned 0 after 0 usecs
calling  max8649_init+0x0/0x11 @ 1
initcall max8649_init+0x0/0x11 returned 0 after 0 usecs
calling  max8660_init+0x0/0x11 @ 1
initcall max8660_init+0x0/0x11 returned 0 after 0 usecs
calling  max8998_pmic_init+0x0/0x11 @ 1
initcall max8998_pmic_init+0x0/0x11 returned 0 after 0 usecs
calling  tps51632_init+0x0/0x11 @ 1
initcall tps51632_init+0x0/0x11 returned 0 after 0 usecs
calling  pcf50633_regulator_init+0x0/0x11 @ 1
initcall pcf50633_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  tps6105x_regulator_init+0x0/0x11 @ 1
initcall tps6105x_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  tps62360_init+0x0/0x11 @ 1
initcall tps62360_init+0x0/0x11 returned 0 after 0 usecs
calling  tps_65023_init+0x0/0x11 @ 1
initcall tps_65023_init+0x0/0x11 returned 0 after 0 usecs
calling  tps65090_regulator_init+0x0/0x11 @ 1
initcall tps65090_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  tps65217_regulator_init+0x0/0x11 @ 1
initcall tps65217_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  tps65910_init+0x0/0x11 @ 1
initcall tps65910_init+0x0/0x11 returned 0 after 0 usecs
calling  twlreg_init+0x0/0x11 @ 1
initcall twlreg_init+0x0/0x11 returned 0 after 0 usecs
calling  wm831x_dcdc_init+0x0/0x8a @ 1
initcall wm831x_dcdc_init+0x0/0x8a returned 0 after 0 usecs
calling  wm831x_isink_init+0x0/0x30 @ 1
initcall wm831x_isink_init+0x0/0x30 returned 0 after 0 usecs
calling  wm831x_ldo_init+0x0/0x6a @ 1
initcall wm831x_ldo_init+0x0/0x6a returned 0 after 0 usecs
calling  wm8350_regulator_init+0x0/0x11 @ 1
initcall wm8350_regulator_init+0x0/0x11 returned 0 after 0 usecs
calling  misc_init+0x0/0xad @ 1
initcall misc_init+0x0/0xad returned 0 after 0 usecs
calling  tifm_init+0x0/0x8c @ 1
initcall tifm_init+0x0/0x8c returned 0 after 0 usecs
calling  pm860x_i2c_init+0x0/0x30 @ 1
initcall pm860x_i2c_init+0x0/0x30 returned 0 after 0 usecs
calling  pm800_i2c_init+0x0/0x11 @ 1
initcall pm800_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  tc3589x_init+0x0/0x11 @ 1
initcall tc3589x_init+0x0/0x11 returned 0 after 0 usecs
calling  wm8400_module_init+0x0/0x30 @ 1
initcall wm8400_module_init+0x0/0x30 returned 0 after 0 usecs
calling  wm831x_i2c_init+0x0/0x30 @ 1
initcall wm831x_i2c_init+0x0/0x30 returned 0 after 0 usecs
calling  wm8350_i2c_init+0x0/0x11 @ 1
initcall wm8350_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  tps6105x_init+0x0/0x11 @ 1
initcall tps6105x_init+0x0/0x11 returned 0 after 0 usecs
calling  tps6507x_i2c_init+0x0/0x11 @ 1
initcall tps6507x_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  tps65217_init+0x0/0x11 @ 1
initcall tps65217_init+0x0/0x11 returned 0 after 0 usecs
calling  tps65910_i2c_init+0x0/0x11 @ 1
initcall tps65910_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  da903x_init+0x0/0x11 @ 1
initcall da903x_init+0x0/0x11 returned 0 after 0 usecs
calling  lp8788_init+0x0/0x11 @ 1
initcall lp8788_init+0x0/0x11 returned 0 after 0 usecs
calling  max77693_i2c_init+0x0/0x11 @ 1
initcall max77693_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  max8998_i2c_init+0x0/0x11 @ 1
initcall max8998_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  pcf50633_init+0x0/0x11 @ 1
initcall pcf50633_init+0x0/0x11 returned 0 after 0 usecs
calling  tps65090_init+0x0/0x11 @ 1
initcall tps65090_init+0x0/0x11 returned 0 after 0 usecs
calling  lm3533_i2c_init+0x0/0x11 @ 1
initcall lm3533_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  as3711_i2c_init+0x0/0x11 @ 1
initcall as3711_i2c_init+0x0/0x11 returned 0 after 0 usecs
calling  init_scsi+0x0/0x80 @ 1
SCSI subsystem initialized
initcall init_scsi+0x0/0x80 returned 0 after 976 usecs
calling  ata_init+0x0/0x2a1 @ 1
libata version 3.00 loaded.
initcall ata_init+0x0/0x2a1 returned 0 after 976 usecs
calling  phy_init+0x0/0x28 @ 1
initcall phy_init+0x0/0x28 returned 0 after 0 usecs
calling  init_pcmcia_cs+0x0/0x32 @ 1
initcall init_pcmcia_cs+0x0/0x32 returned 0 after 0 usecs
calling  usb_init+0x0/0x146 @ 1
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
initcall usb_init+0x0/0x146 returned 0 after 3906 usecs
calling  usb_phy_gen_xceiv_init+0x0/0x11 @ 1
initcall usb_phy_gen_xceiv_init+0x0/0x11 returned 0 after 0 usecs
calling  usb_udc_init+0x0/0x45 @ 1
initcall usb_udc_init+0x0/0x45 returned 0 after 0 usecs
calling  serio_init+0x0/0x2e @ 1
initcall serio_init+0x0/0x2e returned 0 after 0 usecs
calling  gameport_init+0x0/0x2e @ 1
initcall gameport_init+0x0/0x2e returned 0 after 0 usecs
calling  input_init+0x0/0xf9 @ 1
initcall input_init+0x0/0xf9 returned 0 after 0 usecs
calling  tca6416_keypad_init+0x0/0x11 @ 1
initcall tca6416_keypad_init+0x0/0x11 returned 0 after 0 usecs
calling  tca8418_keypad_init+0x0/0x11 @ 1
initcall tca8418_keypad_init+0x0/0x11 returned 0 after 0 usecs
calling  rtc_init+0x0/0x44 @ 1
initcall rtc_init+0x0/0x44 returned 0 after 0 usecs
calling  i2c_gpio_init+0x0/0x30 @ 1
initcall i2c_gpio_init+0x0/0x30 returned 0 after 0 usecs
calling  videodev_init+0x0/0x7f @ 1
Linux video capture interface: v2.00
initcall videodev_init+0x0/0x7f returned 0 after 976 usecs
calling  init_dvbdev+0x0/0xc2 @ 1
initcall init_dvbdev+0x0/0xc2 returned 0 after 0 usecs
calling  pps_init+0x0/0xa6 @ 1
pps_core: LinuxPPS API ver. 1 registered
pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
initcall pps_init+0x0/0xa6 returned 0 after 1953 usecs
calling  ptp_init+0x0/0x8c @ 1
PTP clock support registered
initcall ptp_init+0x0/0x8c returned 0 after 976 usecs
calling  power_supply_class_init+0x0/0x35 @ 1
initcall power_supply_class_init+0x0/0x35 returned 0 after 0 usecs
calling  hwmon_init+0x0/0xea @ 1
initcall hwmon_init+0x0/0xea returned 0 after 976 usecs
calling  mmc_init+0x0/0x80 @ 1
initcall mmc_init+0x0/0x80 returned 0 after 0 usecs
calling  leds_init+0x0/0x32 @ 1
initcall leds_init+0x0/0x32 returned 0 after 0 usecs
calling  iio_init+0x0/0x85 @ 1
initcall iio_init+0x0/0x85 returned 0 after 0 usecs
calling  pci_subsys_init+0x0/0x44 @ 1
PCI: Probing PCI hardware
PCI: root bus 00: using default resources
PCI: Probing PCI hardware (bus 00)
PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffff]
pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
pci 0000:00:00.0: [10de:005e] type 00 class 0x058000
pci 0000:00:01.0: [10de:0050] type 00 class 0x060100
pci 0000:00:01.1: [10de:0052] type 00 class 0x0c0500
pci 0000:00:01.1: reg 0x10: [io  0xe400-0xe41f]
pci 0000:00:01.1: reg 0x20: [io  0x4c00-0x4c3f]
pci 0000:00:01.1: reg 0x24: [io  0x4c40-0x4c7f]
pci 0000:00:01.1: PME# supported from D3hot D3cold
pci 0000:00:02.0: [10de:005a] type 00 class 0x0c0310
pci 0000:00:02.0: reg 0x10: [mem 0xda004000-0xda004fff]
pci 0000:00:02.0: supports D1 D2
pci 0000:00:02.0: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:02.1: [10de:005b] type 00 class 0x0c0320
pci 0000:00:02.1: reg 0x10: [mem 0xfeb00000-0xfeb000ff]
pci 0000:00:02.1: supports D1 D2
pci 0000:00:02.1: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:04.0: [10de:0059] type 00 class 0x040100
pci 0000:00:04.0: reg 0x10: [io  0xdc00-0xdcff]
pci 0000:00:04.0: reg 0x14: [io  0xe000-0xe0ff]
pci 0000:00:04.0: reg 0x18: [mem 0xda003000-0xda003fff]
pci 0000:00:04.0: supports D1 D2
pci 0000:00:06.0: [10de:0053] type 00 class 0x01018a
pci 0000:00:06.0: reg 0x20: [io  0xf000-0xf00f]
pci 0000:00:07.0: [10de:0054] type 00 class 0x010185
pci 0000:00:07.0: reg 0x10: [io  0x09f0-0x09f7]
pci 0000:00:07.0: reg 0x14: [io  0x0bf0-0x0bf3]
pci 0000:00:07.0: reg 0x18: [io  0x0970-0x0977]
pci 0000:00:07.0: reg 0x1c: [io  0x0b70-0x0b73]
pci 0000:00:07.0: reg 0x20: [io  0xd800-0xd80f]
pci 0000:00:07.0: reg 0x24: [mem 0xda002000-0xda002fff]
pci 0000:00:08.0: [10de:0055] type 00 class 0x010185
pci 0000:00:08.0: reg 0x10: [io  0x09e0-0x09e7]
pci 0000:00:08.0: reg 0x14: [io  0x0be0-0x0be3]
pci 0000:00:08.0: reg 0x18: [io  0x0960-0x0967]
pci 0000:00:08.0: reg 0x1c: [io  0x0b60-0x0b63]
pci 0000:00:08.0: reg 0x20: [io  0xc400-0xc40f]
pci 0000:00:08.0: reg 0x24: [mem 0xda001000-0xda001fff]
pci 0000:00:09.0: [10de:005c] type 01 class 0x060401
pci 0000:00:0a.0: [10de:0057] type 00 class 0x068000
pci 0000:00:0a.0: reg 0x10: [mem 0xda000000-0xda000fff]
pci 0000:00:0a.0: reg 0x14: [io  0xb000-0xb007]
pci 0000:00:0a.0: supports D1 D2
pci 0000:00:0a.0: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:0b.0: [10de:005d] type 01 class 0x060400
pci 0000:00:0b.0: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:0c.0: [10de:005d] type 01 class 0x060400
pci 0000:00:0c.0: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:0d.0: [10de:005d] type 01 class 0x060400
pci 0000:00:0d.0: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:0e.0: [10de:005d] type 01 class 0x060400
pci 0000:00:0e.0: PME# supported from D0 D1 D2 D3hot D3cold
pci 0000:00:18.0: [1022:1100] type 00 class 0x060000
pci 0000:00:18.1: [1022:1101] type 00 class 0x060000
pci 0000:00:18.2: [1022:1102] type 00 class 0x060000
pci 0000:00:18.3: [1022:1103] type 00 class 0x060000
pci 0000:00:09.0: PCI bridge to [bus 05] (subtractive decode)
pci 0000:00:09.0:   bridge window [io  0x0000-0xffff] (subtractive decode)
pci 0000:00:09.0:   bridge window [mem 0x00000000-0xffffffff] (subtractive decode)
pci 0000:00:0b.0: PCI bridge to [bus 04]
pci 0000:00:0c.0: PCI bridge to [bus 03]
pci 0000:00:0d.0: PCI bridge to [bus 02]
pci 0000:01:00.0: [1002:5b60] type 00 class 0x030000
pci 0000:01:00.0: reg 0x10: [mem 0xd0000000-0xd7ffffff pref]
pci 0000:01:00.0: reg 0x14: [io  0xa000-0xa0ff]
pci 0000:01:00.0: reg 0x18: [mem 0xd9000000-0xd900ffff]
pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
pci 0000:01:00.0: supports D1 D2
pci 0000:01:00.1: [1002:5b70] type 00 class 0x038000
pci 0000:01:00.1: reg 0x10: [mem 0xd9010000-0xd901ffff]
pci 0000:01:00.1: supports D1 D2
pci 0000:00:0e.0: PCI bridge to [bus 01]
pci 0000:00:0e.0:   bridge window [io  0xa000-0xafff]
pci 0000:00:0e.0:   bridge window [mem 0xd8000000-0xd9ffffff]
pci 0000:00:0e.0:   bridge window [mem 0xd0000000-0xd7ffffff 64bit pref]
pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 05
pci 0000:00:00.0: default IRQ router [10de:005e]
PCI: pci_cache_line_size set to 64 bytes
e820: reserve RAM buffer [mem 0x0009f800-0x0009ffff]
e820: reserve RAM buffer [mem 0x3fff0000-0x3fffffff]
initcall pci_subsys_init+0x0/0x44 returned 0 after 85937 usecs
calling  proto_init+0x0/0xf @ 1
initcall proto_init+0x0/0xf returned 0 after 0 usecs
calling  net_dev_init+0x0/0x197 @ 1
initcall net_dev_init+0x0/0x197 returned 0 after 0 usecs
calling  neigh_init+0x0/0xa4 @ 1
initcall neigh_init+0x0/0xa4 returned 0 after 0 usecs
calling  fib_rules_init+0x0/0xbd @ 1
initcall fib_rules_init+0x0/0xbd returned 0 after 0 usecs
calling  genl_init+0x0/0x76 @ 1
initcall genl_init+0x0/0x76 returned 0 after 0 usecs
calling  cipso_v4_init+0x0/0x76 @ 1
initcall cipso_v4_init+0x0/0x76 returned 0 after 0 usecs
calling  irda_init+0x0/0x88 @ 1
NET: Registered protocol family 23
initcall irda_init+0x0/0x88 returned 0 after 976 usecs
calling  bt_init+0x0/0x89 @ 1
Bluetooth: Core ver 2.16
NET: Registered protocol family 31
Bluetooth: HCI device and connection manager initialized
Bluetooth: HCI socket layer initialized
Bluetooth: L2CAP socket layer initialized
Bluetooth: SCO socket layer initialized
initcall bt_init+0x0/0x89 returned 0 after 5859 usecs
calling  atm_init+0x0/0xd0 @ 1
NET: Registered protocol family 8
NET: Registered protocol family 20
initcall atm_init+0x0/0xd0 returned 0 after 1953 usecs
calling  cfg80211_init+0x0/0xd7 @ 1
cfg80211: Calling CRDA to update world regulatory domain
initcall cfg80211_init+0x0/0xd7 returned 0 after 1953 usecs
calling  wireless_nlevent_init+0x0/0xf @ 1
initcall wireless_nlevent_init+0x0/0xf returned 0 after 0 usecs
calling  ieee80211_init+0x0/0xa @ 1
initcall ieee80211_init+0x0/0xa returned 0 after 0 usecs
calling  netlbl_init+0x0/0x7d @ 1
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
initcall netlbl_init+0x0/0x7d returned 0 after 3906 usecs
calling  wpan_phy_class_init+0x0/0x33 @ 1
initcall wpan_phy_class_init+0x0/0x33 returned 0 after 0 usecs
calling  nfc_init+0x0/0x8f @ 1
nfc: nfc_init: NFC Core ver 0.1
NET: Registered protocol family 39
initcall nfc_init+0x0/0x8f returned 0 after 1953 usecs
calling  nfc_hci_init+0x0/0xa @ 1
initcall nfc_hci_init+0x0/0xa returned 0 after 0 usecs
calling  nmi_warning_debugfs+0x0/0x24 @ 1
initcall nmi_warning_debugfs+0x0/0x24 returned 0 after 0 usecs
calling  clocksource_done_booting+0x0/0x3b @ 1
Switched to clocksource pit
initcall clocksource_done_booting+0x0/0x3b returned 0 after 1025 usecs
calling  tracer_init_debugfs+0x0/0x158 @ 1
initcall tracer_init_debugfs+0x0/0x158 returned 0 after 608 usecs
calling  init_trace_printk_function_export+0x0/0x33 @ 1
initcall init_trace_printk_function_export+0x0/0x33 returned 0 after 14 usecs
calling  event_trace_init+0x0/0x1bb @ 1
initcall event_trace_init+0x0/0x1bb returned 0 after 98030 usecs
calling  init_uprobe_trace+0x0/0x59 @ 1
initcall init_uprobe_trace+0x0/0x59 returned 0 after 28 usecs
calling  init_pipe_fs+0x0/0x3d @ 1
initcall init_pipe_fs+0x0/0x3d returned 0 after 103 usecs
calling  eventpoll_init+0x0/0x10d @ 1
initcall eventpoll_init+0x0/0x10d returned 0 after 42 usecs
calling  anon_inode_init+0x0/0x52 @ 1
initcall anon_inode_init+0x0/0x52 returned 0 after 62 usecs
calling  fscache_init+0x0/0x193 @ 1
FS-Cache: Loaded
initcall fscache_init+0x0/0x193 returned 0 after 1112 usecs
calling  cachefiles_init+0x0/0x99 @ 1
CacheFiles: Loaded
initcall cachefiles_init+0x0/0x99 returned 0 after 1510 usecs
calling  tomoyo_initerface_init+0x0/0x17c @ 1
initcall tomoyo_initerface_init+0x0/0x17c returned 0 after 200 usecs
calling  aa_create_aafs+0x0/0xa7 @ 1
initcall aa_create_aafs+0x0/0xa7 returned 0 after 4 usecs
calling  blk_scsi_ioctl_init+0x0/0x288 @ 1
initcall blk_scsi_ioctl_init+0x0/0x288 returned 0 after 4 usecs
calling  dynamic_debug_init_debugfs+0x0/0x6a @ 1
initcall dynamic_debug_init_debugfs+0x0/0x6a returned 0 after 26 usecs
calling  pnp_system_init+0x0/0xf @ 1
initcall pnp_system_init+0x0/0xf returned 0 after 39 usecs
calling  pnpbios_init+0x0/0x337 @ 1
PnPBIOS: Scanning system for PnP BIOS support...
PnPBIOS: Found PnP BIOS installation structure at 0xb00fc550
PnPBIOS: PnP BIOS version 1.0, entry 0xf0000:0xc580, dseg 0xf0000
pnp 00:00: [irq 2]
pnp 00:00: [io  0x0020-0x0021]
pnp 00:00: [io  0x00a0-0x00a1]
pnp 00:00: Plug and Play BIOS device, IDs PNP0000 (active)
pnp 00:01: [dma 4]
pnp 00:01: [io  0x0000-0x000f]
pnp 00:01: [io  0x0081-0x0083]
pnp 00:01: [io  0x0087]
pnp 00:01: [io  0x0089-0x008b]
pnp 00:01: [io  0x008f-0x0091]
pnp 00:01: [io  0x00c0-0x00df]
pnp 00:01: Plug and Play BIOS device, IDs PNP0200 (active)
pnp 00:02: [irq 0]
pnp 00:02: [io  0x0040-0x0043]
pnp 00:02: Plug and Play BIOS device, IDs PNP0100 (active)
pnp 00:03: [irq 8]
pnp 00:03: [io  0x0070-0x0071]
pnp 00:03: Plug and Play BIOS device, IDs PNP0b00 (active)
pnp 00:04: [irq 1]
pnp 00:04: [io  0x0060]
pnp 00:04: [io  0x0064]
pnp 00:04: Plug and Play BIOS device, IDs PNP0303 (active)
pnp 00:05: [io  0x0061]
pnp 00:05: Plug and Play BIOS device, IDs PNP0800 (active)
pnp 00:06: [irq 13]
pnp 00:06: [io  0x00f0-0x00ff]
pnp 00:06: Plug and Play BIOS device, IDs PNP0c04 (active)
pnp 00:07: [mem 0x00000000-0x0009ffff]
pnp 00:07: [mem 0xfffe0000-0xffffffff]
pnp 00:07: [mem 0xfec00000-0xfec0ffff]
pnp 00:07: [mem 0xfee00000-0xfeefffff]
pnp 00:07: [mem 0xfefffc00-0xfeffffff]
pnp 00:07: [mem 0x00100000-0x00ffffff]
system 00:07: [mem 0x00000000-0x0009ffff] could not be reserved
system 00:07: [mem 0xfffe0000-0xffffffff] has been reserved
system 00:07: [mem 0xfec00000-0xfec0ffff] has been reserved
system 00:07: [mem 0xfee00000-0xfeefffff] has been reserved
system 00:07: [mem 0xfefffc00-0xfeffffff] has been reserved
system 00:07: [mem 0x00100000-0x00ffffff] could not be reserved
system 00:07: Plug and Play BIOS device, IDs PNP0c01 (active)
pnp 00:08: [mem 0x000f0000-0x000f3fff]
pnp 00:08: [mem 0x000f4000-0x000f7fff]
pnp 00:08: [mem 0x000f8000-0x000fbfff]
pnp 00:08: [mem 0x000fc000-0x000fffff]
system 00:08: [mem 0x000f0000-0x000f3fff] could not be reserved
system 00:08: [mem 0x000f4000-0x000f7fff] could not be reserved
system 00:08: [mem 0x000f8000-0x000fbfff] could not be reserved
system 00:08: [mem 0x000fc000-0x000fffff] could not be reserved
system 00:08: Plug and Play BIOS device, IDs PNP0c02 (active)
pnp 00:09: [io  0x0290-0x029f]
pnp 00:09: [io  0x04d0-0x04d1]
pnp 00:09: [io  0x0cf8-0x0cff]
pnp 00:09: Plug and Play BIOS device, IDs PNP0a03 (active)
pnp 00:0b: [irq 4]
pnp 00:0b: [io  0x03f8-0x03ff]
pnp 00:0b: Plug and Play BIOS device, IDs PNP0501 (active)
pnp 00:0c: [dma 2]
pnp 00:0c: [io  0x03f0-0x03f5]
pnp 00:0c: [io  0x03f7]
pnp 00:0c: [irq 6]
pnp 00:0c: Plug and Play BIOS device, IDs PNP0700 (active)
pnp 00:0e: [dma 3]
pnp 00:0e: [irq 7]
pnp 00:0e: [io  0x0378-0x037f]
pnp 00:0e: [io  0x0778-0x077f]
pnp 00:0e: Plug and Play BIOS device, IDs PNP0401 (active)
pnp 00:0f: [irq 10]
pnp 00:0f: [io  0x0330-0x0333]
pnp 00:0f: Plug and Play BIOS device, IDs PNPb006 (active)
pnp 00:10: [io  0x0201]
pnp 00:10: Plug and Play BIOS device, IDs PNPb02f (active)
PnPBIOS: 15 nodes reported by PnP BIOS; 15 recorded by driver
initcall pnpbios_init+0x0/0x337 returned 0 after 76635 usecs
calling  chr_dev_init+0x0/0xc3 @ 1
initcall chr_dev_init+0x0/0xc3 returned 0 after 5687 usecs
calling  firmware_class_init+0x0/0x109 @ 1
initcall firmware_class_init+0x0/0x109 returned 0 after 29 usecs
calling  init_pcmcia_bus+0x0/0x5e @ 1
initcall init_pcmcia_bus+0x0/0x5e returned 0 after 55 usecs
calling  thermal_init+0x0/0xb5 @ 1
initcall thermal_init+0x0/0xb5 returned 0 after 85 usecs
calling  ssb_modinit+0x0/0x3f @ 1
initcall ssb_modinit+0x0/0x3f returned 0 after 39 usecs
calling  pcibios_assign_resources+0x0/0x8f @ 1
pci 0000:00:09.0: PCI bridge to [bus 05]
pci 0000:00:0b.0: PCI bridge to [bus 04]
pci 0000:00:0c.0: PCI bridge to [bus 03]
pci 0000:00:0d.0: PCI bridge to [bus 02]
pci 0000:01:00.0: BAR 6: assigned [mem 0xd8000000-0xd801ffff pref]
pci 0000:00:0e.0: PCI bridge to [bus 01]
pci 0000:00:0e.0:   bridge window [io  0xa000-0xafff]
pci 0000:00:0e.0:   bridge window [mem 0xd8000000-0xd9ffffff]
pci 0000:00:0e.0:   bridge window [mem 0xd0000000-0xd7ffffff 64bit pref]
pci_bus 0000:00: resource 4 [io  0x0000-0xffff]
pci_bus 0000:00: resource 5 [mem 0x00000000-0xffffffff]
pci_bus 0000:05: resource 4 [io  0x0000-0xffff]
pci_bus 0000:05: resource 5 [mem 0x00000000-0xffffffff]
pci_bus 0000:01: resource 0 [io  0xa000-0xafff]
pci_bus 0000:01: resource 1 [mem 0xd8000000-0xd9ffffff]
pci_bus 0000:01: resource 2 [mem 0xd0000000-0xd7ffffff 64bit pref]
initcall pcibios_assign_resources+0x0/0x8f returned 0 after 16287 usecs
calling  sysctl_core_init+0x0/0x23 @ 1
initcall sysctl_core_init+0x0/0x23 returned 0 after 36 usecs
calling  inet_init+0x0/0x26f @ 1
NET: Registered protocol family 2
TCP established hash table entries: 8192 (order: 4, 65536 bytes)
TCP bind hash table entries: 8192 (order: 6, 294912 bytes)
TCP: Hash tables configured (established 8192 bind 8192)
TCP: reno registered
UDP hash table entries: 512 (order: 3, 40960 bytes)
UDP-Lite hash table entries: 512 (order: 3, 40960 bytes)
initcall inet_init+0x0/0x26f returned 0 after 8885 usecs
calling  ipv4_offload_init+0x0/0x4e @ 1
initcall ipv4_offload_init+0x0/0x4e returned 0 after 11 usecs
calling  af_unix_init+0x0/0x4d @ 1
NET: Registered protocol family 1
initcall af_unix_init+0x0/0x4d returned 0 after 1086 usecs
calling  ipv6_offload_init+0x0/0x6b @ 1
initcall ipv6_offload_init+0x0/0x6b returned 0 after 4 usecs
calling  init_sunrpc+0x0/0x64 @ 1
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
initcall init_sunrpc+0x0/0x64 returned 0 after 4243 usecs
calling  pci_apply_final_quirks+0x0/0x10f @ 1
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:0b.0: Found disabled HT MSI Mapping
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:0c.0: Found disabled HT MSI Mapping
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:0d.0: Found disabled HT MSI Mapping
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:0e.0: Found disabled HT MSI Mapping
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:01:00.0: Boot video device
PCI: CLS 32 bytes, default 64
initcall pci_apply_final_quirks+0x0/0x10f returned 0 after 74263 usecs
calling  populate_rootfs+0x0/0x93 @ 1
initcall populate_rootfs+0x0/0x93 returned 0 after 307 usecs
calling  pci_iommu_init+0x0/0x34 @ 1
initcall pci_iommu_init+0x0/0x34 returned 0 after 4 usecs
calling  i8259A_init_ops+0x0/0x20 @ 1
initcall i8259A_init_ops+0x0/0x20 returned 0 after 4 usecs
calling  sbf_init+0x0/0xc9 @ 1
initcall sbf_init+0x0/0xc9 returned 0 after 4 usecs
calling  init_tsc_clocksource+0x0/0xac @ 1
initcall init_tsc_clocksource+0x0/0xac returned 0 after 24 usecs
calling  add_rtc_cmos+0x0/0x84 @ 1
initcall add_rtc_cmos+0x0/0x84 returned 0 after 5 usecs
calling  i8237A_init_ops+0x0/0x11 @ 1
initcall i8237A_init_ops+0x0/0x11 returned 0 after 4 usecs
calling  cache_sysfs_init+0x0/0x1c4 @ 1
initcall cache_sysfs_init+0x0/0x1c4 returned 0 after 4 usecs
calling  thermal_throttle_init_device+0x0/0xb8 @ 1
initcall thermal_throttle_init_device+0x0/0xb8 returned 0 after 4 usecs
calling  amd_ibs_init+0x0/0x1ce @ 1
initcall amd_ibs_init+0x0/0x1ce returned -19 after 4 usecs
calling  cpuid_init+0x0/0xe7 @ 1
initcall cpuid_init+0x0/0xe7 returned 0 after 188 usecs
calling  ioapic_init_ops+0x0/0x11 @ 1
initcall ioapic_init_ops+0x0/0x11 returned 0 after 4 usecs
calling  add_pcspkr+0x0/0x3c @ 1
initcall add_pcspkr+0x0/0x3c returned 0 after 67 usecs
calling  start_periodic_check_for_corruption+0x0/0x54 @ 1
Scanning for low memory corruption every 60 seconds
initcall start_periodic_check_for_corruption+0x0/0x54 returned 0 after 1177 usecs
calling  add_bus_probe+0x0/0x21 @ 1
initcall add_bus_probe+0x0/0x21 returned 0 after 4 usecs
calling  sysfb_init+0x0/0x80 @ 1
initcall sysfb_init+0x0/0x80 returned 0 after 57 usecs
calling  start_pageattr_test+0x0/0x4b @ 1
initcall start_pageattr_test+0x0/0x4b returned 0 after 104 usecs
calling  pt_dump_init+0x0/0x6d @ 1
initcall pt_dump_init+0x0/0x6d returned 0 after 19 usecs
calling  aes_init+0x0/0xf @ 1
initcall aes_init+0x0/0xf returned 0 after 196 usecs
calling  init+0x0/0xf @ 1
initcall init+0x0/0xf returned 0 after 178 usecs
calling  init+0x0/0xf @ 1
initcall init+0x0/0xf returned 0 after 55 usecs
calling  aesni_init+0x0/0x32 @ 1
initcall aesni_init+0x0/0x32 returned -19 after 4 usecs
calling  crc32c_intel_mod_init+0x0/0x24 @ 1
initcall crc32c_intel_mod_init+0x0/0x24 returned -19 after 4 usecs
calling  crc32_pclmul_mod_init+0x0/0x31 @ 1
PCLMULQDQ-NI instructions are not detected.
initcall crc32_pclmul_mod_init+0x0/0x31 returned -19 after 798 usecs
calling  net5501_init+0x0/0x104 @ 1
initcall net5501_init+0x0/0x104 returned 0 after 4 usecs
calling  goldfish_init+0x0/0x3f @ 1
initcall goldfish_init+0x0/0x3f returned 0 after 65 usecs
calling  iris_init+0x0/0xb1 @ 1
The force parameter has not been set to 1. The Iris poweroff handler will not be installed.
initcall iris_init+0x0/0xb1 returned -19 after 1113 usecs
calling  olpc_create_platform_devices+0x0/0x1c @ 1
initcall olpc_create_platform_devices+0x0/0x1c returned 0 after 4 usecs
calling  scx200_init+0x0/0x23 @ 1
NatSemi SCx200 Driver
initcall scx200_init+0x0/0x23 returned 0 after 1028 usecs
calling  proc_execdomains_init+0x0/0x27 @ 1
initcall proc_execdomains_init+0x0/0x27 returned 0 after 14 usecs
calling  ioresources_init+0x0/0x44 @ 1
initcall ioresources_init+0x0/0x44 returned 0 after 19 usecs
calling  uid_cache_init+0x0/0x81 @ 1
initcall uid_cache_init+0x0/0x81 returned 0 after 27 usecs
calling  init_posix_timers+0x0/0x1dd @ 1
initcall init_posix_timers+0x0/0x1dd returned 0 after 16 usecs
calling  init_posix_cpu_timers+0x0/0x9b @ 1
initcall init_posix_cpu_timers+0x0/0x9b returned 0 after 4 usecs
calling  init_sched_debug_procfs+0x0/0x30 @ 1
initcall init_sched_debug_procfs+0x0/0x30 returned 0 after 12 usecs
calling  irq_gc_init_ops+0x0/0x11 @ 1
initcall irq_gc_init_ops+0x0/0x11 returned 0 after 4 usecs
calling  irq_debugfs_init+0x0/0x30 @ 1
initcall irq_debugfs_init+0x0/0x30 returned 0 after 16 usecs
calling  irq_pm_init_ops+0x0/0x11 @ 1
initcall irq_pm_init_ops+0x0/0x11 returned 0 after 17 usecs
calling  timekeeping_init_ops+0x0/0x11 @ 1
initcall timekeeping_init_ops+0x0/0x11 returned 0 after 4 usecs
calling  init_clocksource_sysfs+0x0/0x58 @ 1
initcall init_clocksource_sysfs+0x0/0x58 returned 0 after 108 usecs
calling  init_timer_list_procfs+0x0/0x30 @ 1
initcall init_timer_list_procfs+0x0/0x30 returned 0 after 12 usecs
calling  alarmtimer_init+0x0/0x168 @ 1
initcall alarmtimer_init+0x0/0x168 returned 0 after 109 usecs
calling  clockevents_init_sysfs+0x0/0x7a @ 1
initcall clockevents_init_sysfs+0x0/0x7a returned 0 after 150 usecs
calling  lockdep_proc_init+0x0/0x4a @ 1
initcall lockdep_proc_init+0x0/0x4a returned 0 after 19 usecs
calling  futex_init+0x0/0x61 @ 1
initcall futex_init+0x0/0x61 returned 0 after 17 usecs
calling  init_rttest+0x0/0x139 @ 1
Initializing RT-Tester: OK
initcall init_rttest+0x0/0x139 returned 0 after 1478 usecs
calling  proc_dma_init+0x0/0x27 @ 1
initcall proc_dma_init+0x0/0x27 returned 0 after 12 usecs
calling  kallsyms_init+0x0/0x2a @ 1
initcall kallsyms_init+0x0/0x2a returned 0 after 18 usecs
calling  backtrace_regression_test+0x0/0xdb @ 1
====[ backtrace testing ]===========
Testing a backtrace from process context.
The following trace is a kernel self test and not a bug!
CPU: 0 PID: 1 Comm: swapper Tainted: G        W    3.12.0-rc4-01668-gfd71a04-dirty #229484
 b1069b79 b1069b79 b010debc b010debc b235a604 b235a604 b010def8 b010def8 b1069ba8 b1069ba8 b2871e98 b2871e98 00000282 00000282 00d17a94 00d17a94

 00000976 00000976 b2c82480 b2c82480 000006aa 000006aa b010def8 b010def8 b105c5f4 b105c5f4 00000001 00000001 00000000 00000000 00000000 00000000

 b1069b79 b1069b79 000006aa 000006aa b010df78 b010df78 b2d05af4 b2d05af4 b2cc09a0 b2cc09a0 b2864daa b2864daa b1069b79 b1069b79 00000001 00000001

Call Trace:
 [<b1069b79>] ? backtrace_test_irq_callback+0x14/0x14
 [<b235a604>] dump_stack+0x16/0x18
 [<b1069ba8>] backtrace_regression_test+0x2f/0xdb
 [<b105c5f4>] ? ktime_get+0x58/0xc6
 [<b1069b79>] ? backtrace_test_irq_callback+0x14/0x14
 [<b2d05af4>] do_one_initcall+0x72/0x14e
 [<b1069b79>] ? backtrace_test_irq_callback+0x14/0x14
 [<b103eedc>] ? parse_args+0x243/0x384
 [<b1038bed>] ? __usermodehelper_set_disable_depth+0x3c/0x42
 [<b2d05ca6>] kernel_init_freeable+0xd6/0x178
 [<b2d05475>] ? loglevel+0x2b/0x2b
 [<b2350e9e>] kernel_init+0xb/0xc3
 [<b23735d7>] ret_from_kernel_thread+0x1b/0x28
 [<b2350e93>] ? rest_init+0xb7/0xb7
Testing a backtrace from irq context.
The following trace is a kernel self test and not a bug!
CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G        W    3.12.0-rc4-01668-gfd71a04-dirty #229484
 00000006 00000006 b012feb8 b012feb8 b235a604 b235a604 b012fec0 b012fec0 b1069b6d b1069b6d b012fecc b012fecc b102be74 b102be74 00000040 00000040

 b012ff0c b012ff0c b102ba58 b102ba58 b012fedc b012fedc 00000046 00000046 b1049660 b1049660 b012ff58 b012ff58 00000286 00000286 b012ff48 b012ff48

 04208040 04208040 fffb712d fffb712d 0000000a 0000000a 00000100 00000100 00000001 00000001 b0055a20 b0055a20 b2b5db40 b2b5db40 b01137a0 b01137a0

Call Trace:
 [<b235a604>] dump_stack+0x16/0x18
 [<b1069b6d>] backtrace_test_irq_callback+0x8/0x14
 [<b102be74>] tasklet_action+0x65/0x6a
 [<b102ba58>] __do_softirq+0xa9/0x1e8
 [<b1049660>] ? complete+0x42/0x4a
 [<b102bbae>] run_ksoftirqd+0x17/0x3a
 [<b1046629>] smpboot_thread_fn+0x101/0x117
 [<b1046528>] ? lg_global_unlock+0x29/0x29
 [<b1040519>] kthread+0x8e/0x9a
 [<b23735d7>] ret_from_kernel_thread+0x1b/0x28
 [<b104048b>] ? __kthread_unpark+0x29/0x29
Testing a saved backtrace.
The following trace is a kernel self test and not a bug!
  [<b100a677>] save_stack_trace+0x2a/0x44
[<b100a677>] save_stack_trace+0x2a/0x44
  [<b1069c37>] backtrace_regression_test+0xbe/0xdb
[<b1069c37>] backtrace_regression_test+0xbe/0xdb
  [<b2d05af4>] do_one_initcall+0x72/0x14e
[<b2d05af4>] do_one_initcall+0x72/0x14e
  [<b2d05ca6>] kernel_init_freeable+0xd6/0x178
[<b2d05ca6>] kernel_init_freeable+0xd6/0x178
  [<b2350e9e>] kernel_init+0xb/0xc3
[<b2350e9e>] kernel_init+0xb/0xc3
  [<b23735d7>] ret_from_kernel_thread+0x1b/0x28
[<b23735d7>] ret_from_kernel_thread+0x1b/0x28
  [<ffffffff>] 0xffffffff
[<ffffffff>] 0xffffffff
====[ end of backtrace testing ]====
initcall backtrace_regression_test+0x0/0xdb returned 0 after 95106 usecs
calling  audit_init+0x0/0x14c @ 1
audit: initializing netlink socket (disabled)
type=2000 audit(1381004849.342:1): initialized
initcall audit_init+0x0/0x14c returned 0 after 2452 usecs
calling  audit_watch_init+0x0/0x31 @ 1
initcall audit_watch_init+0x0/0x31 returned 0 after 13 usecs
calling  audit_tree_init+0x0/0x42 @ 1
initcall audit_tree_init+0x0/0x42 returned 0 after 10 usecs
calling  hung_task_init+0x0/0x56 @ 1
initcall hung_task_init+0x0/0x56 returned 0 after 44 usecs
calling  utsname_sysctl_init+0x0/0x11 @ 1
initcall utsname_sysctl_init+0x0/0x11 returned 0 after 19 usecs
calling  init_mmio_trace+0x0/0xf @ 1
initcall init_mmio_trace+0x0/0xf returned 0 after 5 usecs
calling  init_blk_tracer+0x0/0x54 @ 1
initcall init_blk_tracer+0x0/0x54 returned 0 after 6 usecs
calling  perf_event_sysfs_init+0x0/0x91 @ 1
initcall perf_event_sysfs_init+0x0/0x91 returned 0 after 184 usecs
calling  init_uprobes+0x0/0x4f @ 1
initcall init_uprobes+0x0/0x4f returned 0 after 18 usecs
calling  init_per_zone_wmark_min+0x0/0x9a @ 1
initcall init_per_zone_wmark_min+0x0/0x9a returned 0 after 68 usecs
calling  kswapd_init+0x0/0x13 @ 1
initcall kswapd_init+0x0/0x13 returned 0 after 46 usecs
calling  extfrag_debug_init+0x0/0x7d @ 1
initcall extfrag_debug_init+0x0/0x7d returned 0 after 40 usecs
calling  setup_vmstat+0x0/0x8a @ 1
initcall setup_vmstat+0x0/0x8a returned 0 after 35 usecs
calling  mm_sysfs_init+0x0/0x22 @ 1
initcall mm_sysfs_init+0x0/0x22 returned 0 after 8 usecs
calling  slab_proc_init+0x0/0x2a @ 1
initcall slab_proc_init+0x0/0x2a returned 0 after 11 usecs
calling  init_reserve_notifier+0x0/0x7 @ 1
initcall init_reserve_notifier+0x0/0x7 returned 0 after 4 usecs
calling  init_admin_reserve+0x0/0x25 @ 1
initcall init_admin_reserve+0x0/0x25 returned 0 after 4 usecs
calling  init_user_reserve+0x0/0x25 @ 1
initcall init_user_reserve+0x0/0x25 returned 0 after 4 usecs
calling  proc_vmalloc_init+0x0/0x2a @ 1
initcall proc_vmalloc_init+0x0/0x2a returned 0 after 12 usecs
calling  ksm_init+0x0/0x15b @ 1
initcall ksm_init+0x0/0x15b returned 0 after 114 usecs
calling  slab_proc_init+0x0/0x7 @ 1
initcall slab_proc_init+0x0/0x7 returned 0 after 3 usecs
calling  cpucache_init+0x0/0xcc @ 1
initcall cpucache_init+0x0/0xcc returned 0 after 4 usecs
calling  hugepage_init+0x0/0x125 @ 1
initcall hugepage_init+0x0/0x125 returned 0 after 146 usecs
calling  pfn_inject_init+0x0/0x131 @ 1
initcall pfn_inject_init+0x0/0x131 returned 0 after 95 usecs
calling  fcntl_init+0x0/0x2f @ 1
initcall fcntl_init+0x0/0x2f returned 0 after 15 usecs
calling  proc_filesystems_init+0x0/0x27 @ 1
initcall proc_filesystems_init+0x0/0x27 returned 0 after 12 usecs
calling  dio_init+0x0/0x32 @ 1
initcall dio_init+0x0/0x32 returned 0 after 14 usecs
calling  fsnotify_mark_init+0x0/0x46 @ 1
initcall fsnotify_mark_init+0x0/0x46 returned 0 after 253 usecs
calling  inotify_user_setup+0x0/0x78 @ 1
initcall inotify_user_setup+0x0/0x78 returned 0 after 25 usecs
calling  fanotify_user_setup+0x0/0x5a @ 1
initcall fanotify_user_setup+0x0/0x5a returned 0 after 24 usecs
calling  aio_setup+0x0/0x8a @ 1
initcall aio_setup+0x0/0x8a returned 0 after 30 usecs
calling  proc_locks_init+0x0/0x27 @ 1
initcall proc_locks_init+0x0/0x27 returned 0 after 12 usecs
calling  init_mbcache+0x0/0x11 @ 1
initcall init_mbcache+0x0/0x11 returned 0 after 4 usecs
calling  dquot_init+0x0/0xfb @ 1
VFS: Disk quotas dquot_6.5.2
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
initcall dquot_init+0x0/0xfb returned 0 after 1600 usecs
calling  init_v1_quota_format+0x0/0xf @ 1
initcall init_v1_quota_format+0x0/0xf returned 0 after 12 usecs
calling  init_v2_quota_format+0x0/0x1d @ 1
initcall init_v2_quota_format+0x0/0x1d returned 0 after 4 usecs
calling  proc_cmdline_init+0x0/0x27 @ 1
initcall proc_cmdline_init+0x0/0x27 returned 0 after 12 usecs
calling  proc_consoles_init+0x0/0x27 @ 1
initcall proc_consoles_init+0x0/0x27 returned 0 after 11 usecs
calling  proc_cpuinfo_init+0x0/0x27 @ 1
initcall proc_cpuinfo_init+0x0/0x27 returned 0 after 12 usecs
calling  proc_devices_init+0x0/0x27 @ 1
initcall proc_devices_init+0x0/0x27 returned 0 after 13 usecs
calling  proc_interrupts_init+0x0/0x27 @ 1
initcall proc_interrupts_init+0x0/0x27 returned 0 after 11 usecs
calling  proc_loadavg_init+0x0/0x27 @ 1
initcall proc_loadavg_init+0x0/0x27 returned 0 after 12 usecs
calling  proc_meminfo_init+0x0/0x27 @ 1
initcall proc_meminfo_init+0x0/0x27 returned 0 after 11 usecs
calling  proc_stat_init+0x0/0x27 @ 1
initcall proc_stat_init+0x0/0x27 returned 0 after 12 usecs
calling  proc_uptime_init+0x0/0x27 @ 1
initcall proc_uptime_init+0x0/0x27 returned 0 after 11 usecs
calling  proc_version_init+0x0/0x27 @ 1
initcall proc_version_init+0x0/0x27 returned 0 after 11 usecs
calling  proc_softirqs_init+0x0/0x27 @ 1
initcall proc_softirqs_init+0x0/0x27 returned 0 after 11 usecs
calling  proc_kcore_init+0x0/0x7e @ 1
initcall proc_kcore_init+0x0/0x7e returned 0 after 20 usecs
calling  proc_kmsg_init+0x0/0x2a @ 1
initcall proc_kmsg_init+0x0/0x2a returned 0 after 12 usecs
calling  proc_page_init+0x0/0x4a @ 1
initcall proc_page_init+0x0/0x4a returned 0 after 19 usecs
calling  configfs_init+0x0/0xa8 @ 1
initcall configfs_init+0x0/0xa8 returned 0 after 23 usecs
calling  init_devpts_fs+0x0/0x52 @ 1
initcall init_devpts_fs+0x0/0x52 returned 0 after 82 usecs
calling  init_reiserfs_fs+0x0/0x70 @ 1
initcall init_reiserfs_fs+0x0/0x70 returned 0 after 27 usecs
calling  init_ext3_fs+0x0/0x76 @ 1
initcall init_ext3_fs+0x0/0x76 returned 0 after 48 usecs
calling  init_ext2_fs+0x0/0x76 @ 1
initcall init_ext2_fs+0x0/0x76 returned 0 after 38 usecs
calling  ext4_init_fs+0x0/0x1e6 @ 1
initcall ext4_init_fs+0x0/0x1e6 returned 0 after 161 usecs
calling  journal_init+0x0/0xeb @ 1
initcall journal_init+0x0/0xeb returned 0 after 67 usecs
calling  journal_init+0x0/0x120 @ 1
initcall journal_init+0x0/0x120 returned 0 after 87 usecs
calling  init_ramfs_fs+0x0/0x45 @ 1
initcall init_ramfs_fs+0x0/0x45 returned 0 after 4 usecs
calling  init_fat_fs+0x0/0x4c @ 1
initcall init_fat_fs+0x0/0x4c returned 0 after 26 usecs
calling  init_vfat_fs+0x0/0xf @ 1
initcall init_vfat_fs+0x0/0xf returned 0 after 4 usecs
calling  init_msdos_fs+0x0/0xf @ 1
initcall init_msdos_fs+0x0/0xf returned 0 after 4 usecs
calling  init_iso9660_fs+0x0/0x66 @ 1
initcall init_iso9660_fs+0x0/0x66 returned 0 after 16 usecs
calling  init_nfs_fs+0x0/0x159 @ 1
initcall init_nfs_fs+0x0/0x159 returned 0 after 432 usecs
calling  init_nfs_v2+0x0/0x11 @ 1
initcall init_nfs_v2+0x0/0x11 returned 0 after 13 usecs
calling  init_nfs_v4+0x0/0x40 @ 1
NFS: Registering the id_resolver key type
Key type id_resolver registered
Key type id_legacy registered
initcall init_nfs_v4+0x0/0x40 returned 0 after 3527 usecs
calling  init_nfsd+0x0/0x113 @ 1
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
initcall init_nfsd+0x0/0x113 returned 0 after 1421 usecs
calling  init_nlm+0x0/0x3d @ 1
initcall init_nlm+0x0/0x3d returned 0 after 30 usecs
calling  init_nls_cp437+0x0/0xf @ 1
initcall init_nls_cp437+0x0/0xf returned 0 after 13 usecs
calling  init_nls_cp737+0x0/0xf @ 1
initcall init_nls_cp737+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp775+0x0/0xf @ 1
initcall init_nls_cp775+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp850+0x0/0xf @ 1
initcall init_nls_cp850+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp852+0x0/0xf @ 1
initcall init_nls_cp852+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp855+0x0/0xf @ 1
initcall init_nls_cp855+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp857+0x0/0xf @ 1
initcall init_nls_cp857+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp863+0x0/0xf @ 1
initcall init_nls_cp863+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp865+0x0/0xf @ 1
initcall init_nls_cp865+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp866+0x0/0xf @ 1
initcall init_nls_cp866+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp869+0x0/0xf @ 1
initcall init_nls_cp869+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp936+0x0/0xf @ 1
initcall init_nls_cp936+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp949+0x0/0xf @ 1
initcall init_nls_cp949+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp950+0x0/0xf @ 1
initcall init_nls_cp950+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp1250+0x0/0xf @ 1
initcall init_nls_cp1250+0x0/0xf returned 0 after 17 usecs
calling  init_nls_ascii+0x0/0xf @ 1
initcall init_nls_ascii+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_1+0x0/0xf @ 1
initcall init_nls_iso8859_1+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_2+0x0/0xf @ 1
initcall init_nls_iso8859_2+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_4+0x0/0xf @ 1
initcall init_nls_iso8859_4+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_6+0x0/0xf @ 1
initcall init_nls_iso8859_6+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_7+0x0/0xf @ 1
initcall init_nls_iso8859_7+0x0/0xf returned 0 after 4 usecs
calling  init_nls_cp1255+0x0/0xf @ 1
initcall init_nls_cp1255+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_9+0x0/0xf @ 1
initcall init_nls_iso8859_9+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_13+0x0/0xf @ 1
initcall init_nls_iso8859_13+0x0/0xf returned 0 after 4 usecs
calling  init_nls_iso8859_15+0x0/0xf @ 1
initcall init_nls_iso8859_15+0x0/0xf returned 0 after 4 usecs
calling  init_nls_utf8+0x0/0x1f @ 1
initcall init_nls_utf8+0x0/0x1f returned 0 after 4 usecs
calling  init_nls_macceltic+0x0/0xf @ 1
initcall init_nls_macceltic+0x0/0xf returned 0 after 4 usecs
calling  init_nls_maccroatian+0x0/0xf @ 1
initcall init_nls_maccroatian+0x0/0xf returned 0 after 4 usecs
calling  init_nls_maccyrillic+0x0/0xf @ 1
initcall init_nls_maccyrillic+0x0/0xf returned 0 after 4 usecs
calling  init_nls_macgreek+0x0/0xf @ 1
initcall init_nls_macgreek+0x0/0xf returned 0 after 4 usecs
calling  init_nls_maciceland+0x0/0xf @ 1
initcall init_nls_maciceland+0x0/0xf returned 0 after 17 usecs
calling  init_nls_macinuit+0x0/0xf @ 1
initcall init_nls_macinuit+0x0/0xf returned 0 after 4 usecs
calling  init_nls_macromanian+0x0/0xf @ 1
initcall init_nls_macromanian+0x0/0xf returned 0 after 4 usecs
calling  init_cifs+0x0/0x45a @ 1
FS-Cache: Netfs 'cifs' registered for caching
Key type cifs.spnego registered
initcall init_cifs+0x0/0x45a returned 0 after 2218 usecs
calling  init_ncp_fs+0x0/0x66 @ 1
initcall init_ncp_fs+0x0/0x66 returned 0 after 23 usecs
calling  init_ntfs_fs+0x0/0x23c @ 1
NTFS driver 2.1.30 [Flags: R/W DEBUG].
initcall init_ntfs_fs+0x0/0x23c returned 0 after 998 usecs
calling  init_autofs4_fs+0x0/0x24 @ 1
initcall init_autofs4_fs+0x0/0x24 returned 0 after 102 usecs
calling  fuse_init+0x0/0x19d @ 1
fuse init (API version 7.22)
initcall fuse_init+0x0/0x19d returned 0 after 1317 usecs
calling  cuse_init+0x0/0x93 @ 1
initcall cuse_init+0x0/0x93 returned 0 after 90 usecs
calling  init_udf_fs+0x0/0x66 @ 1
initcall init_udf_fs+0x0/0x66 returned 0 after 18 usecs
calling  init_xfs_fs+0x0/0xcf @ 1
SGI XFS with security attributes, realtime, debug enabled
initcall init_xfs_fs+0x0/0xcf returned 0 after 1705 usecs
calling  init_v9fs+0x0/0xf3 @ 1
9p: Installing v9fs 9p2000 file system support
FS-Cache: Netfs '9p' registered for caching
initcall init_v9fs+0x0/0xf3 returned 0 after 1137 usecs
calling  init_btrfs_fs+0x0/0x11f @ 1
bio: create slab <bio-1> at 1
Btrfs loaded, assert=on
btrfs: selftest: Running btrfs free space cache tests
btrfs: selftest: Running extent only tests
btrfs: selftest: Running bitmap only tests
btrfs: selftest: Running bitmap and extent tests
btrfs: selftest: Free space cache tests finished
initcall init_btrfs_fs+0x0/0x11f returned 0 after 6991 usecs
calling  init_ceph+0x0/0x146 @ 1
FS-Cache: Netfs 'ceph' registered for caching
ceph: loaded (mds proto 32)
initcall init_ceph+0x0/0x146 returned 0 after 2158 usecs
calling  init_mqueue_fs+0x0/0x9e @ 1
initcall init_mqueue_fs+0x0/0x9e returned 0 after 115 usecs
calling  key_proc_init+0x0/0x37 @ 1
initcall key_proc_init+0x0/0x37 returned 0 after 13 usecs
calling  crypto_wq_init+0x0/0x41 @ 1
initcall crypto_wq_init+0x0/0x41 returned 0 after 50 usecs
calling  crypto_algapi_init+0x0/0xc @ 1
initcall crypto_algapi_init+0x0/0xc returned 0 after 27 usecs
calling  skcipher_module_init+0x0/0x11 @ 1
initcall skcipher_module_init+0x0/0x11 returned 0 after 3 usecs
calling  chainiv_module_init+0x0/0xf @ 1
initcall chainiv_module_init+0x0/0xf returned 0 after 5 usecs
calling  eseqiv_module_init+0x0/0xf @ 1
initcall eseqiv_module_init+0x0/0xf returned 0 after 4 usecs
calling  seqiv_module_init+0x0/0xf @ 1
initcall seqiv_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_user_init+0x0/0x42 @ 1
initcall crypto_user_init+0x0/0x42 returned 0 after 26 usecs
calling  crypto_cmac_module_init+0x0/0xf @ 1
initcall crypto_cmac_module_init+0x0/0xf returned 0 after 4 usecs
calling  hmac_module_init+0x0/0xf @ 1
initcall hmac_module_init+0x0/0xf returned 0 after 4 usecs
calling  vmac_module_init+0x0/0xf @ 1
initcall vmac_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_xcbc_module_init+0x0/0xf @ 1
initcall crypto_xcbc_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_null_mod_init+0x0/0x41 @ 1
initcall crypto_null_mod_init+0x0/0x41 returned 0 after 289 usecs
calling  md4_mod_init+0x0/0xf @ 1
initcall md4_mod_init+0x0/0xf returned 0 after 184 usecs
calling  md5_mod_init+0x0/0xf @ 1
initcall md5_mod_init+0x0/0xf returned 0 after 176 usecs
calling  rmd128_mod_init+0x0/0xf @ 1
initcall rmd128_mod_init+0x0/0xf returned 0 after 210 usecs
calling  rmd160_mod_init+0x0/0xf @ 1
initcall rmd160_mod_init+0x0/0xf returned 0 after 229 usecs
calling  rmd256_mod_init+0x0/0xf @ 1
initcall rmd256_mod_init+0x0/0xf returned 0 after 198 usecs
calling  rmd320_mod_init+0x0/0xf @ 1
initcall rmd320_mod_init+0x0/0xf returned 0 after 214 usecs
calling  sha1_generic_mod_init+0x0/0xf @ 1
initcall sha1_generic_mod_init+0x0/0xf returned 0 after 175 usecs
calling  sha256_generic_mod_init+0x0/0x14 @ 1
initcall sha256_generic_mod_init+0x0/0x14 returned 0 after 351 usecs
calling  sha512_generic_mod_init+0x0/0x14 @ 1
initcall sha512_generic_mod_init+0x0/0x14 returned 0 after 545 usecs
calling  wp512_mod_init+0x0/0x14 @ 1
initcall wp512_mod_init+0x0/0x14 returned 0 after 1128 usecs
calling  crypto_ecb_module_init+0x0/0xf @ 1
initcall crypto_ecb_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_cbc_module_init+0x0/0xf @ 1
initcall crypto_cbc_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_pcbc_module_init+0x0/0xf @ 1
initcall crypto_pcbc_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_module_init+0x0/0xf @ 1
initcall crypto_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_module_init+0x0/0xf @ 1
initcall crypto_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_ctr_module_init+0x0/0x33 @ 1
initcall crypto_ctr_module_init+0x0/0x33 returned 0 after 4 usecs
calling  crypto_gcm_module_init+0x0/0x95 @ 1
initcall crypto_gcm_module_init+0x0/0x95 returned 0 after 5 usecs
calling  crypto_ccm_module_init+0x0/0x4d @ 1
initcall crypto_ccm_module_init+0x0/0x4d returned 0 after 4 usecs
calling  des_generic_mod_init+0x0/0x14 @ 1
initcall des_generic_mod_init+0x0/0x14 returned 0 after 249 usecs
calling  fcrypt_mod_init+0x0/0xf @ 1
initcall fcrypt_mod_init+0x0/0xf returned 0 after 111 usecs
calling  serpent_mod_init+0x0/0x14 @ 1
initcall serpent_mod_init+0x0/0x14 returned 0 after 250 usecs
calling  aes_init+0x0/0xf @ 1
initcall aes_init+0x0/0xf returned 0 after 133 usecs
calling  camellia_init+0x0/0xf @ 1
initcall camellia_init+0x0/0xf returned 0 after 126 usecs
calling  cast5_mod_init+0x0/0xf @ 1
initcall cast5_mod_init+0x0/0xf returned 0 after 137 usecs
calling  cast6_mod_init+0x0/0xf @ 1
initcall cast6_mod_init+0x0/0xf returned 0 after 139 usecs
calling  arc4_init+0x0/0x14 @ 1
initcall arc4_init+0x0/0x14 returned 0 after 605 usecs
calling  tea_mod_init+0x0/0x14 @ 1
initcall tea_mod_init+0x0/0x14 returned 0 after 329 usecs
calling  anubis_mod_init+0x0/0xf @ 1
initcall anubis_mod_init+0x0/0xf returned 0 after 168 usecs
calling  deflate_mod_init+0x0/0xf @ 1
initcall deflate_mod_init+0x0/0xf returned 0 after 527 usecs
calling  zlib_mod_init+0x0/0xf @ 1
initcall zlib_mod_init+0x0/0xf returned 0 after 727 usecs
calling  michael_mic_init+0x0/0xf @ 1
initcall michael_mic_init+0x0/0xf returned 0 after 234 usecs
calling  crc32c_mod_init+0x0/0xf @ 1
initcall crc32c_mod_init+0x0/0xf returned 0 after 306 usecs
calling  crc32_mod_init+0x0/0xf @ 1
alg: No test for crc32 (crc32-table)
initcall crc32_mod_init+0x0/0xf returned 0 after 609 usecs
calling  crct10dif_mod_init+0x0/0xf @ 1
initcall crct10dif_mod_init+0x0/0xf returned 0 after 142 usecs
calling  crypto_authenc_module_init+0x0/0xf @ 1
initcall crypto_authenc_module_init+0x0/0xf returned 0 after 4 usecs
calling  crypto_authenc_esn_module_init+0x0/0xf @ 1
initcall crypto_authenc_esn_module_init+0x0/0xf returned 0 after 4 usecs
calling  lzo_mod_init+0x0/0xf @ 1
initcall lzo_mod_init+0x0/0xf returned 0 after 107 usecs
calling  lz4_mod_init+0x0/0xf @ 1
alg: No test for lz4 (lz4-generic)
initcall lz4_mod_init+0x0/0xf returned 0 after 271 usecs
calling  lz4hc_mod_init+0x0/0xf @ 1
alg: No test for lz4hc (lz4hc-generic)
initcall lz4hc_mod_init+0x0/0xf returned 0 after 946 usecs
calling  krng_mod_init+0x0/0xf @ 1
alg: No test for stdrng (krng)
initcall krng_mod_init+0x0/0xf returned 0 after 1546 usecs
calling  prng_mod_init+0x0/0x14 @ 1
alg: No test for fips(ansi_cprng) (fips_ansi_cprng)
initcall prng_mod_init+0x0/0x14 returned 0 after 8854 usecs
calling  ghash_mod_init+0x0/0xf @ 1
initcall ghash_mod_init+0x0/0xf returned 0 after 198 usecs
calling  af_alg_init+0x0/0x35 @ 1
NET: Registered protocol family 38
initcall af_alg_init+0x0/0x35 returned 0 after 1229 usecs
calling  algif_hash_init+0x0/0xf @ 1
initcall algif_hash_init+0x0/0xf returned 0 after 13 usecs
calling  asymmetric_key_init+0x0/0xf @ 1
Key type asymmetric registered
initcall asymmetric_key_init+0x0/0xf returned 0 after 1527 usecs
calling  proc_genhd_init+0x0/0x44 @ 1
initcall proc_genhd_init+0x0/0x44 returned 0 after 27 usecs
calling  bsg_init+0x0/0x154 @ 1
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
initcall bsg_init+0x0/0x154 returned 0 after 1127 usecs
calling  noop_init+0x0/0xf @ 1
io scheduler noop registered (default)
initcall noop_init+0x0/0xf returned 0 after 928 usecs
calling  deadline_init+0x0/0xf @ 1
io scheduler deadline registered
initcall deadline_init+0x0/0xf returned 0 after 889 usecs
calling  cfq_init+0x0/0x8b @ 1
io scheduler cfq registered
initcall cfq_init+0x0/0x8b returned 0 after 1019 usecs
calling  test_kstrtox_init+0x0/0xb32 @ 1
initcall test_kstrtox_init+0x0/0xb32 returned -22 after 63 usecs
calling  btree_module_init+0x0/0x2f @ 1
initcall btree_module_init+0x0/0x2f returned 0 after 18 usecs
calling  crc_t10dif_mod_init+0x0/0x35 @ 1
initcall crc_t10dif_mod_init+0x0/0x35 returned 0 after 5 usecs
calling  crc32test_init+0x0/0x339 @ 1
crc32: CRC_LE_BITS = 8, CRC_BE BITS = 8
crc32: self tests passed, processed 225944 bytes in 0 nsec
crc32c: CRC_LE_BITS = 8
crc32c: self tests passed, processed 225944 bytes in 0 nsec
initcall crc32test_init+0x0/0x339 returned 0 after 6803 usecs
calling  libcrc32c_mod_init+0x0/0x24 @ 1
initcall libcrc32c_mod_init+0x0/0x24 returned 0 after 5 usecs
calling  init_kmp+0x0/0xf @ 1
initcall init_kmp+0x0/0xf returned 0 after 13 usecs
calling  init_bm+0x0/0xf @ 1
initcall init_bm+0x0/0xf returned 0 after 4 usecs
calling  init_fsm+0x0/0xf @ 1
initcall init_fsm+0x0/0xf returned 0 after 4 usecs
calling  audit_classes_init+0x0/0x4f @ 1
initcall audit_classes_init+0x0/0x4f returned 0 after 28 usecs
calling  err_inject_init+0x0/0x1e @ 1
initcall err_inject_init+0x0/0x1e returned 0 after 25 usecs
calling  digsig_init+0x0/0x36 @ 1
initcall digsig_init+0x0/0x36 returned 0 after 5 usecs
calling  rbtree_test_init+0x0/0x208 @ 1
rbtree testingrbtree testing -> 20477 cycles
 -> 20477 cycles
augmented rbtree testingaugmented rbtree testing -> 28701 cycles
 -> 28701 cycles
initcall rbtree_test_init+0x0/0x208 returned -11 after 2519126 usecs
calling  bgpio_driver_init+0x0/0x11 @ 1
initcall bgpio_driver_init+0x0/0x11 returned 0 after 40 usecs
calling  adnp_i2c_driver_init+0x0/0x11 @ 1
initcall adnp_i2c_driver_init+0x0/0x11 returned 0 after 52 usecs
calling  adp5588_gpio_driver_init+0x0/0x11 @ 1
initcall adp5588_gpio_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  amd_gpio_init+0x0/0x155 @ 1
initcall amd_gpio_init+0x0/0x155 returned -19 after 19 usecs
calling  bt8xxgpio_pci_driver_init+0x0/0x16 @ 1
initcall bt8xxgpio_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  cs5535_gpio_driver_init+0x0/0x11 @ 1
initcall cs5535_gpio_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  grgpio_driver_init+0x0/0x11 @ 1
initcall grgpio_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  ichx_gpio_driver_init+0x0/0x11 @ 1
initcall ichx_gpio_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  it8761e_gpio_init+0x0/0x180 @ 1
initcall it8761e_gpio_init+0x0/0x180 returned -19 after 46 usecs
calling  ttl_driver_init+0x0/0x11 @ 1
initcall ttl_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  kempld_gpio_driver_init+0x0/0x11 @ 1
initcall kempld_gpio_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  lnw_gpio_init+0x0/0x16 @ 1
initcall lnw_gpio_init+0x0/0x16 returned 0 after 40 usecs
calling  ioh_gpio_driver_init+0x0/0x16 @ 1
initcall ioh_gpio_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  pch_gpio_driver_init+0x0/0x16 @ 1
initcall pch_gpio_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  sch_gpio_driver_init+0x0/0x11 @ 1
initcall sch_gpio_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  sdv_gpio_driver_init+0x0/0x16 @ 1
initcall sdv_gpio_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  timbgpio_platform_driver_init+0x0/0x11 @ 1
initcall timbgpio_platform_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  ts5500_dio_driver_init+0x0/0x11 @ 1
initcall ts5500_dio_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  gpo_twl6040_driver_init+0x0/0x11 @ 1
initcall gpo_twl6040_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  vx855gpio_driver_init+0x0/0x11 @ 1
initcall vx855gpio_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  pca9685_i2c_driver_init+0x0/0x11 @ 1
initcall pca9685_i2c_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  twl_pwmled_driver_init+0x0/0x11 @ 1
initcall twl_pwmled_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  pci_proc_init+0x0/0x64 @ 1
initcall pci_proc_init+0x0/0x64 returned 0 after 222 usecs
calling  pci_hotplug_init+0x0/0x4f @ 1
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
initcall pci_hotplug_init+0x0/0x4f returned 0 after 500 usecs
calling  cpcihp_generic_init+0x0/0x428 @ 1
cpcihp_generic: Generic port I/O CompactPCI Hot Plug Driver version: 0.1
cpcihp_generic: not configured, disabling.
initcall cpcihp_generic_init+0x0/0x428 returned -22 after 2431 usecs
calling  shpcd_init+0x0/0x5f @ 1
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
initcall shpcd_init+0x0/0x5f returned 0 after 750 usecs
calling  fb_console_init+0x0/0x108 @ 1
initcall fb_console_init+0x0/0x108 returned 0 after 66 usecs
calling  pm860x_backlight_driver_init+0x0/0x11 @ 1
initcall pm860x_backlight_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  bd6107_driver_init+0x0/0x11 @ 1
initcall bd6107_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  da903x_backlight_driver_init+0x0/0x11 @ 1
initcall da903x_backlight_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  gpio_backlight_driver_init+0x0/0x11 @ 1
initcall gpio_backlight_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  lm3533_bl_driver_init+0x0/0x11 @ 1
initcall lm3533_bl_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  lm3630_i2c_driver_init+0x0/0x11 @ 1
initcall lm3630_i2c_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  lp8788_bl_driver_init+0x0/0x11 @ 1
initcall lp8788_bl_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  ot200_backlight_driver_init+0x0/0x11 @ 1
initcall ot200_backlight_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  pandora_backlight_driver_init+0x0/0x11 @ 1
initcall pandora_backlight_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  kb3886_init+0x0/0xa @ 1
initcall kb3886_init+0x0/0xa returned -19 after 4 usecs
calling  tps65217_bl_driver_init+0x0/0x11 @ 1
initcall tps65217_bl_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  arcfb_init+0x0/0x66 @ 1
initcall arcfb_init+0x0/0x66 returned -6 after 4 usecs
calling  pm2fb_init+0x0/0x120 @ 1
initcall pm2fb_init+0x0/0x120 returned 0 after 40 usecs
calling  i740fb_init+0x0/0x99 @ 1
initcall i740fb_init+0x0/0x99 returned 0 after 34 usecs
calling  matroxfb_init+0x0/0x24b @ 1
initcall matroxfb_init+0x0/0x24b returned 0 after 49 usecs
calling  i2c_matroxfb_init+0x0/0x29 @ 1
initcall i2c_matroxfb_init+0x0/0x29 returned 0 after 4 usecs
calling  rivafb_init+0x0/0x18e @ 1
rivafb_setup START
initcall rivafb_init+0x0/0x18e returned 0 after 516 usecs
calling  nvidiafb_init+0x0/0x288 @ 1
nvidiafb_setup START
initcall nvidiafb_init+0x0/0x288 returned 0 after 842 usecs
calling  aty128fb_init+0x0/0x111 @ 1
initcall aty128fb_init+0x0/0x111 returned 0 after 38 usecs
calling  savagefb_init+0x0/0x5e @ 1
initcall savagefb_init+0x0/0x5e returned 0 after 41 usecs
calling  neofb_init+0x0/0x11d @ 1
initcall neofb_init+0x0/0x11d returned 0 after 34 usecs
calling  tdfxfb_init+0x0/0x105 @ 1
initcall tdfxfb_init+0x0/0x105 returned 0 after 33 usecs
calling  imsttfb_init+0x0/0xdd @ 1
initcall imsttfb_init+0x0/0xdd returned 0 after 34 usecs
calling  vt8623fb_init+0x0/0x6c @ 1
initcall vt8623fb_init+0x0/0x6c returned 0 after 33 usecs
calling  tridentfb_init+0x0/0x1db @ 1
initcall tridentfb_init+0x0/0x1db returned 0 after 36 usecs
calling  vmlfb_init+0x0/0x82 @ 1
vmlfb: initializing
initcall vmlfb_init+0x0/0x82 returned 0 after 679 usecs
calling  cr_pll_init+0x0/0xd3 @ 1
Could not find Carillo Ranch MCH device.
initcall cr_pll_init+0x0/0xd3 returned -19 after 1266 usecs
calling  s3fb_init+0x0/0xf1 @ 1
initcall s3fb_init+0x0/0xf1 returned 0 after 34 usecs
calling  arkfb_init+0x0/0x6c @ 1
initcall arkfb_init+0x0/0x6c returned 0 after 33 usecs
calling  hecubafb_init+0x0/0x11 @ 1
initcall hecubafb_init+0x0/0x11 returned 0 after 33 usecs
calling  n411_init+0x0/0x7f @ 1
no IO addresses supplied
initcall n411_init+0x0/0x7f returned -22 after 1501 usecs
calling  hgafb_init+0x0/0x6f @ 1
hgafb: HGA card not detected.
hgafb: probe of hgafb.0 failed with error -22
initcall hgafb_init+0x0/0x6f returned 0 after 2499 usecs
calling  sstfb_init+0x0/0x176 @ 1
initcall sstfb_init+0x0/0x176 returned 0 after 40 usecs
calling  s1d13xxxfb_init+0x0/0x28 @ 1
initcall s1d13xxxfb_init+0x0/0x28 returned 0 after 27 usecs
calling  sm501fb_driver_init+0x0/0x11 @ 1
initcall sm501fb_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  ufx_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver smscufx
initcall ufx_driver_init+0x0/0x16 returned 0 after 670 usecs
calling  carminefb_init+0x0/0x33 @ 1
initcall carminefb_init+0x0/0x33 returned 0 after 33 usecs
calling  ssd1307fb_driver_init+0x0/0x11 @ 1
initcall ssd1307fb_driver_init+0x0/0x11 returned 0 after 40 usecs
calling  ipmi_init_msghandler_mod+0x0/0xc @ 1
ipmi message handler version 39.2
initcall ipmi_init_msghandler_mod+0x0/0xc returned 0 after 1071 usecs
calling  init_ipmi_devintf+0x0/0xf3 @ 1
ipmi device interface
initcall init_ipmi_devintf+0x0/0xf3 returned 0 after 1022 usecs
calling  init_ipmi_si+0x0/0x4e6 @ 1
IPMI System Interface driver.
ipmi_si: Adding default-specified kcs state machineipmi_si: Adding default-specified kcs state machine

ipmi_si: Trying default-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0
ipmi_si: Interface detection failed
Switched to clocksource tsc
ipmi_si: Adding default-specified smic state machineipmi_si: Adding default-specified smic state machine

ipmi_si: Trying default-specified smic state machine at i/o address 0xca9, slave address 0x0, irq 0
ipmi_si: Interface detection failed
ipmi_si: Adding default-specified bt state machineipmi_si: Adding default-specified bt state machine

ipmi_si: Trying default-specified bt state machine at i/o address 0xe4, slave address 0x0, irq 0
ipmi_si: Interface detection failed
ipmi_si: Unable to find any System Interface(s)
initcall init_ipmi_si+0x0/0x4e6 returned -19 after 82793 usecs
calling  ipmi_wdog_init+0x0/0x117 @ 1
IPMI Watchdog: driver initialized
initcall ipmi_wdog_init+0x0/0x117 returned 0 after 5941 usecs
calling  ipmi_poweroff_init+0x0/0x7f @ 1
Copyright (C) 2004 MontaVista Software - IPMI Powerdown via sys_reboot.
initcall ipmi_poweroff_init+0x0/0x7f returned 0 after 12394 usecs
calling  pnpbios_thread_init+0x0/0x6c @ 1
initcall pnpbios_thread_init+0x0/0x6c returned 0 after 31 usecs
calling  isapnp_init+0x0/0x5de @ 1
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
initcall isapnp_init+0x0/0x5de returned 0 after 359039 usecs
calling  virtio_mmio_init+0x0/0x11 @ 1
initcall virtio_mmio_init+0x0/0x11 returned 0 after 23 usecs
calling  virtio_pci_driver_init+0x0/0x16 @ 1
initcall virtio_pci_driver_init+0x0/0x16 returned 0 after 30 usecs
calling  virtio_balloon_driver_init+0x0/0xf @ 1
initcall virtio_balloon_driver_init+0x0/0xf returned 0 after 21 usecs
calling  regulator_virtual_consumer_driver_init+0x0/0x11 @ 1
initcall regulator_virtual_consumer_driver_init+0x0/0x11 returned 0 after 24 usecs
calling  regulator_userspace_consumer_driver_init+0x0/0x11 @ 1
initcall regulator_userspace_consumer_driver_init+0x0/0x11 returned 0 after 23 usecs
calling  pm800_regulator_driver_init+0x0/0x11 @ 1
initcall pm800_regulator_driver_init+0x0/0x11 returned 0 after 23 usecs
calling  da9210_regulator_driver_init+0x0/0x11 @ 1
initcall da9210_regulator_driver_init+0x0/0x11 returned 0 after 22 usecs
calling  lp3971_i2c_driver_init+0x0/0x11 @ 1
initcall lp3971_i2c_driver_init+0x0/0x11 returned 0 after 21 usecs
calling  pfuze_driver_init+0x0/0x11 @ 1
initcall pfuze_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  wm8994_ldo_driver_init+0x0/0x11 @ 1
initcall wm8994_ldo_driver_init+0x0/0x11 returned 0 after 23 usecs
calling  pty_init+0x0/0x357 @ 1
initcall pty_init+0x0/0x357 returned 0 after 34834 usecs
calling  sysrq_init+0x0/0xb6 @ 1
initcall sysrq_init+0x0/0xb6 returned 0 after 20 usecs
calling  gsm_init+0x0/0x141 @ 1
initcall gsm_init+0x0/0x141 returned 0 after 44 usecs
calling  serial8250_init+0x0/0x15a @ 1
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
initcall serial8250_init+0x0/0x15a returned 0 after 42661 usecs
calling  serial_pci_driver_init+0x0/0x16 @ 1
initcall serial_pci_driver_init+0x0/0x16 returned 0 after 77 usecs
calling  dw8250_platform_driver_init+0x0/0x11 @ 1
initcall dw8250_platform_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  ulite_init+0x0/0x87 @ 1
initcall ulite_init+0x0/0x87 returned 0 after 53 usecs
calling  altera_uart_init+0x0/0x35 @ 1
initcall altera_uart_init+0x0/0x35 returned 0 after 52 usecs
calling  asc_init+0x0/0x43 @ 1
STMicroelectronics ASC driver initialized
initcall asc_init+0x0/0x43 returned 0 after 7345 usecs
calling  init_kgdboc+0x0/0x15 @ 1
initcall init_kgdboc+0x0/0x15 returned 0 after 1 usecs
calling  timbuart_platform_driver_init+0x0/0x11 @ 1
initcall timbuart_platform_driver_init+0x0/0x11 returned 0 after 24 usecs
calling  altera_jtaguart_init+0x0/0x35 @ 1
initcall altera_jtaguart_init+0x0/0x35 returned 0 after 43 usecs
calling  hsu_pci_init+0x0/0x2b1 @ 1
initcall hsu_pci_init+0x0/0x2b1 returned 0 after 210 usecs
calling  pch_uart_module_init+0x0/0x3a @ 1
initcall pch_uart_module_init+0x0/0x3a returned 0 after 68 usecs
calling  xuartps_init+0x0/0x35 @ 1
initcall xuartps_init+0x0/0x35 returned 0 after 45 usecs
calling  rp2_uart_init+0x0/0x3a @ 1
initcall rp2_uart_init+0x0/0x3a returned 0 after 106 usecs
calling  lpuart_serial_init+0x0/0x43 @ 1
serial: Freescale lpuart driver
initcall lpuart_serial_init+0x0/0x43 returned 0 after 5667 usecs
calling  nozomi_init+0x0/0x100 @ 1
Initializing Nozomi driver 2.1d
initcall nozomi_init+0x0/0x100 returned 0 after 5696 usecs
calling  rand_initialize+0x0/0x25 @ 1
initcall rand_initialize+0x0/0x25 returned 0 after 43 usecs
calling  init+0x0/0xf3 @ 1
initcall init+0x0/0xf3 returned 0 after 77 usecs
calling  raw_init+0x0/0x135 @ 1
initcall raw_init+0x0/0x135 returned 0 after 160 usecs
calling  lp_init_module+0x0/0x210 @ 1
lp: driver loaded but no devices found
initcall lp_init_module+0x0/0x210 returned 0 after 6790 usecs
calling  dtlk_init+0x0/0x1d1 @ 1
DoubleTalk PC - not found
initcall dtlk_init+0x0/0x1d1 returned -19 after 4597 usecs
calling  applicom_init+0x0/0x484 @ 1
Applicom driver: $Id: ac.c,v 1.30 2000/03/22 16:03:57 dwmw2 Exp $
ac.o: No PCI boards found.
ac.o: For an ISA board you must supply memory and irq parameters.
initcall applicom_init+0x0/0x484 returned -6 after 27475 usecs
calling  i8k_init+0x0/0x30d @ 1
initcall i8k_init+0x0/0x30d returned -19 after 0 usecs
calling  timeriomem_rng_driver_init+0x0/0x11 @ 1
initcall timeriomem_rng_driver_init+0x0/0x11 returned 0 after 24 usecs
calling  mod_init+0x0/0x1dc @ 1
initcall mod_init+0x0/0x1dc returned -19 after 111 usecs
calling  mod_init+0x0/0x11a @ 1
initcall mod_init+0x0/0x11a returned -19 after 11 usecs
calling  mod_init+0x0/0x9a @ 1
initcall mod_init+0x0/0x9a returned -19 after 8 usecs
calling  mod_init+0x0/0x48 @ 1
initcall mod_init+0x0/0x48 returned -19 after 0 usecs
calling  virtio_rng_driver_init+0x0/0xf @ 1
initcall virtio_rng_driver_init+0x0/0xf returned 0 after 21 usecs
calling  rng_init+0x0/0xf @ 1
initcall rng_init+0x0/0xf returned 0 after 136 usecs
calling  ppdev_init+0x0/0xba @ 1
ppdev: user-space parallel port driver
initcall ppdev_init+0x0/0xba returned 0 after 6789 usecs
calling  pc8736x_gpio_init+0x0/0x2dd @ 1
platform pc8736x_gpio.0: NatSemi pc8736x GPIO Driver Initializing
platform pc8736x_gpio.0: no device found
initcall pc8736x_gpio_init+0x0/0x2dd returned -19 after 18607 usecs
calling  nsc_gpio_init+0x0/0x14 @ 1
nsc_gpio initializing
initcall nsc_gpio_init+0x0/0x14 returned 0 after 3910 usecs
calling  tlclk_init+0x0/0x1d9 @ 1
telclk_interrupt = 0xf non-mcpbl0010 hw.
initcall tlclk_init+0x0/0x1d9 returned -6 after 7128 usecs
calling  mwave_init+0x0/0x1e5 @ 1
smapi::smapi_init, ERROR invalid usSmapiID
mwave: tp3780i::tp3780I_InitializeBoardData: Error: SMAPI is not available on this machine
mwave: mwavedd::mwave_init: Error: Failed to initialize board data
mwave: mwavedd::mwave_init: Error: Failed to initialize
initcall mwave_init+0x0/0x1e5 returned -5 after 44250 usecs
calling  agp_init+0x0/0x32 @ 1
Linux agpgart interface v0.103
initcall agp_init+0x0/0x32 returned 0 after 5434 usecs
calling  agp_ali_init+0x0/0x27 @ 1
initcall agp_ali_init+0x0/0x27 returned 0 after 37 usecs
calling  agp_intel_init+0x0/0x27 @ 1
initcall agp_intel_init+0x0/0x27 returned 0 after 34 usecs
calling  agp_nvidia_init+0x0/0x27 @ 1
initcall agp_nvidia_init+0x0/0x27 returned 0 after 36 usecs
calling  agp_sis_init+0x0/0x27 @ 1
initcall agp_sis_init+0x0/0x27 returned 0 after 31 usecs
calling  agp_serverworks_init+0x0/0x27 @ 1
initcall agp_serverworks_init+0x0/0x27 returned 0 after 29 usecs
calling  agp_via_init+0x0/0x27 @ 1
initcall agp_via_init+0x0/0x27 returned 0 after 33 usecs
calling  synclink_cs_init+0x0/0x103 @ 1
SyncLink PC Card driver $Revision: 4.34 $, tty major#242
initcall synclink_cs_init+0x0/0x103 returned 0 after 9837 usecs
calling  cmm_init+0x0/0xa7 @ 1
initcall cmm_init+0x0/0xa7 returned 0 after 53 usecs
calling  cm4040_init+0x0/0xa7 @ 1
initcall cm4040_init+0x0/0xa7 returned 0 after 55 usecs
calling  hangcheck_init+0x0/0xab @ 1
Hangcheck: starting hangcheck timer 0.9.1 (tick is 180 seconds, margin is 60 seconds).
Hangcheck: Using getrawmonotonic().
initcall hangcheck_init+0x0/0xab returned 0 after 21193 usecs
calling  init_tis+0x0/0x9d @ 1
initcall init_tis+0x0/0x9d returned 0 after 32 usecs
calling  tpm_tis_i2c_driver_init+0x0/0x11 @ 1
initcall tpm_tis_i2c_driver_init+0x0/0x11 returned 0 after 35 usecs
calling  init_nsc+0x0/0x570 @ 1
initcall init_nsc+0x0/0x570 returned -19 after 10 usecs
calling  init_inf+0x0/0xf @ 1
initcall init_inf+0x0/0xf returned 0 after 23 usecs
calling  i810fb_init+0x0/0x33e @ 1
initcall i810fb_init+0x0/0x33e returned 0 after 31 usecs
calling  parport_default_proc_register+0x0/0x16 @ 1
initcall parport_default_proc_register+0x0/0x16 returned 0 after 24 usecs
calling  parport_pc_init+0x0/0x336 @ 1
IT8712 SuperIO detected.
parport_pc 00:0e: reported by Plug and Play BIOS
parport0: PC-style at 0x378parport0: PC-style at 0x378 (0x778) (0x778), irq 7, irq 7 [ [PCSPPPCSPP,TRISTATE,TRISTATE]
]
lp0: using parport0 (interrupt-driven).
initcall parport_pc_init+0x0/0x336 returned 0 after 110351 usecs
calling  parport_serial_init+0x0/0x16 @ 1
initcall parport_serial_init+0x0/0x16 returned 0 after 35 usecs
calling  parport_cs_driver_init+0x0/0xf @ 1
initcall parport_cs_driver_init+0x0/0xf returned 0 after 24 usecs
calling  axdrv_init+0x0/0x11 @ 1
initcall axdrv_init+0x0/0x11 returned 0 after 25 usecs
calling  topology_sysfs_init+0x0/0x19 @ 1
initcall topology_sysfs_init+0x0/0x19 returned 0 after 10 usecs
calling  isa_bus_init+0x0/0x33 @ 1
initcall isa_bus_init+0x0/0x33 returned 0 after 43 usecs
calling  floppy_init+0x0/0x13 @ 1
initcall floppy_init+0x0/0x13 returned 0 after 15 usecs
calling  loop_init+0x0/0x129 @ 1
loop: module loaded
initcall loop_init+0x0/0x129 returned 0 after 5479 usecs
calling  cpqarray_init+0x0/0x26d @ 1
Compaq SMART2 Driver (v 2.6.0)
initcall cpqarray_init+0x0/0x26d returned -19 after 5496 usecs
calling  cciss_init+0x0/0x9b @ 1
HP CISS Driver (v 3.6.26)
calling  1_floppy_async_init+0x0/0xa @ 6
Floppy drive(s):Floppy drive(s): fd0 is 1.44M fd0 is 1.44M

initcall cciss_init+0x0/0x9b returned 0 after 18234 usecs
calling  pkt_init+0x0/0x169 @ 1
initcall pkt_init+0x0/0x169 returned 0 after 181 usecs
calling  osdblk_init+0x0/0x7a @ 1
initcall osdblk_init+0x0/0x7a returned 0 after 27 usecs
calling  mm_init+0x0/0x168 @ 1
MM: desc_per_page = 128
initcall mm_init+0x0/0x168 returned 0 after 4250 usecs
calling  nbd_init+0x0/0x323 @ 1
nbd: registered device at major 43
initcall nbd_init+0x0/0x323 returned 0 after 10245 usecs
calling  init+0x0/0x8b @ 1
initcall init+0x0/0x8b returned 0 after 39 usecs
calling  carm_init+0x0/0x16 @ 1
initcall carm_init+0x0/0x16 returned 0 after 37 usecs
calling  mtip_init+0x0/0x13e @ 1
mtip32xx Version 1.2.6os3
initcall mtip_init+0x0/0x13e returned 0 after 4670 usecs
calling  ibmasm_init+0x0/0x69 @ 1
ibmasm: IBM ASM Service Processor Driver version 1.0 loaded
initcall ibmasm_init+0x0/0x69 returned 0 after 10344 usecs
calling  dummy_irq_init+0x0/0x75 @ 1
dummy-irq: no IRQ given.  Use irq=N
initcall dummy_irq_init+0x0/0x75 returned -5 after 6280 usecs
calling  ics932s401_driver_init+0x0/0x11 @ 1
initcall ics932s401_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  lkdtm_module_init+0x0/0x1b4 @ 1
lkdtm: No crash points registered, enable through debugfs
initcall lkdtm_module_init+0x0/0x1b4 returned 0 after 10005 usecs
calling  tifm_7xx1_driver_init+0x0/0x16 @ 1
initcall tifm_7xx1_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  phantom_init+0x0/0xed @ 1
Phantom Linux Driver, version n0.9.8, init OK
initcall phantom_init+0x0/0xed returned 0 after 7975 usecs
calling  bh1780_driver_init+0x0/0x11 @ 1
initcall bh1780_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  apds990x_driver_init+0x0/0x11 @ 1
initcall apds990x_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  ioc4_init+0x0/0x16 @ 1
initcall ioc4_init+0x0/0x16 returned 0 after 30 usecs
calling  enclosure_init+0x0/0x14 @ 1
initcall enclosure_init+0x0/0x14 returned 0 after 22 usecs
calling  init_kgdbts+0x0/0x15 @ 1
initcall init_kgdbts+0x0/0x15 returned 0 after 0 usecs
calling  cs5535_mfgpt_init+0x0/0x11 @ 1
initcall cs5535_mfgpt_init+0x0/0x11 returned 0 after 32 usecs
calling  ilo_init+0x0/0x82 @ 1
initcall ilo_init+0x0/0x82 returned 0 after 78 usecs
calling  isl29020_driver_init+0x0/0x11 @ 1
initcall isl29020_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  tsl2550_driver_init+0x0/0x11 @ 1
initcall tsl2550_driver_init+0x0/0x11 returned 0 after 22 usecs
calling  ds1682_driver_init+0x0/0x11 @ 1
initcall ds1682_driver_init+0x0/0x11 returned 0 after 21 usecs
calling  at24_init+0x0/0x43 @ 1
initcall at24_init+0x0/0x43 returned 0 after 22 usecs
calling  eeprom_driver_init+0x0/0x11 @ 1
initcall eeprom_driver_init+0x0/0x11 returned 0 after 21 usecs
calling  cb710_init_module+0x0/0x16 @ 1
initcall cb710_init_module+0x0/0x16 returned 0 after 32 usecs
calling  kim_platform_driver_init+0x0/0x11 @ 1
initcall kim_platform_driver_init+0x0/0x11 returned 0 after 24 usecs
calling  fsa9480_i2c_driver_init+0x0/0x11 @ 1
initcall fsa9480_i2c_driver_init+0x0/0x11 returned 0 after 22 usecs
calling  vmci_drv_init+0x0/0xcf @ 1
Guest personality initialized and is inactive
VMCI host device registered (name=vmci, major=10, minor=60)
Initialized host personality
initcall vmci_drv_init+0x0/0xcf returned 0 after 24080 usecs
calling  sm501_base_init+0x0/0x22 @ 1
initcall sm501_base_init+0x0/0x22 returned 0 after 58 usecs
calling  cros_ec_driver_init+0x0/0x11 @ 1
initcall cros_ec_driver_init+0x0/0x11 returned 0 after 23 usecs
calling  rtsx_pci_driver_init+0x0/0x16 @ 1
initcall rtsx_pci_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  pasic3_driver_init+0x0/0x14 @ 1
initcall pasic3_driver_init+0x0/0x14 returned -19 after 44 usecs
calling  htcpld_core_init+0x0/0x24 @ 1
initcall htcpld_core_init+0x0/0x24 returned -19 after 63 usecs
calling  ti_tscadc_driver_init+0x0/0x11 @ 1
initcall ti_tscadc_driver_init+0x0/0x11 returned 0 after 24 usecs
calling  wm8994_i2c_driver_init+0x0/0x11 @ 1
initcall wm8994_i2c_driver_init+0x0/0x11 returned 0 after 57 usecs
calling  twl_driver_init+0x0/0x11 @ 1
initcall twl_driver_init+0x0/0x11 returned 0 after 41 usecs
calling  twl4030_madc_driver_init+0x0/0x11 @ 1
initcall twl4030_madc_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  twl4030_audio_driver_init+0x0/0x11 @ 1
initcall twl4030_audio_driver_init+0x0/0x11 returned 0 after 24 usecs
calling  twl6040_driver_init+0x0/0x11 @ 1
initcall twl6040_driver_init+0x0/0x11 returned 0 after 22 usecs
calling  timberdale_pci_driver_init+0x0/0x16 @ 1
initcall timberdale_pci_driver_init+0x0/0x16 returned 0 after 30 usecs
calling  kempld_init+0x0/0x5f @ 1
initcall kempld_init+0x0/0x5f returned -19 after 0 usecs
calling  lpc_sch_driver_init+0x0/0x16 @ 1
initcall lpc_sch_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  lpc_ich_driver_init+0x0/0x16 @ 1
initcall lpc_ich_driver_init+0x0/0x16 returned 0 after 49 usecs
calling  rdc321x_sb_driver_init+0x0/0x16 @ 1
initcall rdc321x_sb_driver_init+0x0/0x16 returned 0 after 30 usecs
calling  cmodio_pci_driver_init+0x0/0x16 @ 1
initcall cmodio_pci_driver_init+0x0/0x16 returned 0 after 37 usecs
calling  vx855_pci_driver_init+0x0/0x16 @ 1
initcall vx855_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  si476x_core_driver_init+0x0/0x11 @ 1
initcall si476x_core_driver_init+0x0/0x11 returned 0 after 22 usecs
calling  cs5535_mfd_driver_init+0x0/0x16 @ 1
initcall cs5535_mfd_driver_init+0x0/0x16 returned 0 after 37 usecs
calling  vprbrd_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver viperboard
initcall vprbrd_driver_init+0x0/0x16 returned 0 after 8989 usecs
calling  nfcwilink_driver_init+0x0/0x11 @ 1
initcall nfcwilink_driver_init+0x0/0x11 returned 0 after 23 usecs
calling  scsi_tgt_init+0x0/0x9d @ 1
FDC 0 is a post-1991 82077
initcall scsi_tgt_init+0x0/0x9d returned 0 after 5476 usecs
calling  raid_init+0x0/0xf @ 1
initcall raid_init+0x0/0xf returned 0 after 23 usecs
calling  spi_transport_init+0x0/0x79 @ 1
initcall spi_transport_init+0x0/0x79 returned 0 after 46 usecs
calling  fc_transport_init+0x0/0x71 @ 1
initcall fc_transport_init+0x0/0x71 returned 0 after 93 usecs
calling  iscsi_transport_init+0x0/0x199 @ 1
Loading iSCSI transport class v2.0-870.
initcall iscsi_transport_init+0x0/0x199 returned 0 after 7239 usecs
calling  sas_transport_init+0x0/0x9f @ 1
initcall sas_transport_init+0x0/0x9f returned 0 after 127 usecs
calling  sas_class_init+0x0/0x38 @ 1
initcall sas_class_init+0x0/0x38 returned 0 after 14 usecs
calling  srp_transport_init+0x0/0x33 @ 1
initcall srp_transport_init+0x0/0x33 returned 0 after 43 usecs
calling  scsi_dh_init+0x0/0x35 @ 1
initcall scsi_dh_init+0x0/0x35 returned 0 after 1 usecs
calling  rdac_init+0x0/0x85 @ 1
rdac: device handler registered
initcall rdac_init+0x0/0x85 returned 0 after 5661 usecs
calling  hp_sw_init+0x0/0xf @ 1
hp_sw: device handler registered
initcall hp_sw_init+0x0/0xf returned 0 after 5773 usecs
calling  clariion_init+0x0/0x32 @ 1
emc: device handler registered
initcall clariion_init+0x0/0x32 returned 0 after 5434 usecs
calling  alua_init+0x0/0x32 @ 1
alua: device handler registered
initcall alua_init+0x0/0x32 returned 0 after 5603 usecs
calling  libfc_init+0x0/0x37 @ 1
initcall libfc_init+0x0/0x37 returned 0 after 152 usecs
calling  libfcoe_init+0x0/0x23 @ 1
initcall libfcoe_init+0x0/0x23 returned 0 after 37 usecs
calling  fnic_init_module+0x0/0x25d @ 1
fnic: Cisco FCoE HBA Driver, ver 1.5.0.23
fnic: Successfully Initialized Trace Buffer
initcall fnic_init_module+0x0/0x25d returned 0 after 15159 usecs
calling  bnx2fc_mod_init+0x0/0x255 @ 1
bnx2fc: Broadcom NetXtreme II FCoE Driver bnx2fc v1.0.14 (Mar 08, 2013)
initcall 1_floppy_async_init+0x0/0xa returned 0 after 1339307 usecs
initcall bnx2fc_mod_init+0x0/0x255 returned 0 after 24511 usecs
calling  iscsi_sw_tcp_init+0x0/0x43 @ 1
iscsi: registered transport (tcp)
initcall iscsi_sw_tcp_init+0x0/0x43 returned 0 after 5942 usecs
calling  arcmsr_module_init+0x0/0x16 @ 1
initcall arcmsr_module_init+0x0/0x16 returned 0 after 35 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 389 usecs
calling  aha152x_init+0x0/0x615 @ 1
initcall aha152x_init+0x0/0x615 returned -19 after 67 usecs
calling  ahc_linux_init+0x0/0x5a @ 1
initcall ahc_linux_init+0x0/0x5a returned 0 after 52 usecs
calling  ahd_linux_init+0x0/0x6c @ 1
initcall ahd_linux_init+0x0/0x6c returned 0 after 55 usecs
calling  aac_init+0x0/0x76 @ 1
Adaptec aacraid driver 1.2-0[30200]-ms
initcall aac_init+0x0/0x76 returned 0 after 6836 usecs
calling  aic94xx_init+0x0/0x12d @ 1
aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.3 loaded
initcall aic94xx_init+0x0/0x12d returned 0 after 10762 usecs
calling  pm8001_init+0x0/0x9a @ 1
initcall pm8001_init+0x0/0x9a returned 0 after 58 usecs
calling  ips_module_init+0x0/0x2f3 @ 1
initcall ips_module_init+0x0/0x2f3 returned -19 after 65 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
scsi: <fdomain> Detection failed (no card)
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 7465 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 3 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 1 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
NCR53c406a: no available ports found
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 6450 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
sym53c416.c: Version 1.0.0-ac
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 5277 usecs
calling  qla1280_init+0x0/0x16 @ 1
initcall qla1280_init+0x0/0x16 returned 0 after 30 usecs
calling  qla2x00_module_init+0x0/0x242 @ 1
qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 8.06.00.08-k.
initcall qla2x00_module_init+0x0/0x242 returned 0 after 13430 usecs
calling  tcm_qla2xxx_init+0x0/0x29d @ 1
initcall tcm_qla2xxx_init+0x0/0x29d returned 0 after 94 usecs
calling  qla4xxx_module_init+0x0/0xcc @ 1
iscsi: registered transport (qla4xxx)
QLogic iSCSI HBA Driver
initcall qla4xxx_module_init+0x0/0xcc returned 0 after 10868 usecs
calling  lpfc_init+0x0/0xf0 @ 1
Emulex LightPulse Fibre Channel SCSI driver 8.3.42
Copyright(c) 2004-2013 Emulex.  All rights reserved.
initcall lpfc_init+0x0/0xf0 returned 0 after 18119 usecs
calling  bfad_init+0x0/0xab @ 1
Brocade BFA FC/FCOE SCSI driver - version: 3.2.21.1
initcall bfad_init+0x0/0xab returned 0 after 9048 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 2 usecs
calling  dmx3191d_init+0x0/0x16 @ 1
initcall dmx3191d_init+0x0/0x16 returned 0 after 38 usecs
calling  hpsa_init+0x0/0x16 @ 1
initcall hpsa_init+0x0/0x16 returned 0 after 43 usecs
calling  sym2_init+0x0/0xe0 @ 1
initcall sym2_init+0x0/0xe0 returned 0 after 39 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
Failed initialization of WD-7000 SCSI card!
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 28676 usecs
calling  init_this_scsi_driver+0x0/0xc6 @ 1
initcall init_this_scsi_driver+0x0/0xc6 returned -19 after 30939 usecs
calling  dc395x_module_init+0x0/0x16 @ 1
initcall dc395x_module_init+0x0/0x16 returned 0 after 31 usecs
calling  dc390_module_init+0x0/0x8f @ 1
DC390: clustering now enabled by default. If you get problems load
       with "disable_clustering=1" and report to maintainers
initcall dc390_module_init+0x0/0x8f returned 0 after 22071 usecs
calling  megaraid_init+0x0/0xae @ 1
initcall megaraid_init+0x0/0xae returned 0 after 57 usecs
calling  megasas_init+0x0/0x18e @ 1
megasas: 06.700.06.00-rc1 Sat. Aug. 31 17:00:00 PDT 2013
initcall megasas_init+0x0/0x18e returned 0 after 9887 usecs
calling  _scsih_init+0x0/0x154 @ 1
mpt2sas version 16.100.00.00 loaded
initcall _scsih_init+0x0/0x154 returned 0 after 6401 usecs
calling  _scsih_init+0x0/0x154 @ 1
mpt3sas version 02.100.00.00 loaded
initcall _scsih_init+0x0/0x154 returned 0 after 6407 usecs
calling  ufshcd_pci_driver_init+0x0/0x16 @ 1
initcall ufshcd_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  ufshcd_pltfrm_driver_init+0x0/0x11 @ 1
initcall ufshcd_pltfrm_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  gdth_init+0x0/0x7a5 @ 1
GDT-HA: Storage RAID Controller Driver. Version: 3.05
initcall gdth_init+0x0/0x7a5 returned 0 after 9365 usecs
calling  initio_init_driver+0x0/0x16 @ 1
initcall initio_init_driver+0x0/0x16 returned 0 after 31 usecs
calling  inia100_init+0x0/0x16 @ 1
initcall inia100_init+0x0/0x16 returned 0 after 30 usecs
calling  tw_init+0x0/0x2d @ 1
3ware Storage Controller device driver for Linux v1.26.02.003.
initcall tw_init+0x0/0x2d returned 0 after 10888 usecs
calling  twa_init+0x0/0x2d @ 1
3ware 9000 Storage Controller device driver for Linux v2.26.02.014.
initcall twa_init+0x0/0x2d returned 0 after 11728 usecs
calling  imm_driver_init+0x0/0x26 @ 1
imm: Version 2.05 (for Linux 2.4.0)
initcall imm_driver_init+0x0/0x26 returned 0 after 6646 usecs
calling  init_nsp32+0x0/0x3d @ 1
nsp32: loading...
initcall init_nsp32+0x0/0x3d returned 0 after 3265 usecs
calling  ipr_init+0x0/0x3f @ 1
ipr: IBM Power RAID SCSI Device Driver version: 2.6.0 (November 16, 2012)
initcall ipr_init+0x0/0x3f returned 0 after 12748 usecs
calling  hptiop_module_init+0x0/0x35 @ 1
RocketRAID 3xxx/4xxx Controller driver v1.8
initcall hptiop_module_init+0x0/0x35 returned 0 after 7669 usecs
calling  stex_init+0x0/0x2d @ 1
stex: Promise SuperTrak EX Driver version: 4.6.0000.4
initcall stex_init+0x0/0x2d returned 0 after 9360 usecs
calling  libcxgbi_init_module+0x0/0x17a @ 1
Clocksource tsc unstable (delta = 2830721179 ns)
libcxgbi:libcxgbi_init_module: tag itt 0x1fff, 13 bits, age 0xf, 4 bits.
libcxgbi:ddp_setup_host_page_size: system PAGE 4096, ddp idx 0.
initcall libcxgbi_init_module+0x0/0x17a returned 0 after 23559 usecs
calling  cxgb4i_init_module+0x0/0x40 @ 1
Chelsio T4/T5 iSCSI Driver cxgb4i v0.9.4
iscsi: registered transport (cxgb4i)
initcall cxgb4i_init_module+0x0/0x40 returned 0 after 13588 usecs
calling  beiscsi_module_init+0x0/0x75 @ 1
iscsi: registered transport (be2iscsi)
In beiscsi_module_init, tt=b2bf7600
initcall beiscsi_module_init+0x0/0x75 returned 0 after 13100 usecs
calling  esas2r_init+0x0/0x287 @ 1
esas2r: driver will not be loaded because no ATTO esas2r devices were found
initcall esas2r_init+0x0/0x287 returned -1 after 13052 usecs
calling  pmcraid_init+0x0/0x126 @ 1
initcall pmcraid_init+0x0/0x126 returned 0 after 87 usecs
calling  init+0x0/0xc4 @ 1
initcall init+0x0/0xc4 returned 0 after 411 usecs
calling  pvscsi_init+0x0/0x35 @ 1
VMware PVSCSI driver - version 1.0.2.0-k
initcall pvscsi_init+0x0/0x35 returned 0 after 7164 usecs
calling  init_sd+0x0/0x14b @ 1
initcall init_sd+0x0/0x14b returned 0 after 83 usecs
calling  init_sg+0x0/0xaf @ 1
initcall init_sg+0x0/0xaf returned 0 after 40 usecs
calling  init_ch_module+0x0/0xab @ 1
SCSI Media Changer driver v0.25 
initcall init_ch_module+0x0/0xab returned 0 after 5817 usecs
calling  osd_uld_init+0x0/0xbe @ 1
osd: LOADED open-osd 0.2.1
initcall osd_uld_init+0x0/0xbe returned 0 after 4758 usecs
calling  ahci_pci_driver_init+0x0/0x16 @ 1
initcall ahci_pci_driver_init+0x0/0x16 returned 0 after 54 usecs
calling  ahci_driver_init+0x0/0x11 @ 1
initcall ahci_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  sil24_pci_driver_init+0x0/0x16 @ 1
initcall sil24_pci_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  ahci_highbank_driver_init+0x0/0x11 @ 1
initcall ahci_highbank_driver_init+0x0/0x11 returned 0 after 39 usecs
calling  imx_ahci_driver_init+0x0/0x11 @ 1
initcall imx_ahci_driver_init+0x0/0x11 returned 0 after 24 usecs
calling  qs_ata_pci_driver_init+0x0/0x16 @ 1
initcall qs_ata_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  piix_init+0x0/0x24 @ 1
initcall piix_init+0x0/0x24 returned 0 after 40 usecs
calling  mv_init+0x0/0x3c @ 1
initcall mv_init+0x0/0x3c returned 0 after 66 usecs
calling  nv_pci_driver_init+0x0/0x16 @ 1
sata_nv 0000:00:07.0: version 3.5
sata_nv 0000:00:07.0: setting latency timer to 64
scsi0 : sata_nv
scsi1 : sata_nv
ata1: SATA max UDMA/133 cmd 0x9f0 ctl 0xbf0 bmdma 0xd800 irq 11
ata2: SATA max UDMA/133 cmd 0x970 ctl 0xb70 bmdma 0xd808 irq 11
sata_nv 0000:00:08.0: setting latency timer to 64
calling  2_async_port_probe+0x0/0x49 @ 6
calling  3_async_port_probe+0x0/0x49 @ 91
async_waiting @ 91
scsi2 : sata_nv
scsi3 : sata_nv
ata3: SATA max UDMA/133 cmd 0x9e0 ctl 0xbe0 bmdma 0xc400 irq 5
ata4: SATA max UDMA/133 cmd 0x960 ctl 0xb60 bmdma 0xc408 irq 5
initcall nv_pci_driver_init+0x0/0x16 returned 0 after 97585 usecs
calling  pdc_ata_pci_driver_init+0x0/0x16 @ 1
initcall pdc_ata_pci_driver_init+0x0/0x16 returned 0 after 41 usecs
calling  sis_pci_driver_init+0x0/0x16 @ 1
initcall sis_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  k2_sata_pci_driver_init+0x0/0x16 @ 1
initcall k2_sata_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  uli_pci_driver_init+0x0/0x16 @ 1
initcall uli_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  svia_pci_driver_init+0x0/0x16 @ 1
initcall svia_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  vsc_sata_pci_driver_init+0x0/0x16 @ 1
initcall vsc_sata_pci_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  ali_init+0x0/0x40 @ 1
initcall ali_init+0x0/0x40 returned 0 after 36 usecs
calling  amd_pci_driver_init+0x0/0x16 @ 1
pata_amd 0000:00:06.0: version 0.4.1
pata_amd 0000:00:06.0: setting latency timer to 64
scsi4 : pata_amd
scsi5 : pata_amd
ata5: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0xf000 irq 14
ata6: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xf008 irq 15
initcall amd_pci_driver_init+0x0/0x16 returned 0 after 44162 usecs
calling  atp867x_driver_init+0x0/0x16 @ 1
initcall atp867x_driver_init+0x0/0x16 returned 0 after 40 usecs
calling  cmd64x_pci_driver_init+0x0/0x16 @ 1
initcall cmd64x_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  cs5535_pci_driver_init+0x0/0x16 @ 1
initcall cs5535_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  cs5536_pci_driver_init+0x0/0x16 @ 1
initcall cs5536_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  efar_pci_driver_init+0x0/0x16 @ 1
initcall efar_pci_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  hpt36x_pci_driver_init+0x0/0x16 @ 1
initcall hpt36x_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  hpt37x_pci_driver_init+0x0/0x16 @ 1
initcall hpt37x_pci_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  hpt3x2n_pci_driver_init+0x0/0x16 @ 1
initcall hpt3x2n_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  it821x_pci_driver_init+0x0/0x16 @ 1
initcall it821x_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  jmicron_pci_driver_init+0x0/0x16 @ 1
initcall jmicron_pci_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  marvell_pci_driver_init+0x0/0x16 @ 1
initcall marvell_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  ninja32_pci_driver_init+0x0/0x16 @ 1
initcall ninja32_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  oldpiix_pci_driver_init+0x0/0x16 @ 1
initcall oldpiix_pci_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  optidma_pci_driver_init+0x0/0x16 @ 1
initcall optidma_pci_driver_init+0x0/0x16 returned 0 after 40 usecs
calling  pdc2027x_pci_driver_init+0x0/0x16 @ 1
initcall pdc2027x_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  pdc202xx_pci_driver_init+0x0/0x16 @ 1
initcall pdc202xx_pci_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  radisys_pci_driver_init+0x0/0x16 @ 1
initcall radisys_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  rdc_pci_driver_init+0x0/0x16 @ 1
initcall rdc_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  sc1200_pci_driver_init+0x0/0x16 @ 1
initcall sc1200_pci_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  sch_pci_driver_init+0x0/0x16 @ 1
initcall sch_pci_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  serverworks_pci_driver_init+0x0/0x16 @ 1
initcall serverworks_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  sis_pci_driver_init+0x0/0x16 @ 1
initcall sis_pci_driver_init+0x0/0x16 returned 0 after 40 usecs
calling  ata_tosh_pci_driver_init+0x0/0x16 @ 1
initcall ata_tosh_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  triflex_pci_driver_init+0x0/0x16 @ 1
initcall triflex_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  via_pci_driver_init+0x0/0x16 @ 1
initcall via_pci_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  cmd640_pci_driver_init+0x0/0x16 @ 1
initcall cmd640_pci_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  isapnp_init+0x0/0xf @ 1
initcall isapnp_init+0x0/0xf returned 0 after 29 usecs
calling  mpiix_pci_driver_init+0x0/0x16 @ 1
initcall mpiix_pci_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  ns87410_pci_driver_init+0x0/0x16 @ 1
initcall ns87410_pci_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  opti_pci_driver_init+0x0/0x16 @ 1
initcall opti_pci_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  pata_platform_driver_init+0x0/0x11 @ 1
initcall pata_platform_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  pata_of_platform_driver_init+0x0/0x11 @ 1
initcall pata_of_platform_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  rz1000_pci_driver_init+0x0/0x16 @ 1
initcall rz1000_pci_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  ata_generic_pci_driver_init+0x0/0x16 @ 1
initcall ata_generic_pci_driver_init+0x0/0x16 returned 0 after 34 usecs
calling  legacy_init+0x0/0x874 @ 1
initcall legacy_init+0x0/0x874 returned -19 after 14 usecs
calling  target_core_init_configfs+0x0/0x394 @ 1
calling  4_async_port_probe+0x0/0x49 @ 93
calling  5_async_port_probe+0x0/0x49 @ 109
async_waiting @ 109
calling  6_async_port_probe+0x0/0x49 @ 112
calling  7_async_port_probe+0x0/0x49 @ 113
async_waiting @ 113
Switched to clocksource pit
Rounding down aligned max_sectors from 4294967295 to 4294967288
initcall target_core_init_configfs+0x0/0x394 returned 0 after 39537 usecs
calling  iblock_module_init+0x0/0xf @ 1
initcall iblock_module_init+0x0/0xf returned 0 after 4 usecs
calling  fileio_module_init+0x0/0xf @ 1
initcall fileio_module_init+0x0/0xf returned 0 after 4 usecs
calling  tcm_loop_fabric_init+0x0/0x38f @ 1
initcall tcm_loop_fabric_init+0x0/0x38f returned 0 after 132 usecs
calling  iscsi_target_init_module+0x0/0x21e @ 1
initcall iscsi_target_init_module+0x0/0x21e returned 0 after 2380 usecs
calling  hsc_init+0x0/0x66 @ 1
HSI/SSI char device loaded
initcall hsc_init+0x0/0x66 returned 0 after 1828 usecs
calling  bonding_init+0x0/0x97 @ 1
bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
initcall bonding_init+0x0/0x97 returned 0 after 1577 usecs
calling  eql_init_module+0x0/0x6a @ 1
eql: Equalizer2002: Simon Janes (simon@ncm.com) and David S. Miller (davem@redhat.com)
initcall eql_init_module+0x0/0x6a returned 0 after 1500 usecs
calling  macvlan_init_module+0x0/0x31 @ 1
initcall macvlan_init_module+0x0/0x31 returned 0 after 4 usecs
calling  macvtap_init+0x0/0xc7 @ 1
initcall macvtap_init+0x0/0xc7 returned 0 after 33 usecs
calling  net_olddevs_init+0x0/0x50 @ 1
LocalTalk card not found; 220 = ff, 240 = ff.
initcall net_olddevs_init+0x0/0x50 returned 0 after 1451 usecs
calling  cicada_init+0x0/0x14 @ 1
initcall cicada_init+0x0/0x14 returned 0 after 49 usecs
calling  qs6612_init+0x0/0xf @ 1
initcall qs6612_init+0x0/0xf returned 0 after 24 usecs
calling  smsc_init+0x0/0x14 @ 1
initcall smsc_init+0x0/0x14 returned 0 after 119 usecs
calling  broadcom_init+0x0/0x14 @ 1
initcall broadcom_init+0x0/0x14 returned 0 after 253 usecs
calling  bcm87xx_init+0x0/0x14 @ 1
initcall bcm87xx_init+0x0/0x14 returned 0 after 46 usecs
calling  icplus_init+0x0/0x14 @ 1
initcall icplus_init+0x0/0x14 returned 0 after 152 usecs
calling  et1011c_init+0x0/0xf @ 1
initcall et1011c_init+0x0/0xf returned 0 after 23 usecs
calling  ns_init+0x0/0xf @ 1
initcall ns_init+0x0/0xf returned 0 after 24 usecs
calling  ste10Xp_init+0x0/0x14 @ 1
initcall ste10Xp_init+0x0/0x14 returned 0 after 53 usecs
calling  ksphy_init+0x0/0x14 @ 1
initcall ksphy_init+0x0/0x14 returned 0 after 261 usecs
calling  atheros_init+0x0/0x14 @ 1
initcall atheros_init+0x0/0x14 returned 0 after 72 usecs
calling  mdio_mux_gpio_driver_init+0x0/0x11 @ 1
initcall mdio_mux_gpio_driver_init+0x0/0x11 returned 0 after 40 usecs
calling  mdio_mux_mmioreg_driver_init+0x0/0x11 @ 1
initcall mdio_mux_mmioreg_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  team_module_init+0x0/0x80 @ 1
initcall team_module_init+0x0/0x80 returned 0 after 54 usecs
calling  bc_init_module+0x0/0xf @ 1
initcall bc_init_module+0x0/0xf returned 0 after 22 usecs
calling  rr_init_module+0x0/0xf @ 1
initcall rr_init_module+0x0/0xf returned 0 after 4 usecs
calling  rnd_init_module+0x0/0xf @ 1
initcall rnd_init_module+0x0/0xf returned 0 after 4 usecs
calling  ab_init_module+0x0/0xf @ 1
initcall ab_init_module+0x0/0xf returned 0 after 4 usecs
calling  lb_init_module+0x0/0xf @ 1
initcall lb_init_module+0x0/0xf returned 0 after 4 usecs
calling  virtio_net_driver_init+0x0/0xf @ 1
initcall virtio_net_driver_init+0x0/0xf returned 0 after 27 usecs
calling  nlmon_register+0x0/0xf @ 1
initcall nlmon_register+0x0/0xf returned 0 after 4 usecs
calling  ipddp_init_module+0x0/0xed @ 1
ipddp.c:v0.01 8/28/97 Bradford W. Johnson <johns393@maroon.tc.umn.edu>
ipddp0: Appletalk-IP Encap. mode by Bradford W. Johnson <johns393@maroon.tc.umn.edu>
initcall ipddp_init_module+0x0/0xed returned 0 after 3465 usecs
calling  can_dev_init+0x0/0x2c @ 1
CAN device driver interface
initcall can_dev_init+0x0/0x2c returned 0 after 1019 usecs
calling  esd_usb2_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver esd_usb2
initcall esd_usb2_driver_init+0x0/0x16 returned 0 after 839 usecs
calling  kvaser_usb_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver kvaser_usb
initcall kvaser_usb_driver_init+0x0/0x16 returned 0 after 1178 usecs
calling  usb_8dev_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usb_8dev
initcall usb_8dev_driver_init+0x0/0x16 returned 0 after 1813 usecs
calling  softing_driver_init+0x0/0x11 @ 1
initcall softing_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  sja1000_init+0x0/0x1e @ 1
sja1000 CAN netdevice driver
initcall sja1000_init+0x0/0x1e returned 0 after 1189 usecs
calling  sja1000_isa_init+0x0/0x14f @ 1
sja1000_isa: insufficient parameters supplied
initcall sja1000_isa_init+0x0/0x14f returned -22 after 1137 usecs
calling  sp_driver_init+0x0/0x11 @ 1
initcall sp_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  ems_pcmcia_driver_init+0x0/0xf @ 1
initcall ems_pcmcia_driver_init+0x0/0xf returned 0 after 36 usecs
calling  ems_pci_driver_init+0x0/0x16 @ 1
initcall ems_pci_driver_init+0x0/0x16 returned 0 after 53 usecs
calling  pcan_driver_init+0x0/0xf @ 1
initcall pcan_driver_init+0x0/0xf returned 0 after 26 usecs
calling  peak_pci_driver_init+0x0/0x16 @ 1
initcall peak_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  plx_pci_driver_init+0x0/0x16 @ 1
initcall plx_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  tscan1_init+0x0/0x14 @ 1
initcall tscan1_init+0x0/0x14 returned 0 after 263 usecs
calling  cc770_init+0x0/0x2b @ 1
cc770: CAN netdevice driver
initcall cc770_init+0x0/0x2b returned 0 after 1033 usecs
calling  cc770_isa_init+0x0/0x130 @ 1
cc770_isa: insufficient parameters supplied
initcall cc770_isa_init+0x0/0x130 returned -22 after 798 usecs
calling  cc770_platform_driver_init+0x0/0x11 @ 1
initcall cc770_platform_driver_init+0x0/0x11 returned 0 after 45 usecs
calling  pch_can_pci_driver_init+0x0/0x16 @ 1
initcall pch_can_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  grcan_driver_init+0x0/0x11 @ 1
initcall grcan_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  tc589_driver_init+0x0/0xf @ 1
initcall tc589_driver_init+0x0/0xf returned 0 after 27 usecs
calling  vortex_init+0x0/0x9f @ 1
initcall vortex_init+0x0/0x9f returned 0 after 54 usecs
calling  typhoon_init+0x0/0x16 @ 1
initcall typhoon_init+0x0/0x16 returned 0 after 36 usecs
calling  ne_init+0x0/0x21 @ 1
initcall ne_init+0x0/0x21 returned -19 after 361 usecs
calling  NS8390p_init_module+0x0/0x7 @ 1
initcall NS8390p_init_module+0x0/0x7 returned 0 after 4 usecs
calling  ne2k_pci_init+0x0/0x16 @ 1
initcall ne2k_pci_init+0x0/0x16 returned 0 after 37 usecs
calling  axnet_cs_driver_init+0x0/0xf @ 1
initcall axnet_cs_driver_init+0x0/0xf returned 0 after 31 usecs
calling  pcnet_driver_init+0x0/0xf @ 1
initcall pcnet_driver_init+0x0/0xf returned 0 after 61 usecs
calling  amd8111e_driver_init+0x0/0x16 @ 1
initcall amd8111e_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  nmclan_cs_driver_init+0x0/0xf @ 1
initcall nmclan_cs_driver_init+0x0/0xf returned 0 after 27 usecs
calling  pcnet32_init_module+0x0/0x114 @ 1
pcnet32: pcnet32.c:v1.35 21.Apr.2008 tsbogend@alpha.franken.de
initcall pcnet32_init_module+0x0/0x114 returned 0 after 1118 usecs
calling  b44_init+0x0/0x3f @ 1
initcall b44_init+0x0/0x3f returned 0 after 57 usecs
calling  bnx2_pci_driver_init+0x0/0x16 @ 1
initcall bnx2_pci_driver_init+0x0/0x16 returned 0 after 37 usecs
calling  cnic_init+0x0/0x8b @ 1
cnic: Broadcom NetXtreme II CNIC Driver cnic v2.5.18 (Sept 01, 2013)
initcall cnic_init+0x0/0x8b returned 0 after 1371 usecs
calling  tg3_driver_init+0x0/0x16 @ 1
initcall tg3_driver_init+0x0/0x16 returned 0 after 54 usecs
calling  cxgb_pci_driver_init+0x0/0x16 @ 1
initcall cxgb_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  cxgb4_init_module+0x0/0xa9 @ 1
initcall cxgb4_init_module+0x0/0xa9 returned 0 after 136 usecs
calling  dnet_driver_init+0x0/0x11 @ 1
initcall dnet_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  dmfe_init_module+0x0/0xea @ 1
dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17)
initcall dmfe_init_module+0x0/0xea returned 0 after 1758 usecs
calling  w840_init+0x0/0x23 @ 1
v1.01-e (2.4 port) Sep-11-2006  Donald Becker <becker@scyld.com>
  http://www.scyld.com/network/drivers.html
v1.01-e (2.4 port) Sep-11-2006  Donald Becker <becker@scyld.com>
  http://www.scyld.com/network/drivers.html
initcall w840_init+0x0/0x23 returned 0 after 1263 usecs
calling  de_init+0x0/0x16 @ 1
initcall de_init+0x0/0x16 returned 0 after 36 usecs
calling  tulip_init+0x0/0x2a @ 1
initcall tulip_init+0x0/0x2a returned 0 after 45 usecs
calling  de4x5_module_init+0x0/0x16 @ 1
initcall de4x5_module_init+0x0/0x16 returned 0 after 36 usecs
calling  uli526x_init_module+0x0/0x9e @ 1
uli526x: ULi M5261/M5263 net driver, version 0.9.3 (2005-7-29)
initcall uli526x_init_module+0x0/0x9e returned 0 after 1119 usecs
calling  s2io_starter+0x0/0x16 @ 1
initcall s2io_starter+0x0/0x16 returned 0 after 36 usecs
calling  jme_init_module+0x0/0x2d @ 1
jme: JMicron JMC2XX ethernet driver version 1.0.8
initcall jme_init_module+0x0/0x2d returned 0 after 870 usecs
calling  mlx4_init+0x0/0x12e @ 1
initcall mlx4_init+0x0/0x12e returned 0 after 99 usecs
calling  mlx4_en_init+0x0/0xf @ 1
initcall mlx4_en_init+0x0/0xf returned 0 after 16 usecs
calling  init+0x0/0x5d @ 1
initcall init+0x0/0x5d returned 0 after 83 usecs
calling  ks8851_platform_driver_init+0x0/0x11 @ 1
initcall ks8851_platform_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  pci_device_driver_init+0x0/0x16 @ 1
initcall pci_device_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  fealnx_init+0x0/0x16 @ 1
initcall fealnx_init+0x0/0x16 returned 0 after 36 usecs
calling  natsemi_init_mod+0x0/0x16 @ 1
initcall natsemi_init_mod+0x0/0x16 returned 0 after 36 usecs
calling  forcedeth_pci_driver_init+0x0/0x16 @ 1
forcedeth: Reverse Engineered nForce ethernet driver. Version 0.64.
forcedeth 0000:00:0a.0: setting latency timer to 64
ata5.00: ATA-6: HDS722525VLAT80, V36OA60A, max UDMA/100
ata5.00: 488397168 sectors, multi 1: LBA48 
ata5: nv_mode_filter: 0x3f39f&0x3f3ff->0x3f39f, BIOS=0x3f000 (0xc60000c0) ACPI=0x0
ata5.00: configured for UDMA/100
async_waiting @ 112
ata1: SATA link down (SStatus 0 SControl 300)
async_waiting @ 6
async_continuing @ 6 after 4 usec
initcall 2_async_port_probe+0x0/0x49 returned 0 after 1220598 usecs
async_continuing @ 91 after 1210867 usec
ata3: SATA link down (SStatus 0 SControl 300)
async_waiting @ 93
ata2: SATA link down (SStatus 0 SControl 300)
async_waiting @ 91
async_continuing @ 91 after 4 usec
initcall 3_async_port_probe+0x0/0x49 returned 0 after 1514369 usecs
async_continuing @ 93 after 268746 usec
initcall 4_async_port_probe+0x0/0x49 returned 0 after 591892 usecs
async_continuing @ 109 after 581639 usec
forcedeth 0000:00:0a.0: ifname eth0, PHY OUI 0x5043 @ 1, addr 00:13:d4:dc:41:12
forcedeth 0000:00:0a.0: highdma csum gbit lnktim desc-v3
initcall forcedeth_pci_driver_init+0x0/0x16 returned 0 after 514388 usecs
calling  ethoc_driver_init+0x0/0x11 @ 1
initcall ethoc_driver_init+0x0/0x11 returned 0 after 42 usecs
calling  yellowfin_init+0x0/0x16 @ 1
initcall yellowfin_init+0x0/0x16 returned 0 after 45 usecs
calling  sh_eth_driver_init+0x0/0x11 @ 1
initcall sh_eth_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  epic_init+0x0/0x16 @ 1
initcall epic_init+0x0/0x16 returned 0 after 44 usecs
calling  smsc9420_init_module+0x0/0x36 @ 1
initcall smsc9420_init_module+0x0/0x36 returned 0 after 36 usecs
calling  gem_driver_init+0x0/0x16 @ 1
initcall gem_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  cas_init+0x0/0x36 @ 1
initcall cas_init+0x0/0x36 returned 0 after 37 usecs
calling  niu_init+0x0/0x36 @ 1
initcall niu_init+0x0/0x36 returned 0 after 45 usecs
calling  velocity_init_module+0x0/0x48 @ 1
initcall velocity_init_module+0x0/0x48 returned 0 after 62 usecs
calling  xirc2ps_cs_driver_init+0x0/0xf @ 1
initcall xirc2ps_cs_driver_init+0x0/0xf returned 0 after 39 usecs
calling  skfddi_pci_driver_init+0x0/0x16 @ 1
initcall skfddi_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  dmascc_init+0x0/0x89c @ 1
dmascc: autoprobing (dangerous)
dmascc: no adapters found
initcall dmascc_init+0x0/0x89c returned -5 after 52931 usecs
calling  scc_init_driver+0x0/0xa5 @ 1
AX.25: Z8530 SCC driver version 3.0.dl1bke
initcall scc_init_driver+0x0/0xa5 returned 0 after 1870 usecs
calling  mkiss_init_driver+0x0/0x3f @ 1
mkiss: AX.25 Multikiss, Hans Albas PE1AYX
initcall mkiss_init_driver+0x0/0x3f returned 0 after 461 usecs
calling  sixpack_init_driver+0x0/0x3f @ 1
AX.25: 6pack driver, Revision: 0.3.0
initcall sixpack_init_driver+0x0/0x3f returned 0 after 1567 usecs
calling  yam_init_driver+0x0/0x120 @ 1
YAM driver version 0.8 by F1OAT/F6FBB
initcall yam_init_driver+0x0/0x120 returned 0 after 1737 usecs
calling  bpq_init_driver+0x0/0x65 @ 1
AX.25: bpqether driver version 004
initcall bpq_init_driver+0x0/0x65 returned 0 after 252 usecs
calling  init_baycomserfdx+0x0/0x100 @ 1
baycom_ser_fdx: (C) 1996-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_ser_fdx: version 0.10
baycom_ser_fdx: (C) 1996-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_ser_fdx: version 0.10
initcall init_baycomserfdx+0x0/0x100 returned 0 after 2460 usecs
calling  hdlcdrv_init_driver+0x0/0x20 @ 1
hdlcdrv: (C) 1996-2000 Thomas Sailer HB9JNX/AE4WA
hdlcdrv: version 0.8
initcall hdlcdrv_init_driver+0x0/0x20 returned 0 after 1618 usecs
calling  init_baycomserhdx+0x0/0xf3 @ 1
baycom_ser_hdx: (C) 1996-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_ser_hdx: version 0.10
baycom_ser_hdx: (C) 1996-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_ser_hdx: version 0.10
initcall init_baycomserhdx+0x0/0xf3 returned 0 after 1408 usecs
calling  init_baycompar+0x0/0xe5 @ 1
baycom_par: (C) 1996-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_par: version 0.9
baycom_par: (C) 1996-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_par: version 0.9
initcall init_baycompar+0x0/0xe5 returned 0 after 2824 usecs
calling  init_baycomepp+0x0/0x111 @ 1
baycom_epp: (C) 1998-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_epp: version 0.7
baycom_epp: (C) 1998-2000 Thomas Sailer, HB9JNX/AE4WA
baycom_epp: version 0.7
initcall init_baycomepp+0x0/0x111 returned 0 after 1838 usecs
calling  nsc_ircc_init+0x0/0x1d8 @ 1
initcall nsc_ircc_init+0x0/0x1d8 returned -19 after 135 usecs
calling  donauboe_init+0x0/0x16 @ 1
initcall donauboe_init+0x0/0x16 returned 0 after 52 usecs
calling  smsc_ircc_init+0x0/0x4f9 @ 1
initcall smsc_ircc_init+0x0/0x4f9 returned -19 after 162 usecs
calling  vlsi_mod_init+0x0/0x110 @ 1
initcall vlsi_mod_init+0x0/0x110 returned 0 after 53 usecs
calling  via_ircc_init+0x0/0x1c @ 1
initcall via_ircc_init+0x0/0x1c returned 0 after 37 usecs
calling  mcs_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver mcs7780
initcall mcs_driver_init+0x0/0x16 returned 0 after 671 usecs
calling  irtty_sir_init+0x0/0x3c @ 1
initcall irtty_sir_init+0x0/0x3c returned 0 after 4 usecs
calling  sir_wq_init+0x0/0x49 @ 1
initcall sir_wq_init+0x0/0x49 returned 0 after 94 usecs
calling  irda_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver kingsun-sir
initcall irda_driver_init+0x0/0x16 returned 0 after 371 usecs
calling  slip_init+0x0/0xa8 @ 1
SLIP: version 0.8.4-NET3.019-NEWTTY (dynamic channels, max=256) (6 bit encapsulation enabled).
CSLIP: code copyright 1989 Regents of the University of California.
SLIP linefill/keepalive option.
initcall slip_init+0x0/0xa8 returned 0 after 2324 usecs
calling  init_x25_asy+0x0/0x6c @ 1
x25_asy: X.25 async: version 0.00 ALPHA (dynamic channels, max=256)
initcall init_x25_asy+0x0/0x6c returned 0 after 964 usecs
calling  lapbeth_init_driver+0x0/0x28 @ 1
LAPB Ethernet driver version 0.02
initcall lapbeth_init_driver+0x0/0x28 returned 0 after 1059 usecs
calling  ipw2100_init+0x0/0x78 @ 1
ipw2100: Intel(R) PRO/Wireless 2100 Network Driver, git-1.2.2
ipw2100: Copyright(c) 2003-2006 Intel Corporation
initcall ipw2100_init+0x0/0x78 returned 0 after 2832 usecs
calling  ipw_init+0x0/0x73 @ 1
ipw2200: Intel(R) PRO/Wireless 2200/2915 Network Driver, 1.2.2kdmq
ipw2200: Copyright(c) 2003-2006 Intel Corporation
initcall ipw_init+0x0/0x73 returned 0 after 1661 usecs
calling  libipw_init+0x0/0x20 @ 1
libipw: 802.11 data/management/control stack, git-1.1.13
libipw: Copyright (C) 2004-2005 Intel Corporation <jketreno@linux.intel.com>
initcall libipw_init+0x0/0x20 returned 0 after 1573 usecs
calling  init_orinoco+0x0/0x1e @ 1
orinoco 0.15 (David Gibson <hermes@gibson.dropbear.id.au>, Pavel Roskin <proski@gnu.org>, et al)
initcall init_orinoco+0x0/0x1e returned 0 after 981 usecs
calling  orinoco_driver_init+0x0/0xf @ 1
initcall orinoco_driver_init+0x0/0xf returned 0 after 41 usecs
calling  orinoco_plx_init+0x0/0x2d @ 1
orinoco_plx 0.15 (Pavel Roskin <proski@gnu.org>, David Gibson <hermes@gibson.dropbear.id.au>, Daniel Barlow <dan@telent.net>)
initcall orinoco_plx_init+0x0/0x2d returned 0 after 1052 usecs
calling  orinoco_driver_init+0x0/0xf @ 1
initcall orinoco_driver_init+0x0/0xf returned 0 after 27 usecs
calling  airo_driver_init+0x0/0xf @ 1
initcall airo_driver_init+0x0/0xf returned 0 after 26 usecs
calling  airo_init_module+0x0/0x113 @ 1
airo(): Probing for PCI adapters
airo(): Finished probing for PCI adapters
initcall airo_init_module+0x0/0x113 returned 0 after 2328 usecs
calling  prism54_module_init+0x0/0x35 @ 1
Loaded prism54 driver, version 1.2
initcall prism54_module_init+0x0/0x35 returned 0 after 1268 usecs
calling  hostap_init+0x0/0x40 @ 1
initcall hostap_init+0x0/0x40 returned 0 after 15 usecs
calling  hostap_driver_init+0x0/0xf @ 1
initcall hostap_driver_init+0x0/0xf returned 0 after 32 usecs
calling  prism2_pci_driver_init+0x0/0x16 @ 1
initcall prism2_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  usb_init+0x0/0xf3 @ 1
zd1211rw usb_init()
usbcore: registered new interface driver zd1211rw
zd1211rw initialized
initcall usb_init+0x0/0xf3 returned 0 after 2291 usecs
calling  rtl8180_driver_init+0x0/0x16 @ 1
initcall rtl8180_driver_init+0x0/0x16 returned 0 after 37 usecs
calling  rtl_core_module_init+0x0/0x45 @ 1
initcall rtl_core_module_init+0x0/0x45 returned 0 after 18 usecs
calling  rtl8192cu_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver rtl8192cu
initcall rtl8192cu_driver_init+0x0/0x16 returned 0 after 1009 usecs
calling  rtl92se_driver_init+0x0/0x16 @ 1
initcall rtl92se_driver_init+0x0/0x16 returned 0 after 45 usecs
calling  rtl8723ae_driver_init+0x0/0x16 @ 1
initcall rtl8723ae_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  rtl88ee_driver_init+0x0/0x16 @ 1
initcall rtl88ee_driver_init+0x0/0x16 returned 0 after 43 usecs
calling  wl3501_driver_init+0x0/0xf @ 1
initcall wl3501_driver_init+0x0/0xf returned 0 after 28 usecs
calling  lbtf_init_module+0x0/0xde @ 1
initcall lbtf_init_module+0x0/0xde returned 0 after 58 usecs
calling  adm8211_driver_init+0x0/0x16 @ 1
initcall adm8211_driver_init+0x0/0x16 returned 0 after 84 usecs
calling  mwl8k_driver_init+0x0/0x16 @ 1
initcall mwl8k_driver_init+0x0/0x16 returned 0 after 51 usecs
calling  iwl_drv_init+0x0/0x7b @ 1
Intel(R) Wireless WiFi driver for Linux, in-tree:
Copyright(c) 2003-2013 Intel Corporation
initcall iwl_drv_init+0x0/0x7b returned 0 after 2177 usecs
calling  iwl_init+0x0/0x55 @ 1
initcall iwl_init+0x0/0x55 returned 0 after 20 usecs
calling  iwl_mvm_init+0x0/0x55 @ 1
initcall iwl_mvm_init+0x0/0x55 returned 0 after 4 usecs
calling  il4965_init+0x0/0x6b @ 1
iwl4965: Intel(R) Wireless WiFi 4965 driver for Linux, in-tree:
iwl4965: Copyright(c) 2003-2011 Intel Corporation
initcall il4965_init+0x0/0x6b returned 0 after 2136 usecs
calling  il3945_init+0x0/0x6b @ 1
iwl3945: Intel(R) PRO/Wireless 3945ABG/BG Network Connection driver for Linux, in-tree:s
iwl3945: Copyright(c) 2003-2011 Intel Corporation
initcall il3945_init+0x0/0x6b returned 0 after 2468 usecs
calling  rt2500pci_driver_init+0x0/0x16 @ 1
initcall rt2500pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  rt61pci_driver_init+0x0/0x16 @ 1
initcall rt61pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  rt2800pci_init+0x0/0x16 @ 1
initcall rt2800pci_init+0x0/0x16 returned 0 after 39 usecs
calling  rt2800usb_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver rt2800usb
initcall rt2800usb_driver_init+0x0/0x16 returned 0 after 1009 usecs
calling  ath5k_pci_driver_init+0x0/0x16 @ 1
initcall ath5k_pci_driver_init+0x0/0x16 returned 0 after 44 usecs
calling  ath9k_init+0x0/0x39 @ 1
initcall ath9k_init+0x0/0x39 returned 0 after 72 usecs
calling  ath9k_init+0x0/0x7 @ 1
initcall ath9k_init+0x0/0x7 returned 0 after 4 usecs
calling  ath9k_cmn_init+0x0/0x7 @ 1
initcall ath9k_cmn_init+0x0/0x7 returned 0 after 4 usecs
calling  ath9k_htc_init+0x0/0x24 @ 1
usbcore: registered new interface driver ath9k_htc
initcall ath9k_htc_init+0x0/0x24 returned 0 after 1009 usecs
calling  carl9170_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver carl9170
initcall carl9170_driver_init+0x0/0x16 returned 0 after 1816 usecs
calling  ath6kl_sdio_init+0x0/0x2e @ 1
initcall ath6kl_sdio_init+0x0/0x2e returned 0 after 24 usecs
calling  ath6kl_usb_init+0x0/0x36 @ 1
usbcore: registered new interface driver ath6kl_usb
initcall ath6kl_usb_init+0x0/0x36 returned 0 after 1178 usecs
calling  wil6210_driver_init+0x0/0x16 @ 1
initcall wil6210_driver_init+0x0/0x16 returned 0 after 43 usecs
calling  wl12xx_driver_init+0x0/0x11 @ 1
initcall wl12xx_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  brcmfmac_module_init+0x0/0x26 @ 1
initcall brcmfmac_module_init+0x0/0x26 returned 0 after 49 usecs
calling  cw1200_sdio_init+0x0/0x105 @ 1
initcall cw1200_sdio_init+0x0/0x105 returned 0 after 23 usecs
calling  vmxnet3_init_module+0x0/0x35 @ 1
VMware vmxnet3 virtual NIC driver - version 1.2.0.0-k-NAPI
initcall vmxnet3_init_module+0x0/0x35 returned 0 after 1420 usecs
calling  catc_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver catc
initcall catc_driver_init+0x0/0x16 returned 0 after 1138 usecs
calling  pegasus_init+0x0/0x13e @ 1
pegasus: v0.9.3 (2013/04/25), Pegasus/Pegasus II USB Ethernet driver
usbcore: registered new interface driver pegasus
initcall pegasus_init+0x0/0x13e returned 0 after 1795 usecs
calling  rtl8150_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver rtl8150
initcall rtl8150_driver_init+0x0/0x16 returned 0 after 670 usecs
calling  rtl8152_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver r8152
initcall rtl8152_driver_init+0x0/0x16 returned 0 after 1307 usecs
calling  asix_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver asix
initcall asix_driver_init+0x0/0x16 returned 0 after 1139 usecs
calling  ax88179_178a_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ax88179_178a
initcall ax88179_178a_driver_init+0x0/0x16 returned 0 after 540 usecs
calling  cdc_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cdc_ether
initcall cdc_driver_init+0x0/0x16 returned 0 after 1009 usecs
calling  r815x_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver r815x
initcall r815x_driver_init+0x0/0x16 returned 0 after 330 usecs
calling  eem_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cdc_eem
initcall eem_driver_init+0x0/0x16 returned 0 after 671 usecs
calling  dm9601_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver dm9601
initcall dm9601_driver_init+0x0/0x16 returned 0 after 501 usecs
calling  smsc75xx_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver smsc75xx
initcall smsc75xx_driver_init+0x0/0x16 returned 0 after 1816 usecs
calling  smsc95xx_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver smsc95xx
initcall smsc95xx_driver_init+0x0/0x16 returned 0 after 839 usecs
calling  gl620a_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver gl620a
initcall gl620a_driver_init+0x0/0x16 returned 0 after 1478 usecs
calling  net1080_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver net1080
initcall net1080_driver_init+0x0/0x16 returned 0 after 671 usecs
calling  rndis_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver rndis_host
initcall rndis_driver_init+0x0/0x16 returned 0 after 201 usecs
calling  cdc_subset_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cdc_subset
initcall cdc_subset_driver_init+0x0/0x16 returned 0 after 1178 usecs
calling  zaurus_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver zaurus
initcall zaurus_driver_init+0x0/0x16 returned 0 after 500 usecs
calling  usbnet_init+0x0/0x26 @ 1
initcall usbnet_init+0x0/0x26 returned 0 after 11 usecs
calling  int51x1_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver int51x1
initcall int51x1_driver_init+0x0/0x16 returned 0 after 1646 usecs
calling  kalmia_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver kalmia
initcall kalmia_driver_init+0x0/0x16 returned 0 after 500 usecs
calling  ipheth_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ipheth
initcall ipheth_driver_init+0x0/0x16 returned 0 after 500 usecs
calling  cx82310_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cx82310_eth
initcall cx82310_driver_init+0x0/0x16 returned 0 after 1347 usecs
calling  cdc_ncm_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cdc_ncm
initcall cdc_ncm_driver_init+0x0/0x16 returned 0 after 670 usecs
calling  qmi_wwan_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver qmi_wwan
initcall qmi_wwan_driver_init+0x0/0x16 returned 0 after 840 usecs
calling  cdc_mbim_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cdc_mbim
initcall cdc_mbim_driver_init+0x0/0x16 returned 0 after 838 usecs
calling  zatm_init_module+0x0/0x16 @ 1
initcall zatm_init_module+0x0/0x16 returned 0 after 51 usecs
calling  uPD98402_module_init+0x0/0x7 @ 1
initcall uPD98402_module_init+0x0/0x7 returned 0 after 4 usecs
calling  nicstar_init+0x0/0x67 @ 1
initcall nicstar_init+0x0/0x67 returned 0 after 36 usecs
calling  hrz_module_init+0x0/0xc6 @ 1
Madge ATM Horizon [Ultra] driver version 1.2.1
initcall hrz_module_init+0x0/0xc6 returned 0 after 1338 usecs
calling  fore200e_module_init+0x0/0x23 @ 1
fore200e: FORE Systems 200E-series ATM driver - version 0.3e
initcall fore200e_module_init+0x0/0x23 returned 0 after 788 usecs
calling  eni_init+0x0/0x16 @ 1
initcall eni_init+0x0/0x16 returned 0 after 36 usecs
calling  idt77252_init+0x0/0x35 @ 1
idt77252_init: at b2d5a532
initcall idt77252_init+0x0/0x35 returned 0 after 1873 usecs
calling  solos_pci_init+0x0/0x2d @ 1
Solos PCI Driver Version 1.04
initcall solos_pci_init+0x0/0x2d returned 0 after 1393 usecs
calling  adummy_init+0x0/0xe6 @ 1
adummy: version 1.0
initcall adummy_init+0x0/0xe6 returned 0 after 749 usecs
calling  he_driver_init+0x0/0x16 @ 1
initcall he_driver_init+0x0/0x16 returned 0 after 37 usecs
calling  lynx_pci_driver_init+0x0/0x16 @ 1
initcall lynx_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  uio_init+0x0/0xd6 @ 1
initcall uio_init+0x0/0xd6 returned 0 after 32 usecs
calling  uio_pdrv_genirq_init+0x0/0x11 @ 1
initcall uio_pdrv_genirq_init+0x0/0x11 returned 0 after 31 usecs
calling  uio_dmem_genirq_init+0x0/0x11 @ 1
initcall uio_dmem_genirq_init+0x0/0x11 returned 0 after 42 usecs
calling  sercos3_pci_driver_init+0x0/0x16 @ 1
initcall sercos3_pci_driver_init+0x0/0x16 returned 0 after 43 usecs
calling  uio_pci_driver_init+0x0/0x16 @ 1
initcall uio_pci_driver_init+0x0/0x16 returned 0 after 36 usecs
calling  mf624_pci_driver_init+0x0/0x16 @ 1
initcall mf624_pci_driver_init+0x0/0x16 returned 0 after 43 usecs
calling  cdrom_init+0x0/0xc @ 1
initcall cdrom_init+0x0/0xc returned 0 after 27 usecs
calling  nonstatic_sysfs_init+0x0/0xf @ 1
initcall nonstatic_sysfs_init+0x0/0xf returned 0 after 4 usecs
calling  yenta_socket_init+0x0/0x16 @ 1
initcall yenta_socket_init+0x0/0x16 returned 0 after 39 usecs
calling  pd6729_module_init+0x0/0x16 @ 1
initcall pd6729_module_init+0x0/0x16 returned 0 after 36 usecs
calling  init_i82365+0x0/0x497 @ 1
Intel ISA PCIC probe: Intel ISA PCIC probe: not found.
not found.
initcall init_i82365+0x0/0x497 returned -19 after 1954 usecs
calling  i82092aa_module_init+0x0/0x16 @ 1
initcall i82092aa_module_init+0x0/0x16 returned 0 after 37 usecs
calling  init_tcic+0x0/0x6b1 @ 1
Databook TCIC-2 PCMCIA probe: Databook TCIC-2 PCMCIA probe: not found.
not found.
initcall init_tcic+0x0/0x6b1 returned -19 after 2282 usecs
calling  mon_init+0x0/0xec @ 1
initcall mon_init+0x0/0xec returned 0 after 217 usecs
calling  ehci_hcd_init+0x0/0xc4 @ 1
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
initcall ehci_hcd_init+0x0/0xc4 returned 0 after 1397 usecs
calling  ehci_pci_init+0x0/0x62 @ 1
ehci-pci: EHCI PCI platform driver
ehci-pci 0000:00:02.1: setting latency timer to 64
ehci-pci 0000:00:02.1: EHCI Host Controller
ehci-pci 0000:00:02.1: new USB bus registered, assigned bus number 1
ehci-pci 0000:00:02.1: debug port 1
ehci-pci 0000:00:02.1: cache line size of 32 is not supported
ehci-pci 0000:00:02.1: irq 11, io mem 0xfeb00000
ata4: SATA link down (SStatus 0 SControl 300)
async_waiting @ 109
async_continuing @ 109 after 4 usec
initcall 5_async_port_probe+0x0/0x49 returned 0 after 1146508 usecs
async_continuing @ 112 after 904807 usec
ehci-pci 0000:00:02.1: USB 2.0 started, EHCI 1.00
scsi 4:0:0:0: Direct-Access     ATA      HDS722525VLAT80  V36O PQ: 0 ANSI: 5
calling  8_sd_probe_async+0x0/0x1ba @ 91
sd 4:0:0:0: [sda] 488397168 512-byte logical blocks: (250 GB/232 GiB)
hub 1-0:1.0: USB hub found
sd 4:0:0:0: [sda] Write Protect is off
sd 4:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 4:0:0:0: Attached scsi generic sg0 type 0
hub 1-0:1.0: 10 ports detected
initcall 6_async_port_probe+0x0/0x49 returned 0 after 1149272 usecs
async_continuing @ 113 after 1139128 usec
sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
initcall ehci_pci_init+0x0/0x62 returned 0 after 26604 usecs
calling  ehci_platform_init+0x0/0x49 @ 1
ehci-platform: EHCI generic platform driver
initcall ehci_platform_init+0x0/0x49 returned 0 after 1813 usecs
calling  oxu_driver_init+0x0/0x11 @ 1
initcall oxu_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  ohci_hcd_mod_init+0x0/0x84 @ 1
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
initcall ohci_hcd_mod_init+0x0/0x84 returned 0 after 734 usecs
calling  ohci_platform_init+0x0/0x49 @ 1
ohci-platform: OHCI generic platform driver
initcall ohci_platform_init+0x0/0x49 returned 0 after 1807 usecs
calling  uhci_hcd_init+0x0/0xbb @ 1
uhci_hcd: USB Universal Host Controller Interface driver
initcall uhci_hcd_init+0x0/0xbb returned 0 after 1132 usecs
calling  sl811h_driver_init+0x0/0x11 @ 1
initcall sl811h_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  u132_hcd_init+0x0/0xaf @ 1
driver u132_hcd
initcall u132_hcd_init+0x0/0xaf returned 0 after 2012 usecs
calling  r8a66597_driver_init+0x0/0x11 @ 1
initcall r8a66597_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  isp1760_init+0x0/0x5c @ 1
initcall isp1760_init+0x0/0x5c returned 0 after 162 usecs
calling  ssb_hcd_init+0x0/0x11 @ 1
initcall ssb_hcd_init+0x0/0x11 returned 0 after 31 usecs
calling  fusbh200_hcd_init+0x0/0xae @ 1
fusbh200_hcd: FUSBH200 Host Controller (EHCI) Driver
Warning! fusbh200_hcd should always be loaded before uhci_hcd and ohci_hcd, not after
initcall fusbh200_hcd_init+0x0/0xae returned 0 after 1469 usecs
calling  fotg210_hcd_init+0x0/0xae @ 1
fotg210_hcd: FOTG210 Host Controller (EHCI) Driver
\x014Warning! fotg210_hcd should always be loaded before uhci_hcd and ohci_hcd, not after
initcall fotg210_hcd_init+0x0/0xae returned 0 after 2277 usecs
calling  c67x00_driver_init+0x0/0x11 @ 1
initcall c67x00_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  acm_init+0x0/0xd8 @ 1
usbcore: registered new interface driver cdc_acm
cdc_acm: USB Abstract Control Model driver for USB modems and ISDN adapters
initcall acm_init+0x0/0xd8 returned 0 after 2002 usecs
calling  usblp_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usblp
initcall usblp_driver_init+0x0/0x16 returned 0 after 333 usecs
calling  wdm_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cdc_wdm
initcall wdm_driver_init+0x0/0x16 returned 0 after 1646 usecs
calling  usbtmc_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usbtmc
initcall usbtmc_driver_init+0x0/0x16 returned 0 after 1478 usecs
calling  usb_storage_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usb-storage
initcall usb_storage_driver_init+0x0/0x16 returned 0 after 1346 usecs
calling  alauda_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ums-alauda
initcall alauda_driver_init+0x0/0x16 returned 0 after 1178 usecs
calling  datafab_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ums-datafab
initcall datafab_driver_init+0x0/0x16 returned 0 after 1348 usecs
calling  ene_ub6250_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ums_eneub6250
initcall ene_ub6250_driver_init+0x0/0x16 returned 0 after 709 usecs
calling  freecom_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ums-freecom
initcall freecom_driver_init+0x0/0x16 returned 0 after 1347 usecs
calling  isd200_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ums-isd200
initcall isd200_driver_init+0x0/0x16 returned 0 after 205 usecs
calling  jumpshot_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ums-jumpshot
initcall jumpshot_driver_init+0x0/0x16 returned 0 after 539 usecs
calling  realtek_cr_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ums-realtek
initcall realtek_cr_driver_init+0x0/0x16 returned 0 after 1348 usecs
calling  usb_serial_init+0x0/0x17f @ 1
usbcore: registered new interface driver usbserial
initcall usb_serial_init+0x0/0x17f returned 0 after 1035 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver ark3116
usbserial: USB Serial support registered for ark3116
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2019 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver belkin_sa
usbserial: USB Serial support registered for Belkin / Peracom / GoHubs USB Serial Adapter
initcall usb_serial_module_init+0x0/0x19 returned 0 after 3737 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver ch341
usbserial: USB Serial support registered for ch341-uart
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1209 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver cp210x
usbserial: USB Serial support registered for cp210x
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1681 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver cyberjack
usbserial: USB Serial support registered for Reiner SCT Cyberjack USB card reader
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2382 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver digi_acceleport
usbserial: USB Serial support registered for Digi 2 port USB adapter
usbserial: USB Serial support registered for Digi 4 port USB adapter
initcall usb_serial_module_init+0x0/0x19 returned 0 after 3299 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver empeg
usbserial: USB Serial support registered for empeg
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2316 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver f81232
usbserial: USB Serial support registered for f81232
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1679 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver ftdi_sio
usbserial: USB Serial support registered for FTDI USB Serial Device
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1807 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver garmin_gps
usbserial: USB Serial support registered for Garmin GPS usb/tty
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1457 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver iuu_phoenix
usbserial: USB Serial support registered for iuu_phoenix
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2395 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver kl5kusb105
usbserial: USB Serial support registered for KL5KUSB105D / PalmConnect
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1666 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver mct_u232
usbserial: USB Serial support registered for MCT U232
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1379 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver metro_usb
usbserial: USB Serial support registered for Metrologic USB to Serial
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2304 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver mos7720
usbserial: USB Serial support registered for Moschip 2 port adapter
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1627 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver mos7840
usbserial: USB Serial support registered for Moschip 7840/7820 USB Serial Driver
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1876 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver navman
usbserial: USB Serial support registered for navman
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1680 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver opticon
usbserial: USB Serial support registered for opticon
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2018 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver oti6858
usbserial: USB Serial support registered for oti6858
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2018 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver pl2303
usbserial: USB Serial support registered for pl2303
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1679 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver quatech2
usbserial: USB Serial support registered for Quatech 2nd gen USB to Serial Driver
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1238 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver sierra
usbserial: USB Serial support registered for Sierra USB modem
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2397 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver usb_serial_simple
usbserial: USB Serial support registered for zio
usbserial: USB Serial support registered for funsoft
usbserial: USB Serial support registered for flashloader
usbserial: USB Serial support registered for vivopay
usbserial: USB Serial support registered for moto_modem
usbserial: USB Serial support registered for hp4x
usbserial: USB Serial support registered for suunto
usbserial: USB Serial support registered for siemens_mpi
initcall usb_serial_module_init+0x0/0x19 returned 0 after 8761 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver ti_usb_3410_5052
usbserial: USB Serial support registered for TI USB 3410 1 port adapter
usbserial: USB Serial support registered for TI USB 5052 2 port adapter
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2533 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver visor
usbserial: USB Serial support registered for Handspring Visor / Palm OS
usbserial: USB Serial support registered for Sony Clie 5.0
usbserial: USB Serial support registered for Sony Clie 3.5
initcall usb_serial_module_init+0x0/0x19 returned 0 after 4736 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver wishbone_serial
usbserial: USB Serial support registered for wishbone_serial
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1796 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver whiteheat
usbserial: USB Serial support registered for Connect Tech - WhiteHEAT - (prerenumeration)
usbserial: USB Serial support registered for Connect Tech - WhiteHEAT
initcall usb_serial_module_init+0x0/0x19 returned 0 after 3078 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver keyspan_pda
usbserial: USB Serial support registered for Keyspan PDA
usbserial: USB Serial support registered for Xircom / Entregra PGS - (prerenumeration)
initcall usb_serial_module_init+0x0/0x19 returned 0 after 2662 usecs
calling  usb_serial_module_init+0x0/0x19 @ 1
usbcore: registered new interface driver xsens_mt
usbserial: USB Serial support registered for xsens_mt
initcall usb_serial_module_init+0x0/0x19 returned 0 after 1380 usecs
calling  adu_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver adutux
initcall adu_driver_init+0x0/0x16 returned 0 after 1478 usecs
calling  cypress_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cypress_cy7c63
initcall cypress_driver_init+0x0/0x16 returned 0 after 1856 usecs
calling  cytherm_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver cytherm
initcall cytherm_driver_init+0x0/0x16 returned 0 after 1646 usecs
calling  emi26_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver emi26 - firmware loader
initcall emi26_driver_init+0x0/0x16 returned 0 after 1427 usecs
calling  emi62_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver emi62 - firmware loader
initcall emi62_driver_init+0x0/0x16 returned 0 after 1425 usecs
calling  ftdi_elan_init+0x0/0x180 @ 1
driver ftdi-elan
ata6.01: ATAPI: DVDRW IDE 16X, VER A079, max UDMA/66
ata6: nv_mode_filter: 0x1f39f&0x73ff->0x739f, BIOS=0x7000 (0xc60000c0) ACPI=0x0
usbcore: registered new interface driver ftdi-elan
initcall ftdi_elan_init+0x0/0x180 returned 0 after 4651 usecs
calling  idmouse_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver idmouse
initcall idmouse_driver_init+0x0/0x16 returned 0 after 1643 usecs
calling  iowarrior_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver iowarrior
initcall iowarrior_driver_init+0x0/0x16 returned 0 after 1009 usecs
calling  lcd_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usblcd
initcall lcd_driver_init+0x0/0x16 returned 0 after 1477 usecs
calling  ld_usb_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ldusb
initcall ld_usb_driver_init+0x0/0x16 returned 0 after 330 usecs
calling  led_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usbled
initcall led_driver_init+0x0/0x16 returned 0 after 1478 usecs
calling  tower_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver legousbtower
initcall tower_driver_init+0x0/0x16 returned 0 after 539 usecs
calling  rio_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver rio500
initcall rio_driver_init+0x0/0x16 returned 0 after 1478 usecs
calling  usbtest_init+0x0/0x54 @ 1
usbcore: registered new interface driver usbtest
initcall usbtest_init+0x0/0x54 returned 0 after 670 usecs
calling  ehset_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usb_ehset_test
initcall ehset_driver_init+0x0/0x16 returned 0 after 879 usecs
calling  tv_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver trancevibrator
initcall tv_driver_init+0x0/0x16 returned 0 after 878 usecs
calling  uss720_init+0x0/0x52 @ 1
usbcore: registered new interface driver uss720
uss720: v0.6:USB Parport Cable driver for Cables using the Lucent Technologies USS720 Chip
uss720: NOTE: this is a special purpose driver to allow nonstandard
uss720: protocols (eg. bitbang) over USS720 usb to parallel cables
uss720: If you just want to connect to a printer, use usblp instead
initcall uss720_init+0x0/0x52 returned 0 after 5118 usecs
calling  sevseg_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usbsevseg
initcall sevseg_driver_init+0x0/0x16 returned 0 after 1009 usecs
calling  usb3503_init+0x0/0x4a @ 1
initcall usb3503_init+0x0/0x4a returned 0 after 76 usecs
calling  usb_sisusb_init+0x0/0x1b @ 1
usbcore: registered new interface driver sisusb
initcall usb_sisusb_init+0x0/0x1b returned 0 after 501 usecs
calling  samsung_usb2phy_driver_init+0x0/0x11 @ 1
initcall samsung_usb2phy_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  samsung_usb3phy_driver_init+0x0/0x11 @ 1
initcall samsung_usb3phy_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  gpio_vbus_driver_init+0x0/0x11 @ 1
initcall gpio_vbus_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  rcar_usb_phy_driver_init+0x0/0x11 @ 1
initcall rcar_usb_phy_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  musb_init+0x0/0x1e @ 1
initcall musb_init+0x0/0x1e returned 0 after 27 usecs
calling  tusb_driver_init+0x0/0x11 @ 1
initcall tusb_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  ci_hdrc_driver_init+0x0/0x11 @ 1
initcall ci_hdrc_driver_init+0x0/0x11 returned 0 after 35 usecs
calling  ci_hdrc_msm_driver_init+0x0/0x11 @ 1
initcall ci_hdrc_msm_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  ci_hdrc_pci_driver_init+0x0/0x16 @ 1
initcall ci_hdrc_pci_driver_init+0x0/0x16 returned 0 after 58 usecs
calling  ci_hdrc_imx_driver_init+0x0/0x11 @ 1
initcall ci_hdrc_imx_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  usbmisc_imx_driver_init+0x0/0x11 @ 1
initcall usbmisc_imx_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  renesas_usbhs_driver_init+0x0/0x11 @ 1
initcall renesas_usbhs_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  gadget_cfs_init+0x0/0x19 @ 1
initcall gadget_cfs_init+0x0/0x19 returned 0 after 25 usecs
calling  init+0x0/0x2e8 @ 1
dummy_hcd dummy_hcd.0: USB Host+Gadget Emulator, driver 02 May 2005
dummy_hcd dummy_hcd.0: Dummy host controller
ata6.01: configured for UDMA/33
dummy_hcd dummy_hcd.0: new USB bus registered, assigned bus number 2
async_waiting @ 113
async_continuing @ 113 after 4 usec
hub 2-0:1.0: USB hub found
scsi 5:0:1:0: CD-ROM            DVDRW    IDE 16X          A079 PQ: 0 ANSI: 5
scsi 5:0:1:0: Attached scsi generic sg1 type 5
hub 2-0:1.0: 1 port detected
initcall 7_async_port_probe+0x0/0x49 returned 0 after 1455987 usecs
initcall init+0x0/0x2e8 returned 0 after 12075 usecs
calling  net2272_init+0x0/0x3c @ 1
initcall net2272_init+0x0/0x3c returned 0 after 94 usecs
calling  udc_pci_driver_init+0x0/0x16 @ 1
initcall udc_pci_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  udc_driver_init+0x0/0x11 @ 1
initcall udc_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  r8a66597_driver_init+0x0/0x14 @ 1
initcall r8a66597_driver_init+0x0/0x14 returned -19 after 49 usecs
calling  ncmmod_init+0x0/0xf @ 1
initcall ncmmod_init+0x0/0xf returned 0 after 18 usecs
calling  init+0x0/0xf @ 1
udc dummy_udc.0: registering UDC driver [g_ncm]
using random self ethernet address
using random host ethernet address
g_ncm gadget: adding config #1 'CDC Ethernet (NCM)'/b2c40ce0
g_ncm gadget: adding 'cdc_network'/eb816f00 to config 'CDC Ethernet (NCM)'/b2c40ce0
usb0: HOST MAC aa:99:de:d4:76:38
usb0: MAC 2a:0a:99:90:0a:c2
g_ncm gadget: CDC Network: dual speed IN/ep1in-bulk OUT/ep2out-bulk NOTIFY/ep5in-int
g_ncm gadget: cfg 1/b2c40ce0 speeds: high full
g_ncm gadget:   interface 0 = cdc_network/eb816f00
g_ncm gadget:   interface 1 = cdc_network/eb816f00
g_ncm gadget: NCM Gadget
g_ncm gadget: g_ncm ready
dummy_udc dummy_udc.0: binding gadget driver 'g_ncm'
dummy_hcd dummy_hcd.0: port status 0x00010101 has changes
initcall init+0x0/0xf returned 0 after 15500 usecs
calling  i8042_init+0x0/0x33e @ 1
i8042: PNP: PS/2 Controller [PNP0303] at 0x60,0x64 irq 1
i8042: PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp
serio: i8042 KBD port at 0x60,0x64 irq 1
initcall i8042_init+0x0/0x33e returned 0 after 4027 usecs
calling  parkbd_init+0x0/0x177 @ 1
parport0: cannot grant exclusive access for device parkbd
initcall parkbd_init+0x0/0x177 returned -19 after 241 usecs
calling  serport_init+0x0/0x2c @ 1
initcall serport_init+0x0/0x2c returned 0 after 4 usecs
calling  ct82c710_init+0x0/0x137 @ 1
initcall ct82c710_init+0x0/0x137 returned -19 after 13 usecs
calling  pcips2_driver_init+0x0/0x16 @ 1
initcall pcips2_driver_init+0x0/0x16 returned 0 after 47 usecs
calling  ps2mult_drv_init+0x0/0x16 @ 1
initcall ps2mult_drv_init+0x0/0x16 returned 0 after 29 usecs
calling  altera_ps2_driver_init+0x0/0x11 @ 1
initcall altera_ps2_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  arc_ps2_driver_init+0x0/0x11 @ 1
initcall arc_ps2_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  apbps2_of_driver_init+0x0/0x11 @ 1
initcall apbps2_of_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  olpc_apsp_driver_init+0x0/0x11 @ 1
initcall olpc_apsp_driver_init+0x0/0x11 returned 0 after 44 usecs
calling  fm801_gp_driver_init+0x0/0x16 @ 1
initcall fm801_gp_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  l4_init+0x0/0x27e @ 1
initcall l4_init+0x0/0x27e returned -19 after 15 usecs
calling  ns558_init+0x0/0x2fe @ 1
initcall ns558_init+0x0/0x2fe returned 0 after 7856 usecs
calling  mousedev_init+0x0/0x7c @ 1
mousedev: PS/2 mouse device common for all mice
initcall mousedev_init+0x0/0x7c returned 0 after 1749 usecs
calling  joydev_init+0x0/0xf @ 1
initcall joydev_init+0x0/0xf returned 0 after 4 usecs
calling  evdev_init+0x0/0xf @ 1
initcall evdev_init+0x0/0xf returned 0 after 4 usecs
calling  adp5588_driver_init+0x0/0x11 @ 1
initcall adp5588_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  adp5589_driver_init+0x0/0x11 @ 1
initcall adp5589_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  atkbd_init+0x0/0x16 @ 1
initcall atkbd_init+0x0/0x16 returned 0 after 29 usecs
calling  cros_ec_keyb_driver_init+0x0/0x11 @ 1
initcall cros_ec_keyb_driver_init+0x0/0x11 returned 0 after 40 usecs
calling  events_driver_init+0x0/0x11 @ 1
initcall events_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  gpio_keys_polled_driver_init+0x0/0x11 @ 1
initcall gpio_keys_polled_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  lm8323_i2c_driver_init+0x0/0x11 @ 1
initcall lm8323_i2c_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  lm8333_driver_init+0x0/0x11 @ 1
initcall lm8333_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  matrix_keypad_driver_init+0x0/0x11 @ 1
initcall matrix_keypad_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  max7359_i2c_driver_init+0x0/0x11 @ 1
initcall max7359_i2c_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  mcs_touchkey_driver_init+0x0/0x11 @ 1
initcall mcs_touchkey_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  mpr_touchkey_driver_init+0x0/0x11 @ 1
initcall mpr_touchkey_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  nkbd_drv_init+0x0/0x16 @ 1
initcall nkbd_drv_init+0x0/0x16 returned 0 after 36 usecs
calling  opencores_kbd_device_driver_init+0x0/0x11 @ 1
initcall opencores_kbd_device_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  qt2160_driver_init+0x0/0x11 @ 1
initcall qt2160_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  skbd_drv_init+0x0/0x16 @ 1
initcall skbd_drv_init+0x0/0x16 returned 0 after 28 usecs
calling  sunkbd_drv_init+0x0/0x16 @ 1
initcall sunkbd_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  tc3589x_keypad_driver_init+0x0/0x11 @ 1
initcall tc3589x_keypad_driver_init+0x0/0x11 returned 0 after 45 usecs
calling  twl4030_kp_driver_init+0x0/0x11 @ 1
initcall twl4030_kp_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  xtkbd_drv_init+0x0/0x16 @ 1
initcall xtkbd_drv_init+0x0/0x16 returned 0 after 36 usecs
calling  auo_pixcir_driver_init+0x0/0x11 @ 1
initcall auo_pixcir_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  cy8ctmg110_driver_init+0x0/0x11 @ 1
initcall cy8ctmg110_driver_init+0x0/0x11 returned 0 after 94 usecs
calling  cyttsp_i2c_driver_init+0x0/0x11 @ 1
initcall cyttsp_i2c_driver_init+0x0/0x11 returned 0 after 49 usecs
calling  da9034_touch_driver_init+0x0/0x11 @ 1
initcall da9034_touch_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  dynapro_drv_init+0x0/0x16 @ 1
initcall dynapro_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  hampshire_drv_init+0x0/0x16 @ 1
initcall hampshire_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  gunze_drv_init+0x0/0x16 @ 1
initcall gunze_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  eeti_ts_driver_init+0x0/0x11 @ 1
initcall eeti_ts_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  elo_drv_init+0x0/0x16 @ 1
initcall elo_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  egalax_ts_driver_init+0x0/0x11 @ 1
initcall egalax_ts_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  fujitsu_drv_init+0x0/0x16 @ 1
initcall fujitsu_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  max11801_ts_driver_init+0x0/0x11 @ 1
initcall max11801_ts_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  mcs5000_ts_driver_init+0x0/0x11 @ 1
initcall mcs5000_ts_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  mms114_driver_init+0x0/0x11 @ 1
initcall mms114_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  mtouch_drv_init+0x0/0x16 @ 1
initcall mtouch_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  mk712_init+0x0/0x1c1 @ 1
mk712: device not present
initcall mk712_init+0x0/0x1c1 returned -19 after 683 usecs
calling  htcpen_isa_init+0x0/0xa @ 1
initcall htcpen_isa_init+0x0/0xa returned -19 after 4 usecs
calling  usbtouch_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver usbtouchscreen
initcall usbtouch_driver_init+0x0/0x16 returned 0 after 879 usecs
calling  ti_tsc_driver_init+0x0/0x11 @ 1
initcall ti_tsc_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  touchit213_drv_init+0x0/0x16 @ 1
initcall touchit213_drv_init+0x0/0x16 returned 0 after 39 usecs
calling  tr_drv_init+0x0/0x16 @ 1
initcall tr_drv_init+0x0/0x16 returned 0 after 28 usecs
calling  tw_drv_init+0x0/0x16 @ 1
initcall tw_drv_init+0x0/0x16 returned 0 after 27 usecs
calling  tsc2007_driver_init+0x0/0x11 @ 1
initcall tsc2007_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  w8001_drv_init+0x0/0x16 @ 1
initcall w8001_drv_init+0x0/0x16 returned 0 after 34 usecs
calling  wacom_i2c_driver_init+0x0/0x11 @ 1
initcall wacom_i2c_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  wm831x_ts_driver_init+0x0/0x11 @ 1
initcall wm831x_ts_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  tps6507x_ts_driver_init+0x0/0x11 @ 1
initcall tps6507x_ts_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  pm860x_onkey_driver_init+0x0/0x11 @ 1
initcall pm860x_onkey_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  bma150_driver_init+0x0/0x11 @ 1
initcall bma150_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  gp2a_i2c_driver_init+0x0/0x11 @ 1
initcall gp2a_i2c_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  gpio_tilt_polled_driver_init+0x0/0x11 @ 1
initcall gpio_tilt_polled_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  keyspan_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver keyspan_remote
initcall keyspan_driver_init+0x0/0x16 returned 0 after 879 usecs
calling  mpu3050_i2c_driver_init+0x0/0x11 @ 1
initcall mpu3050_i2c_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  pcf50633_input_driver_init+0x0/0x11 @ 1
initcall pcf50633_input_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  pcf8574_kp_driver_init+0x0/0x11 @ 1
initcall pcf8574_kp_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  powermate_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver powermate
initcall powermate_driver_init+0x0/0x16 returned 0 after 1008 usecs
calling  twl4030_pwrbutton_driver_init+0x0/0x14 @ 1
initcall twl4030_pwrbutton_driver_init+0x0/0x14 returned -19 after 49 usecs
calling  twl4030_vibra_driver_init+0x0/0x11 @ 1
initcall twl4030_vibra_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  twl6040_vibra_driver_init+0x0/0x11 @ 1
initcall twl6040_vibra_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  uinput_init+0x0/0xf @ 1
gameport gameport0: NS558 PnP Gameport is pnp00:10/gameport0, io 0x201, speed 903kHz
dummy_hcd dummy_hcd.0: port status 0x00010101 has changes
initcall uinput_init+0x0/0xf returned 0 after 17685 usecs
calling  wm831x_on_driver_init+0x0/0x11 @ 1
initcall wm831x_on_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  slidebar_init+0x0/0xab @ 1
ideapad_slidebar: DMI does not match
initcall slidebar_init+0x0/0xab returned -19 after 591 usecs
calling  i2o_iop_init+0x0/0x45 @ 1
I2O subsystem v1.325
i2o: max drivers = 8
 sda: sda1 sda2 sda3 < sda5 sda6 sda7 sda8 sda9 sda10 >
initcall i2o_iop_init+0x0/0x45 returned 0 after 3704 usecs
calling  i2o_config_init+0x0/0x3f @ 1
I2O Configuration OSM v1.323
initcall i2o_config_init+0x0/0x3f returned 0 after 1217 usecs
calling  i2o_bus_init+0x0/0x3e @ 1
I2O Bus Adapter OSM v1.317
initcall i2o_bus_init+0x0/0x3e returned 0 after 858 usecs
calling  i2o_scsi_init+0x0/0x3e @ 1
I2O SCSI Peripheral OSM v1.316
initcall i2o_scsi_init+0x0/0x3e returned 0 after 1550 usecs
calling  i2o_proc_init+0x0/0x169 @ 1
I2O ProcFS OSM v1.316
initcall i2o_proc_init+0x0/0x169 returned 0 after 1018 usecs
calling  pm80x_rtc_driver_init+0x0/0x11 @ 1
initcall pm80x_rtc_driver_init+0x0/0x11 returned 0 after 35 usecs
calling  bq32k_driver_init+0x0/0x11 @ 1
initcall bq32k_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  cmos_init+0x0/0x5e @ 1
rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0
rtc_cmos 00:03: alarms up to one day, 114 bytes nvram
initcall cmos_init+0x0/0x5e returned 0 after 2174 usecs
calling  ds1286_platform_driver_init+0x0/0x11 @ 1
initcall ds1286_platform_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  ds1307_driver_init+0x0/0x11 @ 1
initcall ds1307_driver_init+0x0/0x11 returned 0 after 52 usecs
calling  ds1374_driver_init+0x0/0x11 @ 1
initcall ds1374_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  ds1511_rtc_driver_init+0x0/0x11 @ 1
initcall ds1511_rtc_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  ds1672_driver_init+0x0/0x11 @ 1
initcall ds1672_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  ds1742_rtc_driver_init+0x0/0x11 @ 1
initcall ds1742_rtc_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  ds3232_driver_init+0x0/0x11 @ 1
initcall ds3232_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  em3027_driver_init+0x0/0x11 @ 1
initcall em3027_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  fm3130_driver_init+0x0/0x11 @ 1
initcall fm3130_driver_init+0x0/0x11 returned 0 after 25 usecs
calling  hid_time_platform_driver_init+0x0/0x11 @ 1
initcall hid_time_platform_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  isl1208_driver_init+0x0/0x11 @ 1
initcall isl1208_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  isl12022_driver_init+0x0/0x11 @ 1
initcall isl12022_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  lp8788_rtc_driver_init+0x0/0x11 @ 1
initcall lp8788_rtc_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  m48t35_platform_driver_init+0x0/0x11 @ 1
initcall m48t35_platform_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  m48t86_rtc_platform_driver_init+0x0/0x11 @ 1
initcall m48t86_rtc_platform_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  max6900_driver_init+0x0/0x11 @ 1
initcall max6900_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  msm6242_rtc_driver_init+0x0/0x14 @ 1
initcall msm6242_rtc_driver_init+0x0/0x14 returned -19 after 62 usecs
calling  pcf2127_driver_init+0x0/0x11 @ 1
initcall pcf2127_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  pcf8523_driver_init+0x0/0x11 @ 1
initcall pcf8523_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  pcf50633_rtc_driver_init+0x0/0x11 @ 1
initcall pcf50633_rtc_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  rs5c372_driver_init+0x0/0x11 @ 1
initcall rs5c372_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  rv3029c2_driver_init+0x0/0x11 @ 1
initcall rv3029c2_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  rx8581_driver_init+0x0/0x11 @ 1
initcall rx8581_driver_init+0x0/0x11 returned 0 after 26 usecs
calling  snvs_rtc_driver_init+0x0/0x11 @ 1
initcall snvs_rtc_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  test_init+0x0/0xa2 @ 1
rtc-test rtc-test.0: rtc core: registered test as rtc1
rtc-test rtc-test.1: rtc core: registered test as rtc2
initcall test_init+0x0/0xa2 returned 0 after 1551 usecs
calling  twl4030rtc_driver_init+0x0/0x11 @ 1
initcall twl4030rtc_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  tps65910_rtc_driver_init+0x0/0x11 @ 1
initcall tps65910_rtc_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  rtc_device_driver_init+0x0/0x11 @ 1
initcall rtc_device_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  wm831x_rtc_driver_init+0x0/0x11 @ 1
initcall wm831x_rtc_driver_init+0x0/0x11 returned 0 after 37 usecs
calling  x1205_driver_init+0x0/0x11 @ 1
initcall x1205_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  moxart_rtc_driver_init+0x0/0x11 @ 1
initcall moxart_rtc_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  smbalert_driver_init+0x0/0x11 @ 1
initcall smbalert_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  i2c_dev_init+0x0/0xb8 @ 1
i2c /dev entries driver
initcall i2c_dev_init+0x0/0xb8 returned 0 after 1356 usecs
calling  ali1535_driver_init+0x0/0x16 @ 1
initcall ali1535_driver_init+0x0/0x16 returned 0 after 47 usecs
calling  amd756_driver_init+0x0/0x16 @ 1
initcall amd756_driver_init+0x0/0x16 returned 0 after 45 usecs
calling  amd756_s4882_init+0x0/0x261 @ 1
initcall amd756_s4882_init+0x0/0x261 returned -19 after 4 usecs
calling  i2c_i801_init+0x0/0x16 @ 1
initcall i2c_i801_init+0x0/0x16 returned 0 after 55 usecs
calling  smbus_sch_driver_init+0x0/0x11 @ 1
initcall smbus_sch_driver_init+0x0/0x11 returned 0 after 38 usecs
calling  ismt_driver_init+0x0/0x16 @ 1
initcall ismt_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  nforce2_driver_init+0x0/0x16 @ 1
sd 4:0:0:0: [sda] Attached SCSI disk
initcall 8_sd_probe_async+0x0/0x1ba returned 0 after 655420 usecs
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-0: nForce2 SMBus adapter at 0x4c00
i2c i2c-1: Transaction failed (0x10)!
g_ncm gadget: resume
dummy_hcd dummy_hcd.0: port status 0x00100503 has changes
i2c i2c-1: Transaction failed (0x10)!
i2c i2c-1: nForce2 SMBus adapter at 0x4c40
initcall nforce2_driver_init+0x0/0x16 returned 0 after 54017 usecs
calling  nforce2_s4985_init+0x0/0x26d @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-0: PCA9556 configuration failed
initcall nforce2_s4985_init+0x0/0x26d returned -5 after 4239 usecs
calling  piix4_driver_init+0x0/0x16 @ 1
initcall piix4_driver_init+0x0/0x16 returned 0 after 54 usecs
calling  i2c_sis5595_init+0x0/0x16 @ 1
initcall i2c_sis5595_init+0x0/0x16 returned 0 after 46 usecs
calling  sis630_driver_init+0x0/0x16 @ 1
initcall sis630_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  vt586b_driver_init+0x0/0x16 @ 1
initcall vt586b_driver_init+0x0/0x16 returned 0 after 45 usecs
calling  i2c_vt596_init+0x0/0x16 @ 1
initcall i2c_vt596_init+0x0/0x16 returned 0 after 40 usecs
calling  dw_i2c_driver_init+0x0/0x16 @ 1
initcall dw_i2c_driver_init+0x0/0x16 returned 0 after 39 usecs
calling  pch_pcidriver_init+0x0/0x16 @ 1
initcall pch_pcidriver_init+0x0/0x16 returned 0 after 38 usecs
calling  kempld_i2c_driver_init+0x0/0x11 @ 1
initcall kempld_i2c_driver_init+0x0/0x11 returned 0 after 52 usecs
calling  i2c_pca_pf_driver_init+0x0/0x11 @ 1
initcall i2c_pca_pf_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  simtec_i2c_driver_init+0x0/0x11 @ 1
initcall simtec_i2c_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  xiic_i2c_driver_init+0x0/0x11 @ 1
initcall xiic_i2c_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  diolan_u2c_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver i2c-diolan-u2c
initcall diolan_u2c_driver_init+0x0/0x16 returned 0 after 877 usecs
calling  i2c_parport_init+0x0/0x4a @ 1
i2c-parport: adapter type unspecified
initcall i2c_parport_init+0x0/0x4a returned -19 after 759 usecs
calling  i2c_parport_init+0x0/0x15d @ 1
i2c-parport-light: adapter type unspecified
initcall i2c_parport_init+0x0/0x15d returned -19 after 800 usecs
calling  taos_init+0x0/0x16 @ 1
initcall taos_init+0x0/0x16 returned 0 after 33 usecs
calling  i2c_tiny_usb_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver i2c-tiny-usb
initcall i2c_tiny_usb_driver_init+0x0/0x16 returned 0 after 1529 usecs
calling  scx200_acb_init+0x0/0x7b @ 1
scx200_acb: NatSemi SCx200 ACCESS.bus Driver
initcall scx200_acb_init+0x0/0x7b returned 0 after 1005 usecs
calling  msp_driver_init+0x0/0x11 @ 1
initcall msp_driver_init+0x0/0x11 returned 0 after 59 usecs
calling  tda7432_driver_init+0x0/0x11 @ 1
initcall tda7432_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  tda9840_driver_init+0x0/0x11 @ 1
initcall tda9840_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  tea6415c_driver_init+0x0/0x11 @ 1
initcall tea6415c_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  saa711x_driver_init+0x0/0x11 @ 1
initcall saa711x_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  saa717x_driver_init+0x0/0x11 @ 1
initcall saa717x_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  saa7127_driver_init+0x0/0x11 @ 1
initcall saa7127_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  saa7185_driver_init+0x0/0x11 @ 1
initcall saa7185_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  saa7191_driver_init+0x0/0x11 @ 1
initcall saa7191_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  adv7343_driver_init+0x0/0x11 @ 1
initcall adv7343_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  adv7393_driver_init+0x0/0x11 @ 1
initcall adv7393_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  vpx3220_driver_init+0x0/0x11 @ 1
initcall vpx3220_driver_init+0x0/0x11 returned 0 after 43 usecs
calling  bt819_driver_init+0x0/0x11 @ 1
initcall bt819_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  ks0127_driver_init+0x0/0x11 @ 1
initcall ks0127_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  ths7303_driver_init+0x0/0x11 @ 1
initcall ths7303_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  ths8200_driver_init+0x0/0x11 @ 1
initcall ths8200_driver_init+0x0/0x11 returned 0 after 46 usecs
calling  tvp5150_driver_init+0x0/0x11 @ 1
initcall tvp5150_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  tw2804_driver_init+0x0/0x11 @ 1
initcall tw2804_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  tw9903_driver_init+0x0/0x11 @ 1
initcall tw9903_driver_init+0x0/0x11 returned 0 after 107 usecs
calling  tw9906_driver_init+0x0/0x11 @ 1
initcall tw9906_driver_init+0x0/0x11 returned 0 after 45 usecs
calling  cs53l32a_driver_init+0x0/0x11 @ 1
initcall cs53l32a_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  m52790_driver_init+0x0/0x11 @ 1
initcall m52790_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  tlv320aic23b_driver_init+0x0/0x11 @ 1
initcall tlv320aic23b_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  uda1342_driver_init+0x0/0x11 @ 1
initcall uda1342_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  wm8775_driver_init+0x0/0x11 @ 1
initcall wm8775_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  wm8739_driver_init+0x0/0x11 @ 1
initcall wm8739_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  vp27smpx_driver_init+0x0/0x11 @ 1
initcall vp27smpx_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  upd64031a_driver_init+0x0/0x11 @ 1
initcall upd64031a_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  upd64083_driver_init+0x0/0x11 @ 1
initcall upd64083_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  ml86v7667_i2c_driver_init+0x0/0x11 @ 1
initcall ml86v7667_i2c_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  au8522_driver_init+0x0/0x11 @ 1
initcall au8522_driver_init+0x0/0x11 returned 0 after 27 usecs
calling  flexcop_module_init+0x0/0x14 @ 1
b2c2-flexcop: B2C2 FlexcopII/II(b)/III digital TV receiver chip loaded successfully
initcall flexcop_module_init+0x0/0x14 returned 0 after 735 usecs
calling  saa7146_vv_init_module+0x0/0x7 @ 1
initcall saa7146_vv_init_module+0x0/0x7 returned 0 after 4 usecs
calling  smscore_module_init+0x0/0x6b @ 1
initcall smscore_module_init+0x0/0x6b returned 0 after 4 usecs
calling  smsdvb_module_init+0x0/0x5f @ 1
initcall smsdvb_module_init+0x0/0x5f returned 0 after 47 usecs
calling  budget_init+0x0/0xf @ 1
saa7146: register extension 'budget dvb'
initcall budget_init+0x0/0xf returned 0 after 329 usecs
calling  budget_av_init+0x0/0xf @ 1
saa7146: register extension 'budget_av'
initcall budget_av_init+0x0/0xf returned 0 after 1134 usecs
calling  av7110_init+0x0/0xf @ 1
saa7146: register extension 'av7110'
initcall av7110_init+0x0/0xf returned 0 after 1603 usecs
calling  flexcop_pci_driver_init+0x0/0x16 @ 1
initcall flexcop_pci_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  pluto2_driver_init+0x0/0x16 @ 1
initcall pluto2_driver_init+0x0/0x16 returned 0 after 43 usecs
calling  pt1_driver_init+0x0/0x16 @ 1
initcall pt1_driver_init+0x0/0x16 returned 0 after 37 usecs
calling  module_init_ngene+0x0/0x23 @ 1
nGene PCIE bridge driver, Copyright (C) 2005-2007 Micronas
initcall module_init_ngene+0x0/0x23 returned 0 after 443 usecs
calling  module_init_ddbridge+0x0/0xa1 @ 1
Digital Devices PCIE bridge driver, Copyright (C) 2010-11 Digital Devices GmbH
initcall module_init_ddbridge+0x0/0xa1 returned 0 after 927 usecs
calling  cx25821_init+0x0/0x3d @ 1
cx25821: driver version 0.0.106 loaded
initcall cx25821_init+0x0/0x3d returned 0 after 963 usecs
calling  ttusb_dec_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ttusb-dec
initcall ttusb_dec_driver_init+0x0/0x16 returned 0 after 1010 usecs
calling  ttusb_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ttusb
initcall ttusb_driver_init+0x0/0x16 returned 0 after 331 usecs
calling  au0828_init+0x0/0xb5 @ 1
au0828 driver loaded
usbcore: registered new interface driver au0828
initcall au0828_init+0x0/0xb5 returned 0 after 2289 usecs
calling  smssdio_module_init+0x0/0x28 @ 1
smssdio: Siano SMS1xxx SDIO driver
smssdio: Copyright Pierre Ossman
initcall smssdio_module_init+0x0/0x28 returned 0 after 2140 usecs
calling  pps_ktimer_init+0x0/0x92 @ 1
usb 2-1: new high-speed USB device number 2 using dummy_hcd
pps pps0: new PPS source ktimer
pps pps0: ktimer PPS source registered
initcall pps_ktimer_init+0x0/0x92 returned 0 after 3272 usecs
calling  pps_tty_init+0x0/0x94 @ 1
pps_ldisc: PPS line discipline registered
initcall pps_tty_init+0x0/0x94 returned 0 after 460 usecs
calling  pps_gpio_driver_init+0x0/0x11 @ 1
initcall pps_gpio_driver_init+0x0/0x11 returned 0 after 63 usecs
calling  pda_power_pdrv_init+0x0/0x11 @ 1
initcall pda_power_pdrv_init+0x0/0x11 returned 0 after 38 usecs
calling  wm8350_power_driver_init+0x0/0x11 @ 1
initcall wm8350_power_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  pm860x_battery_driver_init+0x0/0x11 @ 1
initcall pm860x_battery_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  goldfish_battery_device_init+0x0/0x11 @ 1
initcall goldfish_battery_device_init+0x0/0x11 returned 0 after 31 usecs
calling  sbs_battery_driver_init+0x0/0x11 @ 1
initcall sbs_battery_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  bq27x00_battery_init+0x0/0x7 @ 1
initcall bq27x00_battery_init+0x0/0x7 returned 0 after 4 usecs
calling  da903x_battery_driver_init+0x0/0x11 @ 1
initcall da903x_battery_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  max17040_i2c_driver_init+0x0/0x11 @ 1
initcall max17040_i2c_driver_init+0x0/0x11 returned 0 after 45 usecs
calling  twl4030_madc_battery_driver_init+0x0/0x11 @ 1
initcall twl4030_madc_battery_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  pm860x_charger_driver_init+0x0/0x11 @ 1
initcall pm860x_charger_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  pcf50633_mbc_driver_init+0x0/0x11 @ 1
initcall pcf50633_mbc_driver_init+0x0/0x11 returned 0 after 38 usecs
calling  rx51_battery_driver_init+0x0/0x11 @ 1
initcall rx51_battery_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  twl4030_bci_driver_init+0x0/0x14 @ 1
initcall twl4030_bci_driver_init+0x0/0x14 returned -19 after 49 usecs
calling  lp8727_driver_init+0x0/0x11 @ 1
initcall lp8727_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  lp8788_charger_driver_init+0x0/0x11 @ 1
initcall lp8788_charger_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  gpio_charger_driver_init+0x0/0x11 @ 1
initcall gpio_charger_driver_init+0x0/0x11 returned 0 after 46 usecs
calling  bq2415x_driver_init+0x0/0x11 @ 1
initcall bq2415x_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  bq24190_driver_init+0x0/0x11 @ 1
initcall bq24190_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  smb347_driver_init+0x0/0x11 @ 1
initcall smb347_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  tps65090_charger_driver_init+0x0/0x11 @ 1
initcall tps65090_charger_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  asb100_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall asb100_driver_init+0x0/0x11 returned 0 after 6270 usecs
calling  sensors_w83627hf_init+0x0/0x12e @ 1
initcall sensors_w83627hf_init+0x0/0x12e returned -19 after 20 usecs
calling  w83793_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall w83793_driver_init+0x0/0x11 returned 0 after 23535 usecs
calling  w83795_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
g_ncm gadget: resume
dummy_hcd dummy_hcd.0: port status 0x00100503 has changes
i2c i2c-1: Transaction failed (0x10)!
initcall w83795_driver_init+0x0/0x11 returned 0 after 23502 usecs
calling  w83791d_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall w83791d_driver_init+0x0/0x11 returned 0 after 23331 usecs
calling  ad7418_driver_init+0x0/0x11 @ 1
initcall ad7418_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  adm1021_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
dummy_udc dummy_udc.0: set_address = 2
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
g_ncm gadget: high-speed config #1: CDC Ethernet (NCM)
g_ncm gadget: init ncm ctrl 0
dummy_udc dummy_udc.0: enabled ep5in-int (ep5in-intr) maxpacket 16 stream disabled
g_ncm gadget: notify speed 425984000
g_ncm gadget: notify connect false
i2c i2c-1: Transaction failed (0x10)!
g_ncm gadget: ncm reqa1.80 v0000 i0000 l28
g_ncm gadget: non-CRC mode selected
g_ncm gadget: ncm req21.8a v0000 i0000 l0
i2c i2c-1: Transaction failed (0x10)!
g_ncm gadget: NCM16 selected
g_ncm gadget: ncm req21.84 v0000 i0000 l0
g_ncm gadget: init ncm
g_ncm gadget: activate ncm
dummy_udc dummy_udc.0: enabled ep1in-bulk (ep1in-bulk) maxpacket 512 stream disabled
dummy_udc dummy_udc.0: enabled ep2out-bulk (ep2out-bulk) maxpacket 512 stream disabled
usb0: qlen 10
g_ncm gadget: ncm_close
i2c i2c-1: Transaction failed (0x10)!
usb 2-1: MAC-Address: aa:99:de:d4:76:38
cdc_ncm 2-1:1.0 usb1: register 'cdc_ncm' at usb-dummy_hcd.0-1, CDC NCM, aa:99:de:d4:76:38
i2c i2c-1: Transaction failed (0x10)!
initcall adm1021_driver_init+0x0/0x11 returned 0 after 58900 usecs
calling  adm1025_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adm1025_driver_init+0x0/0x11 returned 0 after 17304 usecs
calling  adm1026_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adm1026_driver_init+0x0/0x11 returned 0 after 17313 usecs
calling  adm1029_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adm1029_driver_init+0x0/0x11 returned 0 after 46601 usecs
calling  adm9240_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adm9240_driver_init+0x0/0x11 returned 0 after 23163 usecs
calling  ads1015_driver_init+0x0/0x11 @ 1
initcall ads1015_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  ads7828_driver_init+0x0/0x11 @ 1
initcall ads7828_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  adt7410_driver_init+0x0/0x11 @ 1
initcall adt7410_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  adt7411_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adt7411_driver_init+0x0/0x11 returned 0 after 18023 usecs
calling  adt7462_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adt7462_driver_init+0x0/0x11 returned 0 after 11450 usecs
calling  adt7470_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adt7470_driver_init+0x0/0x11 returned 0 after 17303 usecs
calling  adt7475_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall adt7475_driver_init+0x0/0x11 returned 0 after 17313 usecs
calling  applesmc_init+0x0/0x2d @ 1
applesmc: supported laptop not found!
applesmc: driver init failed (ret=-19)!
initcall applesmc_init+0x0/0x2d returned -19 after 1859 usecs
calling  sm_asc7621_init+0x0/0x73 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall sm_asc7621_init+0x0/0x73 returned 0 after 17720 usecs
calling  atxp1_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall atxp1_driver_init+0x0/0x11 returned 0 after 11487 usecs
calling  coretemp_init+0x0/0x183 @ 1
initcall coretemp_init+0x0/0x183 returned -19 after 4 usecs
calling  dme1737_init+0x0/0x155 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall dme1737_init+0x0/0x155 returned 0 after 17503 usecs
calling  ds620_driver_init+0x0/0x11 @ 1
initcall ds620_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  ds1621_driver_init+0x0/0x11 @ 1
initcall ds1621_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  emc2103_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall emc2103_driver_init+0x0/0x11 returned 0 after 6152 usecs
calling  f71882fg_init+0x0/0x114 @ 1
initcall f71882fg_init+0x0/0x114 returned -19 after 31 usecs
calling  f75375_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall f75375_driver_init+0x0/0x11 returned 0 after 12401 usecs
calling  fam15h_power_driver_init+0x0/0x16 @ 1
initcall fam15h_power_driver_init+0x0/0x16 returned 0 after 61 usecs
calling  fschmd_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall fschmd_driver_init+0x0/0x11 returned 0 after 6430 usecs
calling  g760a_driver_init+0x0/0x11 @ 1
initcall g760a_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  gl520_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall gl520_driver_init+0x0/0x11 returned 0 after 12063 usecs
calling  hih6130_driver_init+0x0/0x11 @ 1
initcall hih6130_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  htu21_driver_init+0x0/0x11 @ 1
initcall htu21_driver_init+0x0/0x11 returned 0 after 28 usecs
calling  i5k_amb_init+0x0/0x56 @ 1
initcall i5k_amb_init+0x0/0x56 returned 0 after 201 usecs
calling  aem_init+0x0/0x43 @ 1
initcall aem_init+0x0/0x43 returned 0 after 32 usecs
calling  iio_hwmon_driver_init+0x0/0x11 @ 1
initcall iio_hwmon_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  ina2xx_driver_init+0x0/0x11 @ 1
initcall ina2xx_driver_init+0x0/0x11 returned 0 after 37 usecs
calling  sm_it87_init+0x0/0x5ab @ 1
it87: Found IT8712F chip at 0x290, revision 7
it87: VID is disabled (pins used for GPIO)
it87 it87.656: Detected broken BIOS defaults, disabling PWM interface
initcall sm_it87_init+0x0/0x5ab returned 0 after 3399 usecs
calling  k10temp_driver_init+0x0/0x16 @ 1
initcall k10temp_driver_init+0x0/0x16 returned 0 after 47 usecs
calling  pem_driver_init+0x0/0x11 @ 1
initcall pem_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  lm63_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall lm63_driver_init+0x0/0x11 returned 0 after 17458 usecs
calling  lm73_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall lm73_driver_init+0x0/0x11 returned 0 after 34924 usecs
calling  lm77_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall lm77_driver_init+0x0/0x11 returned 0 after 23204 usecs
calling  sm_lm78_init+0x0/0x393 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall sm_lm78_init+0x0/0x393 returned 0 after 47151 usecs
calling  lm80_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall lm80_driver_init+0x0/0x11 returned 0 after 47142 usecs
calling  lm95234_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall lm95234_driver_init+0x0/0x11 returned 0 after 17820 usecs
calling  ltc4245_driver_init+0x0/0x11 @ 1
initcall ltc4245_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  max16065_driver_init+0x0/0x11 @ 1
initcall max16065_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  max1619_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall max1619_driver_init+0x0/0x11 returned 0 after 52588 usecs
calling  max1668_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall max1668_driver_init+0x0/0x11 returned 0 after 52461 usecs
calling  max197_driver_init+0x0/0x11 @ 1
initcall max197_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  max6639_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall max6639_driver_init+0x0/0x11 returned 0 after 17888 usecs
calling  max6642_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall max6642_driver_init+0x0/0x11 returned 0 after 46603 usecs
calling  sensors_nct6775_init+0x0/0x323 @ 1
initcall sensors_nct6775_init+0x0/0x323 returned -19 after 90 usecs
calling  ntc_thermistor_driver_init+0x0/0x11 @ 1
initcall ntc_thermistor_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  pc87360_init+0x0/0x16f @ 1
pc87360: PC8736x not detected, module not inserted
initcall pc87360_init+0x0/0x16f returned -19 after 1006 usecs
calling  pcf8591_init+0x0/0x38 @ 1
initcall pcf8591_init+0x0/0x38 returned 0 after 31 usecs
calling  sht15_driver_init+0x0/0x11 @ 1
initcall sht15_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  sm_sis5595_init+0x0/0x16 @ 1
initcall sm_sis5595_init+0x0/0x16 returned 0 after 39 usecs
calling  smm665_driver_init+0x0/0x11 @ 1
initcall smm665_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  smsc47b397_init+0x0/0x167 @ 1
initcall smsc47b397_init+0x0/0x167 returned -19 after 9 usecs
calling  sm_smsc47m1_init+0x0/0x244 @ 1
initcall sm_smsc47m1_init+0x0/0x244 returned -19 after 9 usecs
calling  thmc50_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall thmc50_driver_init+0x0/0x11 returned 0 after 17778 usecs
calling  tmp102_driver_init+0x0/0x11 @ 1
initcall tmp102_driver_init+0x0/0x11 returned 0 after 37 usecs
calling  via_cputemp_init+0x0/0x12d @ 1
initcall via_cputemp_init+0x0/0x12d returned -19 after 4 usecs
calling  vt1211_init+0x0/0x155 @ 1
initcall vt1211_init+0x0/0x155 returned -19 after 20 usecs
calling  sm_vt8231_init+0x0/0x16 @ 1
initcall sm_vt8231_init+0x0/0x16 returned 0 after 38 usecs
calling  sensors_w83627ehf_init+0x0/0x130 @ 1
initcall sensors_w83627ehf_init+0x0/0x130 returned -19 after 34 usecs
calling  w83l785ts_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall w83l785ts_driver_init+0x0/0x11 returned 0 after 5585 usecs
calling  w83l786ng_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall w83l786ng_driver_init+0x0/0x11 returned 0 after 11919 usecs
calling  wm831x_hwmon_driver_init+0x0/0x11 @ 1
initcall wm831x_hwmon_driver_init+0x0/0x11 returned 0 after 41 usecs
calling  wm8350_hwmon_driver_init+0x0/0x11 @ 1
initcall wm8350_hwmon_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  pmbus_driver_init+0x0/0x11 @ 1
initcall pmbus_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  adm1275_driver_init+0x0/0x11 @ 1
initcall adm1275_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  ltc2978_driver_init+0x0/0x11 @ 1
initcall ltc2978_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  max16064_driver_init+0x0/0x11 @ 1
initcall max16064_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  max34440_driver_init+0x0/0x11 @ 1
initcall max34440_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  max8688_driver_init+0x0/0x11 @ 1
initcall max8688_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  ucd9200_driver_init+0x0/0x11 @ 1
initcall ucd9200_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  zl6100_driver_init+0x0/0x11 @ 1
initcall zl6100_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  pkg_temp_thermal_init+0x0/0x4d6 @ 1
initcall pkg_temp_thermal_init+0x0/0x4d6 returned -19 after 4 usecs
calling  vhci_init+0x0/0x26 @ 1
Bluetooth: Virtual HCI driver ver 1.3
initcall vhci_init+0x0/0x26 returned 0 after 1846 usecs
calling  bfusb_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver bfusb
initcall bfusb_driver_init+0x0/0x16 returned 0 after 1308 usecs
calling  dtl1_driver_init+0x0/0xf @ 1
initcall dtl1_driver_init+0x0/0xf returned 0 after 38 usecs
calling  bt3c_driver_init+0x0/0xf @ 1
initcall bt3c_driver_init+0x0/0xf returned 0 after 26 usecs
calling  btuart_driver_init+0x0/0xf @ 1
initcall btuart_driver_init+0x0/0xf returned 0 after 26 usecs
calling  btsdio_init+0x0/0x26 @ 1
Bluetooth: Generic Bluetooth SDIO driver ver 0.1
initcall btsdio_init+0x0/0x26 returned 0 after 690 usecs
calling  btwilink_driver_init+0x0/0x11 @ 1
initcall btwilink_driver_init+0x0/0x11 returned 0 after 35 usecs
calling  mmc_blk_init+0x0/0x6d @ 1
initcall mmc_blk_init+0x0/0x6d returned 0 after 27 usecs
calling  mmc_test_init+0x0/0xf @ 1
initcall mmc_test_init+0x0/0xf returned 0 after 40 usecs
calling  sdio_uart_init+0x0/0xc9 @ 1
initcall sdio_uart_init+0x0/0xc9 returned 0 after 43 usecs
calling  sdhci_drv_init+0x0/0x20 @ 1
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
initcall sdhci_drv_init+0x0/0x20 returned 0 after 1767 usecs
calling  sdhci_driver_init+0x0/0x16 @ 1
initcall sdhci_driver_init+0x0/0x16 returned 0 after 55 usecs
calling  goldfish_mmc_driver_init+0x0/0x11 @ 1
initcall goldfish_mmc_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  sdricoh_driver_init+0x0/0xf @ 1
initcall sdricoh_driver_init+0x0/0xf returned 0 after 27 usecs
calling  ushc_driver_init+0x0/0x16 @ 1
usbcore: registered new interface driver ushc
initcall ushc_driver_init+0x0/0x16 returned 0 after 1138 usecs
calling  rtsx_pci_sdmmc_driver_init+0x0/0x11 @ 1
initcall rtsx_pci_sdmmc_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  sdhci_pltfm_drv_init+0x0/0x14 @ 1
sdhci-pltfm: SDHCI platform and OF driver helper
initcall sdhci_pltfm_drv_init+0x0/0x14 returned 0 after 1645 usecs
calling  memstick_init+0x0/0x8c @ 1
initcall memstick_init+0x0/0x8c returned 0 after 141 usecs
calling  msb_init+0x0/0x66 @ 1
initcall msb_init+0x0/0x66 returned 0 after 25 usecs
calling  tifm_ms_init+0x0/0xf @ 1
initcall tifm_ms_init+0x0/0xf returned 0 after 25 usecs
calling  jmb38x_ms_driver_init+0x0/0x16 @ 1
initcall jmb38x_ms_driver_init+0x0/0x16 returned 0 after 46 usecs
calling  r852_pci_driver_init+0x0/0x16 @ 1
initcall r852_pci_driver_init+0x0/0x16 returned 0 after 45 usecs
calling  rtsx_pci_ms_driver_init+0x0/0x11 @ 1
initcall rtsx_pci_ms_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  lm3533_led_driver_init+0x0/0x11 @ 1
initcall lm3533_led_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  lm3642_i2c_driver_init+0x0/0x11 @ 1
initcall lm3642_i2c_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  pca9532_driver_init+0x0/0x11 @ 1
initcall pca9532_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  gpio_led_driver_init+0x0/0x11 @ 1
initcall gpio_led_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  lp5523_driver_init+0x0/0x11 @ 1
initcall lp5523_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  lp8501_driver_init+0x0/0x11 @ 1
initcall lp8501_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  tca6507_driver_init+0x0/0x11 @ 1
initcall tca6507_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  ot200_led_driver_init+0x0/0x11 @ 1
initcall ot200_led_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  pca963x_driver_init+0x0/0x11 @ 1
initcall pca963x_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  da903x_led_driver_init+0x0/0x11 @ 1
initcall da903x_led_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  wm8350_led_driver_init+0x0/0x11 @ 1
initcall wm8350_led_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  led_pwm_driver_init+0x0/0x11 @ 1
initcall led_pwm_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  regulator_led_driver_init+0x0/0x11 @ 1
initcall regulator_led_driver_init+0x0/0x11 returned 0 after 91 usecs
calling  lt3593_led_driver_init+0x0/0x11 @ 1
initcall lt3593_led_driver_init+0x0/0x11 returned 0 after 50 usecs
calling  lm355x_i2c_driver_init+0x0/0x11 @ 1
initcall lm355x_i2c_driver_init+0x0/0x11 returned 0 after 46 usecs
calling  blinkm_driver_init+0x0/0x11 @ 1
i2c i2c-0: Transaction failed (0x10)!
i2c i2c-1: Transaction failed (0x10)!
initcall blinkm_driver_init+0x0/0x11 returned 0 after 5914 usecs
calling  timer_trig_init+0x0/0xf @ 1
initcall timer_trig_init+0x0/0xf returned 0 after 45 usecs
calling  heartbeat_trig_init+0x0/0x32 @ 1
initcall heartbeat_trig_init+0x0/0x32 returned 0 after 6 usecs
calling  bl_trig_init+0x0/0xf @ 1
initcall bl_trig_init+0x0/0xf returned 0 after 4 usecs
calling  gpio_trig_init+0x0/0xf @ 1
initcall gpio_trig_init+0x0/0xf returned 0 after 4 usecs
calling  ledtrig_cpu_init+0x0/0x53 @ 1
ledtrig-cpu: registered to indicate activity on CPUs
initcall ledtrig_cpu_init+0x0/0x53 returned 0 after 369 usecs
calling  transient_trig_init+0x0/0xf @ 1
initcall transient_trig_init+0x0/0xf returned 0 after 4 usecs
calling  ledtrig_camera_init+0x0/0x25 @ 1
initcall ledtrig_camera_init+0x0/0x25 returned 0 after 5 usecs
calling  ib_core_init+0x0/0xa8 @ 1
initcall ib_core_init+0x0/0xa8 returned 0 after 71 usecs
calling  ib_mad_init_module+0x0/0xca @ 1
initcall ib_mad_init_module+0x0/0xca returned 0 after 42 usecs
calling  ib_sa_init+0x0/0x59 @ 1
initcall ib_sa_init+0x0/0x59 returned 0 after 80 usecs
calling  ib_cm_init+0x0/0x152 @ 1
initcall ib_cm_init+0x0/0x152 returned 0 after 94 usecs
calling  iw_cm_init+0x0/0x49 @ 1
initcall iw_cm_init+0x0/0x49 returned 0 after 58 usecs
calling  addr_init+0x0/0x58 @ 1
initcall addr_init+0x0/0x58 returned 0 after 76 usecs
calling  cma_init+0x0/0xcf @ 1
initcall cma_init+0x0/0xcf returned 0 after 98 usecs
calling  mthca_init+0x0/0x15f @ 1
initcall mthca_init+0x0/0x15f returned 0 after 113 usecs
calling  c2_pci_driver_init+0x0/0x16 @ 1
initcall c2_pci_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  mlx4_ib_init+0x0/0x7d @ 1
initcall mlx4_ib_init+0x0/0x7d returned 0 after 116 usecs
calling  mlx5_ib_init+0x0/0x16 @ 1
initcall mlx5_ib_init+0x0/0x16 returned 0 after 39 usecs
calling  nes_init_module+0x0/0x100 @ 1
initcall nes_init_module+0x0/0x100 returned 0 after 172 usecs
calling  ipoib_init_module+0x0/0x122 @ 1
initcall ipoib_init_module+0x0/0x122 returned 0 after 83 usecs
calling  srp_init_module+0x0/0x12b @ 1
initcall srp_init_module+0x0/0x12b returned 0 after 33 usecs
calling  isert_init+0x0/0xd4 @ 1
initcall isert_init+0x0/0xd4 returned 0 after 26 usecs
calling  dcdbas_init+0x0/0x57 @ 1
dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.2)
initcall dcdbas_init+0x0/0x57 returned 0 after 562 usecs
calling  cs5535_mfgpt_init+0x0/0x107 @ 1
cs5535-clockevt: Could not allocate MFGPT timer
initcall cs5535_mfgpt_init+0x0/0x107 returned -19 after 1488 usecs
calling  hid_init+0x0/0x43 @ 1
initcall hid_init+0x0/0x43 returned 0 after 56 usecs
calling  apple_driver_init+0x0/0x16 @ 1
initcall apple_driver_init+0x0/0x16 returned 0 after 32 usecs
calling  appleir_driver_init+0x0/0x16 @ 1
initcall appleir_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  aureal_driver_init+0x0/0x16 @ 1
initcall aureal_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  belkin_driver_init+0x0/0x16 @ 1
initcall belkin_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  ch_driver_init+0x0/0x16 @ 1
initcall ch_driver_init+0x0/0x16 returned 0 after 37 usecs
calling  ch_driver_init+0x0/0x16 @ 1
initcall ch_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  cp_driver_init+0x0/0x16 @ 1
initcall cp_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  dr_driver_init+0x0/0x16 @ 1
initcall dr_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  ems_driver_init+0x0/0x16 @ 1
initcall ems_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  elecom_driver_init+0x0/0x16 @ 1
initcall elecom_driver_init+0x0/0x16 returned 0 after 33 usecs
calling  elo_driver_init+0x0/0x74 @ 1
initcall elo_driver_init+0x0/0x74 returned 0 after 85 usecs
calling  ez_driver_init+0x0/0x16 @ 1
initcall ez_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  gyration_driver_init+0x0/0x16 @ 1
initcall gyration_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  holtek_kbd_driver_init+0x0/0x16 @ 1
initcall holtek_kbd_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  holtek_mouse_driver_init+0x0/0x16 @ 1
initcall holtek_mouse_driver_init+0x0/0x16 returned 0 after 26 usecs
calling  holtek_driver_init+0x0/0x16 @ 1
initcall holtek_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  huion_driver_init+0x0/0x16 @ 1
initcall huion_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  ks_driver_init+0x0/0x16 @ 1
initcall ks_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  keytouch_driver_init+0x0/0x16 @ 1
initcall keytouch_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  kye_driver_init+0x0/0x16 @ 1
initcall kye_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  magicmouse_driver_init+0x0/0x16 @ 1
initcall magicmouse_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  ms_driver_init+0x0/0x16 @ 1
initcall ms_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  ntrig_driver_init+0x0/0x16 @ 1
initcall ntrig_driver_init+0x0/0x16 returned 0 after 48 usecs
calling  ortek_driver_init+0x0/0x16 @ 1
initcall ortek_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  pl_driver_init+0x0/0x16 @ 1
initcall pl_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  pl_driver_init+0x0/0x16 @ 1
initcall pl_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  picolcd_driver_init+0x0/0x16 @ 1
initcall picolcd_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  px_driver_init+0x0/0x16 @ 1
initcall px_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  roccat_init+0x0/0x87 @ 1
initcall roccat_init+0x0/0x87 returned 0 after 7 usecs
calling  arvo_init+0x0/0x50 @ 1
initcall arvo_init+0x0/0x50 returned 0 after 47 usecs
calling  isku_init+0x0/0x50 @ 1
initcall isku_init+0x0/0x50 returned 0 after 53 usecs
calling  kone_init+0x0/0x50 @ 1
initcall kone_init+0x0/0x50 returned 0 after 45 usecs
calling  koneplus_init+0x0/0x50 @ 1
initcall koneplus_init+0x0/0x50 returned 0 after 53 usecs
calling  konepure_init+0x0/0x50 @ 1
initcall konepure_init+0x0/0x50 returned 0 after 46 usecs
calling  kovaplus_init+0x0/0x50 @ 1
initcall kovaplus_init+0x0/0x50 returned 0 after 46 usecs
calling  lua_driver_init+0x0/0x16 @ 1
initcall lua_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  pyra_init+0x0/0x50 @ 1
initcall pyra_init+0x0/0x50 returned 0 after 59 usecs
calling  savu_init+0x0/0x50 @ 1
initcall savu_init+0x0/0x50 returned 0 after 45 usecs
calling  saitek_driver_init+0x0/0x16 @ 1
initcall saitek_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  samsung_driver_init+0x0/0x16 @ 1
initcall samsung_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  sony_driver_init+0x0/0x16 @ 1
initcall sony_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  speedlink_driver_init+0x0/0x16 @ 1
initcall speedlink_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  steelseries_srws1_driver_init+0x0/0x16 @ 1
initcall steelseries_srws1_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  sp_driver_init+0x0/0x16 @ 1
initcall sp_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  ga_driver_init+0x0/0x16 @ 1
initcall ga_driver_init+0x0/0x16 returned 0 after 31 usecs
calling  thingm_driver_init+0x0/0x16 @ 1
initcall thingm_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  tm_driver_init+0x0/0x16 @ 1
initcall tm_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  tivo_driver_init+0x0/0x16 @ 1
initcall tivo_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  twinhan_driver_init+0x0/0x16 @ 1
initcall twinhan_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  uclogic_driver_init+0x0/0x16 @ 1
initcall uclogic_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  xinmo_driver_init+0x0/0x16 @ 1
initcall xinmo_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  zp_driver_init+0x0/0x16 @ 1
initcall zp_driver_init+0x0/0x16 returned 0 after 40 usecs
calling  zc_driver_init+0x0/0x16 @ 1
initcall zc_driver_init+0x0/0x16 returned 0 after 26 usecs
calling  waltop_driver_init+0x0/0x16 @ 1
initcall waltop_driver_init+0x0/0x16 returned 0 after 24 usecs
calling  wiimote_hid_driver_init+0x0/0x16 @ 1
initcall wiimote_hid_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  sensor_hub_driver_init+0x0/0x16 @ 1
initcall sensor_hub_driver_init+0x0/0x16 returned 0 after 25 usecs
calling  hid_init+0x0/0x45 @ 1
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
initcall hid_init+0x0/0x45 returned 0 after 1520 usecs
calling  vhost_net_init+0x0/0x20 @ 1
initcall vhost_net_init+0x0/0x20 returned 0 after 99 usecs
calling  vhost_init+0x0/0x7 @ 1
initcall vhost_init+0x0/0x7 returned 0 after 4 usecs
calling  hdaps_init+0x0/0x2d @ 1
hdaps: supported laptop not found!
hdaps: driver init failed (ret=-19)!
initcall hdaps_init+0x0/0x2d returned -19 after 2795 usecs
calling  goldfish_pdev_bus_driver_init+0x0/0x11 @ 1
goldfish_pdev_bus goldfish_pdev_bus: unable to reserve Goldfish MMIO.
goldfish_pdev_bus: probe of goldfish_pdev_bus failed with error -16
initcall goldfish_pdev_bus_driver_init+0x0/0x11 returned 0 after 2272 usecs
calling  hid_accel_3d_platform_driver_init+0x0/0x11 @ 1
initcall hid_accel_3d_platform_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  st_accel_driver_init+0x0/0x11 @ 1
initcall st_accel_driver_init+0x0/0x11 returned 0 after 45 usecs
calling  exynos_adc_driver_init+0x0/0x11 @ 1
initcall exynos_adc_driver_init+0x0/0x11 returned 0 after 48 usecs
calling  lp8788_adc_driver_init+0x0/0x11 @ 1
initcall lp8788_adc_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  max1363_driver_init+0x0/0x11 @ 1
initcall max1363_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  tiadc_driver_init+0x0/0x11 @ 1
initcall tiadc_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  twl6030_gpadc_driver_init+0x0/0x11 @ 1
initcall twl6030_gpadc_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  vprbrd_adc_driver_init+0x0/0x11 @ 1
initcall vprbrd_adc_driver_init+0x0/0x11 returned 0 after 45 usecs
calling  ad5380_spi_init+0x0/0x11 @ 1
initcall ad5380_spi_init+0x0/0x11 returned 0 after 31 usecs
calling  ad5446_init+0x0/0x11 @ 1
initcall ad5446_init+0x0/0x11 returned 0 after 30 usecs
calling  mcp4725_driver_init+0x0/0x11 @ 1
initcall mcp4725_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  hid_gyro_3d_platform_driver_init+0x0/0x11 @ 1
initcall hid_gyro_3d_platform_driver_init+0x0/0x11 returned 0 after 34 usecs
calling  itg3200_driver_init+0x0/0x11 @ 1
initcall itg3200_driver_init+0x0/0x11 returned 0 after 31 usecs
calling  st_gyro_driver_init+0x0/0x11 @ 1
initcall st_gyro_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  inv_mpu_driver_init+0x0/0x11 @ 1
initcall inv_mpu_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  adjd_s311_driver_init+0x0/0x11 @ 1
initcall adjd_s311_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  apds9300_driver_init+0x0/0x11 @ 1
initcall apds9300_driver_init+0x0/0x11 returned 0 after 36 usecs
calling  hid_als_platform_driver_init+0x0/0x11 @ 1
initcall hid_als_platform_driver_init+0x0/0x11 returned 0 after 33 usecs
calling  vcnl4000_driver_init+0x0/0x11 @ 1
initcall vcnl4000_driver_init+0x0/0x11 returned 0 after 30 usecs
calling  hid_magn_3d_platform_driver_init+0x0/0x11 @ 1
initcall hid_magn_3d_platform_driver_init+0x0/0x11 returned 0 after 32 usecs
calling  st_magn_driver_init+0x0/0x11 @ 1
initcall st_magn_driver_init+0x0/0x11 returned 0 after 50 usecs
calling  st_press_driver_init+0x0/0x11 @ 1
initcall st_press_driver_init+0x0/0x11 returned 0 after 29 usecs
calling  iio_sysfs_trig_init+0x0/0x30 @ 1
initcall iio_sysfs_trig_init+0x0/0x30 returned 0 after 92 usecs
calling  vme_init+0x0/0xf @ 1
initcall vme_init+0x0/0xf returned 0 after 42 usecs
calling  ca91cx42_driver_init+0x0/0x16 @ 1
initcall ca91cx42_driver_init+0x0/0x16 returned 0 after 61 usecs
calling  tsi148_driver_init+0x0/0x16 @ 1
initcall tsi148_driver_init+0x0/0x16 returned 0 after 38 usecs
calling  ipack_init+0x0/0x19 @ 1
initcall ipack_init+0x0/0x19 returned 0 after 35 usecs
calling  tpci200_pci_drv_init+0x0/0x16 @ 1
initcall tpci200_pci_drv_init+0x0/0x16 returned 0 after 36 usecs
calling  fmc_init+0x0/0xf @ 1
initcall fmc_init+0x0/0xf returned 0 after 35 usecs
calling  t_init+0x0/0x12 @ 1
initcall t_init+0x0/0x12 returned 0 after 23 usecs
calling  fwe_init+0x0/0xf @ 1
initcall fwe_init+0x0/0xf returned 0 after 31 usecs
calling  fc_init+0x0/0xf @ 1
initcall fc_init+0x0/0xf returned 0 after 23 usecs
calling  sock_diag_init+0x0/0xf @ 1
initcall sock_diag_init+0x0/0xf returned 0 after 27 usecs
calling  flow_cache_init_global+0x0/0x10f @ 1
initcall flow_cache_init_global+0x0/0x10f returned 0 after 45 usecs
calling  llc_init+0x0/0x1b @ 1
initcall llc_init+0x0/0x1b returned 0 after 4 usecs
calling  llc2_init+0x0/0xbd @ 1
NET: Registered protocol family 26
initcall llc2_init+0x0/0xbd returned 0 after 1230 usecs
calling  snap_init+0x0/0x33 @ 1
initcall snap_init+0x0/0x33 returned 0 after 28 usecs
calling  netlink_diag_init+0x0/0xf @ 1
initcall netlink_diag_init+0x0/0xf returned 0 after 18 usecs
calling  nfnetlink_init+0x0/0x49 @ 1
Netfilter messages via NETLINK v0.30.
initcall nfnetlink_init+0x0/0x49 returned 0 after 779 usecs
calling  nfnl_acct_init+0x0/0x37 @ 1
nfnl_acct: registering with nfnetlink.
initcall nfnl_acct_init+0x0/0x37 returned 0 after 943 usecs
calling  nfnetlink_queue_init+0x0/0x7b @ 1
initcall nfnetlink_queue_init+0x0/0x7b returned 0 after 27 usecs
calling  nfnetlink_log_init+0x0/0x9c @ 1
initcall nfnetlink_log_init+0x0/0x9c returned 0 after 40 usecs
calling  nf_conntrack_standalone_init+0x0/0x6e @ 1
nf_conntrack version 0.5.0 (15637 buckets, 62548 max)
initcall nf_conntrack_standalone_init+0x0/0x6e returned 0 after 745 usecs
calling  nf_conntrack_proto_sctp_init+0x0/0x51 @ 1
initcall nf_conntrack_proto_sctp_init+0x0/0x51 returned 0 after 57 usecs
calling  ctnetlink_init+0x0/0x92 @ 1
ctnetlink v0.93: registering with nfnetlink.
initcall ctnetlink_init+0x0/0x92 returned 0 after 984 usecs
calling  nf_conntrack_amanda_init+0x0/0x90 @ 1
initcall nf_conntrack_amanda_init+0x0/0x90 returned 0 after 22 usecs
calling  nf_conntrack_ftp_init+0x0/0x1bc @ 1
initcall nf_conntrack_ftp_init+0x0/0x1bc returned 0 after 87 usecs
calling  nf_conntrack_h323_init+0x0/0xe0 @ 1
initcall nf_conntrack_h323_init+0x0/0xe0 returned 0 after 86 usecs
calling  nf_conntrack_irc_init+0x0/0x155 @ 1
initcall nf_conntrack_irc_init+0x0/0x155 returned 0 after 86 usecs
calling  nf_conntrack_snmp_init+0x0/0x19 @ 1
initcall nf_conntrack_snmp_init+0x0/0x19 returned 0 after 4 usecs
calling  nf_conntrack_sane_init+0x0/0x1b6 @ 1
initcall nf_conntrack_sane_init+0x0/0x1b6 returned 0 after 85 usecs
calling  nf_conntrack_sip_init+0x0/0x1e1 @ 1
initcall nf_conntrack_sip_init+0x0/0x1e1 returned 0 after 5 usecs
calling  synproxy_core_init+0x0/0x37 @ 1
initcall synproxy_core_init+0x0/0x37 returned 0 after 31 usecs
calling  xt_init+0x0/0x8d @ 1
initcall xt_init+0x0/0x8d returned 0 after 13 usecs
calling  tcpudp_mt_init+0x0/0x14 @ 1
initcall tcpudp_mt_init+0x0/0x14 returned 0 after 28 usecs
calling  mark_mt_init+0x0/0x37 @ 1
initcall mark_mt_init+0x0/0x37 returned 0 after 4 usecs
calling  connmark_mt_init+0x0/0x37 @ 1
initcall connmark_mt_init+0x0/0x37 returned 0 after 4 usecs
calling  audit_tg_init+0x0/0x14 @ 1
initcall audit_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  checksum_tg_init+0x0/0xf @ 1
initcall checksum_tg_init+0x0/0xf returned 0 after 4 usecs
calling  classify_tg_init+0x0/0x14 @ 1
initcall classify_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  xt_ct_tg_init+0x0/0x3c @ 1
initcall xt_ct_tg_init+0x0/0x3c returned 0 after 4 usecs
calling  dscp_tg_init+0x0/0x14 @ 1
initcall dscp_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  hl_tg_init+0x0/0x14 @ 1
initcall hl_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  hmark_tg_init+0x0/0x14 @ 1
initcall hmark_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  led_tg_init+0x0/0xf @ 1
initcall led_tg_init+0x0/0xf returned 0 after 4 usecs
calling  log_tg_init+0x0/0x5a @ 1
initcall log_tg_init+0x0/0x5a returned 0 after 6 usecs
calling  nflog_tg_init+0x0/0xf @ 1
initcall nflog_tg_init+0x0/0xf returned 0 after 4 usecs
calling  nfqueue_tg_init+0x0/0x14 @ 1
initcall nfqueue_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  xt_rateest_tg_init+0x0/0x22 @ 1
initcall xt_rateest_tg_init+0x0/0x22 returned 0 after 4 usecs
calling  secmark_tg_init+0x0/0xf @ 1
initcall secmark_tg_init+0x0/0xf returned 0 after 4 usecs
calling  tproxy_tg_init+0x0/0x1e @ 1
initcall tproxy_tg_init+0x0/0x1e returned 0 after 4 usecs
calling  tcpmss_tg_init+0x0/0x14 @ 1
initcall tcpmss_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  tcpoptstrip_tg_init+0x0/0x14 @ 1
initcall tcpoptstrip_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  tee_tg_init+0x0/0x14 @ 1
initcall tee_tg_init+0x0/0x14 returned 0 after 4 usecs
calling  trace_tg_init+0x0/0xf @ 1
initcall trace_tg_init+0x0/0xf returned 0 after 4 usecs
calling  idletimer_tg_init+0x0/0xfb @ 1
initcall idletimer_tg_init+0x0/0xfb returned 0 after 111 usecs
calling  xt_cluster_mt_init+0x0/0xf @ 1
initcall xt_cluster_mt_init+0x0/0xf returned 0 after 4 usecs
calling  connbytes_mt_init+0x0/0xf @ 1
initcall connbytes_mt_init+0x0/0xf returned 0 after 4 usecs
calling  connlabel_mt_init+0x0/0xf @ 1
initcall connlabel_mt_init+0x0/0xf returned 0 after 4 usecs
calling  connlimit_mt_init+0x0/0xf @ 1
initcall connlimit_mt_init+0x0/0xf returned 0 after 4 usecs
calling  conntrack_mt_init+0x0/0x14 @ 1
initcall conntrack_mt_init+0x0/0x14 returned 0 after 4 usecs
calling  ecn_mt_init+0x0/0x14 @ 1
initcall ecn_mt_init+0x0/0x14 returned 0 after 4 usecs
calling  hashlimit_mt_init+0x0/0x88 @ 1
initcall hashlimit_mt_init+0x0/0x88 returned 0 after 76 usecs
calling  helper_mt_init+0x0/0xf @ 1
initcall helper_mt_init+0x0/0xf returned 0 after 4 usecs
calling  hl_mt_init+0x0/0x14 @ 1
initcall hl_mt_init+0x0/0x14 returned 0 after 4 usecs
calling  iprange_mt_init+0x0/0x14 @ 1
initcall iprange_mt_init+0x0/0x14 returned 0 after 4 usecs
calling  ipvs_mt_init+0x0/0xf @ 1
initcall ipvs_mt_init+0x0/0xf returned 0 after 4 usecs
calling  limit_mt_init+0x0/0xf @ 1
initcall limit_mt_init+0x0/0xf returned 0 after 4 usecs
calling  multiport_mt_init+0x0/0x14 @ 1
initcall multiport_mt_init+0x0/0x14 returned 0 after 4 usecs
calling  nfacct_mt_init+0x0/0xf @ 1
initcall nfacct_mt_init+0x0/0xf returned 0 after 4 usecs
calling  owner_mt_init+0x0/0xf @ 1
initcall owner_mt_init+0x0/0xf returned 0 after 4 usecs
calling  pkttype_mt_init+0x0/0xf @ 1
initcall pkttype_mt_init+0x0/0xf returned 0 after 4 usecs
calling  quota_mt_init+0x0/0xf @ 1
initcall quota_mt_init+0x0/0xf returned 0 after 4 usecs
calling  xt_rateest_mt_init+0x0/0xf @ 1
initcall xt_rateest_mt_init+0x0/0xf returned 0 after 4 usecs
calling  realm_mt_init+0x0/0xf @ 1
initcall realm_mt_init+0x0/0xf returned 0 after 4 usecs
calling  string_mt_init+0x0/0xf @ 1
initcall string_mt_init+0x0/0xf returned 0 after 4 usecs
calling  tcpmss_mt_init+0x0/0x14 @ 1
initcall tcpmss_mt_init+0x0/0x14 returned 0 after 4 usecs
calling  time_mt_init+0x0/0x57 @ 1
xt_time: kernel timezone is -0000
initcall time_mt_init+0x0/0x57 returned 0 after 1059 usecs
calling  u32_mt_init+0x0/0xf @ 1
initcall u32_mt_init+0x0/0xf returned 0 after 4 usecs
calling  ip_vs_init+0x0/0xe4 @ 1
IPVS: Registered protocols ()
IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
IPVS: Creating netns size=1100 id=0
IPVS: ipvs loaded.
initcall ip_vs_init+0x0/0xe4 returned 0 after 5198 usecs
calling  ip_vs_wrr_init+0x0/0xf @ 1
IPVS: [wrr] scheduler registered.
initcall ip_vs_wrr_init+0x0/0xf returned 0 after 83 usecs
calling  ip_vs_lc_init+0x0/0xf @ 1
IPVS: [lc] scheduler registered.
initcall ip_vs_lc_init+0x0/0xf returned 0 after 889 usecs
calling  ip_vs_lblc_init+0x0/0x33 @ 1
IPVS: [lblc] scheduler registered.
initcall ip_vs_lblc_init+0x0/0x33 returned 0 after 1230 usecs
calling  ip_vs_lblcr_init+0x0/0x33 @ 1
IPVS: [lblcr] scheduler registered.
initcall ip_vs_lblcr_init+0x0/0x33 returned 0 after 421 usecs
calling  ip_vs_dh_init+0x0/0xf @ 1
IPVS: [dh] scheduler registered.
initcall ip_vs_dh_init+0x0/0xf returned 0 after 889 usecs
calling  ip_vs_sh_init+0x0/0xf @ 1
IPVS: [sh] scheduler registered.
initcall ip_vs_sh_init+0x0/0xf returned 0 after 891 usecs
calling  ip_vs_sed_init+0x0/0xf @ 1
IPVS: [sed] scheduler registered.
initcall ip_vs_sed_init+0x0/0xf returned 0 after 1059 usecs
calling  ip_vs_nq_init+0x0/0xf @ 1
IPVS: [nq] scheduler registered.
initcall ip_vs_nq_init+0x0/0xf returned 0 after 890 usecs
calling  sysctl_ipv4_init+0x0/0x7a @ 1
initcall sysctl_ipv4_init+0x0/0x7a returned 0 after 64 usecs
calling  ipip_init+0x0/0x7e @ 1
ipip: IPv4 over IPv4 tunneling driver
initcall ipip_init+0x0/0x7e returned 0 after 1085 usecs
calling  gre_init+0x0/0x9b @ 1
gre: GRE over IPv4 demultiplexor driver
initcall gre_init+0x0/0x9b returned 0 after 1099 usecs
calling  vti_init+0x0/0x6f @ 1
IPv4 over IPSec tunneling driver
initcall vti_init+0x0/0x6f returned 0 after 1163 usecs
calling  init_syncookies+0x0/0x16 @ 1
initcall init_syncookies+0x0/0x16 returned 0 after 27 usecs
calling  ah4_init+0x0/0x75 @ 1
initcall ah4_init+0x0/0x75 returned 0 after 17 usecs
calling  esp4_init+0x0/0x75 @ 1
initcall esp4_init+0x0/0x75 returned 0 after 4 usecs
calling  xfrm4_beet_init+0x0/0x14 @ 1
initcall xfrm4_beet_init+0x0/0x14 returned 0 after 16 usecs
calling  tunnel4_init+0x0/0x6d @ 1
initcall tunnel4_init+0x0/0x6d returned 0 after 4 usecs
calling  xfrm4_transport_init+0x0/0x14 @ 1
initcall xfrm4_transport_init+0x0/0x14 returned 0 after 4 usecs
calling  xfrm4_mode_tunnel_init+0x0/0x14 @ 1
initcall xfrm4_mode_tunnel_init+0x0/0x14 returned 0 after 4 usecs
calling  ipv4_netfilter_init+0x0/0xf @ 1
initcall ipv4_netfilter_init+0x0/0xf returned 0 after 16 usecs
calling  nf_defrag_init+0x0/0x14 @ 1
initcall nf_defrag_init+0x0/0x14 returned 0 after 10 usecs
calling  ip_tables_init+0x0/0x8d @ 1
ip_tables: (C) 2000-2006 Netfilter Core Team
initcall ip_tables_init+0x0/0x8d returned 0 after 969 usecs
calling  iptable_filter_init+0x0/0x40 @ 1
initcall iptable_filter_init+0x0/0x40 returned 0 after 30 usecs
calling  iptable_mangle_init+0x0/0x40 @ 1
initcall iptable_mangle_init+0x0/0x40 returned 0 after 48 usecs
calling  iptable_raw_init+0x0/0x40 @ 1
initcall iptable_raw_init+0x0/0x40 returned 0 after 29 usecs
calling  iptable_security_init+0x0/0x40 @ 1
initcall iptable_security_init+0x0/0x40 returned 0 after 25 usecs
calling  ecn_tg_init+0x0/0xf @ 1
initcall ecn_tg_init+0x0/0xf returned 0 after 4 usecs
calling  synproxy_tg4_init+0x0/0x41 @ 1
initcall synproxy_tg4_init+0x0/0x41 returned 0 after 4 usecs
calling  ulog_tg_init+0x0/0x96 @ 1
initcall ulog_tg_init+0x0/0x96 returned 0 after 29 usecs
calling  inet_diag_init+0x0/0x6d @ 1
initcall inet_diag_init+0x0/0x6d returned 0 after 12 usecs
calling  tcp_diag_init+0x0/0xf @ 1
initcall tcp_diag_init+0x0/0xf returned 0 after 19 usecs
calling  cubictcp_register+0x0/0x78 @ 1
TCP: cubic registered
initcall cubictcp_register+0x0/0x78 returned 0 after 982 usecs
calling  xfrm_user_init+0x0/0x41 @ 1
Initializing XFRM netlink socket
initcall xfrm_user_init+0x0/0x41 returned 0 after 932 usecs
calling  unix_diag_init+0x0/0xf @ 1
initcall unix_diag_init+0x0/0xf returned 0 after 4 usecs
calling  inet6_init+0x0/0x2f3 @ 1
NET: Registered protocol family 10
initcall inet6_init+0x0/0x2f3 returned 0 after 2596 usecs
calling  ah6_init+0x0/0x75 @ 1
initcall ah6_init+0x0/0x75 returned 0 after 4 usecs
calling  esp6_init+0x0/0x75 @ 1
initcall esp6_init+0x0/0x75 returned 0 after 4 usecs
calling  tunnel6_init+0x0/0x75 @ 1
initcall tunnel6_init+0x0/0x75 returned 0 after 4 usecs
calling  xfrm6_transport_init+0x0/0x14 @ 1
initcall xfrm6_transport_init+0x0/0x14 returned 0 after 4 usecs
calling  xfrm6_mode_tunnel_init+0x0/0x14 @ 1
initcall xfrm6_mode_tunnel_init+0x0/0x14 returned 0 after 4 usecs
calling  xfrm6_ro_init+0x0/0x14 @ 1
initcall xfrm6_ro_init+0x0/0x14 returned 0 after 4 usecs
calling  xfrm6_beet_init+0x0/0x14 @ 1
initcall xfrm6_beet_init+0x0/0x14 returned 0 after 4 usecs
calling  ip6_tables_init+0x0/0x8d @ 1
ip6_tables: (C) 2000-2006 Netfilter Core Team
initcall ip6_tables_init+0x0/0x8d returned 0 after 1139 usecs
calling  ip6table_filter_init+0x0/0x40 @ 1
initcall ip6table_filter_init+0x0/0x40 returned 0 after 33 usecs
calling  ip6table_mangle_init+0x0/0x40 @ 1
initcall ip6table_mangle_init+0x0/0x40 returned 0 after 68 usecs
calling  ip6table_raw_init+0x0/0x40 @ 1
initcall ip6table_raw_init+0x0/0x40 returned 0 after 24 usecs
calling  nf_defrag_init+0x0/0x4a @ 1
initcall nf_defrag_init+0x0/0x4a returned 0 after 50 usecs
calling  ah_mt6_init+0x0/0xf @ 1
initcall ah_mt6_init+0x0/0xf returned 0 after 4 usecs
calling  eui64_mt6_init+0x0/0xf @ 1
initcall eui64_mt6_init+0x0/0xf returned 0 after 4 usecs
calling  frag_mt6_init+0x0/0xf @ 1
initcall frag_mt6_init+0x0/0xf returned 0 after 4 usecs
calling  ipv6header_mt6_init+0x0/0xf @ 1
initcall ipv6header_mt6_init+0x0/0xf returned 0 after 4 usecs
calling  mh_mt6_init+0x0/0xf @ 1
initcall mh_mt6_init+0x0/0xf returned 0 after 4 usecs
calling  rpfilter_mt_init+0x0/0xf @ 1
initcall rpfilter_mt_init+0x0/0xf returned 0 after 4 usecs
calling  synproxy_tg6_init+0x0/0x41 @ 1
initcall synproxy_tg6_init+0x0/0x41 returned 0 after 4 usecs
calling  sit_init+0x0/0xbc @ 1
sit: IPv6 over IPv4 tunneling driver
initcall sit_init+0x0/0xbc returned 0 after 1970 usecs
calling  ip6_tunnel_init+0x0/0xb4 @ 1
initcall ip6_tunnel_init+0x0/0xb4 returned 0 after 392 usecs
calling  ip6gre_init+0x0/0x98 @ 1
ip6_gre: GRE over IPv6 tunneling driver
initcall ip6gre_init+0x0/0x98 returned 0 after 1477 usecs
calling  packet_init+0x0/0x39 @ 1
NET: Registered protocol family 17
initcall packet_init+0x0/0x39 returned 0 after 1248 usecs
calling  packet_diag_init+0x0/0xf @ 1
initcall packet_diag_init+0x0/0xf returned 0 after 4 usecs
calling  ipsec_pfkey_init+0x0/0x69 @ 1
NET: Registered protocol family 15
initcall ipsec_pfkey_init+0x0/0x69 returned 0 after 253 usecs
calling  ipx_init+0x0/0xd6 @ 1
NET: Registered protocol family 4
initcall ipx_init+0x0/0xd6 returned 0 after 1149 usecs
calling  atalk_init+0x0/0x78 @ 1
NET: Registered protocol family 5
initcall atalk_init+0x0/0x78 returned 0 after 1121 usecs
calling  x25_init+0x0/0x81 @ 1
NET: Registered protocol family 9
X.25 for Linux Version 0.2
initcall x25_init+0x0/0x81 returned 0 after 1963 usecs
calling  lapb_init+0x0/0x7 @ 1
initcall lapb_init+0x0/0x7 returned 0 after 4 usecs
calling  nr_proto_init+0x0/0x261 @ 1
NET: Registered protocol family 6
initcall nr_proto_init+0x0/0x261 returned 0 after 2061 usecs
calling  rose_proto_init+0x0/0x28f @ 1
NET: Registered protocol family 11
initcall rose_proto_init+0x0/0x28f returned 0 after 3669 usecs
calling  ax25_init+0x0/0xae @ 1
NET: Registered protocol family 3
initcall ax25_init+0x0/0xae returned 0 after 1089 usecs
calling  can_init+0x0/0xe5 @ 1
can: controller area network core (rev 20120528 abi 9)
NET: Registered protocol family 29
initcall can_init+0x0/0xe5 returned 0 after 1942 usecs
calling  bcm_module_init+0x0/0x4c @ 1
can: broadcast manager protocol (rev 20120528 t)
initcall bcm_module_init+0x0/0x4c returned 0 after 697 usecs
calling  cgw_module_init+0x0/0x107 @ 1
can: netlink gateway (rev 20130117) max_hops=1
initcall cgw_module_init+0x0/0x107 returned 0 after 350 usecs
calling  irlan_init+0x0/0x255 @ 1
initcall irlan_init+0x0/0x255 returned 0 after 426 usecs
calling  rfcomm_init+0x0/0xe3 @ 1
Bluetooth: RFCOMM TTY layer initialized
Bluetooth: RFCOMM socket layer initialized
Bluetooth: RFCOMM ver 1.11
initcall rfcomm_init+0x0/0xe3 returned 0 after 2622 usecs
calling  hidp_init+0x0/0x21 @ 1
Bluetooth: HIDP (Human Interface Emulation) ver 1.2
Bluetooth: HIDP socket layer initialized
initcall hidp_init+0x0/0x21 returned 0 after 2445 usecs
calling  init_rpcsec_gss+0x0/0x54 @ 1
initcall init_rpcsec_gss+0x0/0x54 returned 0 after 123 usecs
calling  xprt_rdma_init+0x0/0xbc @ 1
RPC: Registered rdma transport module.
initcall xprt_rdma_init+0x0/0xbc returned 0 after 954 usecs
calling  svc_rdma_init+0x0/0x193 @ 1
initcall svc_rdma_init+0x0/0x193 returned 0 after 80 usecs
calling  af_rxrpc_init+0x0/0x1a0 @ 1
NET: Registered protocol family 33
Key type rxrpc registered
Key type rxrpc_s registered
initcall af_rxrpc_init+0x0/0x1a0 returned 0 after 2949 usecs
calling  rxkad_init+0x0/0x2f @ 1
RxRPC: Registered security type 2 'rxkad'
initcall rxkad_init+0x0/0x2f returned 0 after 1650 usecs
calling  atm_clip_init+0x0/0x9c @ 1
initcall atm_clip_init+0x0/0x9c returned 0 after 35 usecs
calling  br2684_init+0x0/0x4a @ 1
initcall br2684_init+0x0/0x4a returned 0 after 27 usecs
calling  lane_module_init+0x0/0x6b @ 1
lec:lane_module_init: lec.c: initialized
initcall lane_module_init+0x0/0x6b returned 0 after 1270 usecs
calling  atm_mpoa_init+0x0/0x45 @ 1
mpoa:atm_mpoa_init: mpc.c: initialized
initcall atm_mpoa_init+0x0/0x45 returned 0 after 930 usecs
calling  decnet_init+0x0/0x8a @ 1
NET4: DECnet for Linux: V.2.5.68s (C) 1995-2003 Linux DECnet Project Team
DECnet: Routing cache hash table of 1024 buckets, 36Kbytes
NET: Registered protocol family 12
initcall decnet_init+0x0/0x8a returned 0 after 2701 usecs
calling  dn_rtmsg_init+0x0/0x74 @ 1
initcall dn_rtmsg_init+0x0/0x74 returned 0 after 25 usecs
calling  phonet_init+0x0/0x71 @ 1
NET: Registered protocol family 35
initcall phonet_init+0x0/0x71 returned 0 after 1282 usecs
calling  pep_register+0x0/0x14 @ 1
initcall pep_register+0x0/0x14 returned 0 after 18 usecs
calling  vlan_proto_init+0x0/0x88 @ 1
8021q: 802.1Q VLAN Support v1.8
initcall vlan_proto_init+0x0/0x88 returned 0 after 788 usecs
calling  dccp_init+0x0/0x2ec @ 1
DCCP: Activated CCID 2 (TCP-like)
DCCP: Activated CCID 3 (TCP-Friendly Rate Control)
initcall dccp_init+0x0/0x2ec returned 0 after 5991 usecs
calling  dccp_v4_init+0x0/0x70 @ 1
initcall dccp_v4_init+0x0/0x70 returned 0 after 87 usecs
calling  dccp_v6_init+0x0/0x70 @ 1
initcall dccp_v6_init+0x0/0x70 returned 0 after 81 usecs
calling  dccp_diag_init+0x0/0xf @ 1
initcall dccp_diag_init+0x0/0xf returned 0 after 4 usecs
calling  sctp_init+0x0/0x495 @ 1
sctp: Hash tables configured (established 32768 bind 29127)
initcall sctp_init+0x0/0x495 returned 0 after 4239 usecs
calling  lib80211_init+0x0/0x1c @ 1
lib80211: common routines for IEEE802.11 drivers
lib80211_crypt: registered algorithm 'NULL'
initcall lib80211_init+0x0/0x1c returned 0 after 2444 usecs
calling  lib80211_crypto_wep_init+0x0/0xf @ 1
lib80211_crypt: registered algorithm 'WEP'
initcall lib80211_crypto_wep_init+0x0/0xf returned 0 after 629 usecs
calling  lib80211_crypto_ccmp_init+0x0/0xf @ 1
lib80211_crypt: registered algorithm 'CCMP'
initcall lib80211_crypto_ccmp_init+0x0/0xf returned 0 after 799 usecs
calling  lib80211_crypto_tkip_init+0x0/0xf @ 1
lib80211_crypt: registered algorithm 'TKIP'
initcall lib80211_crypto_tkip_init+0x0/0xf returned 0 after 1773 usecs
calling  tipc_init+0x0/0xee @ 1
tipc: Activated (version 2.0.0)
NET: Registered protocol family 30
tipc: Started in single node mode
initcall tipc_init+0x0/0xee returned 0 after 4563 usecs
calling  init_p9+0x0/0x1e @ 1
9pnet: Installing 9P2000 support
initcall init_p9+0x0/0x1e returned 0 after 1881 usecs
calling  p9_virtio_init+0x0/0x2d @ 1
initcall p9_virtio_init+0x0/0x2d returned 0 after 38 usecs
calling  p9_trans_rdma_init+0x0/0x11 @ 1
initcall p9_trans_rdma_init+0x0/0x11 returned 0 after 4 usecs
calling  dcbnl_init+0x0/0x5e @ 1
initcall dcbnl_init+0x0/0x5e returned 0 after 4 usecs
calling  af_ieee802154_init+0x0/0x63 @ 1
NET: Registered protocol family 36
initcall af_ieee802154_init+0x0/0x63 returned 0 after 1230 usecs
calling  lowpan_init_module+0x0/0x47 @ 1
initcall lowpan_init_module+0x0/0x47 returned 0 after 18 usecs
calling  wimax_subsys_init+0x0/0x231 @ 1
initcall wimax_subsys_init+0x0/0x231 returned 0 after 62 usecs
calling  init_dns_resolver+0x0/0xce @ 1
Key type dns_resolver registered
initcall init_dns_resolver+0x0/0xce returned 0 after 891 usecs
calling  init_ceph_lib+0x0/0x6d @ 1
Key type ceph registered
libceph: loaded (mon/osd proto 15/24)
initcall init_ceph_lib+0x0/0x6d returned 0 after 2249 usecs
calling  vmci_transport_init+0x0/0x121 @ 1
NET: Registered protocol family 40
initcall vmci_transport_init+0x0/0x121 returned 0 after 1296 usecs
calling  mpls_gso_init+0x0/0x28 @ 1
mpls_gso: MPLS GSO support
initcall mpls_gso_init+0x0/0x28 returned 0 after 852 usecs
calling  mcheck_init_device+0x0/0x20d @ 1
initcall mcheck_init_device+0x0/0x20d returned -5 after 4 usecs
calling  mcheck_debugfs_init+0x0/0x3e @ 1
initcall mcheck_debugfs_init+0x0/0x3e returned 0 after 45 usecs
calling  severities_debugfs_init+0x0/0x40 @ 1
initcall severities_debugfs_init+0x0/0x40 returned 0 after 23 usecs
calling  lapic_insert_resource+0x0/0x34 @ 1
initcall lapic_insert_resource+0x0/0x34 returned -1 after 4 usecs
calling  io_apic_bug_finalize+0x0/0x1a @ 1
initcall io_apic_bug_finalize+0x0/0x1a returned 0 after 4 usecs
calling  print_ICs+0x0/0x40d @ 1

printing PIC contents

printing PIC contents
... PIC  IMR: 3618
... PIC  IRR: 0001
... PIC  ISR: 0000
... PIC ELCR: 0828
printing local APIC contents on CPU#0/0:
... APIC ID:      00000000 (0)
... APIC VERSION: 00000000
... APIC TASKPRI: 00000000 (00)
... APIC RRR: 00000000
... APIC LDR: 00000000
... APIC DFR: 00000000
... APIC SPIV: 00000000
... APIC ISR field:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

... APIC TMR field:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

... APIC IRR field:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

... APIC ICR: 00000000
... APIC ICR2: 00000000
... APIC LVTT: 00000000
... APIC LVT0: 00000000
... APIC LVT1: 00000000
... APIC TMICT: 00000000
... APIC TMCCT: 00000000
... APIC TDCR: 00000000

number of MP IRQ sources: 0.
testing the IO APIC.......................
IRQ to pin mappings:
.................................... done.
initcall print_ICs+0x0/0x40d returned 0 after 8887 usecs
calling  print_ipi_mode+0x0/0x2e @ 1
Using IPI Shortcut mode
initcall print_ipi_mode+0x0/0x2e returned 0 after 1320 usecs
calling  check_early_ioremap_leak+0x0/0x54 @ 1
initcall check_early_ioremap_leak+0x0/0x54 returned 0 after 4 usecs
calling  pat_memtype_list_init+0x0/0x3a @ 1
initcall pat_memtype_list_init+0x0/0x3a returned 0 after 16 usecs
calling  init_oops_id+0x0/0x3b @ 1
initcall init_oops_id+0x0/0x3b returned 0 after 4 usecs
calling  sched_init_debug+0x0/0x2a @ 1
initcall sched_init_debug+0x0/0x2a returned 0 after 14 usecs
calling  pm_qos_power_init+0x0/0x5b @ 1
initcall pm_qos_power_init+0x0/0x5b returned 0 after 216 usecs
calling  pm_debugfs_init+0x0/0x2a @ 1
initcall pm_debugfs_init+0x0/0x2a returned 0 after 15 usecs
calling  printk_late_init+0x0/0x4c @ 1
initcall printk_late_init+0x0/0x4c returned 0 after 4 usecs
calling  tk_debug_sleep_time_init+0x0/0x41 @ 1
initcall tk_debug_sleep_time_init+0x0/0x41 returned 0 after 14 usecs
calling  test_ringbuffer+0x0/0x448 @ 1
Running ring buffer tests...
finished
CPU 0:
              events:    5000
       dropped bytes:    0
       alloced bytes:    389428
       written bytes:    382036
       biggest event:    23
      smallest event:    0
         read events:   5000
         lost events:   0
        total events:   5000
  recorded len bytes:   389428
 recorded size bytes:   382036
Ring buffer PASSED!
initcall test_ringbuffer+0x0/0x448 returned 0 after 9780917 usecs
calling  clear_boot_tracer+0x0/0x30 @ 1
initcall clear_boot_tracer+0x0/0x30 returned 0 after 4 usecs
calling  set_recommended_min_free_kbytes+0x0/0x6d @ 1
initcall set_recommended_min_free_kbytes+0x0/0x6d returned 0 after 72 usecs
calling  afs_init+0x0/0x169 @ 1
kAFS: Red Hat AFS client v0.1 registering.
initcall afs_init+0x0/0x169 returned 0 after 2198 usecs
calling  init_trusted+0x0/0xa8 @ 1
Key type trusted registered
initcall init_trusted+0x0/0xa8 returned 0 after 1302 usecs
calling  init_encrypted+0x0/0xfa @ 1
Key type encrypted registered
initcall init_encrypted+0x0/0xfa returned 0 after 1961 usecs
calling  init_ima+0x0/0x18 @ 1
IMA: No TPM chip found, activating TPM-bypass!
initcall init_ima+0x0/0x18 returned 0 after 1403 usecs
calling  prandom_reseed+0x0/0x55 @ 1
initcall prandom_reseed+0x0/0x55 returned 0 after 9 usecs
calling  pci_resource_alignment_sysfs_init+0x0/0x19 @ 1
initcall pci_resource_alignment_sysfs_init+0x0/0x19 returned 0 after 17 usecs
calling  pci_sysfs_init+0x0/0x48 @ 1
initcall pci_sysfs_init+0x0/0x48 returned 0 after 237 usecs
calling  regulator_init_complete+0x0/0x14c @ 1
initcall regulator_init_complete+0x0/0x14c returned 0 after 4 usecs
calling  random_int_secret_init+0x0/0x16 @ 1
initcall random_int_secret_init+0x0/0x16 returned 0 after 16 usecs
calling  deferred_probe_initcall+0x0/0x73 @ 1
initcall deferred_probe_initcall+0x0/0x73 returned 0 after 99 usecs
calling  late_resume_init+0x0/0x1bb @ 1
  Magic number: 1:448:497
initcall late_resume_init+0x0/0x1bb returned 0 after 1923 usecs
calling  wl1273_core_init+0x0/0x29 @ 1
initcall wl1273_core_init+0x0/0x29 returned 0 after 59 usecs
calling  init_netconsole+0x0/0x1a3 @ 1
console [netcon0] enabled
netconsole: network logging started
initcall init_netconsole+0x0/0x1a3 returned 0 after 2078 usecs
calling  vxlan_init_module+0x0/0x8e @ 1
initcall vxlan_init_module+0x0/0x8e returned 0 after 38 usecs
calling  gpio_keys_init+0x0/0x11 @ 1
initcall gpio_keys_init+0x0/0x11 returned 0 after 77 usecs
calling  edd_init+0x0/0x2bd @ 1
BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
EDD information not available.
initcall edd_init+0x0/0x2bd returned -19 after 1897 usecs
calling  firmware_memmap_init+0x0/0x29 @ 1
initcall firmware_memmap_init+0x0/0x29 returned 0 after 56 usecs
calling  tcp_congestion_default+0x0/0xf @ 1
initcall tcp_congestion_default+0x0/0xf returned 0 after 4 usecs
calling  tcp_fastopen_init+0x0/0x40 @ 1
initcall tcp_fastopen_init+0x0/0x40 returned 0 after 37 usecs
calling  ip_auto_config+0x0/0xe11 @ 1
initcall ip_auto_config+0x0/0xe11 returned 0 after 24 usecs
calling  initialize_hashrnd+0x0/0x16 @ 1
initcall initialize_hashrnd+0x0/0x16 returned 0 after 6 usecs
async_waiting @ 1
async_continuing @ 1 after 4 usec
EXT3-fs (sda1): recovery required on readonly filesystem
EXT3-fs (sda1): write access will be enabled during recovery
kjournald starting.  Commit interval 5 seconds
EXT3-fs (sda1): recovery complete
EXT3-fs (sda1): mounted filesystem with writeback data mode
VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
async_waiting @ 1
async_continuing @ 1 after 4 usec
debug: unmapping init [mem 0xb2d05000-0xb2dbbfff]
Write protecting the kernel text: 19924k
Testing CPA: Reverting b1000000-b2375000
Testing CPA: write protecting again
Write protecting the kernel read-only data: 8024k
Testing CPA: undo b2375000-b2b4b000
Testing CPA: write protecting again
Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
INIT: version 2.86 booting
BUG: unable to handle kernel BUG: unable to handle kernel paging requestpaging request at eaf10f40
 at eaf10f40
IP:IP: [<b103e0ef>] task_work_run+0x52/0x87
 [<b103e0ef>] task_work_run+0x52/0x87
*pde = 3fbf9067 *pde = 3fbf9067 *pte = 3af10060 *pte = 3af10060 

Oops: 0000 [#1] Oops: 0000 [#1] DEBUG_PAGEALLOCDEBUG_PAGEALLOC

CPU: 0 PID: 171 Comm: hostname Tainted: G        W    3.12.0-rc4-01668-gfd71a04-dirty #229484
task: eaf157a0 ti: eacf2000 task.ti: eacf2000
EIP: 0060:[<b103e0ef>] EFLAGS: 00010282 CPU: 0
EIP is at task_work_run+0x52/0x87
EAX: eaf10f40 EBX: eaf13f40 ECX: 00000000 EDX: eaf10f40
ESI: eaf15a38 EDI: eaf157a0 EBP: eacf3f3c ESP: eacf3f30
 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
CR0: 8005003b CR2: eaf10f40 CR3: 3acf6000 CR4: 00000690
Stack:
 eaf15a50 eaf15a50 eacf5dc0 eacf5dc0 eaf157a0 eaf157a0 eacf3f8c eacf3f8c b1029d1d b1029d1d 00000000 00000000 b01137a0 b01137a0 eaf157a0 eaf157a0

 eaf157a0 eaf157a0 eaf13f40 eaf13f40 00000001 00000001 00000007 00000007 eacf3f88 eacf3f88 b10cdebb b10cdebb 00000001 00000001 eacf5e10 eacf5e10

 00000000 00000000 eaf13f48 eaf13f48 00000002 00000002 eaf13f40 eaf13f40 eace1d80 eace1d80 00000000 00000000 eaf157a0 eaf157a0 eacf3fa4 eacf3fa4

Call Trace:
 [<b1029d1d>] do_exit+0x291/0x753
 [<b10cdebb>] ? vfs_write+0x11f/0x15a
 [<b102a262>] do_group_exit+0x59/0x86
 [<b102a29f>] SyS_exit_group+0x10/0x10
 [<b237365b>] sysenter_do_call+0x12/0x2d
Code:Code: ed ed dc dc b2 b2 0f 0f 45 45 c8 c8 eb eb 02 02 31 31 c9 c9 89 89 d0 d0 0f 0f b1 b1 0e 0e 39 39 c2 c2 75 75 d8 d8 85 85 d2 d2 74 74 41 41 f3 f3 90 90 8b 8b 87 87 d0 d0 02 02 00 00 00 00 85 85 c0 c0 74 74 f4 f4 31 31 db db eb eb 04 04 89 89 d3 d3 89 89 c2 c2 <8b> <8b> 02 02 85 85 c0 c0 89 89 1a 1a 75 75 f4 f4 89 89 d0 d0 ff ff 52 52 04 04 31 31 c9 c9 ba ba 7d 7d 00 00 00 00 00 00 b8 b8

EIP: [<b103e0ef>] EIP: [<b103e0ef>] task_work_run+0x52/0x87task_work_run+0x52/0x87 SS:ESP 0068:eacf3f30
 SS:ESP 0068:eacf3f30
CR2: 00000000eaf10f40
---[ end trace a7919e7f17c0a729 ]---
Fixing recursive fault but reboot is needed!
CPA self-test:
 4k 262128 large 0 gb 0 x 262128[b0000000-effef000] miss 0
ok.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 16:29   ` Ingo Molnar
@ 2013-10-09 16:57       ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 16:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


an interesting aspect is that this is a 32-bit UP kernel.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 16:57       ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 16:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


an interesting aspect is that this is a 32-bit UP kernel.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 16:28 ` Ingo Molnar
@ 2013-10-09 17:08     ` Peter Zijlstra
  2013-10-09 17:08     ` Peter Zijlstra
  1 sibling, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 17:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 06:28:01PM +0200, Ingo Molnar wrote:
> 
> Hm, so I'm seeing boot crashes with the config attached:
> 
>  INIT: version 2.86 booting 
>  BUG: unable to handle kernel BUG: unable to handle kernel paging 
>  requestpaging request at eaf10f40 
>   at eaf10f40 
>  IP:IP: [<b103e0ef>] task_work_run+0x52/0x87 
>   [<b103e0ef>] task_work_run+0x52/0x87 
>  *pde = 3fbf9067 *pde = 3fbf9067 *pte = 3af10060 *pte = 3af10060  
>  
>  Oops: 0000 [#1] Oops: 0000 [#1] DEBUG_PAGEALLOCDEBUG_PAGEALLOC 
>  
>  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
>  3.12.0-rc4-01668-gfd71a04-dirty #229484 
>  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
>  3.12.0-rc4-01668-gfd71a04-dirty #229484 
>  task: eaf157a0 ti: eacf2000 task.ti: eacf2000 
> 
> Note that the config does not have NUMA_BALANCING enabled. With another 
> config I also had a failed bootup due to the OOM killer kicking in. That 
> didn't have NUMA_BALANCING enabled either.
> 
> Yet this all started today, after merging the NUMA patches.
> 
> Any ideas?

> CONFIG_MGEODE_LX=y

It looks like -march=geode generates similar borkage to the
-march=winchip2 like we found earlier today.

Must be randconfig luck to only hit it now.

Very easy to see if you build kernel/task_work.s, the bitops jc label
path fails to initialize the return value.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 17:08     ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 17:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 06:28:01PM +0200, Ingo Molnar wrote:
> 
> Hm, so I'm seeing boot crashes with the config attached:
> 
>  INIT: version 2.86 booting 
>  BUG: unable to handle kernel BUG: unable to handle kernel paging 
>  requestpaging request at eaf10f40 
>   at eaf10f40 
>  IP:IP: [<b103e0ef>] task_work_run+0x52/0x87 
>   [<b103e0ef>] task_work_run+0x52/0x87 
>  *pde = 3fbf9067 *pde = 3fbf9067 *pte = 3af10060 *pte = 3af10060  
>  
>  Oops: 0000 [#1] Oops: 0000 [#1] DEBUG_PAGEALLOCDEBUG_PAGEALLOC 
>  
>  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
>  3.12.0-rc4-01668-gfd71a04-dirty #229484 
>  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
>  3.12.0-rc4-01668-gfd71a04-dirty #229484 
>  task: eaf157a0 ti: eacf2000 task.ti: eacf2000 
> 
> Note that the config does not have NUMA_BALANCING enabled. With another 
> config I also had a failed bootup due to the OOM killer kicking in. That 
> didn't have NUMA_BALANCING enabled either.
> 
> Yet this all started today, after merging the NUMA patches.
> 
> Any ideas?

> CONFIG_MGEODE_LX=y

It looks like -march=geode generates similar borkage to the
-march=winchip2 like we found earlier today.

Must be randconfig luck to only hit it now.

Very easy to see if you build kernel/task_work.s, the bitops jc label
path fails to initialize the return value.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 16:57       ` Ingo Molnar
@ 2013-10-09 17:09         ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 17:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


I started bisecting the crash, and the good news is that it's bisectable 
and it's not the NUMA bits that are causing the crash.

(the bad news is that I now face a boring, possibly very long bisection, 
but hey ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 17:09         ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 17:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


I started bisecting the crash, and the good news is that it's bisectable 
and it's not the NUMA bits that are causing the crash.

(the bad news is that I now face a boring, possibly very long bisection, 
but hey ;-)

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 17:09         ` Ingo Molnar
@ 2013-10-09 17:11           ` Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 17:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 07:09:34PM +0200, Ingo Molnar wrote:
> 
> I started bisecting the crash, and the good news is that it's bisectable 
> and it's not the NUMA bits that are causing the crash.
> 
> (the bad news is that I now face a boring, possibly very long bisection, 
> but hey ;-)

Its the RMW bits..

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 17:11           ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 17:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 07:09:34PM +0200, Ingo Molnar wrote:
> 
> I started bisecting the crash, and the good news is that it's bisectable 
> and it's not the NUMA bits that are causing the crash.
> 
> (the bad news is that I now face a boring, possibly very long bisection, 
> but hey ;-)

Its the RMW bits..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 17:08     ` Peter Zijlstra
@ 2013-10-09 17:15       ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 09, 2013 at 06:28:01PM +0200, Ingo Molnar wrote:
> > 
> > Hm, so I'm seeing boot crashes with the config attached:
> > 
> >  INIT: version 2.86 booting 
> >  BUG: unable to handle kernel BUG: unable to handle kernel paging 
> >  requestpaging request at eaf10f40 
> >   at eaf10f40 
> >  IP:IP: [<b103e0ef>] task_work_run+0x52/0x87 
> >   [<b103e0ef>] task_work_run+0x52/0x87 
> >  *pde = 3fbf9067 *pde = 3fbf9067 *pte = 3af10060 *pte = 3af10060  
> >  
> >  Oops: 0000 [#1] Oops: 0000 [#1] DEBUG_PAGEALLOCDEBUG_PAGEALLOC 
> >  
> >  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
> >  3.12.0-rc4-01668-gfd71a04-dirty #229484 
> >  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
> >  3.12.0-rc4-01668-gfd71a04-dirty #229484 
> >  task: eaf157a0 ti: eacf2000 task.ti: eacf2000 
> > 
> > Note that the config does not have NUMA_BALANCING enabled. With another 
> > config I also had a failed bootup due to the OOM killer kicking in. That 
> > didn't have NUMA_BALANCING enabled either.
> > 
> > Yet this all started today, after merging the NUMA patches.
> > 
> > Any ideas?
> 
> > CONFIG_MGEODE_LX=y
> 
> It looks like -march=geode generates similar borkage to the
> -march=winchip2 like we found earlier today.
> 
> Must be randconfig luck to only hit it now.

Yes, very weird but such is life :-)

Also note that this reproduces with GCC 4.7 ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 17:15       ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-09 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 09, 2013 at 06:28:01PM +0200, Ingo Molnar wrote:
> > 
> > Hm, so I'm seeing boot crashes with the config attached:
> > 
> >  INIT: version 2.86 booting 
> >  BUG: unable to handle kernel BUG: unable to handle kernel paging 
> >  requestpaging request at eaf10f40 
> >   at eaf10f40 
> >  IP:IP: [<b103e0ef>] task_work_run+0x52/0x87 
> >   [<b103e0ef>] task_work_run+0x52/0x87 
> >  *pde = 3fbf9067 *pde = 3fbf9067 *pte = 3af10060 *pte = 3af10060  
> >  
> >  Oops: 0000 [#1] Oops: 0000 [#1] DEBUG_PAGEALLOCDEBUG_PAGEALLOC 
> >  
> >  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
> >  3.12.0-rc4-01668-gfd71a04-dirty #229484 
> >  CPU: 0 PID: 171 Comm: hostname Tainted: G        W    
> >  3.12.0-rc4-01668-gfd71a04-dirty #229484 
> >  task: eaf157a0 ti: eacf2000 task.ti: eacf2000 
> > 
> > Note that the config does not have NUMA_BALANCING enabled. With another 
> > config I also had a failed bootup due to the OOM killer kicking in. That 
> > didn't have NUMA_BALANCING enabled either.
> > 
> > Yet this all started today, after merging the NUMA patches.
> > 
> > Any ideas?
> 
> > CONFIG_MGEODE_LX=y
> 
> It looks like -march=geode generates similar borkage to the
> -march=winchip2 like we found earlier today.
> 
> Must be randconfig luck to only hit it now.

Yes, very weird but such is life :-)

Also note that this reproduces with GCC 4.7 ...

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 17:15       ` Ingo Molnar
@ 2013-10-09 17:18         ` Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 17:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 07:15:37PM +0200, Ingo Molnar wrote:
> > It looks like -march=geode generates similar borkage to the
> > -march=winchip2 like we found earlier today.
> > 
> > Must be randconfig luck to only hit it now.
> 
> Yes, very weird but such is life :-)
> 
> Also note that this reproduces with GCC 4.7 ...

Yes, it does so too for me, I tried both 4.7 and 4.8; they generate
different but similarly broken code.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-09 17:18         ` Peter Zijlstra
  0 siblings, 0 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-09 17:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 07:15:37PM +0200, Ingo Molnar wrote:
> > It looks like -march=geode generates similar borkage to the
> > -march=winchip2 like we found earlier today.
> > 
> > Must be randconfig luck to only hit it now.
> 
> Yes, very weird but such is life :-)
> 
> Also note that this reproduces with GCC 4.7 ...

Yes, it does so too for me, I tried both 4.7 and 4.8; they generate
different but similarly broken code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Document automatic NUMA balancing sysctls
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:24   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  10fc05d0e551146ad6feb0ab8902d28a2d3c5624
Gitweb:     http://git.kernel.org/tip/10fc05d0e551146ad6feb0ab8902d28a2d3c5624
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:40 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:20 +0200

mm: numa: Document automatic NUMA balancing sysctls

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9d4c1d1..1428c66 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -355,6 +355,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Fix comments
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:24   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  c69307d533d7aa7cc8894dbbb8a274599f8630d7
Gitweb:     http://git.kernel.org/tip/c69307d533d7aa7cc8894dbbb8a274599f8630d7
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:28:41 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:30 +0200

sched/numa: Fix comments

Fix a 80 column violation and a PTE vs PMD reference.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 ++++----
 mm/huge_memory.c    | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b89cd2..817cd7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * It is possible to reach the end of the VMA list but the last few VMAs are
-	 * not guaranteed to the vma_migratable. If they are not, we would find the
-	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * It is possible to reach the end of the VMA list but the last few
+	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
+	 * would find the !migratable VMA on the next scan but not reset the
+	 * scanner to the start so check it now.
 	 */
 	if (vma)
 		mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..19dbb08 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Do not account for a hinting fault if we raced
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:24   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  0c3a775e1e0b069bf765f8355b723ce0d18dcc6c
Gitweb:     http://git.kernel.org/tip/0c3a775e1e0b069bf765f8355b723ce0d18dcc6c
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:42 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:40 +0200

mm: numa: Do not account for a hinting fault if we raced

If another task handled a hinting fault in parallel then do not double
account for it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 19dbb08..dab2bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 check_same:
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		/* Someone else took our fault */
+		current_nid = -1;
 		goto out_unlock;
+	}
 clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: Wait for THP migrations to complete during NUMA hinting faults
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:24   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  ff9042b11a71c81238c70af168cd36b98a6d5a3c
Gitweb:     http://git.kernel.org/tip/ff9042b11a71c81238c70af168cd36b98a6d5a3c
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:43 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:41 +0200

mm: Wait for THP migrations to complete during NUMA hinting faults

The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.

This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dab2bab..f362363 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
-	}
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	if (trylock_page(page))
+		goto got_lock;
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Serialise against migrationa and check placement check placement */
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
+
+got_lock:
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		unlock_page(page);
+		put_page(page);
+		goto clear_pmdnuma;
+	}
 
 	/* Migrate the THP to the requested node */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (!migrated)

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: Prevent parallel splits during THP migration
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:24   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  b8916634b77bffb233d8f2f45703c80343457cc1
Gitweb:     http://git.kernel.org/tip/b8916634b77bffb233d8f2f45703c80343457cc1
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:44 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:43 +0200

mm: Prevent parallel splits during THP migration

THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f362363..1d6334f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
 	int current_nid = -1;
-	bool migrated;
+	bool migrated, page_locked;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
 	current_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
-	if (trylock_page(page))
-		goto got_lock;
+	page_locked = trylock_page(page);
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked) {
+			unlock_page(page);
+			goto clear_pmdnuma;
+		}
 
-	/* Serialise against migrationa and check placement check placement */
+		/* Otherwise wait for potential migrations and retry fault */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		goto out;
+	}
+
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked) {
+		lock_page(page);
+		page_locked = true;
+	}
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
-got_lock:
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		unlock_page(page);
-		put_page(page);
-		goto clear_pmdnuma;
-	}
-
 	/* Migrate the THP to the requested node */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
 		goto check_same;
 
 	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
 	return 0;
 
 check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+
+out:
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
 	if (current_nid != -1)
 		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
 	return 0;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Sanitize task_numa_fault() callsites
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:25   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  8191acbd30c73e45c24ad16c372e0b42cc7ac8f8
Gitweb:     http://git.kernel.org/tip/8191acbd30c73e45c24ad16c372e0b42cc7ac8f8
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:45 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:44 +0200

mm: numa: Sanitize task_numa_fault() callsites

There are three callers of task_numa_fault():

 - do_huge_pmd_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_pmd_numa_page():
     Accounts not at all when the page isn't migrated, otherwise
     accounts against the node we migrated towards.

This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.

So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.

They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 25 +++++++++++++------------
 mm/memory.c      | 53 +++++++++++++++++++++--------------------------------
 2 files changed, 34 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6334f..c3bb65f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated, page_locked;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
+	if (migrated)
+		page_nid = target_nid;
+	else
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-	return 0;
+	goto out;
 
 check_same:
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		/* Someone else took our fault */
-		current_nid = -1;
+		page_nid = -1;
 		goto out_unlock;
 	}
 clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);
 
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..42ae82e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3519,12 +3519,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3535,7 +3535,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3565,15 +3565,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3581,11 +3576,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3600,7 +3595,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3623,9 +3617,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3647,25 +3642,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: Close races between THP migration and PMD numa clearing
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:25   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  a54a407fbf7735fd8f7841375574f5d9b0375f93
Gitweb:     http://git.kernel.org/tip/a54a407fbf7735fd8f7841375574f5d9b0375f93
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:46 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:45 +0200

mm: Close races between THP migration and PMD numa clearing

THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open

  Task A					Task B
  ---------------------				---------------------
  do_huge_pmd_numa_page				do_huge_pmd_numa_page
  lock_page
  mpol_misplaced == -1
  unlock_page
  goto clear_pmdnuma
						lock_page
						mpol_misplaced == 2
						migrate_misplaced_transhuge
  pmd = pmd_mknonnuma
  set_pmd_at

During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 33 +++++++++++++++------------------
 mm/migrate.c     | 19 +++++++++++--------
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bb65f..d4928769 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		/* If the page was locked, there are no parallel migrations */
-		if (page_locked) {
-			unlock_page(page);
+		if (page_locked)
 			goto clear_pmdnuma;
-		}
 
-		/* Otherwise wait for potential migrations and retry fault */
+		/*
+		 * Otherwise wait for potential migrations and retry. We do
+		 * relock and check_same as the page may no longer be mapped.
+		 * As the fault is being retried, do not account for it.
+		 */
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
+		page_nid = -1;
 		goto out;
 	}
 
 	/* Page is misplaced, serialise migrations and parallel THP splits */
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	if (!page_locked) {
+	if (!page_locked)
 		lock_page(page);
-		page_locked = true;
-	}
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
 		put_page(page);
+		page_nid = -1;
 		goto out_unlock;
 	}
 
-	/* Migrate the THP to the requested node */
+	/*
+	 * Migrate the THP to the requested node, returns with page unlocked
+	 * and pmd_numa cleared.
+	 */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (migrated)
 		page_nid = target_nid;
-	else
-		goto check_same;
 
 	goto out;
-
-check_same:
-	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp))) {
-		/* Someone else took our fault */
-		page_nid = -1;
-		goto out_unlock;
-	}
 clear_pmdnuma:
+	BUG_ON(!PageLocked(page));
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	unlock_page(page);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index a26bccd..7bd90d3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1713,12 +1713,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		unlock_page(new_page);
 		put_page(new_page);		/* Free it */
 
-		unlock_page(page);
+		/* Retake the callers reference and putback on LRU */
+		get_page(page);
 		putback_lru_page(page);
-
-		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-		isolated = 0;
-		goto out;
+		mod_zone_page_state(page_zone(page),
+			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+		goto out_fail;
 	}
 
 	/*
@@ -1735,9 +1735,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
 
-	page_add_new_anon_rmap(new_page, vma, haddr);
-
+	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
 	/*
@@ -1756,7 +1756,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-out:
 	mod_zone_page_state(page_zone(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
@@ -1765,6 +1764,10 @@ out:
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
+	entry = pmd_mknonnuma(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, &entry);
+
 	unlock_page(page);
 	put_page(page);
 	return 0;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: Account for a THP NUMA hinting update as one PTE update
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:25   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  afcae2655b0ab67e65f161b1bb214efcfa1db415
Gitweb:     http://git.kernel.org/tip/afcae2655b0ab67e65f161b1bb214efcfa1db415
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:47 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:46 +0200

mm: Account for a THP NUMA hinting update as one PTE update

A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:25   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  e920e14ca29b0b2a981cfc90e4e20edd6f078d19
Gitweb:     http://git.kernel.org/tip/e920e14ca29b0b2a981cfc90e4e20edd6f078d19
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:48 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:48 +0200

mm: Do not flush TLB during protection change if !pte_present && !migration_entry

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-11-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..7bdbd4b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -101,8 +101,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				make_migration_entry_read(&entry);
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
+
+				pages++;
 			}
-			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
  2013-10-07 10:28   ` Mel Gorman
  (?)
@ 2013-10-09 17:25   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  f123d74abf91574837d14e5ea58f6a779a387bf5
Gitweb:     http://git.kernel.org/tip/f123d74abf91574837d14e5ea58f6a779a387bf5
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:49 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:49 +0200

mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-12-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 19 ++++++++++++++++---
 mm/mprotect.c    | 14 ++++++++++----
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4928769..de8d5cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1458,6 +1458,12 @@ out:
 	return ret;
 }
 
+/*
+ * Returns
+ *  - 0 if PMD could not be locked
+ *  - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
@@ -1466,9 +1472,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
-		entry = pmdp_get_and_clear(mm, addr, pmd);
+		ret = 1;
 		if (!prot_numa) {
+			entry = pmdp_get_and_clear(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
+			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
@@ -1476,12 +1484,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			/* only check non-shared pages */
 			if (page_mapcount(page) == 1 &&
 			    !pmd_numa(*pmd)) {
+				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
+				ret = HPAGE_PMD_NR;
 			}
 		}
-		set_pmd_at(mm, addr, pmd, entry);
+
+		/* Set PMD if cleared earlier */
+		if (ret == HPAGE_PMD_NR)
+			set_pmd_at(mm, addr, pmd, entry);
+
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		ret = 1;
 	}
 
 	return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7bdbd4b..2da33dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -144,10 +144,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot,
-						 prot_numa)) {
-				pages++;
-				continue;
+			else {
+				int nr_ptes = change_huge_pmd(vma, pmd, addr,
+						newprot, prot_numa);
+
+				if (nr_ptes) {
+					if (nr_ptes == HPAGE_PMD_NR)
+						pages++;
+
+					continue;
+				}
 			}
 			/* fall through */
 		}

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Do not migrate or account for hinting faults on the zero page
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:25   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  a1a46184e34cfd0764f06a54870defa052b0a094
Gitweb:     http://git.kernel.org/tip/a1a46184e34cfd0764f06a54870defa052b0a094
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:50 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:50 +0200

mm: numa: Do not migrate or account for hinting faults on the zero page

The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Correct use of is_huge_zero_page]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-13-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 10 +++++++++-
 mm/memory.c      |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de8d5cf..8677dbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,6 +1291,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
@@ -1481,8 +1482,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			struct page *page = pmd_page(*pmd);
 
-			/* only check non-shared pages */
+			/*
+			 * Only check non-shared pages. Do not trap faults
+			 * against the zero page. The read-only data is likely
+			 * to be read-cached on the local CPU cache and it is
+			 * less useful to know about local vs remote hits on
+			 * the zero page.
+			 */
 			if (page_mapcount(page) == 1 &&
+			    !is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 42ae82e..ed51f15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3564,6 +3564,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
+	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Mitigate chance that same task always updates PTEs
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:26   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  19a78d110d7a8045aeb90d38ee8fe9743ce88c2d
Gitweb:     http://git.kernel.org/tip/19a78d110d7a8045aeb90d38ee8fe9743ce88c2d
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:28:51 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:56 +0200

sched/numa: Mitigate chance that same task always updates PTEs

With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.

This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.

Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.

Before:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
      thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
      thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
      thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
      thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
      thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
      thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
      thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
      thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
      thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
      thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
      thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
      thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
      thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
      thread 0/0-3232  [022] ....   214.209342: task_numa_work: working

After:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
      thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
      thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
      thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
      thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
      thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
      thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
      thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
      thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
      thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
      thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
      thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
      thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
      thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
      thread 0/3-3256  [024] ....   138.267207: task_numa_work: working

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 817cd7b..573d815e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * Delay this task enough that another task of this mm will likely win
+	 * the next time around.
+	 */
+	p->node_stamp += 2 * TICK_NSEC;
+
+	/*
 	 * Do not set pte_numa if the current running node is rate-limited.
 	 * This loses statistics on the fault but if we are unwilling to
 	 * migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		curr->node_stamp = now;
+		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
 			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Continue PTE scanning even if migrate rate limited
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:26   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  9e645ab6d089f5822479a833c6977c785bcfffe3
Gitweb:     http://git.kernel.org/tip/9e645ab6d089f5822479a833c6977c785bcfffe3
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:28:52 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:09 +0200

sched/numa: Continue PTE scanning even if migrate rate limited

Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 573d815e..464207f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	/*
-	 * Do not set pte_numa if the current running node is rate-limited.
-	 * This loses statistics on the fault but if we are unwilling to
-	 * migrate to this node, it is less likely we can do useful work
-	 */
-	if (migrate_ratelimited(numa_node_id()))
-		return;
-
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:26   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  b726b7dfb400c937546fa91cf8523dcb1aa2fc6e
Gitweb:     http://git.kernel.org/tip/b726b7dfb400c937546fa91cf8523dcb1aa2fc6e
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:53 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:17 +0200

Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm_types.h | 10 ----------
 kernel/fork.c            |  3 ---
 kernel/sched/fair.c      | 18 ------------------
 kernel/sched/features.h  |  4 +---
 4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b7adf1d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -428,20 +428,10 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
-
-	/*
-	 * The first node a task was scheduled on. If a task runs on
-	 * a different node than Make PTE Scan Go Now.
-	 */
-	int first_nid;
 #endif
 	struct uprobes_state uprobes_state;
 };
 
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT	-1
-#define NUMA_PTE_SCAN_ACTIVE	-2
-
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 #ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..7192d91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 464207f..49b11fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
-	 * We do not care about task placement until a task runs on a node
-	 * other than the first one used by the address space. This is
-	 * largely because migrations are driven by what CPU the task
-	 * is running on. If it's never scheduled on another node, it'll
-	 * not migrate so why bother trapping the fault.
-	 */
-	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
-		mm->first_nid = numa_node_id();
-	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
-		/* Are we running on a new node yet? */
-		if (numa_node_id() == mm->first_nid &&
-		    !sched_feat_numa(NUMA_FORCE))
-			return;
-
-		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
-	}
-
-	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
-SCHED_FEAT(NUMA_FORCE,	false)
 #endif

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Initialise numa_next_scan properly
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:26   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  7e8d16b6cbccb2f5da579f5085479fb82ba851b8
Gitweb:     http://git.kernel.org/tip/7e8d16b6cbccb2f5da579f5085479fb82ba851b8
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:54 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:19 +0200

sched/numa: Initialise numa_next_scan properly

Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 4 ++--
 kernel/sched/fair.c | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f575d5b..aee7e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1624,8 +1624,8 @@ static void __sched_fork(struct task_struct *p)
 
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
-		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
+		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49b11fa..0966f0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+		mm->numa_next_scan = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		mm->numa_next_reset = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+	}
+
 	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Set the scan rate proportional to the memory usage of the task being scanned
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:26   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  598f0ec0bc996e90a806ee9564af919ea5aad401
Gitweb:     http://git.kernel.org/tip/598f0ec0bc996e90a806ee9564af919ea5aad401
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:55 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:20 +0200

sched/numa: Set the scan rate proportional to the memory usage of the task being scanned

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/sysctl/kernel.txt | 11 +++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 88 +++++++++++++++++++++++++++++++++++------
 3 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1428c66..8cd7e5f 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -403,15 +403,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2ac5285..fdcb4c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0966f0c..e08d757 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 60000;
+unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long rss = 0;
+	unsigned long nr_scan_pages;
+
+	/*
+	 * Calculations based on RSS as non-present and empty pages are skipped
+	 * by the PTE scanner and NUMA hinting faults should be trapped based
+	 * on resident pages
+	 */
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	rss = get_mm_rss(p->mm);
+	if (!rss)
+		rss = nr_scan_pages;
+
+	rss = round_up(rss, nr_scan_pages);
+	return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* FIXME: Scheduling placement policy hints go here */
 }
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
-			p->numa_scan_period + jiffies_to_msecs(10));
+	if (!migrated) {
+		/* Initialise if necessary */
+		if (!p->numa_scan_period_max)
+			p->numa_scan_period_max = task_scan_max(p);
+
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period + 10);
+	}
 
 	task_numa_placement(p);
 }
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
 	 */
 	migrate = mm->numa_next_reset;
 	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+		p->numa_scan_period = task_scan_min(p);
 		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		xchg(&mm->numa_next_reset, next_scan);
 	}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Slow scan rate if no NUMA hinting faults are being recorded
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:26   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  f307cd1a32fab53012b01749a1f5ba10b0a7243f
Gitweb:     http://git.kernel.org/tip/f307cd1a32fab53012b01749a1f5ba10b0a7243f
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:56 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:21 +0200

sched/numa: Slow scan rate if no NUMA hinting faults are being recorded

NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-19-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e08d757..c6c3302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
+	 * If the whole process was scanned without updates then no NUMA
+	 * hinting faults are being recorded and scan rate should be lower.
+	 */
+	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+		mm->numa_next_scan = next_scan;
+	}
+
+	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Track NUMA hinting faults on per-node basis
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:27   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  f809ca9a554dda49fb264c79e31c722e0b063ff8
Gitweb:     http://git.kernel.org/tip/f809ca9a554dda49fb264c79e31c722e0b063ff8
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:57 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:22 +0200

sched/numa: Track NUMA hinting faults on per-node basis

This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 11 ++++++++++-
 kernel/sched/sched.h  | 12 ++++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fdcb4c8..a810e95 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1342,6 +1342,8 @@ struct task_struct {
 	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aee7e4d..6808d35 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1634,6 +1634,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6c3302..0bb3e0a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
 	}
 
 	task_numa_placement(p);
+
+	p->numa_faults[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e82484d..199099c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 #include <linux/tick.h>
+#include <linux/slab.h>
 
 #include "cpupri.h"
 #include "cpuacct.h"
@@ -555,6 +556,17 @@ static inline u64 rq_clock_task(struct rq *rq)
 	return rq->clock_task;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Select a preferred node with the most numa hinting faults
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:27   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  688b7585d16ab57a17aa4422a3b290b3a55fa679
Gitweb:     http://git.kernel.org/tip/688b7585d16ab57a17aa4422a3b290b3a55fa679
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:58 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:23 +0200

sched/numa: Select a preferred node with the most numa hinting faults

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a810e95..b1fc75e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1344,6 +1344,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6808d35..d15cd70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1633,6 +1633,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0bb3e0a..9efd34f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_scan_seq = seq;
 	p->numa_scan_period_max = task_scan_max(p);
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for_each_online_node(nid) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Update NUMA hinting faults once per scan
  2013-10-07 10:28   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:27   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  745d61476ddb737aad3495fa6d9a8f8c2ee59f86
Gitweb:     http://git.kernel.org/tip/745d61476ddb737aad3495fa6d9a8f8c2ee59f86
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:59 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:25 +0200

sched/numa: Update NUMA hinting faults once per scan

NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b1fc75e..a463bc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1343,7 +1343,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d15cd70..064a0af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1636,6 +1636,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9efd34f..3abc651 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults[node] += pages;
+	p->numa_faults_buffer[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Favour moving tasks towards the preferred node
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:27   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  3a7053b3224f4a8b0e8184166190076593621617
Gitweb:     http://git.kernel.org/tip/3a7053b3224f4a8b0e8184166190076593621617
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:00 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:26 +0200

sched/numa: Favour moving tasks towards the preferred node

This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 +-
 kernel/sched/fair.c             | 63 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h         |  7 +++++
 kernel/sysctl.c                 |  7 +++++
 6 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 8cd7e5f..d48bca4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -420,6 +421,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a463bc3..aecdc5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -777,6 +777,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 064a0af..b7e6b6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,7 +1631,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5656,6 +5656,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3abc651..6ffddca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
 	return max(smin, smax);
 }
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -4071,6 +4083,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+	    !(env->sd->flags & SD_NUMA)) {
+		return false;
+	}
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (dst_nid == p->numa_preferred_nid ||
+	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -4128,11 +4172,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..42f616a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Resist moving tasks towards nodes with fewer hinting faults
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:27   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  7a0f308337d11fd5caa9f845c6d08cc5d6067988
Gitweb:     http://git.kernel.org/tip/7a0f308337d11fd5caa9f845c6d08cc5d6067988
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:01 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:27 +0200

sched/numa: Resist moving tasks towards nodes with fewer hinting faults

Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-24-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  8 ++++++++
 2 files changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6ffddca..8943124 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4107,12 +4107,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 
 	return false;
 }
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+		return false;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+
 #else
 static inline bool migrate_improves_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
 	return false;
 }
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
 #endif
 
 /*
@@ -4177,6 +4208,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 
 	if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA,	false)
  * balancing.
  */
 SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Reschedule task on preferred NUMA node once selected
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:27   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  e6628d5b0a2979f3e0ee6f7783ede5df50cb9ede
Gitweb:     http://git.kernel.org/tip/e6628d5b0a2979f3e0ee6f7783ede5df50cb9ede
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:02 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:28 +0200

sched/numa: Reschedule task on preferred NUMA node once selected

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-25-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 19 +++++++++++++++++++
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7e6b6f..66b878e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4348,6 +4348,25 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	/* TODO: This is not properly updating schedstats */
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * migration_cpu_stop - this will be executed by a highprio stopper thread
  * and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8943124..8b15e9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			min_load = load;
+			idlest_cpu = i;
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 199099c..66458c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -557,6 +557,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Add infrastructure for split shared/ private accounting of NUMA hinting faults
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:28   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  ac8e895bd260cb8bb19ade6a3abd44e7abe9a01d
Gitweb:     http://git.kernel.org/tip/ac8e895bd260cb8bb19ade6a3abd44e7abe9a01d
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:03 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:30 +0200

sched/numa: Add infrastructure for split shared/private accounting of NUMA hinting faults

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-26-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 46 +++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |  5 +++--
 mm/memory.c           |  8 ++++++--
 4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index aecdc5a..d946195 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1445,10 +1445,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b15e9e..89eeb89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_faults)
+		return 0;
+
+	return p->numa_faults[task_faults_idx(nid, 0)] +
+		p->numa_faults[task_faults_idx(nid, 1)];
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 
 
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!numabalancing_enabled)
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults_buffer[node] += pages;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -4146,7 +4170,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 		return false;
 
 	if (dst_nid == p->numa_preferred_nid ||
-	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+	    task_faults(p, dst_nid) > task_faults(p, src_nid))
 		return true;
 
 	return false;
@@ -4170,7 +4194,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
 		return false;
 
-	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
 		return true;
 
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8677dbf..9142167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid;
+	int target_nid, last_nid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,6 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
+	last_nid = page_nid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1361,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index ed51f15..24bc9b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,6 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
+	int last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3566,6 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	last_nid = page_nid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3581,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(page_nid, 1, migrated);
+		task_numa_fault(last_nid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3596,6 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,6 +3646,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
+		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3655,7 +3659,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(page_nid, 1, migrated);
+			task_numa_fault(last_nid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Check current-> mm before allocating NUMA faults
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:28   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  9ff1d9ff3c2c8ab3feaeb2e8056a07ca293f7bde
Gitweb:     http://git.kernel.org/tip/9ff1d9ff3c2c8ab3feaeb2e8056a07ca293f7bde
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:04 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:31 +0200

sched/numa: Check current->mm before allocating NUMA faults

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-27-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89eeb89..3383079 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = -1;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Scan pages with elevated page_mapcount
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:28   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  1bc115d87dffd1c43bdc3c9c9d1e3a51c195d18e
Gitweb:     http://git.kernel.org/tip/1bc115d87dffd1c43bdc3c9c9d1e3a51c195d18e
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:05 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:32 +0200

mm: numa: Scan pages with elevated page_mapcount

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/migrate.h |  7 ++++---
 mm/huge_memory.c        | 12 +++++-------
 mm/memory.c             |  7 ++-----
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 5 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..f5096b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,11 +90,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9142167..2a28c2c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1484,14 +1484,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			struct page *page = pmd_page(*pmd);
 
 			/*
-			 * Only check non-shared pages. Do not trap faults
-			 * against the zero page. The read-only data is likely
-			 * to be read-cached on the local CPU cache and it is
-			 * less useful to know about local vs remote hits on
-			 * the zero page.
+			 * Do not trap faults against the zero page. The
+			 * read-only data is likely to be read-cached on the
+			 * local CPU cache and it is less useful to know about
+			 * local vs remote hits on the zero page.
 			 */
-			if (page_mapcount(page) == 1 &&
-			    !is_huge_zero_page(page) &&
+			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 24bc9b8..3e3b4b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3577,7 +3577,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		page_nid = target_nid;
 
@@ -3642,16 +3642,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
 
 		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, target_nid);
+			migrated = migrate_misplaced_page(page, vma, target_nid);
 			if (migrated)
 				page_nid = target_nid;
 		} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 7bd90d3..fcba2f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1599,7 +1599,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1607,10 +1608,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1661,13 +1663,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2da33dc..41e0292 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Remove check that skips small VMAs
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:28   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  073b5beea735c7e1970686c94ff1f3aaac790a2a
Gitweb:     http://git.kernel.org/tip/073b5beea735c7e1970686c94ff1f3aaac790a2a
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:06 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:33 +0200

sched/numa: Remove check that skips small VMAs

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-29-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3383079..862d20d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Set preferred NUMA node based on number of private faults
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:28   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  b795854b1fa70f6aee923ae5df74ff7afeaddcaa
Gitweb:     http://git.kernel.org/tip/b795854b1fa70f6aee923ae5df74ff7afeaddcaa
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:07 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:35 +0200

sched/numa: Set preferred NUMA node based on number of private faults

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h                | 89 +++++++++++++++++++++++++++++----------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 28 +++++++-----
 kernel/sched/fair.c               | 12 ++++--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    |  8 ++--
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 26 ++++++++----
 mm/page_alloc.c                   |  4 +-
 12 files changed, 149 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..bb412ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,48 +661,93 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
 {
-	page->_last_nid = -1;
+	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b7adf1d..38a902a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 862d20d..b1de7c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (!nidpid_pid_unset(last_nidpid))
+		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	else
+		priv = 1;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a28c2c..0baf0e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nid = -1;
+	int target_nid, last_nidpid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 3e3b4b8..cc7f206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nid;
+	int last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, 1, migrated);
+		task_numa_fault(last_nidpid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nid, page_nid, 1, migrated);
+			task_numa_fault(last_nidpid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..aff1f1e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2348,9 +2348,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2373,8 +2375,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fcba2f4..025d1e3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 41e0292..f0b087d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
+	bool all_same_nidpid = true;
 	int last_nid = -1;
+	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -63,11 +64,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int this_nid = page_to_nid(page);
+					int nidpid = page_nidpid_last(page);
+					int this_nid = nidpid_to_nid(nidpid);
+					int this_pid = nidpid_to_pid(nidpid);
+
 					if (last_nid == -1)
 						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
+					if (last_pid == -1)
+						last_pid = this_pid;
+					if (last_nid != this_nid ||
+					    last_pid != this_pid) {
+						all_same_nidpid = false;
+					}
 
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
@@ -107,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
+	*ret_all_same_nidpid = all_same_nidpid;
 	return pages;
 }
 
@@ -134,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
+	bool all_same_nidpid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -158,7 +166,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
+				 dirty_accountable, prot_numa, &all_same_nidpid);
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node)
+		if (prot_numa && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..89bedd0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Do not migrate memory immediately after switching node
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:28   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  6fe6b2d6dabf392aceb3ad3a5e859b46a04465c6
Gitweb:     http://git.kernel.org/tip/6fe6b2d6dabf392aceb3ad3a5e859b46a04465c6
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:08 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:36 +0200

sched/numa: Do not migrate memory immediately after switching node

The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 18 ++++++++++++++++--
 mm/mempolicy.c      | 12 ++++++++++++
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66b878e..9060a7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,7 +1631,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 0;
+	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1de7c5..61ec0d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
  * the preferred node but still allow the scheduler to move the task again if
  * the nodes CPUs are overloaded.
  */
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
 static inline int task_faults_idx(int nid, int priv)
 {
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 0;
+		p->numa_migrate_seq = 1;
 		migrate_task_to(p, preferred_cpu);
 	}
 }
@@ -4121,6 +4121,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+	if (p->numa_preferred_nid != -1) {
+		int src_nid = cpu_to_node(env->src_cpu);
+		int dst_nid = cpu_to_node(env->dst_cpu);
+
+		/*
+		 * If the load balancer has moved the task then limit
+		 * migrations from taking place in the short term in
+		 * case this is a short-lived migration.
+		 */
+		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+			p->numa_migrate_seq = 0;
+	}
+#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aff1f1e..196d8da 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2378,6 +2378,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
 		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * If the scheduler has just moved us away from our
+		 * preferred node, do not bother migrating pages yet.
+		 * This way a short and temporary process migration will
+		 * not cause excessive memory migration.
+		 */
+		if (polnid != current->numa_preferred_nid &&
+				!current->numa_migrate_seq)
+			goto out;
+#endif
 	}
 
 	if (curnid != polnid)

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Limit NUMA scanning to migrate-on-fault VMAs
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:29   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, jmario, riel, srikar,
	aarcange, mgorman, tglx

Commit-ID:  fc3147245d193bd0f57307859c698fa28a20b0fe
Gitweb:     http://git.kernel.org/tip/fc3147245d193bd0f57307859c698fa28a20b0fe
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:09 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:38 +0200

mm: numa: Limit NUMA scanning to migrate-on-fault VMAs

There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.

[riel@redhat.com: Initial patch]

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-and-tested-by: Joe Mario <jmario@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mempolicy.h |  1 +
 kernel/sched/fair.c       |  2 +-
 mm/mempolicy.c            | 24 ++++++++++++++++++++++++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index da6716b..ea4d249 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -136,6 +136,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 
 struct mempolicy *get_vma_policy(struct task_struct *tsk,
 		struct vm_area_struct *vma, unsigned long addr);
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61ec0d4..d98175d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1130,7 +1130,7 @@ void task_numa_work(struct callback_head *work)
 		vma = mm->mmap;
 	}
 	for (; vma; vma = vma->vm_next) {
-		if (!vma_migratable(vma))
+		if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
 			continue;
 
 		do {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 196d8da..0e895a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,6 +1679,30 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 	return pol;
 }
 
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
+{
+	struct mempolicy *pol = get_task_policy(task);
+	if (vma) {
+		if (vma->vm_ops && vma->vm_ops->get_policy) {
+			bool ret = false;
+
+			pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+			if (pol && (pol->flags & MPOL_F_MOF))
+				ret = true;
+			mpol_cond_put(pol);
+
+			return ret;
+		} else if (vma->vm_policy) {
+			pol = vma->vm_policy;
+		}
+	}
+
+	if (!pol)
+		return default_policy.flags & MPOL_F_MOF;
+
+	return pol->flags & MPOL_F_MOF;
+}
+
 static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 {
 	enum zone_type dynamic_policy_zone = policy_zone;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Avoid overloading CPUs on a preferred NUMA node
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:29   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  58d081b5082dd85e02ac9a1fb151d97395340a09
Gitweb:     http://git.kernel.org/tip/58d081b5082dd85e02ac9a1fb151d97395340a09
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:10 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:39 +0200

sched/numa: Avoid overloading CPUs on a preferred NUMA node

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-33-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d98175d..51a7600 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 }
 
 static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+struct numa_stats {
+	unsigned long load;
+	s64 eff_load;
+	unsigned long faults;
+};
 
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+	struct task_struct *p;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	int src_cpu, src_nid;
+	int dst_cpu, dst_nid;
 
-	rcu_read_lock();
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	struct numa_stats src_stats, dst_stats;
 
-		if (load < min_load) {
-			min_load = load;
-			idlest_cpu = i;
+	unsigned long best_load;
+	int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+	struct task_numa_env env = {
+		.p = p,
+		.src_cpu = task_cpu(p),
+		.src_nid = cpu_to_node(task_cpu(p)),
+		.dst_cpu = node_cpu,
+		.dst_nid = p->numa_preferred_nid,
+		.best_load = ULONG_MAX,
+		.best_cpu = task_cpu(p),
+	};
+	struct sched_domain *sd;
+	int cpu;
+	struct task_group *tg = task_group(p);
+	unsigned long weight;
+	bool balanced;
+	int imbalance_pct, idx = -1;
+
+	/*
+	 * Find the lowest common scheduling domain covering the nodes of both
+	 * the CPU the task is currently running on and the target NUMA node.
+	 */
+	rcu_read_lock();
+	for_each_domain(env.src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			/*
+			 * busy_idx is used for the load decision as it is the
+			 * same index used by the regular load balancer for an
+			 * active cpu.
+			 */
+			idx = sd->busy_idx;
+			imbalance_pct = sd->imbalance_pct;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return idlest_cpu;
+	if (WARN_ON_ONCE(idx == -1))
+		return 0;
+
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
+	weight = p->se.load.weight;
+	env.src_stats.load = source_load(env.src_cpu, idx);
+	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+	env.src_stats.eff_load *= power_of(env.src_cpu);
+	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+		env.dst_cpu = cpu;
+		env.dst_stats.load = target_load(cpu, idx);
+
+		/* If the CPU is idle, use it */
+		if (!env.dst_stats.load) {
+			env.best_cpu = cpu;
+			goto migrate;
+		}
+
+		/* Otherwise check the target CPU load */
+		env.dst_stats.eff_load = 100;
+		env.dst_stats.eff_load *= power_of(cpu);
+		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+		/*
+		 * Destination is considered balanced if the destination CPU is
+		 * less loaded than the source CPU. Unfortunately there is a
+		 * risk that a task running on a lightly loaded CPU will not
+		 * migrate to its preferred node due to load imbalances.
+		 */
+		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+		if (!balanced)
+			continue;
+
+		if (env.dst_stats.eff_load < env.best_load) {
+			env.best_load = env.dst_stats.eff_load;
+			env.best_cpu = cpu;
+		}
+	}
+
+migrate:
+	return migrate_task_to(p, env.best_cpu);
 }
 
 static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
-
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid) {
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-		}
-
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		migrate_task_to(p, preferred_cpu);
+		task_numa_migrate(p);
 	}
 }
 
@@ -3292,7 +3366,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
-	if (!tg->parent)	/* the trivial, non-cgroup case */
+	if (!tg->parent || !wl)	/* the trivial, non-cgroup case */
 		return wl;
 
 	for_each_sched_entity(se) {
@@ -3345,8 +3419,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Retry migration of tasks to CPU on a preferred node
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:29   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  6b9a7460b6baf6c77fc3d23d927ddfc3f3f05bf3
Gitweb:     http://git.kernel.org/tip/6b9a7460b6baf6c77fc3d23d927ddfc3f3f05bf3
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:11 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:40 +0200

sched/numa: Retry migration of tasks to CPU on a preferred node

When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-34-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 30 +++++++++++++++++++++++-------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d946195..14251a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1341,6 +1341,7 @@ struct task_struct {
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51a7600..f84ac3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,23 @@ migrate:
 	return migrate_task_to(p, env.best_cpu);
 }
 
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+	/* Success if task is already running on preferred CPU */
+	p->numa_migrate_retry = 0;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+		return;
+
+	/* This task has no NUMA fault statistics yet */
+	if (unlikely(p->numa_preferred_nid == -1))
+		return;
+
+	/* Otherwise, try migrate to a CPU on the preferred node */
+	if (task_numa_migrate(p) != 0)
+		p->numa_migrate_retry = jiffies + HZ*5;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -1045,17 +1062,12 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * Record the preferred node as the node with the most faults,
-	 * requeue the task to be running on the idlest CPU on the
-	 * preferred node and reset the scanning rate to recheck
-	 * the working set placement.
-	 */
+	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		task_numa_migrate(p);
+		numa_migrate_preferred(p);
 	}
 }
 
@@ -1111,6 +1123,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
+	/* Retry task to preferred node migration if it previously failed */
+	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+		numa_migrate_preferred(p);
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Increment numa_migrate_seq when task runs in correct location
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:29   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  06ea5e035b4e66cc77790457a89fc7e368060c4b
Gitweb:     http://git.kernel.org/tip/06ea5e035b4e66cc77790457a89fc7e368060c4b
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:12 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:41 +0200

sched/numa: Increment numa_migrate_seq when task runs in correct location

When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.

Signed-off-by: Rik van Riel <riel@redhat.com>
[ Only increment migrate_seq if migration temporarily disabled. ]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-35-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f84ac3f..de9b4d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+		/*
+		 * If migration is temporarily disabled due to a task migration
+		 * then re-enable it now as the task is running on its
+		 * preferred node and memory should migrate locally
+		 */
+		if (!p->numa_migrate_seq)
+			p->numa_migrate_seq++;
 		return;
+	}
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1))

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Do not trap hinting faults for shared libraries
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:29   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  4591ce4f2d22dc9de7a6719161ce409b5fd1caac
Gitweb:     http://git.kernel.org/tip/4591ce4f2d22dc9de7a6719161ce409b5fd1caac
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:13 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:42 +0200

sched/numa: Do not trap hinting faults for shared libraries

NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-36-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de9b4d8..fbc0c84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,16 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
 			continue;
 
+		/*
+		 * Shared library pages mapped by multiple processes are not
+		 * migrated as it is expected they are cache replicated. Avoid
+		 * hinting faults in read-only file-backed mappings or the vdso
+		 * as migrating the pages will be of marginal benefit.
+		 */
+		if (!vma->vm_mm ||
+		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Trap pmd hinting faults only if we would otherwise trap PTE faults
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:29   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  25cbbef1924299249756bc4030fcb2436c019813
Gitweb:     http://git.kernel.org/tip/25cbbef1924299249756bc4030fcb2436c019813
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:14 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:44 +0200

mm: numa: Trap pmd hinting faults only if we would otherwise trap PTE faults

Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-37-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mprotect.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index f0b087d..5aae390 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 	pmd = pmd_offset(pud, addr);
 	do {
+		unsigned long this_pages;
+
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
@@ -165,8 +167,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma, pmd, addr, next, newprot,
+		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa, &all_same_nidpid);
+		pages += this_pages;
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -174,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] stop_machine: Introduce stop_two_cpus()
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:30   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  1be0bd77c5dd7c903f46abf52f9a3650face3c1d
Gitweb:     http://git.kernel.org/tip/1be0bd77c5dd7c903f46abf52f9a3650face3c1d
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:29:15 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:45 +0200

stop_machine: Introduce stop_two_cpus()

Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ Implemented deadlock avoidance. ]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-38-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/stop_machine.h |   1 +
 kernel/stop_machine.c        | 272 +++++++++++++++++++++++++++----------------
 2 files changed, 175 insertions(+), 98 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
 };
 
 int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
 void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
 			 struct cpu_stop_work *work_buf);
 int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
 	return done.executed ? done.ret : -ENOENT;
 }
 
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+	/* Dummy starting state for thread. */
+	MULTI_STOP_NONE,
+	/* Awaiting everyone to be scheduled. */
+	MULTI_STOP_PREPARE,
+	/* Disable interrupts. */
+	MULTI_STOP_DISABLE_IRQ,
+	/* Run the function */
+	MULTI_STOP_RUN,
+	/* Exit */
+	MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+	int			(*fn)(void *);
+	void			*data;
+	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+	unsigned int		num_threads;
+	const struct cpumask	*active_cpus;
+
+	enum multi_stop_state	state;
+	atomic_t		thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+		      enum multi_stop_state newstate)
+{
+	/* Reset ack counter. */
+	atomic_set(&msdata->thread_ack, msdata->num_threads);
+	smp_wmb();
+	msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+	if (atomic_dec_and_test(&msdata->thread_ack))
+		set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+	struct multi_stop_data *msdata = data;
+	enum multi_stop_state curstate = MULTI_STOP_NONE;
+	int cpu = smp_processor_id(), err = 0;
+	unsigned long flags;
+	bool is_active;
+
+	/*
+	 * When called from stop_machine_from_inactive_cpu(), irq might
+	 * already be disabled.  Save the state and restore it on exit.
+	 */
+	local_save_flags(flags);
+
+	if (!msdata->active_cpus)
+		is_active = cpu == cpumask_first(cpu_online_mask);
+	else
+		is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+	/* Simple state machine */
+	do {
+		/* Chill out and ensure we re-read multi_stop_state. */
+		cpu_relax();
+		if (msdata->state != curstate) {
+			curstate = msdata->state;
+			switch (curstate) {
+			case MULTI_STOP_DISABLE_IRQ:
+				local_irq_disable();
+				hard_irq_disable();
+				break;
+			case MULTI_STOP_RUN:
+				if (is_active)
+					err = msdata->fn(msdata->data);
+				break;
+			default:
+				break;
+			}
+			ack_state(msdata);
+		}
+	} while (curstate != MULTI_STOP_EXIT);
+
+	local_irq_restore(flags);
+	return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+	int cpu1;
+	int cpu2;
+	struct cpu_stop_work *work1;
+	struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+	struct irq_cpu_stop_queue_work_info *info = arg;
+	cpu_stop_queue_work(info->cpu1, info->work1);
+	cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+	int call_cpu;
+	struct cpu_stop_done done;
+	struct cpu_stop_work work1, work2;
+	struct irq_cpu_stop_queue_work_info call_args;
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = arg,
+		.num_threads = 2,
+		.active_cpus = cpumask_of(cpu1),
+	};
+
+	work1 = work2 = (struct cpu_stop_work){
+		.fn = multi_cpu_stop,
+		.arg = &msdata,
+		.done = &done
+	};
+
+	call_args = (struct irq_cpu_stop_queue_work_info){
+		.cpu1 = cpu1,
+		.cpu2 = cpu2,
+		.work1 = &work1,
+		.work2 = &work2,
+	};
+
+	cpu_stop_init_done(&done, 2);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+
+	/*
+	 * Queuing needs to be done by the lowest numbered CPU, to ensure
+	 * that works are always queued in the same order on every CPU.
+	 * This prevents deadlocks.
+	 */
+	call_cpu = min(cpu1, cpu2);
+
+	smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+				 &call_args, 0);
+
+	wait_for_completion(&done.completion);
+	return done.executed ? done.ret : -ENOENT;
+}
+
 /**
  * stop_one_cpu_nowait - stop a cpu but don't wait for completion
  * @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);
 
 #ifdef CONFIG_STOP_MACHINE
 
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
-	/* Dummy starting state for thread. */
-	STOPMACHINE_NONE,
-	/* Awaiting everyone to be scheduled. */
-	STOPMACHINE_PREPARE,
-	/* Disable interrupts. */
-	STOPMACHINE_DISABLE_IRQ,
-	/* Run the function */
-	STOPMACHINE_RUN,
-	/* Exit */
-	STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
-	int			(*fn)(void *);
-	void			*data;
-	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-	unsigned int		num_threads;
-	const struct cpumask	*active_cpus;
-
-	enum stopmachine_state	state;
-	atomic_t		thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
-		      enum stopmachine_state newstate)
-{
-	/* Reset ack counter. */
-	atomic_set(&smdata->thread_ack, smdata->num_threads);
-	smp_wmb();
-	smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
-	if (atomic_dec_and_test(&smdata->thread_ack))
-		set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
-	struct stop_machine_data *smdata = data;
-	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	int cpu = smp_processor_id(), err = 0;
-	unsigned long flags;
-	bool is_active;
-
-	/*
-	 * When called from stop_machine_from_inactive_cpu(), irq might
-	 * already be disabled.  Save the state and restore it on exit.
-	 */
-	local_save_flags(flags);
-
-	if (!smdata->active_cpus)
-		is_active = cpu == cpumask_first(cpu_online_mask);
-	else
-		is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
-	/* Simple state machine */
-	do {
-		/* Chill out and ensure we re-read stopmachine_state. */
-		cpu_relax();
-		if (smdata->state != curstate) {
-			curstate = smdata->state;
-			switch (curstate) {
-			case STOPMACHINE_DISABLE_IRQ:
-				local_irq_disable();
-				hard_irq_disable();
-				break;
-			case STOPMACHINE_RUN:
-				if (is_active)
-					err = smdata->fn(smdata->data);
-				break;
-			default:
-				break;
-			}
-			ack_state(smdata);
-		}
-	} while (curstate != STOPMACHINE_EXIT);
-
-	local_irq_restore(flags);
-	return err;
-}
-
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
-					    .num_threads = num_online_cpus(),
-					    .active_cpus = cpus };
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = data,
+		.num_threads = num_online_cpus(),
+		.active_cpus = cpus,
+	};
 
 	if (!stop_machine_initialized) {
 		/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 		unsigned long flags;
 		int ret;
 
-		WARN_ON_ONCE(smdata.num_threads != 1);
+		WARN_ON_ONCE(msdata.num_threads != 1);
 
 		local_irq_save(flags);
 		hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	}
 
 	/* Set the initial state and stop all online cpus. */
-	set_state(&smdata, STOPMACHINE_PREPARE);
-	return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
 }
 
 int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
 int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 				  const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
+	struct multi_stop_data msdata = { .fn = fn, .data = data,
 					    .active_cpus = cpus };
 	struct cpu_stop_done done;
 	int ret;
 
 	/* Local CPU must be inactive and CPU hotplug in progress. */
 	BUG_ON(cpu_active(raw_smp_processor_id()));
-	smdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
+	msdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
 
 	/* No proper task established and can't sleep - busy wait for lock. */
 	while (!mutex_trylock(&stop_cpus_mutex))
 		cpu_relax();
 
 	/* Schedule work on other CPUs and execute directly for local CPU */
-	set_state(&smdata, STOPMACHINE_PREPARE);
+	set_state(&msdata, MULTI_STOP_PREPARE);
 	cpu_stop_init_done(&done, num_active_cpus());
-	queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+	queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
 			     &done);
-	ret = stop_machine_cpu_stop(&smdata);
+	ret = multi_cpu_stop(&msdata);
 
 	/* Busy wait for completion. */
 	while (!completion_done(&done.completion))

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Introduce migrate_swap()
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:30   ` tip-bot for Peter Zijlstra
  2013-10-10 18:17     ` Peter Zijlstra
  -1 siblings, 1 reply; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  ac66f5477239ebd3c4e2cbf2f591ef387aa09884
Gitweb:     http://git.kernel.org/tip/ac66f5477239ebd3c4e2cbf2f591ef387aa09884
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:29:16 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:46 +0200

sched/numa: Introduce migrate_swap()

Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h    |   2 +
 kernel/sched/core.c      | 106 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c      |   3 +-
 kernel/sched/idle_task.c |   2 +-
 kernel/sched/rt.c        |   5 +--
 kernel/sched/sched.h     |   4 +-
 kernel/sched/stop_task.c |   2 +-
 7 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 14251a8..b661979 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1043,6 +1043,8 @@ struct task_struct {
 	struct task_struct *last_wakee;
 	unsigned long wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+
+	int wake_cpu;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9060a7f..32a2b29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1013,6 +1013,102 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+static void __migrate_swap_task(struct task_struct *p, int cpu)
+{
+	if (p->on_rq) {
+		struct rq *src_rq, *dst_rq;
+
+		src_rq = task_rq(p);
+		dst_rq = cpu_rq(cpu);
+
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	} else {
+		/*
+		 * Task isn't running anymore; make it appear like we migrated
+		 * it before it went to sleep. This means on wakeup we make the
+		 * previous cpu our targer instead of where it really is.
+		 */
+		p->wake_cpu = cpu;
+	}
+}
+
+struct migration_swap_arg {
+	struct task_struct *src_task, *dst_task;
+	int src_cpu, dst_cpu;
+};
+
+static int migrate_swap_stop(void *data)
+{
+	struct migration_swap_arg *arg = data;
+	struct rq *src_rq, *dst_rq;
+	int ret = -EAGAIN;
+
+	src_rq = cpu_rq(arg->src_cpu);
+	dst_rq = cpu_rq(arg->dst_cpu);
+
+	double_rq_lock(src_rq, dst_rq);
+	if (task_cpu(arg->dst_task) != arg->dst_cpu)
+		goto unlock;
+
+	if (task_cpu(arg->src_task) != arg->src_cpu)
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
+		goto unlock;
+
+	__migrate_swap_task(arg->src_task, arg->dst_cpu);
+	__migrate_swap_task(arg->dst_task, arg->src_cpu);
+
+	ret = 0;
+
+unlock:
+	double_rq_unlock(src_rq, dst_rq);
+
+	return ret;
+}
+
+/*
+ * Cross migrate two tasks
+ */
+int migrate_swap(struct task_struct *cur, struct task_struct *p)
+{
+	struct migration_swap_arg arg;
+	int ret = -EINVAL;
+
+	get_online_cpus();
+
+	arg = (struct migration_swap_arg){
+		.src_task = cur,
+		.src_cpu = task_cpu(cur),
+		.dst_task = p,
+		.dst_cpu = task_cpu(p),
+	};
+
+	if (arg.src_cpu == arg.dst_cpu)
+		goto out;
+
+	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
+		goto out;
+
+	if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
+		goto out;
+
+	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
+		goto out;
+
+	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+
+out:
+	put_online_cpus();
+	return ret;
+}
+
 struct migration_arg {
 	struct task_struct *task;
 	int dest_cpu;
@@ -1232,9 +1328,9 @@ out:
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+	cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -1518,7 +1614,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
 
-	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -1752,7 +1848,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
-	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
+	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 
 	/* Initialize new task's runnable average */
@@ -2080,7 +2176,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbc0c84..b1e5061 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3706,11 +3706,10 @@ done:
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..516c3d9 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ceebfba..e9304cd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1169,13 +1169,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
-	int cpu;
-
-	cpu = task_cpu(p);
 
 	if (p->nr_cpus_allowed == 1)
 		goto out;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66458c9..4dc92d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_swap(struct task_struct *, struct task_struct *);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
@@ -736,6 +737,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 	 */
 	smp_wmb();
 	task_thread_info(p)->cpu = cpu;
+	p->wake_cpu = cpu;
 #endif
 }
 
@@ -991,7 +993,7 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..47197de 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Use a system-wide search to find swap/migration candidates
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:30   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  fb13c7ee0ed387bd6bec4b4024a4d49b1bd504f1
Gitweb:     http://git.kernel.org/tip/fb13c7ee0ed387bd6bec4b4024a4d49b1bd504f1
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:17 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:25 +0200

sched/numa: Use a system-wide search to find swap/migration candidates

This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Not there is no guarantee that the node the source task is placed
on by task_numa_migrate() has any relationship to the newly selected
task->numa_preferred_nid due to compute overloading.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Do not swap with tasks that cannot run on source cpu]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Fixed compiler warning on UP. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-40-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  |   4 +
 kernel/sched/fair.c  | 253 ++++++++++++++++++++++++++++++++++++---------------
 kernel/sched/sched.h |  13 +++
 3 files changed, 199 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 32a2b29..1fe59da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5236,6 +5236,7 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -5252,6 +5253,9 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+
+	sd = lowest_flag_domain(cpu, SD_NUMA);
+	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1e5061..1422765 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -681,6 +681,8 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+
 static inline void __update_task_entity_contrib(struct sched_entity *se);
 
 /* Give new task start runnable values to heavy its load in infant time */
@@ -906,12 +908,40 @@ static unsigned long target_load(int cpu, int type);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+/* Cached statistics for all CPUs within a node */
 struct numa_stats {
+	unsigned long nr_running;
 	unsigned long load;
-	s64 eff_load;
-	unsigned long faults;
+
+	/* Total compute capacity of CPUs on a node */
+	unsigned long power;
+
+	/* Approximate capacity in terms of runnable tasks on a node */
+	unsigned long capacity;
+	int has_capacity;
 };
 
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->nr_running += rq->nr_running;
+		ns->load += weighted_cpuload(cpu);
+		ns->power += power_of(cpu);
+	}
+
+	ns->load = (ns->load * SCHED_POWER_SCALE) / ns->power;
+	ns->capacity = DIV_ROUND_CLOSEST(ns->power, SCHED_POWER_SCALE);
+	ns->has_capacity = (ns->nr_running < ns->capacity);
+}
+
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -920,95 +950,178 @@ struct task_numa_env {
 
 	struct numa_stats src_stats, dst_stats;
 
-	unsigned long best_load;
+	int imbalance_pct, idx;
+
+	struct task_struct *best_task;
+	long best_imp;
 	int best_cpu;
 };
 
+static void task_numa_assign(struct task_numa_env *env,
+			     struct task_struct *p, long imp)
+{
+	if (env->best_task)
+		put_task_struct(env->best_task);
+	if (p)
+		get_task_struct(p);
+
+	env->best_task = p;
+	env->best_imp = imp;
+	env->best_cpu = env->dst_cpu;
+}
+
+/*
+ * This checks if the overall compute and NUMA accesses of the system would
+ * be improved if the source tasks was migrated to the target dst_cpu taking
+ * into account that it might be best if task running on the dst_cpu should
+ * be exchanged with the source task
+ */
+static void task_numa_compare(struct task_numa_env *env, long imp)
+{
+	struct rq *src_rq = cpu_rq(env->src_cpu);
+	struct rq *dst_rq = cpu_rq(env->dst_cpu);
+	struct task_struct *cur;
+	long dst_load, src_load;
+	long load;
+
+	rcu_read_lock();
+	cur = ACCESS_ONCE(dst_rq->curr);
+	if (cur->pid == 0) /* idle */
+		cur = NULL;
+
+	/*
+	 * "imp" is the fault differential for the source task between the
+	 * source and destination node. Calculate the total differential for
+	 * the source task and potential destination task. The more negative
+	 * the value is, the more rmeote accesses that would be expected to
+	 * be incurred if the tasks were swapped.
+	 */
+	if (cur) {
+		/* Skip this swap candidate if cannot move to the source cpu */
+		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
+			goto unlock;
+
+		imp += task_faults(cur, env->src_nid) -
+		       task_faults(cur, env->dst_nid);
+	}
+
+	if (imp < env->best_imp)
+		goto unlock;
+
+	if (!cur) {
+		/* Is there capacity at our destination? */
+		if (env->src_stats.has_capacity &&
+		    !env->dst_stats.has_capacity)
+			goto unlock;
+
+		goto balance;
+	}
+
+	/* Balance doesn't matter much if we're running a task per cpu */
+	if (src_rq->nr_running == 1 && dst_rq->nr_running == 1)
+		goto assign;
+
+	/*
+	 * In the overloaded case, try and keep the load balanced.
+	 */
+balance:
+	dst_load = env->dst_stats.load;
+	src_load = env->src_stats.load;
+
+	/* XXX missing power terms */
+	load = task_h_load(env->p);
+	dst_load += load;
+	src_load -= load;
+
+	if (cur) {
+		load = task_h_load(cur);
+		dst_load -= load;
+		src_load += load;
+	}
+
+	/* make src_load the smaller */
+	if (dst_load < src_load)
+		swap(dst_load, src_load);
+
+	if (src_load * env->imbalance_pct < dst_load * 100)
+		goto unlock;
+
+assign:
+	task_numa_assign(env, cur, imp);
+unlock:
+	rcu_read_unlock();
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
-	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
 	struct task_numa_env env = {
 		.p = p,
+
 		.src_cpu = task_cpu(p),
 		.src_nid = cpu_to_node(task_cpu(p)),
-		.dst_cpu = node_cpu,
-		.dst_nid = p->numa_preferred_nid,
-		.best_load = ULONG_MAX,
-		.best_cpu = task_cpu(p),
+
+		.imbalance_pct = 112,
+
+		.best_task = NULL,
+		.best_imp = 0,
+		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	int cpu;
-	struct task_group *tg = task_group(p);
-	unsigned long weight;
-	bool balanced;
-	int imbalance_pct, idx = -1;
+	unsigned long faults;
+	int nid, cpu, ret;
 
 	/*
-	 * Find the lowest common scheduling domain covering the nodes of both
-	 * the CPU the task is currently running on and the target NUMA node.
+	 * Pick the lowest SD_NUMA domain, as that would have the smallest
+	 * imbalance and would be the first to start moving tasks about.
+	 *
+	 * And we want to avoid any moving of tasks about, as that would create
+	 * random movement of tasks -- counter the numa conditions we're trying
+	 * to satisfy here.
 	 */
 	rcu_read_lock();
-	for_each_domain(env.src_cpu, sd) {
-		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
-			/*
-			 * busy_idx is used for the load decision as it is the
-			 * same index used by the regular load balancer for an
-			 * active cpu.
-			 */
-			idx = sd->busy_idx;
-			imbalance_pct = sd->imbalance_pct;
-			break;
-		}
-	}
+	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
+	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	if (WARN_ON_ONCE(idx == -1))
-		return 0;
+	faults = task_faults(p, env.src_nid);
+	update_numa_stats(&env.src_stats, env.src_nid);
 
-	/*
-	 * XXX the below is mostly nicked from wake_affine(); we should
-	 * see about sharing a bit if at all possible; also it might want
-	 * some per entity weight love.
-	 */
-	weight = p->se.load.weight;
-	env.src_stats.load = source_load(env.src_cpu, idx);
-	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
-	env.src_stats.eff_load *= power_of(env.src_cpu);
-	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
-
-	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
-		env.dst_cpu = cpu;
-		env.dst_stats.load = target_load(cpu, idx);
-
-		/* If the CPU is idle, use it */
-		if (!env.dst_stats.load) {
-			env.best_cpu = cpu;
-			goto migrate;
-		}
+	/* Find an alternative node with relatively better statistics */
+	for_each_online_node(nid) {
+		long imp;
 
-		/* Otherwise check the target CPU load */
-		env.dst_stats.eff_load = 100;
-		env.dst_stats.eff_load *= power_of(cpu);
-		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+		if (nid == env.src_nid)
+			continue;
 
-		/*
-		 * Destination is considered balanced if the destination CPU is
-		 * less loaded than the source CPU. Unfortunately there is a
-		 * risk that a task running on a lightly loaded CPU will not
-		 * migrate to its preferred node due to load imbalances.
-		 */
-		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
-		if (!balanced)
+		/* Only consider nodes that recorded more faults */
+		imp = task_faults(p, nid) - faults;
+		if (imp < 0)
 			continue;
 
-		if (env.dst_stats.eff_load < env.best_load) {
-			env.best_load = env.dst_stats.eff_load;
-			env.best_cpu = cpu;
+		env.dst_nid = nid;
+		update_numa_stats(&env.dst_stats, env.dst_nid);
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			/* Skip this CPU if the source task cannot migrate */
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				continue;
+
+			env.dst_cpu = cpu;
+			task_numa_compare(&env, imp);
 		}
 	}
 
-migrate:
-	return migrate_task_to(p, env.best_cpu);
+	/* No better CPU than the current one was found. */
+	if (env.best_cpu == -1)
+		return -EAGAIN;
+
+	if (env.best_task == NULL) {
+		int ret = migrate_task_to(p, env.best_cpu);
+		return ret;
+	}
+
+	ret = migrate_swap(p, env.best_task);
+	put_task_struct(env.best_task);
+	return ret;
 }
 
 /* Attempt to migrate a task to a CPU on the preferred node. */
@@ -1050,7 +1163,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults;
+		unsigned long faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1060,10 +1173,10 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
+
+			faults += p->numa_faults[i];
 		}
 
-		/* Find maximum private faults */
-		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -4455,8 +4568,6 @@ static int move_one_task(struct lb_env *env)
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4dc92d0..691e969 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -610,9 +610,22 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 	return hsd;
 }
 
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & flag)
+			break;
+	}
+
+	return sd;
+}
+
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 
 struct sched_group_power {
 	atomic_t ref;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Favor placing a task on the preferred node
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:30   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  2c8a50aa873a7e1d6cc0913362051ff9912dc6ca
Gitweb:     http://git.kernel.org/tip/2c8a50aa873a7e1d6cc0913362051ff9912dc6ca
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:18 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:41 +0200

sched/numa: Favor placing a task on the preferred node

A tasks preferred node is selected based on the number of faults
recorded for a node but the actual task_numa_migate() conducts a global
search regardless of the preferred nid. This patch checks if the
preferred nid has capacity and if so, searches for a CPU within that
node. This avoids a global search when the preferred node is not
overloaded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-41-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1422765..09aac90 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1052,6 +1052,20 @@ unlock:
 	rcu_read_unlock();
 }
 
+static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+		/* Skip this CPU if the source task cannot migrate */
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
+			continue;
+
+		env->dst_cpu = cpu;
+		task_numa_compare(env, imp);
+	}
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1068,7 +1082,8 @@ static int task_numa_migrate(struct task_struct *p)
 	};
 	struct sched_domain *sd;
 	unsigned long faults;
-	int nid, cpu, ret;
+	int nid, ret;
+	long imp;
 
 	/*
 	 * Pick the lowest SD_NUMA domain, as that would have the smallest
@@ -1085,28 +1100,29 @@ static int task_numa_migrate(struct task_struct *p)
 
 	faults = task_faults(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
+	env.dst_nid = p->numa_preferred_nid;
+	imp = task_faults(env.p, env.dst_nid) - faults;
+	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/* Find an alternative node with relatively better statistics */
-	for_each_online_node(nid) {
-		long imp;
-
-		if (nid == env.src_nid)
-			continue;
-
-		/* Only consider nodes that recorded more faults */
-		imp = task_faults(p, nid) - faults;
-		if (imp < 0)
-			continue;
+	/*
+	 * If the preferred nid has capacity then use it. Otherwise find an
+	 * alternative node with relatively better statistics.
+	 */
+	if (env.dst_stats.has_capacity) {
+		task_numa_find_cpu(&env, imp);
+	} else {
+		for_each_online_node(nid) {
+			if (nid == env.src_nid || nid == p->numa_preferred_nid)
+				continue;
 
-		env.dst_nid = nid;
-		update_numa_stats(&env.dst_stats, env.dst_nid);
-		for_each_cpu(cpu, cpumask_of_node(nid)) {
-			/* Skip this CPU if the source task cannot migrate */
-			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			/* Only consider nodes that recorded more faults */
+			imp = task_faults(env.p, nid) - faults;
+			if (imp < 0)
 				continue;
 
-			env.dst_cpu = cpu;
-			task_numa_compare(&env, imp);
+			env.dst_nid = nid;
+			update_numa_stats(&env.dst_stats, env.dst_nid);
+			task_numa_find_cpu(&env, imp);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Fix placement of workloads spread across multiple nodes
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:30   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  e1dda8a797b59d7ec4b17e393152ec3273a552d5
Gitweb:     http://git.kernel.org/tip/e1dda8a797b59d7ec4b17e393152ec3273a552d5
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:19 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:43 +0200

sched/numa: Fix placement of workloads spread across multiple nodes

The load balancer will spread workloads across multiple NUMA nodes,
in order to balance the load on the system. This means that sometimes
a task's preferred node has available capacity, but moving the task
there will not succeed, because that would create too large an imbalance.

In that case, other NUMA nodes need to be considered.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-42-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09aac90..aa561c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1104,13 +1104,12 @@ static int task_numa_migrate(struct task_struct *p)
 	imp = task_faults(env.p, env.dst_nid) - faults;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/*
-	 * If the preferred nid has capacity then use it. Otherwise find an
-	 * alternative node with relatively better statistics.
-	 */
-	if (env.dst_stats.has_capacity) {
+	/* If the preferred nid has capacity, try to use it. */
+	if (env.dst_stats.has_capacity)
 		task_numa_find_cpu(&env, imp);
-	} else {
+
+	/* No space available on the preferred nid. Look elsewhere. */
+	if (env.best_cpu == -1) {
 		for_each_online_node(nid) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Change page last {nid,pid} into {cpu, pid}
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:30   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  90572890d202527c366aa9489b32404e88a7c020
Gitweb:     http://git.kernel.org/tip/90572890d202527c366aa9489b32404e88a7c020
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:29:20 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:45 +0200

mm: numa: Change page last {nid,pid} into {cpu,pid}

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
[ Fixed build failure on 32-bit systems. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h                | 90 ++++++++++++++++++++++-----------------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 22 +++++-----
 kernel/bounds.c                   |  4 ++
 kernel/sched/fair.c               |  6 +--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    | 16 ++++---
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 28 ++++++------
 mm/page_alloc.c                   |  4 +-
 13 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb412ce..ce464cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
+#define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT	(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
+#define LAST_CPUPID_MASK	((1UL << LAST_CPUPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,96 +661,106 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
-	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
+	return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
-	return nidpid & LAST__PID_MASK;
+	return cpupid & LAST__PID_MASK;
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_cpu(int cpupid)
 {
-	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+	return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
-	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+	return cpu_to_node(cpupid_to_cpu(cpupid));
 }
 
-static inline bool nidpid_nid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
 {
-	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+	return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
 }
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-static inline int page_nidpid_xchg_last(struct page *page, int nid)
+static inline bool cpupid_cpu_unset(int cpupid)
 {
-	return xchg(&page->_last_nidpid, nid);
+	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
-static inline int page_nidpid_last(struct page *page)
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page->_last_nidpid;
+	return xchg(&page->_last_cpupid, cpupid);
 }
-static inline void page_nidpid_reset_last(struct page *page)
+
+static inline int page_cpupid_last(struct page *page)
+{
+	return page->_last_cpupid;
+}
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	page->_last_nidpid = -1;
+	page->_last_cpupid = -1;
 }
 #else
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
+	return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
 }
 
-extern int page_nidpid_xchg_last(struct page *page, int nidpid);
+extern int page_cpupid_xchg_last(struct page *page, int cpupid);
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
+	int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+	page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+	page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 }
-#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
-#else
-static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
+#else /* !CONFIG_NUMA_BALANCING */
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpupid_to_cpu(int cpupid)
 {
 	return -1;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpu_pid_to_cpupid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool cpupid_pid_unset(int cpupid)
 {
 	return 1;
 }
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
 }
-#endif
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 38a902a..a30f9ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-	int _last_nidpid;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+	int _last_cpupid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 02bc918..da52366 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -39,9 +39,9 @@
  * lookup is necessary.
  *
  * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: |       NODE     | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -65,18 +65,18 @@
 #define LAST__PID_SHIFT 8
 #define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
 
-#define LAST__NID_SHIFT NODES_SHIFT
-#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+#define LAST__CPU_SHIFT NR_CPUS_BITS
+#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)
 
-#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
+#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
 #else
-#define LAST_NIDPID_SHIFT 0
+#define LAST_CPUPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
-#define LAST_NIDPID_WIDTH 0
+#define LAST_CPUPID_WIDTH 0
 #endif
 
 /*
@@ -87,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
-#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
 #include <linux/mmzone.h>
 #include <linux/kbuild.h>
 #include <linux/page_cgroup.h>
+#include <linux/log2.h>
 
 void foo(void)
 {
@@ -17,5 +18,8 @@ void foo(void)
 	DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
 	DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
 	DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
 	/* End of constants */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa561c8..dbe0f62 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1210,7 +1210,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1226,8 +1226,8 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 	 * First accesses are treated as private, otherwise consider accesses
 	 * to be private if the accessing pid has not changed
 	 */
-	if (!nidpid_pid_unset(last_nidpid))
-		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	if (!cpupid_pid_unset(last_cpupid))
+		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
 	else
 		priv = 1;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0baf0e4..becf92c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nidpid = -1;
+	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
+		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index cc7f206..5162e6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nidpid;
+	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nidpid;
+	int last_cpupid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nidpid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e895a2..a5867ef 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2324,6 +2324,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	struct zone *zone;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
+	int thiscpu = raw_smp_processor_id();
+	int thisnid = cpu_to_node(thiscpu);
 	int polnid = -1;
 	int ret = -1;
 
@@ -2372,11 +2374,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nidpid;
-		int this_nidpid;
+		int last_cpupid;
+		int this_cpupid;
 
-		polnid = numa_node_id();
-		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
+		polnid = thisnid;
+		this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2399,8 +2401,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
-		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
+		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -2410,7 +2412,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * This way a short and temporary process migration will
 		 * not cause excessive memory migration.
 		 */
-		if (polnid != current->numa_preferred_nid &&
+		if (thisnid != current->numa_preferred_nid &&
 				!current->numa_migrate_seq)
 			goto out;
 #endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 025d1e3..ff53774 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
+		page_cpupid_xchg_last(newpage, page_cpupid_last(page));
 
 	return newpage;
 }
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
+	page_cpupid_xchg_last(new_page, page_cpupid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467de57..68562e9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NIDPID_WIDTH,
+		LAST_CPUPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnidpid %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NIDPID_SHIFT);
+		LAST_CPUPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NIDPID_PGSHIFT);
+		(unsigned long)LAST_CPUPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nidpid not in page flags");
+		"Last cpupid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 25bb477..bf34fb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
-int page_nidpid_xchg_last(struct page *page, int nidpid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
+int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
 	unsigned long old_flags, flags;
-	int last_nidpid;
+	int last_cpupid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 
-		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+		flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+		flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nidpid;
+	return last_cpupid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 5aae390..9a74855 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_nidpid = true;
-	int last_nid = -1;
+	bool all_same_cpupid = true;
+	int last_cpu = -1;
 	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -64,17 +64,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int nidpid = page_nidpid_last(page);
-					int this_nid = nidpid_to_nid(nidpid);
-					int this_pid = nidpid_to_pid(nidpid);
+					int cpupid = page_cpupid_last(page);
+					int this_cpu = cpupid_to_cpu(cpupid);
+					int this_pid = cpupid_to_pid(cpupid);
 
-					if (last_nid == -1)
-						last_nid = this_nid;
+					if (last_cpu == -1)
+						last_cpu = this_cpu;
 					if (last_pid == -1)
 						last_pid = this_pid;
-					if (last_nid != this_nid ||
+					if (last_cpu != this_cpu ||
 					    last_pid != this_pid) {
-						all_same_nidpid = false;
+						all_same_cpupid = false;
 					}
 
 					if (!pte_numa(oldpte)) {
@@ -115,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_nidpid = all_same_nidpid;
+	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
@@ -142,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_nidpid;
+	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -168,7 +168,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_nidpid);
+				 dirty_accountable, prot_numa, &all_same_cpupid);
 		pages += this_pages;
 
 		/*
@@ -177,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && this_pages && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_cpupid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89bedd0..73d812f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nidpid_reset_last(page);
+	page_cpupid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nidpid_reset_last(page);
+		page_cpupid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Use {cpu, pid} to create task groups for shared faults
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:31   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  8c8a743c5087bac9caac8155b8f3b367e75cdd0b
Gitweb:     http://git.kernel.org/tip/8c8a743c5087bac9caac8155b8f3b367e75cdd0b
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:29:21 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:47 +0200

sched/numa: Use {cpu, pid} to create task groups for shared faults

While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-44-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h    |  11 ++++
 include/linux/sched.h |   3 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 165 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |   5 +-
 mm/memory.c           |   8 +++
 6 files changed, 182 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce464cd..81443d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -691,6 +691,12 @@ static inline bool cpupid_cpu_unset(int cpupid)
 	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
+static inline bool __cpupid_match_pid(pid_t task_pid, int cpupid)
+{
+	return (task_pid & LAST__PID_MASK) == cpupid_to_pid(cpupid);
+}
+
+#define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
@@ -760,6 +766,11 @@ static inline bool cpupid_pid_unset(int cpupid)
 static inline void page_cpupid_reset_last(struct page *page)
 {
 }
+
+static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
+{
+	return false;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b661979..f587ded 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1347,6 +1347,9 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	struct list_head numa_entry;
+	struct numa_group *numa_group;
+
 	/*
 	 * Exponential decaying average of faults on a per-node basis.
 	 * Scheduling placement decisions are made based on the these counts.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1fe59da..51092d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1733,6 +1733,9 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+
+	INIT_LIST_HEAD(&p->numa_entry);
+	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dbe0f62..8556505 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,17 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+struct numa_group {
+	atomic_t refcount;
+
+	spinlock_t lock; /* nr_tasks, tasks */
+	int nr_tasks;
+	struct list_head task_list;
+
+	struct rcu_head rcu;
+	atomic_long_t faults[0];
+};
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1182,7 +1193,10 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
+			long diff;
+
 			i = task_faults_idx(nid, priv);
+			diff = -p->numa_faults[i];
 
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
@@ -1190,6 +1204,11 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
+			diff += p->numa_faults[i];
+			if (p->numa_group) {
+				/* safe because we can only change our own group */
+				atomic_long_add(diff, &p->numa_group->faults[i]);
+			}
 		}
 
 		if (faults > max_faults) {
@@ -1207,6 +1226,131 @@ static void task_numa_placement(struct task_struct *p)
 	}
 }
 
+static inline int get_numa_group(struct numa_group *grp)
+{
+	return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+	if (atomic_dec_and_test(&grp->refcount))
+		kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpupid)
+{
+	struct numa_group *grp, *my_grp;
+	struct task_struct *tsk;
+	bool join = false;
+	int cpu = cpupid_to_cpu(cpupid);
+	int i;
+
+	if (unlikely(!p->numa_group)) {
+		unsigned int size = sizeof(struct numa_group) +
+				    2*nr_node_ids*sizeof(atomic_long_t);
+
+		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!grp)
+			return;
+
+		atomic_set(&grp->refcount, 1);
+		spin_lock_init(&grp->lock);
+		INIT_LIST_HEAD(&grp->task_list);
+
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+		list_add(&p->numa_entry, &grp->task_list);
+		grp->nr_tasks++;
+		rcu_assign_pointer(p->numa_group, grp);
+	}
+
+	rcu_read_lock();
+	tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+	if (!cpupid_match_pid(tsk, cpupid))
+		goto unlock;
+
+	grp = rcu_dereference(tsk->numa_group);
+	if (!grp)
+		goto unlock;
+
+	my_grp = p->numa_group;
+	if (grp == my_grp)
+		goto unlock;
+
+	/*
+	 * Only join the other group if its bigger; if we're the bigger group,
+	 * the other task will join us.
+	 */
+	if (my_grp->nr_tasks > grp->nr_tasks)
+		goto unlock;
+
+	/*
+	 * Tie-break on the grp address.
+	 */
+	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+		goto unlock;
+
+	if (!get_numa_group(grp))
+		goto unlock;
+
+	join = true;
+
+unlock:
+	rcu_read_unlock();
+
+	if (!join)
+		return;
+
+	for (i = 0; i < 2*nr_node_ids; i++) {
+		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+	}
+
+	double_lock(&my_grp->lock, &grp->lock);
+
+	list_move(&p->numa_entry, &grp->task_list);
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+}
+
+void task_numa_free(struct task_struct *p)
+{
+	struct numa_group *grp = p->numa_group;
+	int i;
+
+	if (grp) {
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+		spin_lock(&grp->lock);
+		list_del(&p->numa_entry);
+		grp->nr_tasks--;
+		spin_unlock(&grp->lock);
+		rcu_assign_pointer(p->numa_group, NULL);
+		put_numa_group(grp);
+	}
+
+	kfree(p->numa_faults);
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -1222,15 +1366,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/*
-	 * First accesses are treated as private, otherwise consider accesses
-	 * to be private if the accessing pid has not changed
-	 */
-	if (!cpupid_pid_unset(last_cpupid))
-		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
-	else
-		priv = 1;
-
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
@@ -1245,6 +1380,18 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	}
 
 	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+		priv = 1;
+	} else {
+		priv = cpupid_match_pid(p, last_cpupid);
+		if (!priv)
+			task_numa_group(p, last_cpupid);
+	}
+
+	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 691e969..8037b10 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -559,10 +559,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
-	kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
 #else /* CONFIG_NUMA_BALANCING */
 static inline void task_numa_free(struct task_struct *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 5162e6d..c57efa2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2719,6 +2719,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		get_page(dirty_page);
 
 reuse:
+		/*
+		 * Clear the pages cpupid information as the existing
+		 * information potentially belongs to a now completely
+		 * unrelated process.
+		 */
+		if (old_page)
+			page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
+
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Report a NUMA task group ID
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:31   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  e29cf08b05dc0b8151d65704d96d525a9e179a6b
Gitweb:     http://git.kernel.org/tip/e29cf08b05dc0b8151d65704d96d525a9e179a6b
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:22 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:49 +0200

sched/numa: Report a NUMA task group ID

It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 fs/proc/array.c       | 2 ++
 include/linux/sched.h | 5 +++++
 kernel/sched/fair.c   | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd0f1b..1bd2077 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -183,6 +183,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
+		"Ngid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
 		"TracerPid:\t%d\n"
@@ -190,6 +191,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		task_tgid_nr_ns(p, ns),
+		task_numa_group_id(p),
 		pid_nr_ns(pid, ns),
 		ppid, tpid,
 		from_kuid_munged(user_ns, cred->uid),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f587ded..b0b343b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1452,12 +1452,17 @@ struct task_struct {
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   bool migrated)
 {
 }
+static inline pid_t task_numa_group_id(struct task_struct *p)
+{
+	return 0;
+}
 static inline void set_numabalancing_state(bool enabled)
 {
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8556505..5bd309c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,12 +893,18 @@ struct numa_group {
 
 	spinlock_t lock; /* nr_tasks, tasks */
 	int nr_tasks;
+	pid_t gid;
 	struct list_head task_list;
 
 	struct rcu_head rcu;
 	atomic_long_t faults[0];
 };
 
+pid_t task_numa_group_id(struct task_struct *p)
+{
+	return p->numa_group ? p->numa_group->gid : 0;
+}
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1265,6 +1271,7 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		atomic_set(&grp->refcount, 1);
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
+		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Copy cpupid on page migration
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:31   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  7851a45cd3f6198bf542c30e27b330e8eeb3736c
Gitweb:     http://git.kernel.org/tip/7851a45cd3f6198bf542c30e27b330e8eeb3736c
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:23 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:51 +0200

mm: numa: Copy cpupid on page migration

After page migration, the new page has the nidpid unset. This makes
every fault on a recently migrated page look like a first numa fault,
leading to another page migration.

Copying over the nidpid at page migration time should prevent erroneous
migrations of recently migrated pages.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-46-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/migrate.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index ff53774..44c1fa9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -443,6 +443,8 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
+	int cpupid;
+
 	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
@@ -479,6 +481,13 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 			__set_page_dirty_nobuffers(newpage);
  	}
 
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	cpupid = page_cpupid_xchg_last(page, -1);
+	page_cpupid_xchg_last(newpage, cpupid);
+
 	mlock_migrate_page(newpage, page);
 	ksm_migrate_page(newpage, page);
 	/*

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Do not group on RO pages
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:31   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  6688cc05473b36a0a3d3971e1adf1712919b32eb
Gitweb:     http://git.kernel.org/tip/6688cc05473b36a0a3d3971e1adf1712919b32eb
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:29:24 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:53 +0200

mm: numa: Do not group on RO pages

And here's a little something to make sure not the whole world ends up
in a single group.

As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.

Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-47-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  7 +++++--
 kernel/sched/fair.c   |  5 +++--
 mm/huge_memory.c      | 15 +++++++++++++--
 mm/memory.c           | 30 ++++++++++++++++++++++++++----
 4 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0b343b..ff54385 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1450,13 +1450,16 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#define TNF_MIGRATED	0x01
+#define TNF_NO_GROUP	0x02
+
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
-				   bool migrated)
+				   int flags)
 {
 }
 static inline pid_t task_numa_group_id(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5bd309c..35661b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1361,9 +1361,10 @@ void task_numa_free(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 {
 	struct task_struct *p = current;
+	bool migrated = flags & TNF_MIGRATED;
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1394,7 +1395,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 		priv = 1;
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
-		if (!priv)
+		if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, last_cpupid);
 	}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index becf92c..7ab4e32 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1285,6 +1285,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
+	int flags = 0;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1299,6 +1300,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pmd_write(pmd))
+		flags |= TNF_NO_GROUP;
+
+	/*
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
@@ -1343,8 +1352,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (migrated)
+	if (migrated) {
+		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
+	}
 
 	goto out;
 clear_pmdnuma:
@@ -1362,7 +1373,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index c57efa2..eba846b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3547,6 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
+	int flags = 0;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3575,6 +3576,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pte_write(pte))
+		flags |= TNF_NO_GROUP;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
@@ -3586,12 +3595,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
-	if (migrated)
+	if (migrated) {
 		page_nid = target_nid;
+		flags |= TNF_MIGRATED;
+	}
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
 }
 
@@ -3632,6 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int page_nid = -1;
 		int target_nid;
 		bool migrated = false;
+		int flags = 0;
 
 		if (!pte_present(pteval))
 			continue;
@@ -3651,20 +3663,30 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
+		/*
+		 * Avoid grouping on DSO/COW pages in specific and RO pages
+		 * in general, RO pages shouldn't hurt as much anyway since
+		 * they can be in shared cache state.
+		 */
+		if (!pte_write(pteval))
+			flags |= TNF_NO_GROUP;
+
 		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
 			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated)
+			if (migrated) {
 				page_nid = target_nid;
+				flags |= TNF_MIGRATED;
+			}
 		} else {
 			put_page(page);
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, flags);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Do not batch handle PMD pages
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:31   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  0f19c17929c952c6f0966d93ab05558e7bf814cc
Gitweb:     http://git.kernel.org/tip/0f19c17929c952c6f0966d93ab05558e7bf814cc
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:25 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:55 +0200

mm: numa: Do not batch handle PMD pages

With the THP migration races closed it is still possible to occasionally
see corruption. The problem is related to handling PMD pages in batch.
When a page fault is handled it can be assumed that the page being
faulted will also be flushed from the TLB. The same flushing does not
happen when handling PMD pages in batch. Fixing is straight forward but
there are a number of reasons not to

1. Multiple TLB flushes may have to be sent depending on what pages get
   migrated
2. The handling of PMDs in batch means that faults get accounted to
   the task that is handling the fault. While care is taken to only
   mark PMDs where the last CPU and PID match it can still have problems
   due to PID truncation when matching PIDs.
3. Batching on the PMD level may reduce faults but setting pmd_numa
   requires taking a heavy lock that can contend with THP migration
   and handling the fault requires the release/acquisition of the PTL
   for every page migrated. It's still pretty heavy.

PMD batch handling is not something that people ever have been happy
with. This patch removes it and later patches will deal with the
additional fault overhead using more installigent migrate rate adaption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-48-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/memory.c   | 101 ++--------------------------------------------------------
 mm/mprotect.c |  47 ++-------------------------
 2 files changed, 4 insertions(+), 144 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eba846b..9898eeb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3606,103 +3606,6 @@ out:
 	return 0;
 }
 
-/* NUMA hinting page fault entry point for regular pmds */
-#ifdef CONFIG_NUMA_BALANCING
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	pte_t *pte, *orig_pte;
-	unsigned long _addr = addr & PMD_MASK;
-	unsigned long offset;
-	spinlock_t *ptl;
-	bool numa = false;
-	int last_cpupid;
-
-	spin_lock(&mm->page_table_lock);
-	pmd = *pmdp;
-	if (pmd_numa(pmd)) {
-		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
-		numa = true;
-	}
-	spin_unlock(&mm->page_table_lock);
-
-	if (!numa)
-		return 0;
-
-	/* we're in a page fault so some vma must be in the range */
-	BUG_ON(!vma);
-	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
-	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
-	VM_BUG_ON(offset >= PMD_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
-	pte += offset >> PAGE_SHIFT;
-	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
-		pte_t pteval = *pte;
-		struct page *page;
-		int page_nid = -1;
-		int target_nid;
-		bool migrated = false;
-		int flags = 0;
-
-		if (!pte_present(pteval))
-			continue;
-		if (!pte_numa(pteval))
-			continue;
-		if (addr >= vma->vm_end) {
-			vma = find_vma(mm, addr);
-			/* there's a pte present so there must be a vma */
-			BUG_ON(!vma);
-			BUG_ON(addr < vma->vm_start);
-		}
-		if (pte_numa(pteval)) {
-			pteval = pte_mknonnuma(pteval);
-			set_pte_at(mm, addr, pte, pteval);
-		}
-		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page))
-			continue;
-
-		/*
-		 * Avoid grouping on DSO/COW pages in specific and RO pages
-		 * in general, RO pages shouldn't hurt as much anyway since
-		 * they can be in shared cache state.
-		 */
-		if (!pte_write(pteval))
-			flags |= TNF_NO_GROUP;
-
-		last_cpupid = page_cpupid_last(page);
-		page_nid = page_to_nid(page);
-		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
-		pte_unmap_unlock(pte, ptl);
-		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated) {
-				page_nid = target_nid;
-				flags |= TNF_MIGRATED;
-			}
-		} else {
-			put_page(page);
-		}
-
-		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, flags);
-
-		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-	}
-	pte_unmap_unlock(orig_pte, ptl);
-
-	return 0;
-}
-#else
-static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
-{
-	BUG();
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3841,8 +3744,8 @@ retry:
 		}
 	}
 
-	if (pmd_numa(*pmd))
-		return do_pmd_numa_page(mm, vma, address, pmd);
+	/* THP should already have been handled */
+	BUG_ON(pmd_numa(*pmd));
 
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9a74855..a0302ac 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,15 +37,12 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
+		int dirty_accountable, int prot_numa)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_cpupid = true;
-	int last_cpu = -1;
-	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -64,19 +61,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (page) {
-					int cpupid = page_cpupid_last(page);
-					int this_cpu = cpupid_to_cpu(cpupid);
-					int this_pid = cpupid_to_pid(cpupid);
-
-					if (last_cpu == -1)
-						last_cpu = this_cpu;
-					if (last_pid == -1)
-						last_pid = this_pid;
-					if (last_cpu != this_cpu ||
-					    last_pid != this_pid) {
-						all_same_cpupid = false;
-					}
-
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
@@ -115,26 +99,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	spin_lock(&mm->page_table_lock);
-	set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
-	spin_unlock(&mm->page_table_lock);
-}
-#else
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmd)
-{
-	BUG();
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -142,7 +109,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -168,17 +134,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_cpupid);
+				 dirty_accountable, prot_numa);
 		pages += this_pages;
-
-		/*
-		 * If we are changing protections for NUMA hinting faults then
-		 * set pmd_numa if the examined pages were all on the same
-		 * node. This allows a regular PMD to be handled as one fault
-		 * and effectively batches the taking of the PTL
-		 */
-		if (prot_numa && this_pages && all_same_cpupid)
-			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Stay on the same node if CLONE_VM
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:31   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  5e1576ed0e54d419286a8096133029062b6ad456
Gitweb:     http://git.kernel.org/tip/5e1576ed0e54d419286a8096133029062b6ad456
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:26 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:57 +0200

sched/numa: Stay on the same node if CLONE_VM

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-49-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  2 +-
 kernel/fork.c         |  2 +-
 kernel/sched/core.c   | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ff54385..8563e3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2021,7 +2021,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index 7192d91..c93be06 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,7 +1310,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	sched_fork(clone_flags, p);
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51092d5..3e2c893 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1696,7 +1696,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  *
  * __sched_fork() is basic setup used by init_idle() too:
  */
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
@@ -1725,11 +1725,15 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_scan_seq = 0;
 	}
 
+	if (clone_flags & CLONE_VM)
+		p->numa_preferred_nid = current->numa_preferred_nid;
+	else
+		p->numa_preferred_nid = -1;
+
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
@@ -1761,12 +1765,12 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
 
-	__sched_fork(p);
+	__sched_fork(clone_flags, p);
 	/*
 	 * We mark the process as running here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -4287,7 +4291,7 @@ void init_idle(struct task_struct *idle, int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__sched_fork(idle);
+	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Use group fault statistics in numa placement
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:32   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  83e1d2cd9eabec5164afea295ff06b941ae8e4a9
Gitweb:     http://git.kernel.org/tip/83e1d2cd9eabec5164afea295ff06b941ae8e4a9
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:27 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:47:58 +0200

sched/numa: Use group fault statistics in numa placement

This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task.  If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-50-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |   1 +
 kernel/sched/fair.c   | 124 +++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 108 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8563e3d..7244822 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,7 @@ struct task_struct {
 	 * The values remain static for the duration of a PTE scan
 	 */
 	unsigned long *numa_faults;
+	unsigned long total_numa_faults;
 
 	/*
 	 * numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35661b8..4c40e13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,6 +897,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	atomic_long_t total_faults;
 	atomic_long_t faults[0];
 };
 
@@ -919,6 +920,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 		p->numa_faults[task_faults_idx(nid, 1)];
 }
 
+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_group)
+		return 0;
+
+	return atomic_long_read(&p->numa_group->faults[2*nid]) +
+	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node.  The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_faults)
+		return 0;
+
+	total_faults = p->total_numa_faults;
+
+	if (!total_faults)
+		return 0;
+
+	return 1000 * task_faults(p, nid) / total_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_group)
+		return 0;
+
+	total_faults = atomic_long_read(&p->numa_group->total_faults);
+
+	if (!total_faults)
+		return 0;
+
+	return 1200 * group_faults(p, nid) / total_faults;
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
@@ -1018,8 +1064,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_faults(cur, env->src_nid) -
-		       task_faults(cur, env->dst_nid);
+		imp += task_weight(cur, env->src_nid) +
+		       group_weight(cur, env->src_nid) -
+		       task_weight(cur, env->dst_nid) -
+		       group_weight(cur, env->dst_nid);
 	}
 
 	if (imp < env->best_imp)
@@ -1098,7 +1146,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	unsigned long faults;
+	unsigned long weight;
 	int nid, ret;
 	long imp;
 
@@ -1115,10 +1163,10 @@ static int task_numa_migrate(struct task_struct *p)
 	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	faults = task_faults(p, env.src_nid);
+	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_faults(env.p, env.dst_nid) - faults;
+	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* If the preferred nid has capacity, try to use it. */
@@ -1131,8 +1179,8 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
-			/* Only consider nodes that recorded more faults */
-			imp = task_faults(env.p, nid) - faults;
+			/* Only consider nodes where both task and groups benefit */
+			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
 			if (imp < 0)
 				continue;
 
@@ -1183,8 +1231,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1;
-	unsigned long max_faults = 0;
+	int seq, nid, max_nid = -1, max_group_nid = -1;
+	unsigned long max_faults = 0, max_group_faults = 0;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1195,7 +1243,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = 0;
+		unsigned long faults = 0, group_faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1211,9 +1259,12 @@ static void task_numa_placement(struct task_struct *p)
 
 			faults += p->numa_faults[i];
 			diff += p->numa_faults[i];
+			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
 				atomic_long_add(diff, &p->numa_group->faults[i]);
+				atomic_long_add(diff, &p->numa_group->total_faults);
+				group_faults += atomic_long_read(&p->numa_group->faults[i]);
 			}
 		}
 
@@ -1221,6 +1272,27 @@ static void task_numa_placement(struct task_struct *p)
 			max_faults = faults;
 			max_nid = nid;
 		}
+
+		if (group_faults > max_group_faults) {
+			max_group_faults = group_faults;
+			max_group_nid = nid;
+		}
+	}
+
+	/*
+	 * If the preferred task and group nids are different,
+	 * iterate over the nodes again to find the best place.
+	 */
+	if (p->numa_group && max_nid != max_group_nid) {
+		unsigned long weight, max_weight = 0;
+
+		for_each_online_node(nid) {
+			weight = task_weight(p, nid) + group_weight(p, nid);
+			if (weight > max_weight) {
+				max_weight = weight;
+				max_nid = nid;
+			}
+		}
 	}
 
 	/* Preferred node as the node with the most faults */
@@ -1276,6 +1348,8 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
 
+		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
 		rcu_assign_pointer(p->numa_group, grp);
@@ -1323,6 +1397,8 @@ unlock:
 		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
 		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
 	}
+	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
+	atomic_long_add(p->total_numa_faults, &grp->total_faults);
 
 	double_lock(&my_grp->lock, &grp->lock);
 
@@ -1347,6 +1423,8 @@ void task_numa_free(struct task_struct *p)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
 
+		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+
 		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
@@ -1385,6 +1463,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+		p->total_numa_faults = 0;
 	}
 
 	/*
@@ -4572,12 +4651,17 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (dst_nid == p->numa_preferred_nid ||
-	    task_faults(p, dst_nid) > task_faults(p, src_nid))
+	/* Always encourage migration to the preferred node. */
+	if (dst_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is better. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4597,11 +4681,17 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is worse. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Call task_numa_free() from do_execve ()
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:32   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  82727018b0d33d188e9916bcf76f18387484cb04
Gitweb:     http://git.kernel.org/tip/82727018b0d33d188e9916bcf76f18387484cb04
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:28 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:00 +0200

sched/numa: Call task_numa_free() from do_execve()

It is possible for a task in a numa group to call exec, and
have the new (unrelated) executable inherit the numa group
association from its former self.

This has the potential to break numa grouping, and is trivial
to fix.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-51-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 fs/exec.c             | 1 +
 include/linux/sched.h | 4 ++++
 kernel/sched/fair.c   | 9 ++++++++-
 kernel/sched/sched.h  | 5 -----
 4 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 8875dd1..2ea437e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1547,6 +1547,7 @@ static int do_execve_common(const char *filename,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	acct_update_integrals(current);
+	task_numa_free(current);
 	free_bprm(bprm);
 	if (displaced)
 		put_files_struct(displaced);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7244822..f638510 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1458,6 +1458,7 @@ struct task_struct {
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
+extern void task_numa_free(struct task_struct *p);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -1470,6 +1471,9 @@ static inline pid_t task_numa_group_id(struct task_struct *p)
 static inline void set_numabalancing_state(bool enabled)
 {
 }
+static inline void task_numa_free(struct task_struct *p)
+{
+}
 #endif
 
 static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c40e13..c4df2de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1418,6 +1418,7 @@ void task_numa_free(struct task_struct *p)
 {
 	struct numa_group *grp = p->numa_group;
 	int i;
+	void *numa_faults = p->numa_faults;
 
 	if (grp) {
 		for (i = 0; i < 2*nr_node_ids; i++)
@@ -1433,7 +1434,9 @@ void task_numa_free(struct task_struct *p)
 		put_numa_group(grp);
 	}
 
-	kfree(p->numa_faults);
+	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
+	kfree(numa_faults);
 }
 
 /*
@@ -1452,6 +1455,10 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (!p->mm)
 		return;
 
+	/* Do not worry about placement if exiting */
+	if (p->state == TASK_DEAD)
+		return;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8037b10..eeb1923 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -559,11 +559,6 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-extern void task_numa_free(struct task_struct *p);
-#else /* CONFIG_NUMA_BALANCING */
-static inline void task_numa_free(struct task_struct *p)
-{
-}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Prevent parallel updates to group stats during placement
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:32   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  7dbd13ed06513b047216a7ffc718bad9df0660f1
Gitweb:     http://git.kernel.org/tip/7dbd13ed06513b047216a7ffc718bad9df0660f1
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:29 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:02 +0200

sched/numa: Prevent parallel updates to group stats during placement

Having multiple tasks in a group go through task_numa_placement
simultaneously can lead to a task picking a wrong node to run on, because
the group stats may be in the middle of an update. This patch avoids
parallel updates by holding the numa_group lock during placement
decisions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-52-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4df2de..1473499 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1233,6 +1233,7 @@ static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1241,6 +1242,12 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
+	/* If the task is part of a group prevent parallel updates to group stats */
+	if (p->numa_group) {
+		group_lock = &p->numa_group->lock;
+		spin_lock(group_lock);
+	}
+
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults = 0, group_faults = 0;
@@ -1279,20 +1286,24 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * If the preferred task and group nids are different,
-	 * iterate over the nodes again to find the best place.
-	 */
-	if (p->numa_group && max_nid != max_group_nid) {
-		unsigned long weight, max_weight = 0;
-
-		for_each_online_node(nid) {
-			weight = task_weight(p, nid) + group_weight(p, nid);
-			if (weight > max_weight) {
-				max_weight = weight;
-				max_nid = nid;
+	if (p->numa_group) {
+		/*
+		 * If the preferred task and group nids are different,
+		 * iterate over the nodes again to find the best place.
+		 */
+		if (max_nid != max_group_nid) {
+			unsigned long weight, max_weight = 0;
+
+			for_each_online_node(nid) {
+				weight = task_weight(p, nid) + group_weight(p, nid);
+				if (weight > max_weight) {
+					max_weight = weight;
+					max_nid = nid;
+				}
 			}
 		}
+
+		spin_unlock(group_lock);
 	}
 
 	/* Preferred node as the node with the most faults */

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Add debugging
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:32   ` tip-bot for Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Ingo Molnar @ 2013-10-09 17:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  b32e86b4301e345611f0446265f782a229faadf6
Gitweb:     http://git.kernel.org/tip/b32e86b4301e345611f0446265f782a229faadf6
Author:     Ingo Molnar <mingo@kernel.org>
AuthorDate: Mon, 7 Oct 2013 11:29:30 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:04 +0200

sched/numa: Add debugging

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1381141781-10992-53-git-send-email-mgorman@suse.de
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/debug.c  | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |  5 ++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f638510..1127a46 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1366,6 +1366,7 @@ struct task_struct {
 	unsigned long *numa_faults_buffer;
 
 	int numa_preferred_nid;
+	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
@@ -2661,6 +2662,11 @@ static inline unsigned int task_cpu(const struct task_struct *p)
 	return task_thread_info(p)->cpu;
 }
 
+static inline int task_node(const struct task_struct *p)
+{
+	return cpu_to_node(task_cpu(p));
+}
+
 extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
 
 #else
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1965599..e6ba5e3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/kallsyms.h>
 #include <linux/utsname.h>
+#include <linux/mempolicy.h>
 
 #include "sched.h"
 
@@ -137,6 +138,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	SEQ_printf(m, " %d", cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
@@ -159,7 +163,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, p) {
-		if (!p->on_rq || task_cpu(p) != rq_cpu)
+		if (task_cpu(p) != rq_cpu)
 			continue;
 
 		print_task(m, rq, p);
@@ -345,7 +349,7 @@ static void sched_debug_header(struct seq_file *m)
 	cpu_clk = local_clock();
 	local_irq_restore(flags);
 
-	SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
+	SEQ_printf(m, "Sched Debug Version: v0.11, %s %.*s\n",
 		init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version);
@@ -488,6 +492,56 @@ static int __init init_sched_debug_procfs(void)
 
 __initcall(init_sched_debug_procfs);
 
+#define __P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F)
+#define P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F)
+#define __PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F))
+#define PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F))
+
+
+static void sched_show_numa(struct task_struct *p, struct seq_file *m)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	struct mempolicy *pol;
+	int node, i;
+
+	if (p->mm)
+		P(mm->numa_scan_seq);
+
+	task_lock(p);
+	pol = p->mempolicy;
+	if (pol && !(pol->flags & MPOL_F_MORON))
+		pol = NULL;
+	mpol_get(pol);
+	task_unlock(p);
+
+	SEQ_printf(m, "numa_migrations, %ld\n", xchg(&p->numa_pages_migrated, 0));
+
+	for_each_online_node(node) {
+		for (i = 0; i < 2; i++) {
+			unsigned long nr_faults = -1;
+			int cpu_current, home_node;
+
+			if (p->numa_faults)
+				nr_faults = p->numa_faults[2*node + i];
+
+			cpu_current = !i ? (task_node(p) == node) :
+				(pol && node_isset(node, pol->v.nodes));
+
+			home_node = (p->numa_preferred_nid == node);
+
+			SEQ_printf(m, "numa_faults, %d, %d, %d, %d, %ld\n",
+				i, node, cpu_current, home_node, nr_faults);
+		}
+	}
+
+	mpol_put(pol);
+#endif
+}
+
 void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 {
 	unsigned long nr_switches;
@@ -591,6 +645,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		SEQ_printf(m, "%-45s:%21Ld\n",
 			   "clock-delta", (long long)(t1-t0));
 	}
+
+	sched_show_numa(p, m);
 }
 
 void proc_sched_set_task(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1473499..2876a37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1137,7 +1137,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.p = p,
 
 		.src_cpu = task_cpu(p),
-		.src_nid = cpu_to_node(task_cpu(p)),
+		.src_nid = task_node(p),
 
 		.imbalance_pct = 112,
 
@@ -1515,6 +1515,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
+	if (migrated)
+		p->numa_pages_migrated += pages;
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Decide whether to favour task or group weights based on swap candidate relationships
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:32   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  887c290e82e8950d854730c084904c115fc367ac
Gitweb:     http://git.kernel.org/tip/887c290e82e8950d854730c084904c115fc367ac
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:31 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:06 +0200

sched/numa: Decide whether to favour task or group weights based on swap candidate relationships

This patch separately considers task and group affinities when searching
for swap candidates during task NUMA placement. If tasks are not part of
a group or the same group then the task weights are considered.
Otherwise the group weights are compared.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-54-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2876a37..6f45461 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,13 +1039,15 @@ static void task_numa_assign(struct task_numa_env *env,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env, long imp)
+static void task_numa_compare(struct task_numa_env *env,
+			      long taskimp, long groupimp)
 {
 	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long dst_load, src_load;
 	long load;
+	long imp = (groupimp > 0) ? groupimp : taskimp;
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
@@ -1064,10 +1066,19 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_weight(cur, env->src_nid) +
-		       group_weight(cur, env->src_nid) -
-		       task_weight(cur, env->dst_nid) -
-		       group_weight(cur, env->dst_nid);
+		/*
+		 * If dst and source tasks are in the same NUMA group, or not
+		 * in any group then look only at task weights otherwise give
+		 * priority to the group weights.
+		 */
+		if (!cur->numa_group || !env->p->numa_group ||
+		    cur->numa_group == env->p->numa_group) {
+			imp = taskimp + task_weight(cur, env->src_nid) -
+			      task_weight(cur, env->dst_nid);
+		} else {
+			imp = groupimp + group_weight(cur, env->src_nid) -
+			       group_weight(cur, env->dst_nid);
+		}
 	}
 
 	if (imp < env->best_imp)
@@ -1117,7 +1128,8 @@ unlock:
 	rcu_read_unlock();
 }
 
-static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+static void task_numa_find_cpu(struct task_numa_env *env,
+				long taskimp, long groupimp)
 {
 	int cpu;
 
@@ -1127,7 +1139,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, long imp)
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, imp);
+		task_numa_compare(env, taskimp, groupimp);
 	}
 }
 
@@ -1146,9 +1158,9 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
 	struct sched_domain *sd;
-	unsigned long weight;
+	unsigned long taskweight, groupweight;
 	int nid, ret;
-	long imp;
+	long taskimp, groupimp;
 
 	/*
 	 * Pick the lowest SD_NUMA domain, as that would have the smallest
@@ -1163,15 +1175,17 @@ static int task_numa_migrate(struct task_struct *p)
 	env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 	rcu_read_unlock();
 
-	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
+	taskweight = task_weight(p, env.src_nid);
+	groupweight = group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
+	taskimp = task_weight(p, env.dst_nid) - taskweight;
+	groupimp = group_weight(p, env.dst_nid) - groupweight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* If the preferred nid has capacity, try to use it. */
 	if (env.dst_stats.has_capacity)
-		task_numa_find_cpu(&env, imp);
+		task_numa_find_cpu(&env, taskimp, groupimp);
 
 	/* No space available on the preferred nid. Look elsewhere. */
 	if (env.best_cpu == -1) {
@@ -1180,13 +1194,14 @@ static int task_numa_migrate(struct task_struct *p)
 				continue;
 
 			/* Only consider nodes where both task and groups benefit */
-			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
-			if (imp < 0)
+			taskimp = task_weight(p, nid) - taskweight;
+			groupimp = group_weight(p, nid) - groupweight;
+			if (taskimp < 0 && groupimp < 0)
 				continue;
 
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, imp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
@@ -4679,10 +4694,9 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (dst_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is better. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If both task and group weight improve, this move is a winner. */
+	if (task_weight(p, dst_nid) > task_weight(p, src_nid) &&
+	    group_weight(p, dst_nid) > group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4709,10 +4723,9 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	if (src_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is worse. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If either task or group weight get worse, don't do it. */
+	if (task_weight(p, dst_nid) < task_weight(p, src_nid) ||
+	    group_weight(p, dst_nid) < group_weight(p, src_nid))
 		return true;
 
 	return false;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Fix task or group comparison
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:32   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  ca28aa53dd95868c9e38917b9881c09dacfacf1a
Gitweb:     http://git.kernel.org/tip/ca28aa53dd95868c9e38917b9881c09dacfacf1a
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:32 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:08 +0200

sched/numa: Fix task or group comparison

This patch separately considers task and group affinities when
searching for swap candidates during NUMA placement. If tasks
are part of the same group, or no group at all, the task weights
are considered.

Some hysteresis is added to prevent tasks within one group from
getting bounced between NUMA nodes due to tiny differences.

If tasks are part of different groups, the code compares group
weights, in order to favor grouping task groups together.

The patch also changes the group weight multiplier to be the
same as the task weight multiplier, since the two are no longer
added up like before.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-55-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f45461..423316c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -962,7 +962,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 	if (!total_faults)
 		return 0;
 
-	return 1200 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1068,16 +1068,34 @@ static void task_numa_compare(struct task_numa_env *env,
 
 		/*
 		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights otherwise give
-		 * priority to the group weights.
+		 * in any group then look only at task weights.
 		 */
-		if (!cur->numa_group || !env->p->numa_group ||
-		    cur->numa_group == env->p->numa_group) {
+		if (cur->numa_group == env->p->numa_group) {
 			imp = taskimp + task_weight(cur, env->src_nid) -
 			      task_weight(cur, env->dst_nid);
+			/*
+			 * Add some hysteresis to prevent swapping the
+			 * tasks within a group over tiny differences.
+			 */
+			if (cur->numa_group)
+				imp -= imp/16;
 		} else {
-			imp = groupimp + group_weight(cur, env->src_nid) -
-			       group_weight(cur, env->dst_nid);
+			/*
+			 * Compare the group weights. If a task is all by
+			 * itself (not part of a group), use the task weight
+			 * instead.
+			 */
+			if (env->p->numa_group)
+				imp = groupimp;
+			else
+				imp = taskimp;
+
+			if (cur->numa_group)
+				imp += group_weight(cur, env->src_nid) -
+				       group_weight(cur, env->dst_nid);
+			else
+				imp += task_weight(cur, env->src_nid) -
+				       task_weight(cur, env->dst_nid);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Avoid migrating tasks that are placed on their preferred node
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:33   ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
	mgorman, tglx

Commit-ID:  0ec8aa00f2b4dc457836ef4e2662b02483e94fb7
Gitweb:     http://git.kernel.org/tip/0ec8aa00f2b4dc457836ef4e2662b02483e94fb7
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:29:33 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:10 +0200

sched/numa: Avoid migrating tasks that are placed on their preferred node

This patch classifies scheduler domains and runqueues into types depending
the number of tasks that are about their NUMA placement and the number
that are currently running on their preferred node. The types are

regular: There are tasks running that do not care about their NUMA
	placement.

remote: There are tasks running that care about their placement but are
	currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running
that care about their placement (nr_numa_running). The load balancer
uses this information to avoid migrating idea placed NUMA tasks as long
as better options for load balancing exists. For example, it will not
consider balancing between a group whose tasks are all perfectly placed
and a group with remote tasks.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-56-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  |  29 +++++++++++++
 kernel/sched/fair.c  | 120 +++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   5 +++
 3 files changed, 142 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e2c893..8cfd51f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4468,6 +4468,35 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
+
+/*
+ * Requeue a task on a given node and accurately track the number of NUMA
+ * tasks on the runqueues
+ */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	struct rq *rq;
+	unsigned long flags;
+	bool on_rq, running;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 1;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
 #endif
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 423316c..5166b9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,18 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running += (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running -= (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
+}
+
 struct numa_group {
 	atomic_t refcount;
 
@@ -1227,6 +1239,8 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
+	sched_setnuma(p, env.dst_nid);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1342,8 +1356,7 @@ static void task_numa_placement(struct task_struct *p)
 	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
-		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 1;
+		sched_setnuma(p, max_nid);
 		numa_migrate_preferred(p);
 	}
 }
@@ -1741,6 +1754,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1750,8 +1771,12 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_enqueue(rq, task_of(se));
+		list_add(&se->group_node, &rq->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1762,8 +1787,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -4605,6 +4632,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
+enum fbq_type { regular, remote, all };
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
@@ -4631,6 +4660,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	enum fbq_type		fbq_type;
 };
 
 /*
@@ -5092,6 +5123,10 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 };
 
 /*
@@ -5409,6 +5444,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+		sgs->nr_numa_running += rq->nr_numa_running;
+		sgs->nr_preferred_running += rq->nr_preferred_running;
+#endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
@@ -5474,14 +5513,43 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running > sgs->nr_numa_running)
+		return regular;
+	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+		return remote;
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	if (rq->nr_running > rq->nr_numa_running)
+		return regular;
+	if (rq->nr_running > rq->nr_preferred_running)
+		return remote;
+	return all;
+}
+#else
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	return regular;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
  * @balance: Should we balance.
  * @sds: variable to hold the statistics for this sched_domain.
  */
-static inline void update_sd_lb_stats(struct lb_env *env,
-					struct sd_lb_stats *sds)
+static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
@@ -5538,6 +5606,9 @@ next_group:
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
+
+	if (env->sd->flags & SD_NUMA)
+		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 }
 
 /**
@@ -5841,15 +5912,39 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power,
-							   SCHED_POWER_SCALE);
-		unsigned long wl;
+		unsigned long power, capacity, wl;
+		enum fbq_type rt;
+
+		rq = cpu_rq(i);
+		rt = fbq_classify_rq(rq);
 
+		/*
+		 * We classify groups/runqueues into three groups:
+		 *  - regular: there are !numa tasks
+		 *  - remote:  there are numa tasks that run on the 'wrong' node
+		 *  - all:     there is no distinction
+		 *
+		 * In order to avoid migrating ideally placed numa tasks,
+		 * ignore those when there's better options.
+		 *
+		 * If we ignore the actual busiest queue to migrate another
+		 * task, the next balance pass can still reduce the busiest
+		 * queue by moving tasks around inside the node.
+		 *
+		 * If we cannot move enough load due to this classification
+		 * the next pass will adjust the group classification and
+		 * allow migration of more tasks.
+		 *
+		 * Both cases only affect the total convergence complexity.
+		 */
+		if (rt > env->fbq_type)
+			continue;
+
+		power = power_of(i);
+		capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 
 		/*
@@ -5966,6 +6061,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.fbq_type	= all,
 	};
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeb1923..d69cb32 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -409,6 +409,10 @@ struct rq {
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
 	unsigned int nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned long last_load_update_tick;
@@ -557,6 +561,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
 #endif /* CONFIG_NUMA_BALANCING */

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Be more careful about joining numa groups
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:33   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  dabe1d992414a6456e60e41f1d1ad8affc6d444d
Gitweb:     http://git.kernel.org/tip/dabe1d992414a6456e60e41f1d1ad8affc6d444d
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:34 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:12 +0200

sched/numa: Be more careful about joining numa groups

Due to the way the pid is truncated, and tasks are moved between
CPUs by the scheduler, it is possible for the current task_numa_fault
to group together tasks that do not actually share memory together.

This patch adds a few easy sanity checks to task_numa_fault, joining
tasks together if they share the same tsk->mm, or if the fault was on
a page with an elevated mapcount, in a shared VMA.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-57-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 16 +++++++++++-----
 mm/memory.c           |  7 +++++++
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1127a46..59f953b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,7 @@ struct task_struct {
 
 #define TNF_MIGRATED	0x01
 #define TNF_NO_GROUP	0x02
+#define TNF_SHARED	0x04
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5166b9b..222c2d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,7 +1381,7 @@ static void double_lock(spinlock_t *l1, spinlock_t *l2)
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
-static void task_numa_group(struct task_struct *p, int cpupid)
+static void task_numa_group(struct task_struct *p, int cpupid, int flags)
 {
 	struct numa_group *grp, *my_grp;
 	struct task_struct *tsk;
@@ -1439,10 +1439,16 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
 		goto unlock;
 
-	if (!get_numa_group(grp))
-		goto unlock;
+	/* Always join threads in the same process. */
+	if (tsk->mm == current->mm)
+		join = true;
+
+	/* Simple filter to avoid false positives due to PID collisions */
+	if (flags & TNF_SHARED)
+		join = true;
 
-	join = true;
+	if (join && !get_numa_group(grp))
+		join = false;
 
 unlock:
 	rcu_read_unlock();
@@ -1539,7 +1545,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
 		if (!priv && !(flags & TNF_NO_GROUP))
-			task_numa_group(p, last_cpupid);
+			task_numa_group(p, last_cpupid, flags);
 	}
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 9898eeb..823720c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3584,6 +3584,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
+	/*
+	 * Flag if the page is shared between multiple address spaces. This
+	 * is later used when determining whether to group tasks together
+	 */
+	if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
+		flags |= TNF_SHARED;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Take false sharing into account when adapting scan rate
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:33   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  3e6a9418cf05638b103e34f5d13be0321872e623
Gitweb:     http://git.kernel.org/tip/3e6a9418cf05638b103e34f5d13be0321872e623
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:35 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:14 +0200

sched/numa: Take false sharing into account when adapting scan rate

Scan rate is altered based on whether shared/private faults dominated.
task_numa_group() may detect false sharing but that information is not
taken into account when adapting the scan rate. Take it into account.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-58-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 222c2d0..d26a16e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,7 +1381,8 @@ static void double_lock(spinlock_t *l1, spinlock_t *l2)
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
-static void task_numa_group(struct task_struct *p, int cpupid, int flags)
+static void task_numa_group(struct task_struct *p, int cpupid, int flags,
+			int *priv)
 {
 	struct numa_group *grp, *my_grp;
 	struct task_struct *tsk;
@@ -1447,6 +1448,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags)
 	if (flags & TNF_SHARED)
 		join = true;
 
+	/* Update priv based on whether false sharing was detected */
+	*priv = !join;
+
 	if (join && !get_numa_group(grp))
 		join = false;
 
@@ -1545,7 +1549,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
 		if (!priv && !(flags & TNF_NO_GROUP))
-			task_numa_group(p, last_cpupid, flags);
+			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Adjust scan rate in task_numa_placement
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:33   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  04bb2f9475054298f0c67a89ca92cade42d3fe5e
Gitweb:     http://git.kernel.org/tip/04bb2f9475054298f0c67a89ca92cade42d3fe5e
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:36 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:16 +0200

sched/numa: Adjust scan rate in task_numa_placement

Adjust numa_scan_period in task_numa_placement, depending on how much
useful work the numa code can do. The more local faults there are in a
given scan window the longer the period (and hence the slower the scan rate)
during the next window. If there are excessive shared faults then the scan
period will decrease with the amount of scaling depending on whether the
ratio of shared/private faults. If the preferred node changes then the
scan rate is reset to recheck if the task is properly placed.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |   9 ++++
 kernel/sched/fair.c   | 112 +++++++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |   4 +-
 mm/memory.c           |   9 ++--
 4 files changed, 105 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59f953b..2292f6c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1365,6 +1365,14 @@ struct task_struct {
 	 */
 	unsigned long *numa_faults_buffer;
 
+	/*
+	 * numa_faults_locality tracks if faults recorded during the last
+	 * scan window were remote/local. The task scan period is adapted
+	 * based on the locality of the faults with different weights
+	 * depending on whether they were shared or private faults
+	 */
+	unsigned long numa_faults_locality[2];
+
 	int numa_preferred_nid;
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -1455,6 +1463,7 @@ struct task_struct {
 #define TNF_MIGRATED	0x01
 #define TNF_NO_GROUP	0x02
 #define TNF_SHARED	0x04
+#define TNF_FAULT_LOCAL	0x08
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d26a16e..66237ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1241,6 +1241,12 @@ static int task_numa_migrate(struct task_struct *p)
 
 	sched_setnuma(p, env.dst_nid);
 
+	/*
+	 * Reset the scan period if the task is being rescheduled on an
+	 * alternative node to recheck if the tasks is now properly placed.
+	 */
+	p->numa_scan_period = task_scan_min(p);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1276,10 +1282,86 @@ static void numa_migrate_preferred(struct task_struct *p)
 		p->numa_migrate_retry = jiffies + HZ*5;
 }
 
+/*
+ * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
+ * increments. The more local the fault statistics are, the higher the scan
+ * period will be for the next scan window. If local/remote ratio is below
+ * NUMA_PERIOD_THRESHOLD (where range of ratio is 1..NUMA_PERIOD_SLOTS) the
+ * scan period will decrease
+ */
+#define NUMA_PERIOD_SLOTS 10
+#define NUMA_PERIOD_THRESHOLD 3
+
+/*
+ * Increase the scan period (slow down scanning) if the majority of
+ * our memory is already on our local node, or if the majority of
+ * the page accesses are shared with other processes.
+ * Otherwise, decrease the scan period.
+ */
+static void update_task_scan_period(struct task_struct *p,
+			unsigned long shared, unsigned long private)
+{
+	unsigned int period_slot;
+	int ratio;
+	int diff;
+
+	unsigned long remote = p->numa_faults_locality[0];
+	unsigned long local = p->numa_faults_locality[1];
+
+	/*
+	 * If there were no record hinting faults then either the task is
+	 * completely idle or all activity is areas that are not of interest
+	 * to automatic numa balancing. Scan slower
+	 */
+	if (local + shared == 0) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		p->mm->numa_next_scan = jiffies +
+			msecs_to_jiffies(p->numa_scan_period);
+
+		return;
+	}
+
+	/*
+	 * Prepare to scale scan period relative to the current period.
+	 *	 == NUMA_PERIOD_THRESHOLD scan period stays the same
+	 *       <  NUMA_PERIOD_THRESHOLD scan period decreases (scan faster)
+	 *	 >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower)
+	 */
+	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
+	ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
+	if (ratio >= NUMA_PERIOD_THRESHOLD) {
+		int slot = ratio - NUMA_PERIOD_THRESHOLD;
+		if (!slot)
+			slot = 1;
+		diff = slot * period_slot;
+	} else {
+		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
+
+		/*
+		 * Scale scan rate increases based on sharing. There is an
+		 * inverse relationship between the degree of sharing and
+		 * the adjustment made to the scanning period. Broadly
+		 * speaking the intent is that there is little point
+		 * scanning faster if shared accesses dominate as it may
+		 * simply bounce migrations uselessly
+		 */
+		period_slot = DIV_ROUND_UP(diff, NUMA_PERIOD_SLOTS);
+		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
+		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+	}
+
+	p->numa_scan_period = clamp(p->numa_scan_period + diff,
+			task_scan_min(p), task_scan_max(p));
+	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	unsigned long fault_types[2] = { 0, 0 };
 	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -1309,6 +1391,7 @@ static void task_numa_placement(struct task_struct *p)
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
+			fault_types[priv] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
@@ -1333,6 +1416,8 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	update_task_scan_period(p, fault_types[0], fault_types[1]);
+
 	if (p->numa_group) {
 		/*
 		 * If the preferred task and group nids are different,
@@ -1538,6 +1623,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 		p->total_numa_faults = 0;
+		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 	}
 
 	/*
@@ -1552,19 +1638,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}
 
-	/*
-	 * If pages are properly placed (did not migrate) then scan slower.
-	 * This is reset periodically in case of phase changes
-	 */
-	if (!migrated) {
-		/* Initialise if necessary */
-		if (!p->numa_scan_period_max)
-			p->numa_scan_period_max = task_scan_max(p);
-
-		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period + 10);
-	}
-
 	task_numa_placement(p);
 
 	/* Retry task to preferred node migration if it previously failed */
@@ -1575,6 +1648,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 		p->numa_pages_migrated += pages;
 
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
+	p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -1702,18 +1776,6 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * If the whole process was scanned without updates then no NUMA
-	 * hinting faults are being recorded and scan rate should be lower.
-	 */
-	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
-		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period << 1);
-
-		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
-		mm->numa_next_scan = next_scan;
-	}
-
-	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ab4e32..1be2a1f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,8 +1296,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page_nid = page_to_nid(page);
 	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == this_nid)
+	if (page_nid == this_nid) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		flags |= TNF_FAULT_LOCAL;
+	}
 
 	/*
 	 * Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/memory.c b/mm/memory.c
index 823720c..1c7501f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3527,13 +3527,16 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int page_nid)
+				unsigned long addr, int page_nid,
+				int *flags)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == numa_node_id())
+	if (page_nid == numa_node_id()) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		*flags |= TNF_FAULT_LOCAL;
+	}
 
 	return mpol_misplaced(page, vma, addr);
 }
@@ -3593,7 +3596,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
 		put_page(page);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Remove the numa_balancing_scan_period_reset sysctl
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:33   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  930aa174fcc8b0efaad102fd80f677b92f35eaa2
Gitweb:     http://git.kernel.org/tip/930aa174fcc8b0efaad102fd80f677b92f35eaa2
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:37 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:18 +0200

sched/numa: Remove the numa_balancing_scan_period_reset sysctl

With scan rate adaptions based on whether the workload has properly
converged or not there should be no need for the scan period reset
hammer. Get rid of it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/sysctl/kernel.txt | 11 +++--------
 include/linux/mm_types.h        |  3 ---
 include/linux/sched/sysctl.h    |  1 -
 kernel/sched/core.c             |  1 -
 kernel/sched/fair.c             | 18 +-----------------
 kernel/sysctl.c                 |  7 -------
 6 files changed, 4 insertions(+), 37 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index d48bca4..84f1780 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,15 +374,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
 feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
 
 ==============================================================
 
 numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
 
 Automatic NUMA balancing scans tasks address space and unmaps pages to
 detect if pages are properly placed or if the data should be migrated to a
@@ -418,9 +416,6 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
 numa_balancing_settle_count is how many scan periods must complete before
 the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a30f9ca..a3198e5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,9 +420,6 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
-	/* numa_next_reset is when the PTE scanner period will be reset */
-	unsigned long numa_next_reset;
-
 	/* Restart point for scanning and setting pte_numa */
 	unsigned long numa_scan_offset;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_settle_count;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8cfd51f..89c5ae8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1721,7 +1721,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
 		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 66237ff..da6fa22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -826,7 +826,6 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  */
 unsigned int sysctl_numa_balancing_scan_period_min = 1000;
 unsigned int sysctl_numa_balancing_scan_period_max = 60000;
-unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
 
 /* Portion of address space to scan in MB */
 unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -1685,24 +1684,9 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
-	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
 			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-		mm->numa_next_reset = now +
-			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-	}
-
-	/*
-	 * Reset the scan period if enough time has gone by. Objective is that
-	 * scanning will be reduced if pages are properly placed. As tasks
-	 * can enter different phases this needs to be re-examined. Lacking
-	 * proper tracking of reference behaviour, this blunt hammer is used.
-	 */
-	migrate = mm->numa_next_reset;
-	if (time_after(now, migrate)) {
-		p->numa_scan_period = task_scan_min(p);
-		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
-		xchg(&mm->numa_next_reset, next_scan);
 	}
 
 	/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 42f616a..e509b90 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -371,13 +371,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "numa_balancing_scan_period_reset",
-		.data		= &sysctl_numa_balancing_scan_period_reset,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
 		.procname	= "numa_balancing_scan_period_max_ms",
 		.data		= &sysctl_numa_balancing_scan_period_max,
 		.maxlen		= sizeof(unsigned int),

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] mm: numa: Revert temporarily disabling of NUMA migration
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:33   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  1e3646ffc64b232cb14a5ef01d7b98997c1b73f9
Gitweb:     http://git.kernel.org/tip/1e3646ffc64b232cb14a5ef01d7b98997c1b73f9
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:20 +0200

mm: numa: Revert temporarily disabling of NUMA migration

With the scan rate code working (at least for multi-instance specjbb),
the large hammer that is "sched: Do not migrate memory immediately after
switching node" can be replaced with something smarter. Revert temporarily
migration disabling and all traces of numa_migrate_seq.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 -
 kernel/sched/core.c   |  2 --
 kernel/sched/fair.c   | 25 +------------------------
 mm/mempolicy.c        | 12 ------------
 4 files changed, 1 insertion(+), 39 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2292f6c..d24f70f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1340,7 +1340,6 @@ struct task_struct {
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	int numa_scan_seq;
-	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
 	unsigned long numa_migrate_retry;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89c5ae8..0c3feeb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1731,7 +1731,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
@@ -4488,7 +4487,6 @@ void sched_setnuma(struct task_struct *p, int nid)
 		p->sched_class->put_prev_task(rq, p);
 
 	p->numa_preferred_nid = nid;
-	p->numa_migrate_seq = 1;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da6fa22..8454c38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1261,16 +1261,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
-		/*
-		 * If migration is temporarily disabled due to a task migration
-		 * then re-enable it now as the task is running on its
-		 * preferred node and memory should migrate locally
-		 */
-		if (!p->numa_migrate_seq)
-			p->numa_migrate_seq++;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
 		return;
-	}
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1))
@@ -1367,7 +1359,6 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
-	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* If the task is part of a group prevent parallel updates to group stats */
@@ -4730,20 +4721,6 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
-#ifdef CONFIG_NUMA_BALANCING
-	if (p->numa_preferred_nid != -1) {
-		int src_nid = cpu_to_node(env->src_cpu);
-		int dst_nid = cpu_to_node(env->dst_cpu);
-
-		/*
-		 * If the load balancer has moved the task then limit
-		 * migrations from taking place in the short term in
-		 * case this is a short-lived migration.
-		 */
-		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
-			p->numa_migrate_seq = 0;
-	}
-#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a5867ef..2929c24 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2404,18 +2404,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
-
-#ifdef CONFIG_NUMA_BALANCING
-		/*
-		 * If the scheduler has just moved us away from our
-		 * preferred node, do not bother migrating pages yet.
-		 * This way a short and temporary process migration will
-		 * not cause excessive memory migration.
-		 */
-		if (thisnid != current->numa_preferred_nid &&
-				!current->numa_migrate_seq)
-			goto out;
-#endif
 	}
 
 	if (curnid != polnid)

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Skip some page migrations after a shared fault
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:34   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  de1c9ce6f07fec0381a39a9d0b379ea35aa1167f
Gitweb:     http://git.kernel.org/tip/de1c9ce6f07fec0381a39a9d0b379ea35aa1167f
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:39 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:21 +0200

sched/numa: Skip some page migrations after a shared fault

Shared faults can lead to lots of unnecessary page migrations,
slowing down the system, and causing private faults to hit the
per-pgdat migration ratelimit.

This patch adds sysctl numa_balancing_migrate_deferred, which specifies
how many shared page migrations to skip unconditionally, after each page
migration that is skipped because it is a shared fault.

This reduces the number of page migrations back and forth in
shared fault situations. It also gives a strong preference to
the tasks that are already running where most of the memory is,
and to moving the other tasks to near the memory.

Testing this with a much higher scan rate than the default
still seems to result in fewer page migrations than before.

Memory seems to be somewhat better consolidated than previously,
with multi-instance specjbb runs on a 4 node system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/sysctl/kernel.txt | 10 ++++++++-
 include/linux/sched.h           |  5 ++++-
 kernel/sched/fair.c             |  8 +++++++
 kernel/sysctl.c                 |  7 ++++++
 mm/mempolicy.c                  | 48 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 84f1780..4273b2d 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
+numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
+numa_balancing_migrate_deferred.
 
 ==============================================================
 
@@ -421,6 +422,13 @@ the schedule balancer stops pushing the task towards a preferred node. This
 gives the scheduler a chance to place the task on an alternative node if the
 preferred node is overloaded.
 
+numa_balancing_migrate_deferred is how many page migrations get skipped
+unconditionally, after a page migration is skipped because a page is shared
+with other tasks. This reduces page migration overhead, and determines
+how much stronger the "move task near its memory" policy scheduler becomes,
+versus the "move memory near its task" memory management policy, for workloads
+with shared memory.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d24f70f..833eed5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1342,6 +1342,8 @@ struct task_struct {
 	int numa_scan_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	int numa_preferred_nid;
+	int numa_migrate_deferred;
 	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
@@ -1372,7 +1374,6 @@ struct task_struct {
 	 */
 	unsigned long numa_faults_locality[2];
 
-	int numa_preferred_nid;
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -1469,6 +1470,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
+
+extern unsigned int sysctl_numa_balancing_migrate_deferred;
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8454c38..e7884dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -833,6 +833,14 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * After skipping a page migration on a shared page, skip N more numa page
+ * migrations unconditionally. This reduces the number of NUMA migrations
+ * in shared memory workloads, and has the effect of pulling tasks towards
+ * where their memory lives, over pulling the memory towards the task.
+ */
+unsigned int sysctl_numa_balancing_migrate_deferred = 16;
+
 static unsigned int task_nr_scan_windows(struct task_struct *p)
 {
 	unsigned long rss = 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e509b90..a159e1f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode           = 0644,
 		.proc_handler   = proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_migrate_deferred",
+		.data           = &sysctl_numa_balancing_migrate_deferred,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2929c24..71cb253 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2301,6 +2301,35 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
+{
+	/* Never defer a private fault */
+	if (cpupid_match_pid(p, last_cpupid))
+		return false;
+
+	if (p->numa_migrate_deferred) {
+		p->numa_migrate_deferred--;
+		return true;
+	}
+	return false;
+}
+
+static inline void defer_numa_migrate(struct task_struct *p)
+{
+	p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred;
+}
+#else
+static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid)
+{
+	return false;
+}
+
+static inline void defer_numa_migrate(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * mpol_misplaced - check whether current page node is valid in policy
  *
@@ -2402,7 +2431,24 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * relation.
 		 */
 		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
-		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
+
+			/* See sysctl_numa_balancing_migrate_deferred comment */
+			if (!cpupid_match_pid(current, last_cpupid))
+				defer_numa_migrate(current);
+
+			goto out;
+		}
+
+		/*
+		 * The quadratic filter above reduces extraneous migration
+		 * of shared pages somewhat. This code reduces it even more,
+		 * reducing the overhead of page migrations of shared pages.
+		 * This makes workloads with shared pages rely more on
+		 * "move task near its memory", and less on "move memory
+		 * towards its task", which is exactly what we want.
+		 */
+		if (numa_migrate_deferred(current, last_cpupid))
 			goto out;
 	}
 

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Use unsigned longs for numa group fault stats
  2013-10-07 10:29   ` Mel Gorman
  (?)
  (?)
@ 2013-10-09 17:34   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  989348b5fc2367d6880d23a1c779a90bbb6f9baf
Gitweb:     http://git.kernel.org/tip/989348b5fc2367d6880d23a1c779a90bbb6f9baf
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:40 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:23 +0200

sched/numa: Use unsigned longs for numa group fault stats

As Peter says "If you're going to hold locks you can also do away with all
that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
the updates.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-63-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 49 ++++++++++++++++++++-----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7884dc..5b2208e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -916,8 +916,8 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
-	atomic_long_t total_faults;
-	atomic_long_t faults[0];
+	unsigned long total_faults;
+	unsigned long faults[0];
 };
 
 pid_t task_numa_group_id(struct task_struct *p)
@@ -944,8 +944,7 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
 	if (!p->numa_group)
 		return 0;
 
-	return atomic_long_read(&p->numa_group->faults[2*nid]) +
-	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+	return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1];
 }
 
 /*
@@ -971,17 +970,10 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 
 static inline unsigned long group_weight(struct task_struct *p, int nid)
 {
-	unsigned long total_faults;
-
-	if (!p->numa_group)
-		return 0;
-
-	total_faults = atomic_long_read(&p->numa_group->total_faults);
-
-	if (!total_faults)
+	if (!p->numa_group || !p->numa_group->total_faults)
 		return 0;
 
-	return 1000 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1397,9 +1389,9 @@ static void task_numa_placement(struct task_struct *p)
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
-				atomic_long_add(diff, &p->numa_group->faults[i]);
-				atomic_long_add(diff, &p->numa_group->total_faults);
-				group_faults += atomic_long_read(&p->numa_group->faults[i]);
+				p->numa_group->faults[i] += diff;
+				p->numa_group->total_faults += diff;
+				group_faults += p->numa_group->faults[i];
 			}
 		}
 
@@ -1475,7 +1467,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
-				    2*nr_node_ids*sizeof(atomic_long_t);
+				    2*nr_node_ids*sizeof(unsigned long);
 
 		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!grp)
@@ -1487,9 +1479,9 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+			grp->faults[i] = p->numa_faults[i];
 
-		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+		grp->total_faults = p->total_numa_faults;
 
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
@@ -1543,14 +1535,14 @@ unlock:
 	if (!join)
 		return;
 
+	double_lock(&my_grp->lock, &grp->lock);
+
 	for (i = 0; i < 2*nr_node_ids; i++) {
-		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
-		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] += p->numa_faults[i];
 	}
-	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
-	atomic_long_add(p->total_numa_faults, &grp->total_faults);
-
-	double_lock(&my_grp->lock, &grp->lock);
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults += p->total_numa_faults;
 
 	list_move(&p->numa_entry, &grp->task_list);
 	my_grp->nr_tasks--;
@@ -1571,12 +1563,11 @@ void task_numa_free(struct task_struct *p)
 	void *numa_faults = p->numa_faults;
 
 	if (grp) {
+		spin_lock(&grp->lock);
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
-
-		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+			grp->faults[i] -= p->numa_faults[i];
+		grp->total_faults -= p->total_numa_faults;
 
-		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
 		spin_unlock(&grp->lock);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched/numa: Retry task_numa_migrate() periodically
  2013-10-07 10:29   ` Mel Gorman
  (?)
@ 2013-10-09 17:34   ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
	mgorman, tglx

Commit-ID:  2739d3eef3a93a92c366a3a0bb85a0afe09e8b8c
Gitweb:     http://git.kernel.org/tip/2739d3eef3a93a92c366a3a0bb85a0afe09e8b8c
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:41 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 14:48:25 +0200

sched/numa: Retry task_numa_migrate() periodically

Short spikes of CPU load can lead to a task being migrated
away from its preferred node for temporary reasons.

It is important that the task is migrated back to where it
belongs, in order to avoid migrating too much memory to its
new location, and generally disturbing a task's NUMA location.

This patch fixes NUMA placement for 4 specjbb instances on
a 4 node system. Without this patch, things take longer to
converge, and processes are not always completely on their
own node.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-64-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b2208e..e914930 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,18 +1259,19 @@ static int task_numa_migrate(struct task_struct *p)
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
-	/* Success if task is already running on preferred CPU */
-	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	/* This task has no NUMA fault statistics yet */
+	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
 		return;
 
-	/* This task has no NUMA fault statistics yet */
-	if (unlikely(p->numa_preferred_nid == -1))
+	/* Periodically retry migrating the task to the preferred node */
+	p->numa_migrate_retry = jiffies + HZ;
+
+	/* Success if task is already running on preferred CPU */
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
 		return;
 
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	if (task_numa_migrate(p) != 0)
-		p->numa_migrate_retry = jiffies + HZ*5;
+	task_numa_migrate(p);
 }
 
 /*
@@ -1629,8 +1630,11 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 	task_numa_placement(p);
 
-	/* Retry task to preferred node migration if it previously failed */
-	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+	/*
+	 * Retry task to preferred node migration periodically, in case it
+	 * case it previously failed, or the scheduler moved us.
+	 */
+	if (time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
 	if (migrated)

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
  2013-10-09 11:03   ` Ingo Molnar
@ 2013-10-10  7:05     ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-10  7:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 01:03:54PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > This series has roughly the same goals as previous versions despite the
> > size. It reduces overhead of automatic balancing through scan rate reduction
> > and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> > towards their memory as well as moving memory toward their task. It handles
> > shared pages and groups related tasks together. Some problems such as shared
> > page interleaving and properly dealing with processes that are larger than
> > a node are being deferred. This version should be ready for wider testing
> > in -tip.
> 
> Thanks Mel - the series looks really nice. I've applied the patches to 
> tip:sched/core and will push them out later today if they pass testing 
> here.
> 

Thanks very much!

> > Note that with kernel 3.12-rc3 that numa balancing will fail to boot if 
> > CONFIG_JUMP_LABEL is configured. This is a separate bug that is 
> > currently being dealt with.
> 
> Okay, this is about:
> 
>   https://lkml.org/lkml/2013/9/30/308
> 
> Note that Peter and me saw no crashes so far, and we boot with 
> CONFIG_JUMP_LABEL=y and CONFIG_NUMA_BALANCING=y. It seems like an 
> unrelated bug in any case, perhaps related to specific details in your 
> kernel image?
> 

Possibly or it has been fixed since and I missed it. I'll test latest
tip and see what falls out.

> 2)
> 
> I also noticed a small Kconfig annoyance:
> 
> config NUMA_BALANCING_DEFAULT_ENABLED
>         bool "Automatically enable NUMA aware memory/task placement"
>         default y
>         depends on NUMA_BALANCING
>         help
>           If set, autonumic NUMA balancing will be enabled if running on a NUMA
>           machine.
> 
> config NUMA_BALANCING
>         bool "Memory placement aware NUMA scheduler"
>         depends on ARCH_SUPPORTS_NUMA_BALANCING
>         depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
>         depends on SMP && NUMA && MIGRATION
>         help
>           This option adds support for automatic NUM
> 
> the NUMA_BALANCING_DEFAULT_ENABLED option should come after the 
> NUMA_BALANCING entries - things like 'make oldconfig' produce weird output 
> otherwise.
> 

Ok, I did not realise that would be a problem. Thanks for fixing it up
as well as the build errors on UP.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-10  7:05     ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-10  7:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Oct 09, 2013 at 01:03:54PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > This series has roughly the same goals as previous versions despite the
> > size. It reduces overhead of automatic balancing through scan rate reduction
> > and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> > towards their memory as well as moving memory toward their task. It handles
> > shared pages and groups related tasks together. Some problems such as shared
> > page interleaving and properly dealing with processes that are larger than
> > a node are being deferred. This version should be ready for wider testing
> > in -tip.
> 
> Thanks Mel - the series looks really nice. I've applied the patches to 
> tip:sched/core and will push them out later today if they pass testing 
> here.
> 

Thanks very much!

> > Note that with kernel 3.12-rc3 that numa balancing will fail to boot if 
> > CONFIG_JUMP_LABEL is configured. This is a separate bug that is 
> > currently being dealt with.
> 
> Okay, this is about:
> 
>   https://lkml.org/lkml/2013/9/30/308
> 
> Note that Peter and me saw no crashes so far, and we boot with 
> CONFIG_JUMP_LABEL=y and CONFIG_NUMA_BALANCING=y. It seems like an 
> unrelated bug in any case, perhaps related to specific details in your 
> kernel image?
> 

Possibly or it has been fixed since and I missed it. I'll test latest
tip and see what falls out.

> 2)
> 
> I also noticed a small Kconfig annoyance:
> 
> config NUMA_BALANCING_DEFAULT_ENABLED
>         bool "Automatically enable NUMA aware memory/task placement"
>         default y
>         depends on NUMA_BALANCING
>         help
>           If set, autonumic NUMA balancing will be enabled if running on a NUMA
>           machine.
> 
> config NUMA_BALANCING
>         bool "Memory placement aware NUMA scheduler"
>         depends on ARCH_SUPPORTS_NUMA_BALANCING
>         depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
>         depends on SMP && NUMA && MIGRATION
>         help
>           This option adds support for automatic NUM
> 
> the NUMA_BALANCING_DEFAULT_ENABLED option should come after the 
> NUMA_BALANCING entries - things like 'make oldconfig' produce weird output 
> otherwise.
> 

Ok, I did not realise that would be a problem. Thanks for fixing it up
as well as the build errors on UP.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [tip:sched/core] sched/numa: Introduce migrate_swap()
  2013-10-09 17:30   ` [tip:sched/core] sched/numa: " tip-bot for Peter Zijlstra
@ 2013-10-10 18:17     ` Peter Zijlstra
  2013-10-10 19:04       ` Rik van Riel
                         ` (2 more replies)
  0 siblings, 3 replies; 340+ messages in thread
From: Peter Zijlstra @ 2013-10-10 18:17 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, hannes, riel, aarcange, srikar, mgorman, tglx
  Cc: linux-tip-commits

On Wed, Oct 09, 2013 at 10:30:13AM -0700, tip-bot for Peter Zijlstra wrote:
> sched/numa: Introduce migrate_swap()

Thanks to Rik for writing the Changelog!

---
From: Peter Zijlstra <peterz@infradead.org>
Subject: sched: Fix race in migrate_swap_stop

There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
places with task T, on CPU B.

Task P:
  - call migrate_swap
Task T:
  - go to sleep, removing itself from the runqueue
Task P:
  - double lock the runqueues on CPU A & B
Task T:
  - get woken up, place itself on the runqueue of CPU C
Task P:
  - see that task T is on a runqueue, and pretend to remove it
    from the runqueue on CPU B

Now CPUs B & C both have corrupted scheduler data structures.

This patch fixes it, by holding the pi_lock for both of the tasks
involved in the migrate swap. This prevents task T from waking up,
and placing itself onto another runqueue, until after migrate_swap
has released all locks.

This means that, when migrate_swap checks, task T will be either
on the runqueue where it was originally seen, or not on any
runqueue at all. Migrate_swap deals correctly with of those cases.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Joe Mario <jmario@redhat.com>
---
 kernel/sched/core.c  |  4 ++++
 kernel/sched/fair.c  |  9 ---------
 kernel/sched/sched.h | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c3feebcf112..a972acd468b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1049,6 +1049,8 @@ static int migrate_swap_stop(void *data)
 	src_rq = cpu_rq(arg->src_cpu);
 	dst_rq = cpu_rq(arg->dst_cpu);
 
+	double_raw_lock(&arg->src_task->pi_lock,
+			&arg->dst_task->pi_lock);
 	double_rq_lock(src_rq, dst_rq);
 	if (task_cpu(arg->dst_task) != arg->dst_cpu)
 		goto unlock;
@@ -1069,6 +1071,8 @@ static int migrate_swap_stop(void *data)
 
 unlock:
 	double_rq_unlock(src_rq, dst_rq);
+	raw_spin_unlock(&arg->dst_task->pi_lock);
+	raw_spin_unlock(&arg->src_task->pi_lock);
 
 	return ret;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 803e343d7c89..a60d57c5379a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1448,15 +1448,6 @@ static inline void put_numa_group(struct numa_group *grp)
 		kfree_rcu(grp, rcu);
 }
 
-static void double_lock(spinlock_t *l1, spinlock_t *l2)
-{
-	if (l1 > l2)
-		swap(l1, l2);
-
-	spin_lock(l1);
-	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
-}
-
 static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			int *priv)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d69cb325c27e..ffc708717b70 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1249,6 +1249,24 @@ static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
 }
 
+static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	raw_spin_lock(l1);
+	raw_spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
 /*
  * double_rq_lock - safely lock two runqueues
  *

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [tip:sched/core] sched/numa: Introduce migrate_swap()
  2013-10-10 18:17     ` Peter Zijlstra
@ 2013-10-10 19:04       ` Rik van Riel
  2013-10-15  9:55       ` Mel Gorman
  2013-10-17 16:49       ` [tip:sched/core] sched: Fix race in migrate_swap_stop() tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-10 19:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, linux-kernel, hannes, aarcange, srikar, mgorman,
	tglx, linux-tip-commits

On 10/10/2013 02:17 PM, Peter Zijlstra wrote:
> On Wed, Oct 09, 2013 at 10:30:13AM -0700, tip-bot for Peter Zijlstra wrote:
>> sched/numa: Introduce migrate_swap()
>
> Thanks to Rik for writing the Changelog!
>
> ---
> From: Peter Zijlstra <peterz@infradead.org>
> Subject: sched: Fix race in migrate_swap_stop
>
> There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
> places with task T, on CPU B.
>
> Task P:
>    - call migrate_swap
> Task T:
>    - go to sleep, removing itself from the runqueue
> Task P:
>    - double lock the runqueues on CPU A & B
> Task T:
>    - get woken up, place itself on the runqueue of CPU C
> Task P:
>    - see that task T is on a runqueue, and pretend to remove it
>      from the runqueue on CPU B
>
> Now CPUs B & C both have corrupted scheduler data structures.
>
> This patch fixes it, by holding the pi_lock for both of the tasks
> involved in the migrate swap. This prevents task T from waking up,
> and placing itself onto another runqueue, until after migrate_swap
> has released all locks.
>
> This means that, when migrate_swap checks, task T will be either
> on the runqueue where it was originally seen, or not on any
> runqueue at all. Migrate_swap deals correctly with of those cases.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Tested-by: Joe Mario <jmario@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [tip:sched/core] sched/numa: Introduce migrate_swap()
  2013-10-10 18:17     ` Peter Zijlstra
  2013-10-10 19:04       ` Rik van Riel
@ 2013-10-15  9:55       ` Mel Gorman
  2013-10-17 16:49       ` [tip:sched/core] sched: Fix race in migrate_swap_stop() tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-15  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, linux-kernel, hannes, riel, aarcange, srikar, tglx,
	linux-tip-commits

On Thu, Oct 10, 2013 at 08:17:22PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 09, 2013 at 10:30:13AM -0700, tip-bot for Peter Zijlstra wrote:
> > sched/numa: Introduce migrate_swap()
> 
> Thanks to Rik for writing the Changelog!
> 
> ---
> From: Peter Zijlstra <peterz@infradead.org>
> Subject: sched: Fix race in migrate_swap_stop
> 
> There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
> places with task T, on CPU B.
> 
> Task P:
>   - call migrate_swap
> Task T:
>   - go to sleep, removing itself from the runqueue
> Task P:
>   - double lock the runqueues on CPU A & B
> Task T:
>   - get woken up, place itself on the runqueue of CPU C
> Task P:
>   - see that task T is on a runqueue, and pretend to remove it
>     from the runqueue on CPU B
> 
> Now CPUs B & C both have corrupted scheduler data structures.
> 
> This patch fixes it, by holding the pi_lock for both of the tasks
> involved in the migrate swap. This prevents task T from waking up,
> and placing itself onto another runqueue, until after migrate_swap
> has released all locks.
> 
> This means that, when migrate_swap checks, task T will be either
> on the runqueue where it was originally seen, or not on any
> runqueue at all. Migrate_swap deals correctly with of those cases.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Tested-by: Joe Mario <jmario@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [tip:sched/core] sched: Fix race in migrate_swap_stop()
  2013-10-10 18:17     ` Peter Zijlstra
  2013-10-10 19:04       ` Rik van Riel
  2013-10-15  9:55       ` Mel Gorman
@ 2013-10-17 16:49       ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-17 16:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, jmario, riel, mgorman, tglx

Commit-ID:  746023159c40c523b08a3bc3d213dac212385895
Gitweb:     http://git.kernel.org/tip/746023159c40c523b08a3bc3d213dac212385895
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 10 Oct 2013 20:17:22 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 16 Oct 2013 14:22:14 +0200

sched: Fix race in migrate_swap_stop()

There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
places with task T, on CPU B.

Task P:
  - call migrate_swap
Task T:
  - go to sleep, removing itself from the runqueue
Task P:
  - double lock the runqueues on CPU A & B
Task T:
  - get woken up, place itself on the runqueue of CPU C
Task P:
  - see that task T is on a runqueue, and pretend to remove it
    from the runqueue on CPU B

Now CPUs B & C both have corrupted scheduler data structures.

This patch fixes it, by holding the pi_lock for both of the tasks
involved in the migrate swap. This prevents task T from waking up,
and placing itself onto another runqueue, until after migrate_swap
has released all locks.

This means that, when migrate_swap checks, task T will be either
on the runqueue where it was originally seen, or not on any
runqueue at all. Migrate_swap deals correctly with of those cases.

Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: hannes@cmpxchg.org
Cc: aarcange@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: tglx@linutronix.de
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  |  4 ++++
 kernel/sched/fair.c  |  9 ---------
 kernel/sched/sched.h | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c3feeb..a972acd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1049,6 +1049,8 @@ static int migrate_swap_stop(void *data)
 	src_rq = cpu_rq(arg->src_cpu);
 	dst_rq = cpu_rq(arg->dst_cpu);
 
+	double_raw_lock(&arg->src_task->pi_lock,
+			&arg->dst_task->pi_lock);
 	double_rq_lock(src_rq, dst_rq);
 	if (task_cpu(arg->dst_task) != arg->dst_cpu)
 		goto unlock;
@@ -1069,6 +1071,8 @@ static int migrate_swap_stop(void *data)
 
 unlock:
 	double_rq_unlock(src_rq, dst_rq);
+	raw_spin_unlock(&arg->dst_task->pi_lock);
+	raw_spin_unlock(&arg->src_task->pi_lock);
 
 	return ret;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4aa0b10..813dd61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1448,15 +1448,6 @@ static inline void put_numa_group(struct numa_group *grp)
 		kfree_rcu(grp, rcu);
 }
 
-static void double_lock(spinlock_t *l1, spinlock_t *l2)
-{
-	if (l1 > l2)
-		swap(l1, l2);
-
-	spin_lock(l1);
-	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
-}
-
 static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			int *priv)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d69cb32..ffc7087 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1249,6 +1249,24 @@ static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
 }
 
+static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	raw_spin_lock(l1);
+	raw_spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
 /*
  * double_rq_lock - safely lock two runqueues
  *

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Automatic NUMA balancing patches for tip-urgent/stable
  2013-10-07 10:28 ` Mel Gorman
@ 2013-10-24 12:26   ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-24 12:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Mon, Oct 07, 2013 at 11:28:38AM +0100, Mel Gorman wrote:
> This series has roughly the same goals as previous versions despite the
> size. It reduces overhead of automatic balancing through scan rate reduction
> and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> towards their memory as well as moving memory toward their task. It handles
> shared pages and groups related tasks together. Some problems such as shared
> page interleaving and properly dealing with processes that are larger than
> a node are being deferred. This version should be ready for wider testing
> in -tip.
> 

Hi Ingo,

Off-list we talked with Peter about the fact that automatic NUMA
balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt
userspace memory. There is one LKML report on this that I'm aware of --
https://lkml.org/lkml/2013/7/31/647 which I prompt forgot to follow up
properly on . The user-visible effect is that pages get filled with zeros
with results such as null pointer exceptions in JVMs. It is fairly difficult
to trigger but it became much easier to trigger during the development of
the series "Basic scheduler support for automatic NUMA balancing" which
is how it was discovered and finally fixed.

In that series I tagged patches 2-9 for -stable as these patches addressed
the problem for me. I did not call it out as clearly as I should have
and did not realise the cc: stable tags were stripped. Worse, as it was
close to the release and the bug is relatively old I was ok with waiting
until 3.12 came out and then treat it as a -stable backport. It has been
highlighted that this is the wrong attitude and we should consider merging
the fixes now and backporting to -stable sooner rather than later.

The most important patches are 

mm: Wait for THP migrations to complete during NUMA hinting fault
mm: Prevent parallel splits during THP migration
mm: Close races between THP migration and PMD numa clearing

but on their own they will cause conflicts with tricky fixups and -stable
would differ from mainline in annoying ways. Patches 2-9 have been heavily
tested in isolation so I'm reasonably confident they fix the problem and are
-stable material. While strictly speaking not all the patches are required
for the fix, the -stable kernels would then be directly comparable with
3.13 when the full NUMA balancing series is applied. If I rework them at
this point then I'll also have to retest delaying things until next week.

Please consider queueing patches 2-9 for 3.12 via -urgent if it is not
too late and preserve the cc: stable tags so Greg will pick them up
automatically.

Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Automatic NUMA balancing patches for tip-urgent/stable
@ 2013-10-24 12:26   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-24 12:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Mon, Oct 07, 2013 at 11:28:38AM +0100, Mel Gorman wrote:
> This series has roughly the same goals as previous versions despite the
> size. It reduces overhead of automatic balancing through scan rate reduction
> and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> towards their memory as well as moving memory toward their task. It handles
> shared pages and groups related tasks together. Some problems such as shared
> page interleaving and properly dealing with processes that are larger than
> a node are being deferred. This version should be ready for wider testing
> in -tip.
> 

Hi Ingo,

Off-list we talked with Peter about the fact that automatic NUMA
balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt
userspace memory. There is one LKML report on this that I'm aware of --
https://lkml.org/lkml/2013/7/31/647 which I prompt forgot to follow up
properly on . The user-visible effect is that pages get filled with zeros
with results such as null pointer exceptions in JVMs. It is fairly difficult
to trigger but it became much easier to trigger during the development of
the series "Basic scheduler support for automatic NUMA balancing" which
is how it was discovered and finally fixed.

In that series I tagged patches 2-9 for -stable as these patches addressed
the problem for me. I did not call it out as clearly as I should have
and did not realise the cc: stable tags were stripped. Worse, as it was
close to the release and the bug is relatively old I was ok with waiting
until 3.12 came out and then treat it as a -stable backport. It has been
highlighted that this is the wrong attitude and we should consider merging
the fixes now and backporting to -stable sooner rather than later.

The most important patches are 

mm: Wait for THP migrations to complete during NUMA hinting fault
mm: Prevent parallel splits during THP migration
mm: Close races between THP migration and PMD numa clearing

but on their own they will cause conflicts with tricky fixups and -stable
would differ from mainline in annoying ways. Patches 2-9 have been heavily
tested in isolation so I'm reasonably confident they fix the problem and are
-stable material. While strictly speaking not all the patches are required
for the fix, the -stable kernels would then be directly comparable with
3.13 when the full NUMA balancing series is applied. If I rework them at
this point then I'll also have to retest delaying things until next week.

Please consider queueing patches 2-9 for 3.12 via -urgent if it is not
too late and preserve the cc: stable tags so Greg will pick them up
automatically.

Thanks

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
  2013-10-24 12:26   ` Mel Gorman
@ 2013-10-26 12:11     ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-26 12:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Oct 07, 2013 at 11:28:38AM +0100, Mel Gorman wrote:
> > This series has roughly the same goals as previous versions despite the
> > size. It reduces overhead of automatic balancing through scan rate reduction
> > and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> > towards their memory as well as moving memory toward their task. It handles
> > shared pages and groups related tasks together. Some problems such as shared
> > page interleaving and properly dealing with processes that are larger than
> > a node are being deferred. This version should be ready for wider testing
> > in -tip.
> > 
> 
> Hi Ingo,
> 
> Off-list we talked with Peter about the fact that automatic NUMA
> balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt
> userspace memory. There is one LKML report on this that I'm aware of --
> https://lkml.org/lkml/2013/7/31/647 which I prompt forgot to follow up
> properly on . The user-visible effect is that pages get filled with zeros
> with results such as null pointer exceptions in JVMs. It is fairly difficult
> to trigger but it became much easier to trigger during the development of
> the series "Basic scheduler support for automatic NUMA balancing" which
> is how it was discovered and finally fixed.
> 
> In that series I tagged patches 2-9 for -stable as these patches addressed
> the problem for me. I did not call it out as clearly as I should have
> and did not realise the cc: stable tags were stripped. Worse, as it was
> close to the release and the bug is relatively old I was ok with waiting
> until 3.12 came out and then treat it as a -stable backport. It has been
> highlighted that this is the wrong attitude and we should consider merging
> the fixes now and backporting to -stable sooner rather than later.
> 
> The most important patches are 
> 
> mm: Wait for THP migrations to complete during NUMA hinting fault
> mm: Prevent parallel splits during THP migration
> mm: Close races between THP migration and PMD numa clearing
> 
> but on their own they will cause conflicts with tricky fixups and -stable
> would differ from mainline in annoying ways. Patches 2-9 have been heavily
> tested in isolation so I'm reasonably confident they fix the problem and are
> -stable material. While strictly speaking not all the patches are required
> for the fix, the -stable kernels would then be directly comparable with
> 3.13 when the full NUMA balancing series is applied. If I rework them at
> this point then I'll also have to retest delaying things until next week.
> 
> Please consider queueing patches 2-9 for 3.12 via -urgent if it is 
> not too late and preserve the cc: stable tags so Greg will pick 
> them up automatically.

Would be nice if you gave me all the specific SHA1 tags of 
sched/core that are required for the fix. We can certainly
use a range to make it all safer to apply.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
@ 2013-10-26 12:11     ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-26 12:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Oct 07, 2013 at 11:28:38AM +0100, Mel Gorman wrote:
> > This series has roughly the same goals as previous versions despite the
> > size. It reduces overhead of automatic balancing through scan rate reduction
> > and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> > towards their memory as well as moving memory toward their task. It handles
> > shared pages and groups related tasks together. Some problems such as shared
> > page interleaving and properly dealing with processes that are larger than
> > a node are being deferred. This version should be ready for wider testing
> > in -tip.
> > 
> 
> Hi Ingo,
> 
> Off-list we talked with Peter about the fact that automatic NUMA
> balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt
> userspace memory. There is one LKML report on this that I'm aware of --
> https://lkml.org/lkml/2013/7/31/647 which I prompt forgot to follow up
> properly on . The user-visible effect is that pages get filled with zeros
> with results such as null pointer exceptions in JVMs. It is fairly difficult
> to trigger but it became much easier to trigger during the development of
> the series "Basic scheduler support for automatic NUMA balancing" which
> is how it was discovered and finally fixed.
> 
> In that series I tagged patches 2-9 for -stable as these patches addressed
> the problem for me. I did not call it out as clearly as I should have
> and did not realise the cc: stable tags were stripped. Worse, as it was
> close to the release and the bug is relatively old I was ok with waiting
> until 3.12 came out and then treat it as a -stable backport. It has been
> highlighted that this is the wrong attitude and we should consider merging
> the fixes now and backporting to -stable sooner rather than later.
> 
> The most important patches are 
> 
> mm: Wait for THP migrations to complete during NUMA hinting fault
> mm: Prevent parallel splits during THP migration
> mm: Close races between THP migration and PMD numa clearing
> 
> but on their own they will cause conflicts with tricky fixups and -stable
> would differ from mainline in annoying ways. Patches 2-9 have been heavily
> tested in isolation so I'm reasonably confident they fix the problem and are
> -stable material. While strictly speaking not all the patches are required
> for the fix, the -stable kernels would then be directly comparable with
> 3.13 when the full NUMA balancing series is applied. If I rework them at
> this point then I'll also have to retest delaying things until next week.
> 
> Please consider queueing patches 2-9 for 3.12 via -urgent if it is 
> not too late and preserve the cc: stable tags so Greg will pick 
> them up automatically.

Would be nice if you gave me all the specific SHA1 tags of 
sched/core that are required for the fix. We can certainly
use a range to make it all safer to apply.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
  2013-10-26 12:11     ` Ingo Molnar
@ 2013-10-29  9:42       ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-29  9:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Sat, Oct 26, 2013 at 02:11:48PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Mon, Oct 07, 2013 at 11:28:38AM +0100, Mel Gorman wrote:
> > > This series has roughly the same goals as previous versions despite the
> > > size. It reduces overhead of automatic balancing through scan rate reduction
> > > and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> > > towards their memory as well as moving memory toward their task. It handles
> > > shared pages and groups related tasks together. Some problems such as shared
> > > page interleaving and properly dealing with processes that are larger than
> > > a node are being deferred. This version should be ready for wider testing
> > > in -tip.
> > > 
> > 
> > Hi Ingo,
> > 
> > Off-list we talked with Peter about the fact that automatic NUMA
> > balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt
> > userspace memory. There is one LKML report on this that I'm aware of --
> > https://lkml.org/lkml/2013/7/31/647 which I prompt forgot to follow up
> > properly on . The user-visible effect is that pages get filled with zeros
> > with results such as null pointer exceptions in JVMs. It is fairly difficult
> > to trigger but it became much easier to trigger during the development of
> > the series "Basic scheduler support for automatic NUMA balancing" which
> > is how it was discovered and finally fixed.
> > 
> > In that series I tagged patches 2-9 for -stable as these patches addressed
> > the problem for me. I did not call it out as clearly as I should have
> > and did not realise the cc: stable tags were stripped. Worse, as it was
> > close to the release and the bug is relatively old I was ok with waiting
> > until 3.12 came out and then treat it as a -stable backport. It has been
> > highlighted that this is the wrong attitude and we should consider merging
> > the fixes now and backporting to -stable sooner rather than later.
> > 
> > The most important patches are 
> > 
> > mm: Wait for THP migrations to complete during NUMA hinting fault
> > mm: Prevent parallel splits during THP migration
> > mm: Close races between THP migration and PMD numa clearing
> > 
> > but on their own they will cause conflicts with tricky fixups and -stable
> > would differ from mainline in annoying ways. Patches 2-9 have been heavily
> > tested in isolation so I'm reasonably confident they fix the problem and are
> > -stable material. While strictly speaking not all the patches are required
> > for the fix, the -stable kernels would then be directly comparable with
> > 3.13 when the full NUMA balancing series is applied. If I rework them at
> > this point then I'll also have to retest delaying things until next week.
> > 
> > Please consider queueing patches 2-9 for 3.12 via -urgent if it is 
> > not too late and preserve the cc: stable tags so Greg will pick 
> > them up automatically.
> 
> Would be nice if you gave me all the specific SHA1 tags of 
> sched/core that are required for the fix. We can certainly
> use a range to make it all safer to apply.
> 

Of course. The range of the relevant commits in tip/sched/core is
ca4be374c5c0ab3d8b84fb2861d663216281e6ac..778ec5247bb79815af12434980164334fb94cc9e

904f64a376e663cd459fb7aec4f12e14c39c24b6 mm: numa: Document automatic NUMA balancing sysctls
1d649bccc8c1370e402b85e1d345ad24f3f0d1b5 sched, numa: Comment fixlets
f961cab8d55d55d6abc0df08ce2abec8ab56f2c8 mm: numa: Do not account for a hinting fault if we raced
6f2a15fc1df62af3ba3be327877b7e53cb16e878 mm: Wait for THP migrations to complete during NUMA hinting faults
4ee547f994c633f2607d222e2c6385b6fe5f07d8 mm: Prevent parallel splits during THP migration
dd83227f0d93fb37d7621a24e8465b13b437faa6 mm: numa: Sanitize task_numa_fault() callsites
efeeacf7b94babff85da7e468fc5450fdfab0900 mm: Close races between THP migration and PMD numa clearing
778ec5247bb79815af12434980164334fb94cc9e mm: Account for a THP NUMA hinting update as one PTE update

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
@ 2013-10-29  9:42       ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-29  9:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Sat, Oct 26, 2013 at 02:11:48PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Mon, Oct 07, 2013 at 11:28:38AM +0100, Mel Gorman wrote:
> > > This series has roughly the same goals as previous versions despite the
> > > size. It reduces overhead of automatic balancing through scan rate reduction
> > > and the avoidance of TLB flushes. It selects a preferred node and moves tasks
> > > towards their memory as well as moving memory toward their task. It handles
> > > shared pages and groups related tasks together. Some problems such as shared
> > > page interleaving and properly dealing with processes that are larger than
> > > a node are being deferred. This version should be ready for wider testing
> > > in -tip.
> > > 
> > 
> > Hi Ingo,
> > 
> > Off-list we talked with Peter about the fact that automatic NUMA
> > balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt
> > userspace memory. There is one LKML report on this that I'm aware of --
> > https://lkml.org/lkml/2013/7/31/647 which I prompt forgot to follow up
> > properly on . The user-visible effect is that pages get filled with zeros
> > with results such as null pointer exceptions in JVMs. It is fairly difficult
> > to trigger but it became much easier to trigger during the development of
> > the series "Basic scheduler support for automatic NUMA balancing" which
> > is how it was discovered and finally fixed.
> > 
> > In that series I tagged patches 2-9 for -stable as these patches addressed
> > the problem for me. I did not call it out as clearly as I should have
> > and did not realise the cc: stable tags were stripped. Worse, as it was
> > close to the release and the bug is relatively old I was ok with waiting
> > until 3.12 came out and then treat it as a -stable backport. It has been
> > highlighted that this is the wrong attitude and we should consider merging
> > the fixes now and backporting to -stable sooner rather than later.
> > 
> > The most important patches are 
> > 
> > mm: Wait for THP migrations to complete during NUMA hinting fault
> > mm: Prevent parallel splits during THP migration
> > mm: Close races between THP migration and PMD numa clearing
> > 
> > but on their own they will cause conflicts with tricky fixups and -stable
> > would differ from mainline in annoying ways. Patches 2-9 have been heavily
> > tested in isolation so I'm reasonably confident they fix the problem and are
> > -stable material. While strictly speaking not all the patches are required
> > for the fix, the -stable kernels would then be directly comparable with
> > 3.13 when the full NUMA balancing series is applied. If I rework them at
> > this point then I'll also have to retest delaying things until next week.
> > 
> > Please consider queueing patches 2-9 for 3.12 via -urgent if it is 
> > not too late and preserve the cc: stable tags so Greg will pick 
> > them up automatically.
> 
> Would be nice if you gave me all the specific SHA1 tags of 
> sched/core that are required for the fix. We can certainly
> use a range to make it all safer to apply.
> 

Of course. The range of the relevant commits in tip/sched/core is
ca4be374c5c0ab3d8b84fb2861d663216281e6ac..778ec5247bb79815af12434980164334fb94cc9e

904f64a376e663cd459fb7aec4f12e14c39c24b6 mm: numa: Document automatic NUMA balancing sysctls
1d649bccc8c1370e402b85e1d345ad24f3f0d1b5 sched, numa: Comment fixlets
f961cab8d55d55d6abc0df08ce2abec8ab56f2c8 mm: numa: Do not account for a hinting fault if we raced
6f2a15fc1df62af3ba3be327877b7e53cb16e878 mm: Wait for THP migrations to complete during NUMA hinting faults
4ee547f994c633f2607d222e2c6385b6fe5f07d8 mm: Prevent parallel splits during THP migration
dd83227f0d93fb37d7621a24e8465b13b437faa6 mm: numa: Sanitize task_numa_fault() callsites
efeeacf7b94babff85da7e468fc5450fdfab0900 mm: Close races between THP migration and PMD numa clearing
778ec5247bb79815af12434980164334fb94cc9e mm: Account for a THP NUMA hinting update as one PTE update

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
  2013-10-29  9:42       ` Mel Gorman
@ 2013-10-29  9:48         ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-29  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > Would be nice if you gave me all the specific SHA1 tags of 
> > sched/core that are required for the fix. We can certainly use a 
> > range to make it all safer to apply.
> 
> Of course. The range of the relevant commits in tip/sched/core is
> ca4be374c5c0ab3d8b84fb2861d663216281e6ac..778ec5247bb79815af12434980164334fb94cc9e
> 
> 904f64a376e663cd459fb7aec4f12e14c39c24b6 mm: numa: Document automatic NUMA balancing sysctls
> 1d649bccc8c1370e402b85e1d345ad24f3f0d1b5 sched, numa: Comment fixlets
> f961cab8d55d55d6abc0df08ce2abec8ab56f2c8 mm: numa: Do not account for a hinting fault if we raced
> 6f2a15fc1df62af3ba3be327877b7e53cb16e878 mm: Wait for THP migrations to complete during NUMA hinting faults
> 4ee547f994c633f2607d222e2c6385b6fe5f07d8 mm: Prevent parallel splits during THP migration
> dd83227f0d93fb37d7621a24e8465b13b437faa6 mm: numa: Sanitize task_numa_fault() callsites
> efeeacf7b94babff85da7e468fc5450fdfab0900 mm: Close races between THP migration and PMD numa clearing
> 778ec5247bb79815af12434980164334fb94cc9e mm: Account for a THP NUMA hinting update as one PTE update

These commits don't exist in -tip :-/

Some of these don't even exist as patch titles under different 
sha1's - such as "sched, numa: Comment fixlets".

So I'm really confused about what to pick up. What tree are you 
looking at?

-tip is at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
@ 2013-10-29  9:48         ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-29  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> > Would be nice if you gave me all the specific SHA1 tags of 
> > sched/core that are required for the fix. We can certainly use a 
> > range to make it all safer to apply.
> 
> Of course. The range of the relevant commits in tip/sched/core is
> ca4be374c5c0ab3d8b84fb2861d663216281e6ac..778ec5247bb79815af12434980164334fb94cc9e
> 
> 904f64a376e663cd459fb7aec4f12e14c39c24b6 mm: numa: Document automatic NUMA balancing sysctls
> 1d649bccc8c1370e402b85e1d345ad24f3f0d1b5 sched, numa: Comment fixlets
> f961cab8d55d55d6abc0df08ce2abec8ab56f2c8 mm: numa: Do not account for a hinting fault if we raced
> 6f2a15fc1df62af3ba3be327877b7e53cb16e878 mm: Wait for THP migrations to complete during NUMA hinting faults
> 4ee547f994c633f2607d222e2c6385b6fe5f07d8 mm: Prevent parallel splits during THP migration
> dd83227f0d93fb37d7621a24e8465b13b437faa6 mm: numa: Sanitize task_numa_fault() callsites
> efeeacf7b94babff85da7e468fc5450fdfab0900 mm: Close races between THP migration and PMD numa clearing
> 778ec5247bb79815af12434980164334fb94cc9e mm: Account for a THP NUMA hinting update as one PTE update

These commits don't exist in -tip :-/

Some of these don't even exist as patch titles under different 
sha1's - such as "sched, numa: Comment fixlets".

So I'm really confused about what to pick up. What tree are you 
looking at?

-tip is at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
  2013-10-29  9:48         ` Ingo Molnar
@ 2013-10-29 10:24           ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-29 10:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Oct 29, 2013 at 10:48:56AM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > Would be nice if you gave me all the specific SHA1 tags of 
> > > sched/core that are required for the fix. We can certainly use a 
> > > range to make it all safer to apply.
> > 
> > Of course. The range of the relevant commits in tip/sched/core is
> > ca4be374c5c0ab3d8b84fb2861d663216281e6ac..778ec5247bb79815af12434980164334fb94cc9e
> > 
> > 904f64a376e663cd459fb7aec4f12e14c39c24b6 mm: numa: Document automatic NUMA balancing sysctls
> > 1d649bccc8c1370e402b85e1d345ad24f3f0d1b5 sched, numa: Comment fixlets
> > f961cab8d55d55d6abc0df08ce2abec8ab56f2c8 mm: numa: Do not account for a hinting fault if we raced
> > 6f2a15fc1df62af3ba3be327877b7e53cb16e878 mm: Wait for THP migrations to complete during NUMA hinting faults
> > 4ee547f994c633f2607d222e2c6385b6fe5f07d8 mm: Prevent parallel splits during THP migration
> > dd83227f0d93fb37d7621a24e8465b13b437faa6 mm: numa: Sanitize task_numa_fault() callsites
> > efeeacf7b94babff85da7e468fc5450fdfab0900 mm: Close races between THP migration and PMD numa clearing
> > 778ec5247bb79815af12434980164334fb94cc9e mm: Account for a THP NUMA hinting update as one PTE update
> 
> These commits don't exist in -tip :-/
> 

Bah, I have tip as a remote tree but looked at my local copy of the
commits in the incorrect branch. Lets try this again

37bf06375c90a42fe07b9bebdb07bc316ae5a0ce..afcae2655b0ab67e65f161b1bb214efcfa1db415

10fc05d0e551146ad6feb0ab8902d28a2d3c5624 mm: numa: Document automatic NUMA balancing sysctls
c69307d533d7aa7cc8894dbbb8a274599f8630d7 sched/numa: Fix comments
0c3a775e1e0b069bf765f8355b723ce0d18dcc6c mm: numa: Do not account for a hinting fault if we raced
ff9042b11a71c81238c70af168cd36b98a6d5a3c mm: Wait for THP migrations to complete during NUMA hinting faults
b8916634b77bffb233d8f2f45703c80343457cc1 mm: Prevent parallel splits during THP migration
8191acbd30c73e45c24ad16c372e0b42cc7ac8f8 mm: numa: Sanitize task_numa_fault() callsites
a54a407fbf7735fd8f7841375574f5d9b0375f93 mm: Close races between THP migration and PMD numa clearing
afcae2655b0ab67e65f161b1bb214efcfa1db415 mm: Account for a THP NUMA hinting update as one PTE update

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
@ 2013-10-29 10:24           ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-29 10:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Oct 29, 2013 at 10:48:56AM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > Would be nice if you gave me all the specific SHA1 tags of 
> > > sched/core that are required for the fix. We can certainly use a 
> > > range to make it all safer to apply.
> > 
> > Of course. The range of the relevant commits in tip/sched/core is
> > ca4be374c5c0ab3d8b84fb2861d663216281e6ac..778ec5247bb79815af12434980164334fb94cc9e
> > 
> > 904f64a376e663cd459fb7aec4f12e14c39c24b6 mm: numa: Document automatic NUMA balancing sysctls
> > 1d649bccc8c1370e402b85e1d345ad24f3f0d1b5 sched, numa: Comment fixlets
> > f961cab8d55d55d6abc0df08ce2abec8ab56f2c8 mm: numa: Do not account for a hinting fault if we raced
> > 6f2a15fc1df62af3ba3be327877b7e53cb16e878 mm: Wait for THP migrations to complete during NUMA hinting faults
> > 4ee547f994c633f2607d222e2c6385b6fe5f07d8 mm: Prevent parallel splits during THP migration
> > dd83227f0d93fb37d7621a24e8465b13b437faa6 mm: numa: Sanitize task_numa_fault() callsites
> > efeeacf7b94babff85da7e468fc5450fdfab0900 mm: Close races between THP migration and PMD numa clearing
> > 778ec5247bb79815af12434980164334fb94cc9e mm: Account for a THP NUMA hinting update as one PTE update
> 
> These commits don't exist in -tip :-/
> 

Bah, I have tip as a remote tree but looked at my local copy of the
commits in the incorrect branch. Lets try this again

37bf06375c90a42fe07b9bebdb07bc316ae5a0ce..afcae2655b0ab67e65f161b1bb214efcfa1db415

10fc05d0e551146ad6feb0ab8902d28a2d3c5624 mm: numa: Document automatic NUMA balancing sysctls
c69307d533d7aa7cc8894dbbb8a274599f8630d7 sched/numa: Fix comments
0c3a775e1e0b069bf765f8355b723ce0d18dcc6c mm: numa: Do not account for a hinting fault if we raced
ff9042b11a71c81238c70af168cd36b98a6d5a3c mm: Wait for THP migrations to complete during NUMA hinting faults
b8916634b77bffb233d8f2f45703c80343457cc1 mm: Prevent parallel splits during THP migration
8191acbd30c73e45c24ad16c372e0b42cc7ac8f8 mm: numa: Sanitize task_numa_fault() callsites
a54a407fbf7735fd8f7841375574f5d9b0375f93 mm: Close races between THP migration and PMD numa clearing
afcae2655b0ab67e65f161b1bb214efcfa1db415 mm: Account for a THP NUMA hinting update as one PTE update

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
  2013-10-29 10:24           ` Mel Gorman
@ 2013-10-29 10:41             ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-29 10:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> Bah, I have tip as a remote tree but looked at my local copy of 
> the commits in the incorrect branch. Lets try this again
> 
> 37bf06375c90a42fe07b9bebdb07bc316ae5a0ce..afcae2655b0ab67e65f161b1bb214efcfa1db415

Ok, these work a lot better and cherry-pick cleanly on top of -rc7.

> 10fc05d0e551146ad6feb0ab8902d28a2d3c5624 mm: numa: Document automatic NUMA balancing sysctls

We can certainly leave out this one - the rest still cherry-picks 
cleanly.

> c69307d533d7aa7cc8894dbbb8a274599f8630d7 sched/numa: Fix comments

I was able to leave out this one as well.

> 0c3a775e1e0b069bf765f8355b723ce0d18dcc6c mm: numa: Do not account for a hinting fault if we raced
> ff9042b11a71c81238c70af168cd36b98a6d5a3c mm: Wait for THP migrations to complete during NUMA hinting faults
> b8916634b77bffb233d8f2f45703c80343457cc1 mm: Prevent parallel splits during THP migration
> 8191acbd30c73e45c24ad16c372e0b42cc7ac8f8 mm: numa: Sanitize task_numa_fault() callsites
> a54a407fbf7735fd8f7841375574f5d9b0375f93 mm: Close races between THP migration and PMD numa clearing
> afcae2655b0ab67e65f161b1bb214efcfa1db415 mm: Account for a THP NUMA hinting update as one PTE update

Ok, these seem essential and cherry-pick cleanly.

Would be nice to avoid the 'Sanitize task_numa_fault() callsites' 
change, but the remaining fixes rely on it and are well tested 
together.

I've stuck these into tip:core/urgent with a -stable tag and will 
send them to Linus if he cuts an -rc8 (which seems unlikely at this 
point though).

If there's no -rc8 then please forward the above list of 6 commits 
to Greg so that it can be applied to -stable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
@ 2013-10-29 10:41             ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-29 10:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML


* Mel Gorman <mgorman@suse.de> wrote:

> Bah, I have tip as a remote tree but looked at my local copy of 
> the commits in the incorrect branch. Lets try this again
> 
> 37bf06375c90a42fe07b9bebdb07bc316ae5a0ce..afcae2655b0ab67e65f161b1bb214efcfa1db415

Ok, these work a lot better and cherry-pick cleanly on top of -rc7.

> 10fc05d0e551146ad6feb0ab8902d28a2d3c5624 mm: numa: Document automatic NUMA balancing sysctls

We can certainly leave out this one - the rest still cherry-picks 
cleanly.

> c69307d533d7aa7cc8894dbbb8a274599f8630d7 sched/numa: Fix comments

I was able to leave out this one as well.

> 0c3a775e1e0b069bf765f8355b723ce0d18dcc6c mm: numa: Do not account for a hinting fault if we raced
> ff9042b11a71c81238c70af168cd36b98a6d5a3c mm: Wait for THP migrations to complete during NUMA hinting faults
> b8916634b77bffb233d8f2f45703c80343457cc1 mm: Prevent parallel splits during THP migration
> 8191acbd30c73e45c24ad16c372e0b42cc7ac8f8 mm: numa: Sanitize task_numa_fault() callsites
> a54a407fbf7735fd8f7841375574f5d9b0375f93 mm: Close races between THP migration and PMD numa clearing
> afcae2655b0ab67e65f161b1bb214efcfa1db415 mm: Account for a THP NUMA hinting update as one PTE update

Ok, these seem essential and cherry-pick cleanly.

Would be nice to avoid the 'Sanitize task_numa_fault() callsites' 
change, but the remaining fixes rely on it and are well tested 
together.

I've stuck these into tip:core/urgent with a -stable tag and will 
send them to Linus if he cuts an -rc8 (which seems unlikely at this 
point though).

If there's no -rc8 then please forward the above list of 6 commits 
to Greg so that it can be applied to -stable.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [tip:core/urgent] mm: numa: Do not account for a hinting fault if we raced
  2013-10-07 10:28   ` Mel Gorman
                     ` (2 preceding siblings ...)
  (?)
@ 2013-10-29 10:42   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
	srikar, mgorman, tglx

Commit-ID:  1dd49bfa3465756b3ce72214b58a33e4afb67aa3
Gitweb:     http://git.kernel.org/tip/1dd49bfa3465756b3ce72214b58a33e4afb67aa3
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:42 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:05 +0100

mm: numa: Do not account for a hinting fault if we raced

If another task handled a hinting fault in parallel then do not double
account for it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 610e3df..33ee637 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 check_same:
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		/* Someone else took our fault */
+		current_nid = -1;
 		goto out_unlock;
+	}
 clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:core/urgent] mm: Wait for THP migrations to complete during NUMA hinting faults
  2013-10-07 10:28   ` Mel Gorman
                     ` (2 preceding siblings ...)
  (?)
@ 2013-10-29 10:42   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
	srikar, mgorman, tglx

Commit-ID:  42836f5f8baa33085f547098b74aa98991ee9216
Gitweb:     http://git.kernel.org/tip/42836f5f8baa33085f547098b74aa98991ee9216
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:43 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:19 +0100

mm: Wait for THP migrations to complete during NUMA hinting faults

The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.

This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33ee637..e10d780 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
-	}
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	if (trylock_page(page))
+		goto got_lock;
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Serialise against migrationa and check placement check placement */
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
+
+got_lock:
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		unlock_page(page);
+		put_page(page);
+		goto clear_pmdnuma;
+	}
 
 	/* Migrate the THP to the requested node */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (!migrated)

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:core/urgent] mm: Prevent parallel splits during THP migration
  2013-10-07 10:28   ` Mel Gorman
                     ` (2 preceding siblings ...)
  (?)
@ 2013-10-29 10:42   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
	srikar, mgorman, tglx

Commit-ID:  587fe586f44a48f9691001ba6c45b86c8e4ba21f
Gitweb:     http://git.kernel.org/tip/587fe586f44a48f9691001ba6c45b86c8e4ba21f
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:44 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:39 +0100

mm: Prevent parallel splits during THP migration

THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e10d780..d8534b3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
 	int current_nid = -1;
-	bool migrated;
+	bool migrated, page_locked;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
 	current_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
-	if (trylock_page(page))
-		goto got_lock;
+	page_locked = trylock_page(page);
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked) {
+			unlock_page(page);
+			goto clear_pmdnuma;
+		}
 
-	/* Serialise against migrationa and check placement check placement */
+		/* Otherwise wait for potential migrations and retry fault */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		goto out;
+	}
+
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked) {
+		lock_page(page);
+		page_locked = true;
+	}
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PTE did not while locked */
 	spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
-got_lock:
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		unlock_page(page);
-		put_page(page);
-		goto clear_pmdnuma;
-	}
-
 	/* Migrate the THP to the requested node */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
 		goto check_same;
 
 	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
 	return 0;
 
 check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+
+out:
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
 	if (current_nid != -1)
 		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
 	return 0;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:core/urgent] mm: numa: Sanitize task_numa_fault() callsites
  2013-10-07 10:28   ` Mel Gorman
                     ` (2 preceding siblings ...)
  (?)
@ 2013-10-29 10:42   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
	srikar, mgorman, tglx

Commit-ID:  c61109e34f60f6e85bb43c5a1cd51c0e3db40847
Gitweb:     http://git.kernel.org/tip/c61109e34f60f6e85bb43c5a1cd51c0e3db40847
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:45 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:52 +0100

mm: numa: Sanitize task_numa_fault() callsites

There are three callers of task_numa_fault():

 - do_huge_pmd_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_pmd_numa_page():
     Accounts not at all when the page isn't migrated, otherwise
     accounts against the node we migrated towards.

This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.

So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.

They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 25 +++++++++++++------------
 mm/memory.c      | 53 +++++++++++++++++++++--------------------------------
 2 files changed, 34 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8534b3..00ddfcd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated, page_locked;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
+	if (migrated)
+		page_nid = target_nid;
+	else
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-	return 0;
+	goto out;
 
 check_same:
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		/* Someone else took our fault */
-		current_nid = -1;
+		page_nid = -1;
 		goto out_unlock;
 	}
 clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);
 
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 1311f26..d176154 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3521,12 +3521,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3537,7 +3537,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,15 +3567,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3583,11 +3578,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,7 +3597,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3625,9 +3619,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3649,25 +3644,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:core/urgent] mm: Close races between THP migration and PMD numa clearing
  2013-10-07 10:28   ` Mel Gorman
                     ` (2 preceding siblings ...)
  (?)
@ 2013-10-29 10:42   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
	srikar, mgorman, tglx

Commit-ID:  3f926ab945b60a5824369d21add7710622a2eac0
Gitweb:     http://git.kernel.org/tip/3f926ab945b60a5824369d21add7710622a2eac0
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:46 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:38:05 +0100

mm: Close races between THP migration and PMD numa clearing

THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open

  Task A					Task B
  ---------------------				---------------------
  do_huge_pmd_numa_page				do_huge_pmd_numa_page
  lock_page
  mpol_misplaced == -1
  unlock_page
  goto clear_pmdnuma
						lock_page
						mpol_misplaced == 2
						migrate_misplaced_transhuge
  pmd = pmd_mknonnuma
  set_pmd_at

During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 33 +++++++++++++++------------------
 mm/migrate.c     | 19 +++++++++++--------
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 00ddfcd..cca80d9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
 		/* If the page was locked, there are no parallel migrations */
-		if (page_locked) {
-			unlock_page(page);
+		if (page_locked)
 			goto clear_pmdnuma;
-		}
 
-		/* Otherwise wait for potential migrations and retry fault */
+		/*
+		 * Otherwise wait for potential migrations and retry. We do
+		 * relock and check_same as the page may no longer be mapped.
+		 * As the fault is being retried, do not account for it.
+		 */
 		spin_unlock(&mm->page_table_lock);
 		wait_on_page_locked(page);
+		page_nid = -1;
 		goto out;
 	}
 
 	/* Page is misplaced, serialise migrations and parallel THP splits */
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	if (!page_locked) {
+	if (!page_locked)
 		lock_page(page);
-		page_locked = true;
-	}
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PTE did not while locked */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
 		put_page(page);
+		page_nid = -1;
 		goto out_unlock;
 	}
 
-	/* Migrate the THP to the requested node */
+	/*
+	 * Migrate the THP to the requested node, returns with page unlocked
+	 * and pmd_numa cleared.
+	 */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (migrated)
 		page_nid = target_nid;
-	else
-		goto check_same;
 
 	goto out;
-
-check_same:
-	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp))) {
-		/* Someone else took our fault */
-		page_nid = -1;
-		goto out_unlock;
-	}
 clear_pmdnuma:
+	BUG_ON(!PageLocked(page));
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	unlock_page(page);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 7a7325e..c046927 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1715,12 +1715,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		unlock_page(new_page);
 		put_page(new_page);		/* Free it */
 
-		unlock_page(page);
+		/* Retake the callers reference and putback on LRU */
+		get_page(page);
 		putback_lru_page(page);
-
-		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-		isolated = 0;
-		goto out;
+		mod_zone_page_state(page_zone(page),
+			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+		goto out_fail;
 	}
 
 	/*
@@ -1737,9 +1737,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
 
-	page_add_new_anon_rmap(new_page, vma, haddr);
-
+	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
 	/*
@@ -1758,7 +1758,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-out:
 	mod_zone_page_state(page_zone(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
@@ -1767,6 +1766,10 @@ out:
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
+	entry = pmd_mknonnuma(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, &entry);
+
 	unlock_page(page);
 	put_page(page);
 	return 0;

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [tip:core/urgent] mm: Account for a THP NUMA hinting update as one PTE update
  2013-10-07 10:28   ` Mel Gorman
                     ` (2 preceding siblings ...)
  (?)
@ 2013-10-29 10:43   ` tip-bot for Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
	srikar, mgorman, tglx

Commit-ID:  0255d491848032f6c601b6410c3b8ebded3a37b1
Gitweb:     http://git.kernel.org/tip/0255d491848032f6c601b6410c3b8ebded3a37b1
Author:     Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:47 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:38:17 +0100

mm: Account for a THP NUMA hinting update as one PTE update

A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index a3af058..412ba2b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
  2013-10-29 10:41             ` Ingo Molnar
@ 2013-10-29 12:48               ` Mel Gorman
  -1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-29 12:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Oct 29, 2013 at 11:41:02AM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > Bah, I have tip as a remote tree but looked at my local copy of 
> > the commits in the incorrect branch. Lets try this again
> > 
> > 37bf06375c90a42fe07b9bebdb07bc316ae5a0ce..afcae2655b0ab67e65f161b1bb214efcfa1db415
> 
> Ok, these work a lot better and cherry-pick cleanly on top of -rc7.
> 
> > 10fc05d0e551146ad6feb0ab8902d28a2d3c5624 mm: numa: Document automatic NUMA balancing sysctls
> 
> We can certainly leave out this one - the rest still cherry-picks 
> cleanly.
> 
> > c69307d533d7aa7cc8894dbbb8a274599f8630d7 sched/numa: Fix comments
> 
> I was able to leave out this one as well.
> 

Yes, both of those can be left out. They are outside the stable rules and
including them to have comparable documentation in -stable is unnecessary.

> > 0c3a775e1e0b069bf765f8355b723ce0d18dcc6c mm: numa: Do not account for a hinting fault if we raced
> > ff9042b11a71c81238c70af168cd36b98a6d5a3c mm: Wait for THP migrations to complete during NUMA hinting faults
> > b8916634b77bffb233d8f2f45703c80343457cc1 mm: Prevent parallel splits during THP migration
> > 8191acbd30c73e45c24ad16c372e0b42cc7ac8f8 mm: numa: Sanitize task_numa_fault() callsites
> > a54a407fbf7735fd8f7841375574f5d9b0375f93 mm: Close races between THP migration and PMD numa clearing
> > afcae2655b0ab67e65f161b1bb214efcfa1db415 mm: Account for a THP NUMA hinting update as one PTE update
> 
> Ok, these seem essential and cherry-pick cleanly.
> 
> Would be nice to avoid the 'Sanitize task_numa_fault() callsites' 
> change, but the remaining fixes rely on it and are well tested 
> together.
> 

It would have been nice but yes, the combination of patches would not have
been well tested.

> I've stuck these into tip:core/urgent with a -stable tag and will 
> send them to Linus if he cuts an -rc8 (which seems unlikely at this 
> point though).
> 
> If there's no -rc8 then please forward the above list of 6 commits 
> to Greg so that it can be applied to -stable.
> 

Thanks, I'll make sure they get sent to stable when the relevant patches
show up in a mainline tree of some description.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Automatic NUMA balancing patches for tip-urgent/stable
@ 2013-10-29 12:48               ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-29 12:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Oct 29, 2013 at 11:41:02AM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > Bah, I have tip as a remote tree but looked at my local copy of 
> > the commits in the incorrect branch. Lets try this again
> > 
> > 37bf06375c90a42fe07b9bebdb07bc316ae5a0ce..afcae2655b0ab67e65f161b1bb214efcfa1db415
> 
> Ok, these work a lot better and cherry-pick cleanly on top of -rc7.
> 
> > 10fc05d0e551146ad6feb0ab8902d28a2d3c5624 mm: numa: Document automatic NUMA balancing sysctls
> 
> We can certainly leave out this one - the rest still cherry-picks 
> cleanly.
> 
> > c69307d533d7aa7cc8894dbbb8a274599f8630d7 sched/numa: Fix comments
> 
> I was able to leave out this one as well.
> 

Yes, both of those can be left out. They are outside the stable rules and
including them to have comparable documentation in -stable is unnecessary.

> > 0c3a775e1e0b069bf765f8355b723ce0d18dcc6c mm: numa: Do not account for a hinting fault if we raced
> > ff9042b11a71c81238c70af168cd36b98a6d5a3c mm: Wait for THP migrations to complete during NUMA hinting faults
> > b8916634b77bffb233d8f2f45703c80343457cc1 mm: Prevent parallel splits during THP migration
> > 8191acbd30c73e45c24ad16c372e0b42cc7ac8f8 mm: numa: Sanitize task_numa_fault() callsites
> > a54a407fbf7735fd8f7841375574f5d9b0375f93 mm: Close races between THP migration and PMD numa clearing
> > afcae2655b0ab67e65f161b1bb214efcfa1db415 mm: Account for a THP NUMA hinting update as one PTE update
> 
> Ok, these seem essential and cherry-pick cleanly.
> 
> Would be nice to avoid the 'Sanitize task_numa_fault() callsites' 
> change, but the remaining fixes rely on it and are well tested 
> together.
> 

It would have been nice but yes, the combination of patches would not have
been well tested.

> I've stuck these into tip:core/urgent with a -stable tag and will 
> send them to Linus if he cuts an -rc8 (which seems unlikely at this 
> point though).
> 
> If there's no -rc8 then please forward the above list of 6 commits 
> to Greg so that it can be applied to -stable.
> 

Thanks, I'll make sure they get sent to stable when the relevant patches
show up in a mainline tree of some description.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [RFC GIT PULL] NUMA-balancing memory corruption fixes
  2013-10-24 12:26   ` Mel Gorman
@ 2013-10-31  9:51     ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-31  9:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman


Linus,

* Mel Gorman <mgorman@suse.de> wrote:

> Hi Ingo,
> 
> Off-list we talked with Peter about the fact that automatic NUMA 
> balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt 
> userspace memory. [...]

So these fixes are definitely not something I'd like to sit on, but 
as I said to you at the KS the timing is quite tight, with Linus 
planning v3.12-final within a week.

Fedora-19 is affected:

 comet:~> grep NUMA_BALANCING /boot/config-3.11.3-201.fc19.x86_64 

 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
 CONFIG_NUMA_BALANCING=y

AFAICS Ubuntu will be affected as well, once it updates the kernel:

 hubble:~> grep NUMA_BALANCING /boot/config-3.8.0-32-generic 

 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
 CONFIG_NUMA_BALANCING=y

Linus, please consider pulling the latest core-urgent-for-linus git 
tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-for-linus

   # HEAD: 0255d491848032f6c601b6410c3b8ebded3a37b1 mm: Account for a THP NUMA hinting update as one PTE update

These 6 commits are a minimalized set of cherry-picks needed to fix 
the memory corruption bugs. All commits are fixes, except "mm: numa: 
Sanitize task_numa_fault() callsites" which is a cleanup that made 
two followup fixes simpler.

I've done targeted testing with just this SHA1 to try to make sure 
there are no cherry-picking artifacts. The original 
non-cherry-picked set of fixes were exposed to linux-next for a 
couple of weeks.

( If you think this is too dangerous for too little benefit then
  I'll drop this separate tree and will send the original commits in 
  the merge window. )

 Thanks,

	Ingo

------------------>
Mel Gorman (6):
      mm: numa: Do not account for a hinting fault if we raced
      mm: Wait for THP migrations to complete during NUMA hinting faults
      mm: Prevent parallel splits during THP migration
      mm: numa: Sanitize task_numa_fault() callsites
      mm: Close races between THP migration and PMD numa clearing
      mm: Account for a THP NUMA hinting update as one PTE update


 mm/huge_memory.c | 70 ++++++++++++++++++++++++++++++++++++++------------------
 mm/memory.c      | 53 +++++++++++++++++-------------------------
 mm/migrate.c     | 19 ++++++++-------
 mm/mprotect.c    |  2 +-
 4 files changed, 81 insertions(+), 63 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 610e3df..cca80d9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,64 +1278,90 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	page_locked = trylock_page(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked)
+			goto clear_pmdnuma;
+
+		/*
+		 * Otherwise wait for potential migrations and retry. We do
+		 * relock and check_same as the page may no longer be mapped.
+		 * As the fault is being retried, do not account for it.
+		 */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		page_nid = -1;
+		goto out;
 	}
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked)
+		lock_page(page);
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PTE did not while locked */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
 		put_page(page);
+		page_nid = -1;
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
 
-	/* Migrate the THP to the requested node */
+	/*
+	 * Migrate the THP to the requested node, returns with page unlocked
+	 * and pmd_numa cleared.
+	 */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
-		goto check_same;
-
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	return 0;
+	if (migrated)
+		page_nid = target_nid;
 
-check_same:
-	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
-		goto out_unlock;
+	goto out;
 clear_pmdnuma:
+	BUG_ON(!PageLocked(page));
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	unlock_page(page);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+
+out:
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 1311f26..d176154 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3521,12 +3521,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3537,7 +3537,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,15 +3567,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3583,11 +3578,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,7 +3597,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3625,9 +3619,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3649,25 +3644,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 7a7325e..c046927 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1715,12 +1715,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		unlock_page(new_page);
 		put_page(new_page);		/* Free it */
 
-		unlock_page(page);
+		/* Retake the callers reference and putback on LRU */
+		get_page(page);
 		putback_lru_page(page);
-
-		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-		isolated = 0;
-		goto out;
+		mod_zone_page_state(page_zone(page),
+			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+		goto out_fail;
 	}
 
 	/*
@@ -1737,9 +1737,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
 
-	page_add_new_anon_rmap(new_page, vma, haddr);
-
+	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
 	/*
@@ -1758,7 +1758,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-out:
 	mod_zone_page_state(page_zone(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
@@ -1767,6 +1766,10 @@ out:
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
+	entry = pmd_mknonnuma(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, &entry);
+
 	unlock_page(page);
 	put_page(page);
 	return 0;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a3af058..412ba2b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [RFC GIT PULL] NUMA-balancing memory corruption fixes
@ 2013-10-31  9:51     ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-10-31  9:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman


Linus,

* Mel Gorman <mgorman@suse.de> wrote:

> Hi Ingo,
> 
> Off-list we talked with Peter about the fact that automatic NUMA 
> balancing as merged in 3.10, 3.11 and 3.12 shortly may corrupt 
> userspace memory. [...]

So these fixes are definitely not something I'd like to sit on, but 
as I said to you at the KS the timing is quite tight, with Linus 
planning v3.12-final within a week.

Fedora-19 is affected:

 comet:~> grep NUMA_BALANCING /boot/config-3.11.3-201.fc19.x86_64 

 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
 CONFIG_NUMA_BALANCING=y

AFAICS Ubuntu will be affected as well, once it updates the kernel:

 hubble:~> grep NUMA_BALANCING /boot/config-3.8.0-32-generic 

 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
 CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
 CONFIG_NUMA_BALANCING=y

Linus, please consider pulling the latest core-urgent-for-linus git 
tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-for-linus

   # HEAD: 0255d491848032f6c601b6410c3b8ebded3a37b1 mm: Account for a THP NUMA hinting update as one PTE update

These 6 commits are a minimalized set of cherry-picks needed to fix 
the memory corruption bugs. All commits are fixes, except "mm: numa: 
Sanitize task_numa_fault() callsites" which is a cleanup that made 
two followup fixes simpler.

I've done targeted testing with just this SHA1 to try to make sure 
there are no cherry-picking artifacts. The original 
non-cherry-picked set of fixes were exposed to linux-next for a 
couple of weeks.

( If you think this is too dangerous for too little benefit then
  I'll drop this separate tree and will send the original commits in 
  the merge window. )

 Thanks,

	Ingo

------------------>
Mel Gorman (6):
      mm: numa: Do not account for a hinting fault if we raced
      mm: Wait for THP migrations to complete during NUMA hinting faults
      mm: Prevent parallel splits during THP migration
      mm: numa: Sanitize task_numa_fault() callsites
      mm: Close races between THP migration and PMD numa clearing
      mm: Account for a THP NUMA hinting update as one PTE update


 mm/huge_memory.c | 70 ++++++++++++++++++++++++++++++++++++++------------------
 mm/memory.c      | 53 +++++++++++++++++-------------------------
 mm/migrate.c     | 19 ++++++++-------
 mm/mprotect.c    |  2 +-
 4 files changed, 81 insertions(+), 63 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 610e3df..cca80d9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,64 +1278,90 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	page_locked = trylock_page(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
 	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked)
+			goto clear_pmdnuma;
+
+		/*
+		 * Otherwise wait for potential migrations and retry. We do
+		 * relock and check_same as the page may no longer be mapped.
+		 * As the fault is being retried, do not account for it.
+		 */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		page_nid = -1;
+		goto out;
 	}
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked)
+		lock_page(page);
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PTE did not while locked */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
 		put_page(page);
+		page_nid = -1;
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
 
-	/* Migrate the THP to the requested node */
+	/*
+	 * Migrate the THP to the requested node, returns with page unlocked
+	 * and pmd_numa cleared.
+	 */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
-		goto check_same;
-
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	return 0;
+	if (migrated)
+		page_nid = target_nid;
 
-check_same:
-	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
-		goto out_unlock;
+	goto out;
 clear_pmdnuma:
+	BUG_ON(!PageLocked(page));
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	VM_BUG_ON(pmd_numa(*pmdp));
 	update_mmu_cache_pmd(vma, addr, pmdp);
+	unlock_page(page);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+
+out:
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 1311f26..d176154 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3521,12 +3521,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3537,7 +3537,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3567,15 +3567,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3583,11 +3578,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3602,7 +3597,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3625,9 +3619,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3649,25 +3644,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 7a7325e..c046927 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1715,12 +1715,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		unlock_page(new_page);
 		put_page(new_page);		/* Free it */
 
-		unlock_page(page);
+		/* Retake the callers reference and putback on LRU */
+		get_page(page);
 		putback_lru_page(page);
-
-		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-		isolated = 0;
-		goto out;
+		mod_zone_page_state(page_zone(page),
+			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+		goto out_fail;
 	}
 
 	/*
@@ -1737,9 +1737,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 	entry = pmd_mkhuge(entry);
 
-	page_add_new_anon_rmap(new_page, vma, haddr);
-
+	pmdp_clear_flush(vma, haddr, pmd);
 	set_pmd_at(mm, haddr, pmd, entry);
+	page_add_new_anon_rmap(new_page, vma, haddr);
 	update_mmu_cache_pmd(vma, address, &entry);
 	page_remove_rmap(page);
 	/*
@@ -1758,7 +1758,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-out:
 	mod_zone_page_state(page_zone(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
@@ -1767,6 +1766,10 @@ out:
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 out_dropref:
+	entry = pmd_mknonnuma(entry);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, &entry);
+
 	unlock_page(page);
 	put_page(page);
 	return 0;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a3af058..412ba2b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

* Re: [RFC GIT PULL] NUMA-balancing memory corruption fixes
  2013-10-31  9:51     ` Ingo Molnar
@ 2013-10-31 22:25       ` Linus Torvalds
  -1 siblings, 0 replies; 340+ messages in thread
From: Linus Torvalds @ 2013-10-31 22:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

On Thu, Oct 31, 2013 at 2:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> ( If you think this is too dangerous for too little benefit then
>   I'll drop this separate tree and will send the original commits in
>   the merge window. )

Ugh. I hate hate hate the timing, and this is much larger and scarier
than what I'd like at this point, but I don't see the point to
delaying this either.

So I'm pulling them. And then I may end up doing an rc8 after all.

                 Linus

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC GIT PULL] NUMA-balancing memory corruption fixes
@ 2013-10-31 22:25       ` Linus Torvalds
  0 siblings, 0 replies; 340+ messages in thread
From: Linus Torvalds @ 2013-10-31 22:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman

On Thu, Oct 31, 2013 at 2:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> ( If you think this is too dangerous for too little benefit then
>   I'll drop this separate tree and will send the original commits in
>   the merge window. )

Ugh. I hate hate hate the timing, and this is much larger and scarier
than what I'd like at this point, but I don't see the point to
delaying this either.

So I'm pulling them. And then I may end up doing an rc8 after all.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC GIT PULL] NUMA-balancing memory corruption fixes
  2013-10-31 22:25       ` Linus Torvalds
@ 2013-11-01  7:36         ` Ingo Molnar
  -1 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-11-01  7:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 31, 2013 at 2:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > ( If you think this is too dangerous for too little benefit then
> >   I'll drop this separate tree and will send the original commits in
> >   the merge window. )
> 
> Ugh. I hate hate hate the timing, and this is much larger and 
> scarier than what I'd like at this point, but I don't see the 
> point to delaying this either.

Yeah, it's pretty close to worst-case timing. I tried to accelerate 
the fixes as much as I dared, I wasn't even back from the KS yet but 
at another conference, doing all preparation and testing remotely
:-/ Still the timing sucks.

> So I'm pulling them. And then I may end up doing an rc8 after all.

Thanks for pulling them!

	Ingo

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC GIT PULL] NUMA-balancing memory corruption fixes
@ 2013-11-01  7:36         ` Ingo Molnar
  0 siblings, 0 replies; 340+ messages in thread
From: Ingo Molnar @ 2013-11-01  7:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, Tom Weber,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Mel Gorman


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 31, 2013 at 2:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > ( If you think this is too dangerous for too little benefit then
> >   I'll drop this separate tree and will send the original commits in
> >   the merge window. )
> 
> Ugh. I hate hate hate the timing, and this is much larger and 
> scarier than what I'd like at this point, but I don't see the 
> point to delaying this either.

Yeah, it's pretty close to worst-case timing. I tried to accelerate 
the fixes as much as I dared, I wasn't even back from the KS yet but 
at another conference, doing all preparation and testing remotely
:-/ Still the timing sucks.

> So I'm pulling them. And then I may end up doing an rc8 after all.

Thanks for pulling them!

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
  2013-10-07 10:28   ` Mel Gorman
                     ` (2 preceding siblings ...)
  (?)
@ 2013-12-04  5:32   ` Wanpeng Li
  2013-12-04  5:37     ` Wanpeng Li
  -1 siblings, 1 reply; 340+ messages in thread
From: Wanpeng Li @ 2013-12-04  5:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Mon, Oct 07, 2013 at 11:28:57AM +0100, Mel Gorman wrote:
>This patch tracks what nodes numa hinting faults were incurred on.
>This information is later used to schedule a task on the node storing
>the pages most frequently faulted by the task.
>
>Signed-off-by: Mel Gorman <mgorman@suse.de>
>---
> include/linux/sched.h |  2 ++
> kernel/sched/core.c   |  3 +++
> kernel/sched/fair.c   | 11 ++++++++++-
> kernel/sched/sched.h  | 12 ++++++++++++
> 4 files changed, 27 insertions(+), 1 deletion(-)
>
>diff --git a/include/linux/sched.h b/include/linux/sched.h
>index a8095ad..8828e40 100644
>--- a/include/linux/sched.h
>+++ b/include/linux/sched.h
>@@ -1332,6 +1332,8 @@ struct task_struct {
> 	unsigned int numa_scan_period_max;
> 	u64 node_stamp;			/* migration stamp  */
> 	struct callback_head numa_work;
>+
>+	unsigned long *numa_faults;
> #endif /* CONFIG_NUMA_BALANCING */
>
> 	struct rcu_head rcu;
>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>index 681945e..aad2e02 100644
>--- a/kernel/sched/core.c
>+++ b/kernel/sched/core.c
>@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
> 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> 	p->numa_work.next = &p->numa_work;
>+	p->numa_faults = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
>
> 	cpu_hotplug_init_task(p);
>@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> 	if (mm)
> 		mmdrop(mm);
> 	if (unlikely(prev_state == TASK_DEAD)) {
>+		task_numa_free(prev);

Function task_numa_free() depends on patch 43/64.

Regards,
Wanpeng Li 

>+
> 		/*
> 		 * Remove function-return probe instances associated with this
> 		 * task and put them back on the free list.
>diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>index 8cea7a2..df300d9 100644
>--- a/kernel/sched/fair.c
>+++ b/kernel/sched/fair.c
>@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> 	if (!numabalancing_enabled)
> 		return;
>
>-	/* FIXME: Allocate task-specific structure for placement policy here */
>+	/* Allocate buffer to track faults on a per-node basis */
>+	if (unlikely(!p->numa_faults)) {
>+		int size = sizeof(*p->numa_faults) * nr_node_ids;
>+
>+		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
>+		if (!p->numa_faults)
>+			return;
>+	}
>
> 	/*
> 	 * If pages are properly placed (did not migrate) then scan slower.
>@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
> 	}
>
> 	task_numa_placement(p);
>+
>+	p->numa_faults[node] += pages;
> }
>
> static void reset_ptenuma_scan(struct task_struct *p)
>diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>index b3c5653..6a955f4 100644
>--- a/kernel/sched/sched.h
>+++ b/kernel/sched/sched.h
>@@ -6,6 +6,7 @@
> #include <linux/spinlock.h>
> #include <linux/stop_machine.h>
> #include <linux/tick.h>
>+#include <linux/slab.h>
>
> #include "cpupri.h"
> #include "cpuacct.h"
>@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
> 	return rq->clock_task;
> }
>
>+#ifdef CONFIG_NUMA_BALANCING
>+static inline void task_numa_free(struct task_struct *p)
>+{
>+	kfree(p->numa_faults);
>+}
>+#else /* CONFIG_NUMA_BALANCING */
>+static inline void task_numa_free(struct task_struct *p)
>+{
>+}
>+#endif /* CONFIG_NUMA_BALANCING */
>+
> #ifdef CONFIG_SMP
>
> #define rcu_dereference_check_sched_domain(p) \
>-- 
>1.8.4
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
  2013-12-04  5:32   ` [PATCH 19/63] sched: " Wanpeng Li
@ 2013-12-04  5:37     ` Wanpeng Li
  0 siblings, 0 replies; 340+ messages in thread
From: Wanpeng Li @ 2013-12-04  5:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Dec 04, 2013 at 01:32:42PM +0800, Wanpeng Li wrote:
>On Mon, Oct 07, 2013 at 11:28:57AM +0100, Mel Gorman wrote:
>>This patch tracks what nodes numa hinting faults were incurred on.
>>This information is later used to schedule a task on the node storing
>>the pages most frequently faulted by the task.
>>
>>Signed-off-by: Mel Gorman <mgorman@suse.de>
>>---
>> include/linux/sched.h |  2 ++
>> kernel/sched/core.c   |  3 +++
>> kernel/sched/fair.c   | 11 ++++++++++-
>> kernel/sched/sched.h  | 12 ++++++++++++
>> 4 files changed, 27 insertions(+), 1 deletion(-)
>>
>>diff --git a/include/linux/sched.h b/include/linux/sched.h
>>index a8095ad..8828e40 100644
>>--- a/include/linux/sched.h
>>+++ b/include/linux/sched.h
>>@@ -1332,6 +1332,8 @@ struct task_struct {
>> 	unsigned int numa_scan_period_max;
>> 	u64 node_stamp;			/* migration stamp  */
>> 	struct callback_head numa_work;
>>+
>>+	unsigned long *numa_faults;
>> #endif /* CONFIG_NUMA_BALANCING */
>>
>> 	struct rcu_head rcu;
>>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>index 681945e..aad2e02 100644
>>--- a/kernel/sched/core.c
>>+++ b/kernel/sched/core.c
>>@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
>> 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>> 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
>> 	p->numa_work.next = &p->numa_work;
>>+	p->numa_faults = NULL;
>> #endif /* CONFIG_NUMA_BALANCING */
>>
>> 	cpu_hotplug_init_task(p);
>>@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>> 	if (mm)
>> 		mmdrop(mm);
>> 	if (unlikely(prev_state == TASK_DEAD)) {
>>+		task_numa_free(prev);
>
>Function task_numa_free() depends on patch 43/64.

Sorry, I miss it.

>
>Regards,
>Wanpeng Li 
>
>>+
>> 		/*
>> 		 * Remove function-return probe instances associated with this
>> 		 * task and put them back on the free list.
>>diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>index 8cea7a2..df300d9 100644
>>--- a/kernel/sched/fair.c
>>+++ b/kernel/sched/fair.c
>>@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
>> 	if (!numabalancing_enabled)
>> 		return;
>>
>>-	/* FIXME: Allocate task-specific structure for placement policy here */
>>+	/* Allocate buffer to track faults on a per-node basis */
>>+	if (unlikely(!p->numa_faults)) {
>>+		int size = sizeof(*p->numa_faults) * nr_node_ids;
>>+
>>+		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
>>+		if (!p->numa_faults)
>>+			return;
>>+	}
>>
>> 	/*
>> 	 * If pages are properly placed (did not migrate) then scan slower.
>>@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
>> 	}
>>
>> 	task_numa_placement(p);
>>+
>>+	p->numa_faults[node] += pages;
>> }
>>
>> static void reset_ptenuma_scan(struct task_struct *p)
>>diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>index b3c5653..6a955f4 100644
>>--- a/kernel/sched/sched.h
>>+++ b/kernel/sched/sched.h
>>@@ -6,6 +6,7 @@
>> #include <linux/spinlock.h>
>> #include <linux/stop_machine.h>
>> #include <linux/tick.h>
>>+#include <linux/slab.h>
>>
>> #include "cpupri.h"
>> #include "cpuacct.h"
>>@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
>> 	return rq->clock_task;
>> }
>>
>>+#ifdef CONFIG_NUMA_BALANCING
>>+static inline void task_numa_free(struct task_struct *p)
>>+{
>>+	kfree(p->numa_faults);
>>+}
>>+#else /* CONFIG_NUMA_BALANCING */
>>+static inline void task_numa_free(struct task_struct *p)
>>+{
>>+}
>>+#endif /* CONFIG_NUMA_BALANCING */
>>+
>> #ifdef CONFIG_SMP
>>
>> #define rcu_dereference_check_sched_domain(p) \
>>-- 
>>1.8.4
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@kvack.org.  For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 48/63] sched: numa: stay on the same node if CLONE_VM
  2013-09-27 13:26 [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V8 Mel Gorman
@ 2013-09-27 13:27   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-09-27 13:27 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 +-
 kernel/fork.c         |  2 +-
 kernel/sched/core.c   | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 15888f5..4f51ceb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2005,7 +2005,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b1f6af..51f6c4b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,7 +1310,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	sched_fork(clone_flags, p);
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 123ac92..336a8ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1699,7 +1699,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  *
  * __sched_fork() is basic setup used by init_idle() too:
  */
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
@@ -1732,11 +1732,15 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_scan_seq = 0;
 	}
 
+	if (clone_flags & CLONE_VM)
+		p->numa_preferred_nid = current->numa_preferred_nid;
+	else
+		p->numa_preferred_nid = -1;
+
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
@@ -1768,12 +1772,12 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
 
-	__sched_fork(p);
+	__sched_fork(clone_flags, p);
 	/*
 	 * We mark the process as running here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -4304,7 +4308,7 @@ void init_idle(struct task_struct *idle, int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__sched_fork(idle);
+	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 340+ messages in thread

* [PATCH 48/63] sched: numa: stay on the same node if CLONE_VM
@ 2013-09-27 13:27   ` Mel Gorman
  0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-09-27 13:27 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 +-
 kernel/fork.c         |  2 +-
 kernel/sched/core.c   | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 15888f5..4f51ceb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2005,7 +2005,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b1f6af..51f6c4b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,7 +1310,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	sched_fork(clone_flags, p);
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 123ac92..336a8ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1699,7 +1699,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  *
  * __sched_fork() is basic setup used by init_idle() too:
  */
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
@@ -1732,11 +1732,15 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_scan_seq = 0;
 	}
 
+	if (clone_flags & CLONE_VM)
+		p->numa_preferred_nid = current->numa_preferred_nid;
+	else
+		p->numa_preferred_nid = -1;
+
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
@@ -1768,12 +1772,12 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
 
-	__sched_fork(p);
+	__sched_fork(clone_flags, p);
 	/*
 	 * We mark the process as running here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -4304,7 +4308,7 @@ void init_idle(struct task_struct *idle, int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__sched_fork(idle);
+	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 340+ messages in thread

end of thread, other threads:[~2013-12-04  5:37 UTC | newest]

Thread overview: 340+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-07 10:28 [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9 Mel Gorman
2013-10-07 10:28 ` Mel Gorman
2013-10-07 10:28 ` [PATCH 01/63] hotplug: Optimize {get,put}_online_cpus() Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 10:28 ` [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 12:46   ` Rik van Riel
2013-10-07 12:46     ` Rik van Riel
2013-10-09 17:24   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 03/63] sched, numa: Comment fixlets Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 12:46   ` Rik van Riel
2013-10-07 12:46     ` Rik van Riel
2013-10-09 17:24   ` [tip:sched/core] sched/numa: Fix comments tip-bot for Peter Zijlstra
2013-10-07 10:28 ` [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 12:47   ` Rik van Riel
2013-10-07 12:47     ` Rik van Riel
2013-10-09 17:24   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-29 10:42   ` [tip:core/urgent] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 13:55   ` Rik van Riel
2013-10-07 13:55     ` Rik van Riel
2013-10-09 17:24   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-29 10:42   ` [tip:core/urgent] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 06/63] mm: Prevent parallel splits during THP migration Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 14:01   ` Rik van Riel
2013-10-07 14:01     ` Rik van Riel
2013-10-09 17:24   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-29 10:42   ` [tip:core/urgent] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 14:02   ` Rik van Riel
2013-10-07 14:02     ` Rik van Riel
2013-10-09 17:25   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-29 10:42   ` [tip:core/urgent] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 14:02   ` Rik van Riel
2013-10-07 14:02     ` Rik van Riel
2013-10-09 17:25   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-29 10:42   ` [tip:core/urgent] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 14:02   ` Rik van Riel
2013-10-07 14:02     ` Rik van Riel
2013-10-09 17:25   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-29 10:43   ` [tip:core/urgent] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 15:12   ` Rik van Riel
2013-10-07 15:12     ` Rik van Riel
2013-10-09 17:25   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 11/63] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-09 17:25   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 17:10   ` Rik van Riel
2013-10-07 17:10     ` Rik van Riel
2013-10-09 17:25   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 17:24   ` Rik van Riel
2013-10-07 17:24     ` Rik van Riel
2013-10-09 17:26   ` [tip:sched/core] sched/numa: " tip-bot for Peter Zijlstra
2013-10-07 10:28 ` [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 17:24   ` Rik van Riel
2013-10-07 17:24     ` Rik van Riel
2013-10-09 17:26   ` [tip:sched/core] sched/numa: " tip-bot for Peter Zijlstra
2013-10-07 10:28 ` [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 17:42   ` Rik van Riel
2013-10-07 17:42     ` Rik van Riel
2013-10-09 17:26   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 16/63] sched: numa: Initialise numa_next_scan properly Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 17:44   ` Rik van Riel
2013-10-07 17:44     ` Rik van Riel
2013-10-09 17:26   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 17:44   ` Rik van Riel
2013-10-07 17:44     ` Rik van Riel
2013-10-09 17:26   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 18:02   ` Rik van Riel
2013-10-07 18:02     ` Rik van Riel
2013-10-09 17:26   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 18:02   ` Rik van Riel
2013-10-07 18:02     ` Rik van Riel
2013-10-09 17:27   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-12-04  5:32   ` [PATCH 19/63] sched: " Wanpeng Li
2013-12-04  5:37     ` Wanpeng Li
2013-10-07 10:28 ` [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 18:04   ` Rik van Riel
2013-10-07 18:04     ` Rik van Riel
2013-10-09 17:27   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:28 ` [PATCH 21/63] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-10-07 10:28   ` Mel Gorman
2013-10-07 18:39   ` Rik van Riel
2013-10-07 18:39     ` Rik van Riel
2013-10-09 17:27   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 22/63] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:39   ` Rik van Riel
2013-10-07 18:39     ` Rik van Riel
2013-10-09 17:27   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:40   ` Rik van Riel
2013-10-07 18:40     ` Rik van Riel
2013-10-09 17:27   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:40   ` Rik van Riel
2013-10-07 18:40     ` Rik van Riel
2013-10-09 17:27   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:41   ` Rik van Riel
2013-10-07 18:41     ` Rik van Riel
2013-10-09 17:28   ` [tip:sched/core] sched/numa: Add infrastructure for split shared/ private " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 26/63] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:41   ` Rik van Riel
2013-10-07 18:41     ` Rik van Riel
2013-10-09 17:28   ` [tip:sched/core] sched/numa: Check current-> mm " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:43   ` Rik van Riel
2013-10-07 18:43     ` Rik van Riel
2013-10-09 17:28   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 28/63] sched: Remove check that skips small VMAs Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:44   ` Rik van Riel
2013-10-07 18:44     ` Rik van Riel
2013-10-09 17:28   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:45   ` Rik van Riel
2013-10-07 18:45     ` Rik van Riel
2013-10-09 17:28   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 30/63] sched: Do not migrate memory immediately after switching node Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:28   ` [tip:sched/core] sched/numa: " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 31/63] mm: numa: only unmap migrate-on-fault VMAs Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:29   ` [tip:sched/core] mm: numa: Limit NUMA scanning to " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:58   ` Rik van Riel
2013-10-07 18:58     ` Rik van Riel
2013-10-09 17:29   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 18:58   ` Rik van Riel
2013-10-07 18:58     ` Rik van Riel
2013-10-09 17:29   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 34/63] sched: numa: increment numa_migrate_seq when task runs in correct location Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:29   ` [tip:sched/core] sched/numa: Increment " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:04   ` Rik van Riel
2013-10-07 19:04     ` Rik van Riel
2013-10-09 17:29   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:06   ` Rik van Riel
2013-10-07 19:06     ` Rik van Riel
2013-10-09 17:29   ` [tip:sched/core] mm: numa: Trap pmd hinting faults only " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 37/63] stop_machine: Introduce stop_two_cpus() Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:30   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2013-10-07 10:29 ` [PATCH 38/63] sched: Introduce migrate_swap() Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:06   ` Rik van Riel
2013-10-07 19:06     ` Rik van Riel
2013-10-09 17:30   ` [tip:sched/core] sched/numa: " tip-bot for Peter Zijlstra
2013-10-10 18:17     ` Peter Zijlstra
2013-10-10 19:04       ` Rik van Riel
2013-10-15  9:55       ` Mel Gorman
2013-10-17 16:49       ` [tip:sched/core] sched: Fix race in migrate_swap_stop() tip-bot for Peter Zijlstra
2013-10-07 10:29 ` [PATCH 39/63] sched: numa: Use a system-wide search to find swap/migration candidates Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:07   ` Rik van Riel
2013-10-07 19:07     ` Rik van Riel
2013-10-09 17:30   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 40/63] sched: numa: Favor placing a task on the preferred node Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:07   ` Rik van Riel
2013-10-07 19:07     ` Rik van Riel
2013-10-09 17:30   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 41/63] sched: numa: fix placement of workloads spread across multiple nodes Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:30   ` [tip:sched/core] sched/numa: Fix " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 42/63] mm: numa: Change page last {nid,pid} into {cpu,pid} Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:08   ` Rik van Riel
2013-10-07 19:08     ` Rik van Riel
2013-10-09 17:30   ` [tip:sched/core] mm: numa: Change page last {nid,pid} into {cpu, pid} tip-bot for Peter Zijlstra
2013-10-07 10:29 ` [PATCH 43/63] sched: numa: Use {cpu, pid} to create task groups for shared faults Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:09   ` Rik van Riel
2013-10-07 19:09     ` Rik van Riel
2013-10-09 17:31   ` [tip:sched/core] sched/numa: " tip-bot for Peter Zijlstra
2013-10-07 10:29 ` [PATCH 44/63] sched: numa: Report a NUMA task group ID Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:09   ` Rik van Riel
2013-10-07 19:09     ` Rik van Riel
2013-10-09 17:31   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 45/63] mm: numa: copy cpupid on page migration Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:31   ` [tip:sched/core] mm: numa: Copy " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 46/63] mm: numa: Do not group on RO pages Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:10   ` Rik van Riel
2013-10-07 19:10     ` Rik van Riel
2013-10-09 17:31   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2013-10-07 10:29 ` [PATCH 47/63] mm: numa: Do not batch handle PMD pages Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:11   ` Rik van Riel
2013-10-07 19:11     ` Rik van Riel
2013-10-09 17:31   ` [tip:sched/core] " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 48/63] sched: numa: stay on the same node if CLONE_VM Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:31   ` [tip:sched/core] sched/numa: Stay " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 49/63] sched: numa: use group fault statistics in numa placement Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:32   ` [tip:sched/core] sched/numa: Use " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 50/63] sched: numa: call task_numa_free from do_execve Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:32   ` [tip:sched/core] sched/numa: Call task_numa_free() from do_execve () tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 51/63] sched: numa: Prevent parallel updates to group stats during placement Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:13   ` Rik van Riel
2013-10-07 19:13     ` Rik van Riel
2013-10-09 17:32   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 52/63] sched: numa: add debugging Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:13   ` Rik van Riel
2013-10-07 19:13     ` Rik van Riel
2013-10-09 17:32   ` [tip:sched/core] sched/numa: Add debugging tip-bot for Ingo Molnar
2013-10-07 10:29 ` [PATCH 53/63] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:32   ` [tip:sched/core] sched/numa: " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 54/63] sched: numa: fix task or group comparison Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:32   ` [tip:sched/core] sched/numa: Fix " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 55/63] sched: numa: Avoid migrating tasks that are placed on their preferred node Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:14   ` Rik van Riel
2013-10-07 19:14     ` Rik van Riel
2013-10-09 17:33   ` [tip:sched/core] sched/numa: " tip-bot for Peter Zijlstra
2013-10-07 10:29 ` [PATCH 56/63] sched: numa: be more careful about joining numa groups Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:33   ` [tip:sched/core] sched/numa: Be " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 57/63] sched: numa: Take false sharing into account when adapting scan rate Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:14   ` Rik van Riel
2013-10-07 19:14     ` Rik van Riel
2013-10-09 17:33   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 58/63] sched: numa: adjust scan rate in task_numa_placement Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:33   ` [tip:sched/core] sched/numa: Adjust " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 59/63] sched: numa: Remove the numa_balancing_scan_period_reset sysctl Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:14   ` Rik van Riel
2013-10-07 19:14     ` Rik van Riel
2013-10-09 17:33   ` [tip:sched/core] sched/numa: " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 60/63] mm: numa: revert temporarily disabling of NUMA migration Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:33   ` [tip:sched/core] mm: numa: Revert " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 61/63] sched: numa: skip some page migrations after a shared fault Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:34   ` [tip:sched/core] sched/numa: Skip " tip-bot for Rik van Riel
2013-10-07 10:29 ` [PATCH 62/63] sched: numa: use unsigned longs for numa group fault stats Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-07 19:15   ` Rik van Riel
2013-10-07 19:15     ` Rik van Riel
2013-10-09 17:34   ` [tip:sched/core] sched/numa: Use " tip-bot for Mel Gorman
2013-10-07 10:29 ` [PATCH 63/63] sched: numa: periodically retry task_numa_migrate Mel Gorman
2013-10-07 10:29   ` Mel Gorman
2013-10-09 17:34   ` [tip:sched/core] sched/numa: Retry task_numa_migrate() periodically tip-bot for Rik van Riel
2013-10-09 11:03 ` [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9 Ingo Molnar
2013-10-09 11:03   ` Ingo Molnar
2013-10-09 11:11   ` Ingo Molnar
2013-10-09 11:11     ` Ingo Molnar
2013-10-09 11:13     ` Ingo Molnar
2013-10-09 11:13       ` Ingo Molnar
2013-10-09 12:05   ` Peter Zijlstra
2013-10-09 12:05     ` Peter Zijlstra
2013-10-09 12:48     ` Ingo Molnar
2013-10-09 12:48       ` Ingo Molnar
2013-10-10  7:05   ` Mel Gorman
2013-10-10  7:05     ` Mel Gorman
2013-10-09 16:28 ` Ingo Molnar
2013-10-09 16:29   ` Ingo Molnar
2013-10-09 16:57     ` Ingo Molnar
2013-10-09 16:57       ` Ingo Molnar
2013-10-09 17:09       ` Ingo Molnar
2013-10-09 17:09         ` Ingo Molnar
2013-10-09 17:11         ` Peter Zijlstra
2013-10-09 17:11           ` Peter Zijlstra
2013-10-09 17:08   ` Peter Zijlstra
2013-10-09 17:08     ` Peter Zijlstra
2013-10-09 17:15     ` Ingo Molnar
2013-10-09 17:15       ` Ingo Molnar
2013-10-09 17:18       ` Peter Zijlstra
2013-10-09 17:18         ` Peter Zijlstra
2013-10-24 12:26 ` Automatic NUMA balancing patches for tip-urgent/stable Mel Gorman
2013-10-24 12:26   ` Mel Gorman
2013-10-26 12:11   ` Ingo Molnar
2013-10-26 12:11     ` Ingo Molnar
2013-10-29  9:42     ` Mel Gorman
2013-10-29  9:42       ` Mel Gorman
2013-10-29  9:48       ` Ingo Molnar
2013-10-29  9:48         ` Ingo Molnar
2013-10-29 10:24         ` Mel Gorman
2013-10-29 10:24           ` Mel Gorman
2013-10-29 10:41           ` Ingo Molnar
2013-10-29 10:41             ` Ingo Molnar
2013-10-29 12:48             ` Mel Gorman
2013-10-29 12:48               ` Mel Gorman
2013-10-31  9:51   ` [RFC GIT PULL] NUMA-balancing memory corruption fixes Ingo Molnar
2013-10-31  9:51     ` Ingo Molnar
2013-10-31 22:25     ` Linus Torvalds
2013-10-31 22:25       ` Linus Torvalds
2013-11-01  7:36       ` Ingo Molnar
2013-11-01  7:36         ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2013-09-27 13:26 [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V8 Mel Gorman
2013-09-27 13:27 ` [PATCH 48/63] sched: numa: stay on the same node if CLONE_VM Mel Gorman
2013-09-27 13:27   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.