* [PATCH 0/63] Basic scheduler support for automatic NUMA balancing V9
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This series has roughly the same goals as previous versions despite the
size. It reduces overhead of automatic balancing through scan rate reduction
and the avoidance of TLB flushes. It selects a preferred node and moves tasks
towards their memory as well as moving memory toward their task. It handles
shared pages and groups related tasks together. Some problems such as shared
page interleaving and properly dealing with processes that are larger than
a node are being deferred. This version should be ready for wider testing
in -tip.
Note that with kernel 3.12-rc3 that numa balancing will fail to boot if
CONFIG_JUMP_LABEL is configured. This is a separate bug that is currently
being dealt with.
Changelog since V8
o Rebased to v3.12-rc3
o Handle races against hotplug
Changelog since V7
o THP migration race and pagetable insertion fixes
o Do no handle PMDs in batch
o Shared page migration throttling
o Various changes to how last nid/pid information is recorded
o False pid match sanity checks when joining NUMA task groups
o Adapt scan rate based on local/remote fault statistics
o Period retry of migration to preferred node
o Limit scope of system-wide search
o Schedule threads on the same node as process that created them
o Cleanup numa_group on exec
Changelog since V6
o Group tasks that share pages together
o More scan avoidance of VMAs mapping pages that are not likely to migrate
o cpunid conversion, system-wide searching of tasks to balance with
Changelog since V6
o Various TLB flush optimisations
o Comment updates
o Sanitise task_numa_fault callsites for consistent semantics
o Revert some of the scanning adaption stuff
o Revert patch that defers scanning until task schedules on another node
o Start delayed scanning properly
o Avoid the same task always performing the PTE scan
o Continue PTE scanning even if migration is rate limited
Changelog since V5
o Add __GFP_NOWARN for numa hinting fault count
o Use is_huge_zero_page
o Favour moving tasks towards nodes with higher faults
o Optionally resist moving tasks towards nodes with lower faults
o Scan shared THP pages
Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality
Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected
Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads
Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
preferred node
o Laughably basic accounting of a compute overloaded node when selecting
the preferred node.
o Applied review comments
This series integrates basic scheduler support for automatic NUMA balancing.
It was initially based on Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).
There has been a tonne of additional work from both Peter and Rik van Riel.
Some reports indicate that the performance is getting close to manual
bindings for some workloads but your mileage will vary.
Patch 1 is a monolithic dump of patches thare are destined for upstream that
this series indirectly depends upon.
Patches 2-3 adds sysctl documentation and comment fixlets
Patch 4 avoids accounting for a hinting fault if another thread handled the
fault in parallel
Patches 5-6 avoid races with parallel THP migration and THP splits.
Patch 7 corrects a THP NUMA hint fault accounting bug
Patches 8-9 avoids TLB flushes during the PTE scan if no updates are made
Patch 10 sanitizes task_numa_fault callsites to have consist semantics and
always record the fault based on the correct location of the page.
Patch 11 closes races between THP migration and PMD clearing.
Patch 12 avoids trying to migrate the THP zero page
Patch 13 avoids the same task being selected to perform the PTE scan within
a shared address space.
Patch 14 continues PTE scanning even if migration rate limited
Patch 15 notes that delaying the PTE scan until a task is scheduled on an
alternative node misses the case where the task is only accessing
shared memory on a partially loaded machine and reverts a patch.
Patch 16 initialises numa_next_scan properly so that PTE scanning is delayed
when a process starts.
Patch 17 sets the scan rate proportional to the size of the task being
scanned.
Patch 18 slows the scan rate if no hinting faults were trapped by an idle task.
Patch 19 tracks NUMA hinting faults per-task and per-node
Patches 20-24 selects a preferred node at the end of a PTE scan based on what
node incurred the highest number of NUMA faults. When the balancer
is comparing two CPU it will prefer to locate tasks on their
preferred node. When initially selected the task is rescheduled on
the preferred node if it is not running on that node already. This
avoids waiting for the scheduler to move the task slowly.
Patch 25 adds infrastructure to allow separate tracking of shared/private
pages but treats all faults as if they are private accesses. Laying
it out this way reduces churn later in the series when private
fault detection is introduced
Patch 26 avoids some unnecessary allocation
Patch 27-28 kicks away some training wheels and scans shared pages and
small VMAs.
Patch 29 introduces private fault detection based on the PID of the faulting
process and accounts for shared/private accesses differently.
Patch 30 avoids migrating memory immediately after the load balancer moves
a task to another node in case it's a transient migration.
Patch 31 avoids scanning VMAs that do not migrate-on-fault which addresses
a serious regression on a database performance test.
Patch 32 pick the least loaded CPU based on a preferred node based on
a scheduling domain common to both the source and destination
NUMA node.
Patch 33 retries task migration if an earlier attempt failed
Patch 34 will begin task migration immediately if running on its preferred
node
Patch 35 will avoid trapping hinting faults for shared read-only library
pages as these never migrate anyway
Patch 36 avoids handling pmd hinting faults if none of the ptes below it were
marked pte numa
Patches 37-38 introduce a mechanism for swapping tasks
Patch 39 uses a system-wide search to find tasks that can be swapped
to improve the overall locality of the system.
Patch 40 notes that the system-wide search may ignore the preferred node and
will use the preferred node placement if it has spare compute
capacity.
Patch 41 will perform a global search if a node that should have had capacity
cannot have a task migrated to it
Patches 42-43 use cpupid to track pages so potential sharing tasks can
be quickly found
Patch 44 reports the ID of the numa group a task belongs.
Patch 45 copies the cpupid on page migration
Patch 46 avoids grouping based on read-only pages
Patch 47 stops handling pages within a PMD in batch as it distorts fault
statistics and failed to flush TLBs correctly.
Patch 48 schedules new threads on the same node as the parent.
Patch 49 schedules tasks based on their numa group
Patch 50 cleans up tasks numa_group on exec
Patch 51 avoids parallel updates to group stats
Patch 52 adds some debugging aids
Patches 53-54 separately considers task and group weights when selecting the node to
schedule a task on
Patch 56 checks if PID truncation may have caused false matches before joining tasks
to a NUMA grou
Patch 57 uses the false shared detection information for scan rate adaption later
Patch 58 adapts the scan rate based on local/remote faults
Patch 59 removes the period scan rate reset
Patch 60-61 throttles shared page migrations
Patch 62 avoids the use of atomics protects the values with a spinlock
Patch 63 periodically retries migrating a task back to its preferred node
Kernel 3.12-rc3 is the testing baseline.
o account-v9 Patches 1-8
o periodretry-v8 Patches 1-63
This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.
specjbb
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
TPut 1 26187.00 ( 0.00%) 25922.00 ( -1.01%)
TPut 2 55752.00 ( 0.00%) 53928.00 ( -3.27%)
TPut 3 88878.00 ( 0.00%) 84689.00 ( -4.71%)
TPut 4 111226.00 ( 0.00%) 111843.00 ( 0.55%)
TPut 5 138700.00 ( 0.00%) 139712.00 ( 0.73%)
TPut 6 173467.00 ( 0.00%) 161226.00 ( -7.06%)
TPut 7 197609.00 ( 0.00%) 194035.00 ( -1.81%)
TPut 8 220501.00 ( 0.00%) 218853.00 ( -0.75%)
TPut 9 247997.00 ( 0.00%) 244480.00 ( -1.42%)
TPut 10 275616.00 ( 0.00%) 269962.00 ( -2.05%)
TPut 11 301610.00 ( 0.00%) 301051.00 ( -0.19%)
TPut 12 326151.00 ( 0.00%) 318040.00 ( -2.49%)
TPut 13 341671.00 ( 0.00%) 346890.00 ( 1.53%)
TPut 14 372805.00 ( 0.00%) 367204.00 ( -1.50%)
TPut 15 390175.00 ( 0.00%) 371538.00 ( -4.78%)
TPut 16 406716.00 ( 0.00%) 409835.00 ( 0.77%)
TPut 17 429094.00 ( 0.00%) 436172.00 ( 1.65%)
TPut 18 457167.00 ( 0.00%) 456528.00 ( -0.14%)
TPut 19 476963.00 ( 0.00%) 479680.00 ( 0.57%)
TPut 20 492751.00 ( 0.00%) 480019.00 ( -2.58%)
TPut 21 514952.00 ( 0.00%) 511950.00 ( -0.58%)
TPut 22 521962.00 ( 0.00%) 516450.00 ( -1.06%)
TPut 23 537268.00 ( 0.00%) 532825.00 ( -0.83%)
TPut 24 541231.00 ( 0.00%) 539425.00 ( -0.33%)
TPut 25 530459.00 ( 0.00%) 538714.00 ( 1.56%)
TPut 26 538837.00 ( 0.00%) 524894.00 ( -2.59%)
TPut 27 534132.00 ( 0.00%) 519628.00 ( -2.72%)
TPut 28 529470.00 ( 0.00%) 519044.00 ( -1.97%)
TPut 29 504426.00 ( 0.00%) 514158.00 ( 1.93%)
TPut 30 514785.00 ( 0.00%) 513080.00 ( -0.33%)
TPut 31 501018.00 ( 0.00%) 492377.00 ( -1.72%)
TPut 32 488377.00 ( 0.00%) 492108.00 ( 0.76%)
TPut 33 484809.00 ( 0.00%) 493612.00 ( 1.82%)
TPut 34 473015.00 ( 0.00%) 477716.00 ( 0.99%)
TPut 35 451833.00 ( 0.00%) 455368.00 ( 0.78%)
TPut 36 445787.00 ( 0.00%) 460138.00 ( 3.22%)
TPut 37 446034.00 ( 0.00%) 453011.00 ( 1.56%)
TPut 38 433305.00 ( 0.00%) 441966.00 ( 2.00%)
TPut 39 431202.00 ( 0.00%) 443747.00 ( 2.91%)
TPut 40 420040.00 ( 0.00%) 432818.00 ( 3.04%)
TPut 41 416519.00 ( 0.00%) 424105.00 ( 1.82%)
TPut 42 426047.00 ( 0.00%) 430164.00 ( 0.97%)
TPut 43 421725.00 ( 0.00%) 419106.00 ( -0.62%)
TPut 44 414340.00 ( 0.00%) 425471.00 ( 2.69%)
TPut 45 413836.00 ( 0.00%) 418506.00 ( 1.13%)
TPut 46 403636.00 ( 0.00%) 421177.00 ( 4.35%)
TPut 47 387726.00 ( 0.00%) 388190.00 ( 0.12%)
TPut 48 405375.00 ( 0.00%) 418321.00 ( 3.19%)
Mostly flat. Profiles were interesting because they showed heavy contention
on the mm->page_table_lock due to THP faults and migration. It is expected
that Kirill's page table lock split lock work will help here. At the time
of writing that series has been rebased on top for testing.
specjbb Peaks
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%)
Expctd Peak Bops 387726.00 ( 0.00%) 388190.00 ( 0.12%)
Actual Warehouse 25.00 ( 0.00%) 25.00 ( 0.00%)
Actual Peak Bops 541231.00 ( 0.00%) 539425.00 ( -0.33%)
SpecJBB Bops 8273.00 ( 0.00%) 8537.00 ( 3.19%)
SpecJBB Bops/JVM 8273.00 ( 0.00%) 8537.00 ( 3.19%)
Minor gain in the overal specjbb score but the peak performance is
slightly lower.
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
User 44731.08 44820.18
System 189.53 124.16
Elapsed 1665.71 1666.42
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
Minor Faults 3815276 4471086
Major Faults 108 131
Compaction cost 12002 3214
NUMA PTE updates 17955537 3849428
NUMA hint faults 3950201 3822150
NUMA hint local faults 1032610 1029273
NUMA hint local percent 26 26
NUMA pages migrated 11562658 3096443
AutoNUMA cost 20096 19196
As with previous releases system CPU usage is generally lower with fewer
scans.
autonumabench
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
User NUMA01 43871.21 ( 0.00%) 53162.55 (-21.18%)
User NUMA01_THEADLOCAL 25270.59 ( 0.00%) 28868.37 (-14.24%)
User NUMA02 2196.67 ( 0.00%) 2110.35 ( 3.93%)
User NUMA02_SMT 1039.18 ( 0.00%) 1035.41 ( 0.36%)
System NUMA01 187.11 ( 0.00%) 154.69 ( 17.33%)
System NUMA01_THEADLOCAL 216.47 ( 0.00%) 95.47 ( 55.90%)
System NUMA02 3.52 ( 0.00%) 3.26 ( 7.39%)
System NUMA02_SMT 2.42 ( 0.00%) 2.03 ( 16.12%)
Elapsed NUMA01 970.59 ( 0.00%) 1199.46 (-23.58%)
Elapsed NUMA01_THEADLOCAL 569.11 ( 0.00%) 643.37 (-13.05%)
Elapsed NUMA02 51.59 ( 0.00%) 49.94 ( 3.20%)
Elapsed NUMA02_SMT 49.73 ( 0.00%) 50.29 ( -1.13%)
CPU NUMA01 4539.00 ( 0.00%) 4445.00 ( 2.07%)
CPU NUMA01_THEADLOCAL 4478.00 ( 0.00%) 4501.00 ( -0.51%)
CPU NUMA02 4264.00 ( 0.00%) 4231.00 ( 0.77%)
CPU NUMA02_SMT 2094.00 ( 0.00%) 2062.00 ( 1.53%)
The numa01 (adverse workload) is hit quite badly but it often is. The
numa01-threadlocal regression is of greater concern and will be examined
further. It is interesting to note that monitoring the workload affects
the results quite severely. These results are based on no monitoring.
This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running per node on the system.
specjbb
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
Mean 1 30900.00 ( 0.00%) 29541.50 ( -4.40%)
Mean 2 62820.50 ( 0.00%) 63330.25 ( 0.81%)
Mean 3 92803.00 ( 0.00%) 92629.75 ( -0.19%)
Mean 4 119122.25 ( 0.00%) 121981.75 ( 2.40%)
Mean 5 142391.00 ( 0.00%) 148290.50 ( 4.14%)
Mean 6 151073.00 ( 0.00%) 169823.75 ( 12.41%)
Mean 7 152618.50 ( 0.00%) 166411.00 ( 9.04%)
Mean 8 141284.25 ( 0.00%) 153222.00 ( 8.45%)
Mean 9 136055.25 ( 0.00%) 139262.50 ( 2.36%)
Mean 10 124290.50 ( 0.00%) 133464.50 ( 7.38%)
Mean 11 139939.25 ( 0.00%) 159681.25 ( 14.11%)
Mean 12 137545.75 ( 0.00%) 159829.50 ( 16.20%)
Mean 13 133607.25 ( 0.00%) 157809.00 ( 18.11%)
Mean 14 135512.00 ( 0.00%) 153510.50 ( 13.28%)
Mean 15 132730.75 ( 0.00%) 151627.25 ( 14.24%)
Mean 16 129924.25 ( 0.00%) 148248.00 ( 14.10%)
Mean 17 130339.00 ( 0.00%) 149250.00 ( 14.51%)
Mean 18 124314.25 ( 0.00%) 146486.50 ( 17.84%)
Mean 19 120331.25 ( 0.00%) 143616.75 ( 19.35%)
Mean 20 118827.25 ( 0.00%) 141381.50 ( 18.98%)
Mean 21 120938.25 ( 0.00%) 138196.75 ( 14.27%)
Mean 22 118660.75 ( 0.00%) 136879.50 ( 15.35%)
Mean 23 117005.75 ( 0.00%) 134200.50 ( 14.70%)
Mean 24 112711.50 ( 0.00%) 131302.50 ( 16.49%)
Mean 25 115458.50 ( 0.00%) 129939.25 ( 12.54%)
Mean 26 114008.50 ( 0.00%) 128834.50 ( 13.00%)
Mean 27 115063.50 ( 0.00%) 128394.00 ( 11.59%)
Mean 28 114359.50 ( 0.00%) 124072.50 ( 8.49%)
Mean 29 113637.50 ( 0.00%) 124954.50 ( 9.96%)
Mean 30 113392.75 ( 0.00%) 123941.75 ( 9.30%)
Mean 31 115131.25 ( 0.00%) 121477.75 ( 5.51%)
Mean 32 112004.00 ( 0.00%) 122235.00 ( 9.13%)
Mean 33 111287.50 ( 0.00%) 120992.50 ( 8.72%)
Mean 34 111206.75 ( 0.00%) 118769.75 ( 6.80%)
Mean 35 108469.50 ( 0.00%) 120061.50 ( 10.69%)
Mean 36 105932.00 ( 0.00%) 118039.75 ( 11.43%)
Mean 37 107428.00 ( 0.00%) 118295.75 ( 10.12%)
Mean 38 102804.75 ( 0.00%) 120519.50 ( 17.23%)
Mean 39 104095.00 ( 0.00%) 121461.50 ( 16.68%)
Mean 40 103460.00 ( 0.00%) 122506.50 ( 18.41%)
Mean 41 100417.00 ( 0.00%) 118570.50 ( 18.08%)
Mean 42 101025.75 ( 0.00%) 120612.00 ( 19.39%)
Mean 43 100311.75 ( 0.00%) 120743.50 ( 20.37%)
Mean 44 101769.00 ( 0.00%) 120410.25 ( 18.32%)
Mean 45 99649.25 ( 0.00%) 121260.50 ( 21.69%)
Mean 46 101178.50 ( 0.00%) 121210.75 ( 19.80%)
Mean 47 101148.75 ( 0.00%) 119994.25 ( 18.63%)
Mean 48 103446.00 ( 0.00%) 120204.50 ( 16.20%)
Stddev 1 940.15 ( 0.00%) 1277.19 (-35.85%)
Stddev 2 292.47 ( 0.00%) 1851.80 (-533.15%)
Stddev 3 1750.78 ( 0.00%) 1808.61 ( -3.30%)
Stddev 4 859.01 ( 0.00%) 2790.10 (-224.80%)
Stddev 5 3236.13 ( 0.00%) 1892.19 ( 41.53%)
Stddev 6 2489.07 ( 0.00%) 2157.76 ( 13.31%)
Stddev 7 1981.85 ( 0.00%) 4299.27 (-116.93%)
Stddev 8 2586.24 ( 0.00%) 3090.27 (-19.49%)
Stddev 9 7250.82 ( 0.00%) 4762.66 ( 34.32%)
Stddev 10 1242.89 ( 0.00%) 1448.14 (-16.51%)
Stddev 11 1631.31 ( 0.00%) 9758.25 (-498.19%)
Stddev 12 1964.66 ( 0.00%) 17425.60 (-786.95%)
Stddev 13 2080.24 ( 0.00%) 17824.45 (-756.84%)
Stddev 14 1362.07 ( 0.00%) 18551.85 (-1262.03%)
Stddev 15 3142.86 ( 0.00%) 20410.21 (-549.42%)
Stddev 16 2026.28 ( 0.00%) 19767.72 (-875.57%)
Stddev 17 2059.98 ( 0.00%) 19358.07 (-839.72%)
Stddev 18 2832.80 ( 0.00%) 19434.41 (-586.05%)
Stddev 19 4248.17 ( 0.00%) 19590.94 (-361.16%)
Stddev 20 3163.70 ( 0.00%) 18608.43 (-488.19%)
Stddev 21 1046.22 ( 0.00%) 17766.10 (-1598.13%)
Stddev 22 1458.72 ( 0.00%) 16295.25 (-1017.09%)
Stddev 23 1453.80 ( 0.00%) 16933.28 (-1064.76%)
Stddev 24 3387.76 ( 0.00%) 17276.97 (-409.98%)
Stddev 25 467.26 ( 0.00%) 17228.85 (-3587.21%)
Stddev 26 269.10 ( 0.00%) 17614.19 (-6445.71%)
Stddev 27 1024.92 ( 0.00%) 16197.85 (-1480.40%)
Stddev 28 2547.19 ( 0.00%) 22532.91 (-784.62%)
Stddev 29 2496.51 ( 0.00%) 21734.79 (-770.61%)
Stddev 30 1777.21 ( 0.00%) 22407.22 (-1160.81%)
Stddev 31 2948.17 ( 0.00%) 22046.59 (-647.81%)
Stddev 32 3045.75 ( 0.00%) 21317.50 (-599.91%)
Stddev 33 3088.42 ( 0.00%) 24073.34 (-679.47%)
Stddev 34 1695.86 ( 0.00%) 25483.66 (-1402.69%)
Stddev 35 2392.89 ( 0.00%) 22319.81 (-832.76%)
Stddev 36 1002.99 ( 0.00%) 24788.30 (-2371.43%)
Stddev 37 1246.07 ( 0.00%) 22969.98 (-1743.39%)
Stddev 38 3340.47 ( 0.00%) 17764.75 (-431.80%)
Stddev 39 951.45 ( 0.00%) 17467.43 (-1735.88%)
Stddev 40 1861.87 ( 0.00%) 16746.88 (-799.47%)
Stddev 41 3019.63 ( 0.00%) 22203.85 (-635.32%)
Stddev 42 3305.80 ( 0.00%) 19226.07 (-481.59%)
Stddev 43 2149.96 ( 0.00%) 19788.85 (-820.43%)
Stddev 44 4743.81 ( 0.00%) 20232.47 (-326.50%)
Stddev 45 3701.87 ( 0.00%) 19876.40 (-436.93%)
Stddev 46 3742.49 ( 0.00%) 17963.46 (-379.99%)
Stddev 47 1637.98 ( 0.00%) 20138.13 (-1129.45%)
Stddev 48 2192.84 ( 0.00%) 16729.79 (-662.93%)
TPut 1 123600.00 ( 0.00%) 118166.00 ( -4.40%)
TPut 2 251282.00 ( 0.00%) 253321.00 ( 0.81%)
TPut 3 371212.00 ( 0.00%) 370519.00 ( -0.19%)
TPut 4 476489.00 ( 0.00%) 487927.00 ( 2.40%)
TPut 5 569564.00 ( 0.00%) 593162.00 ( 4.14%)
TPut 6 604292.00 ( 0.00%) 679295.00 ( 12.41%)
TPut 7 610474.00 ( 0.00%) 665644.00 ( 9.04%)
TPut 8 565137.00 ( 0.00%) 612888.00 ( 8.45%)
TPut 9 544221.00 ( 0.00%) 557050.00 ( 2.36%)
TPut 10 497162.00 ( 0.00%) 533858.00 ( 7.38%)
TPut 11 559757.00 ( 0.00%) 638725.00 ( 14.11%)
TPut 12 550183.00 ( 0.00%) 639318.00 ( 16.20%)
TPut 13 534429.00 ( 0.00%) 631236.00 ( 18.11%)
TPut 14 542048.00 ( 0.00%) 614042.00 ( 13.28%)
TPut 15 530923.00 ( 0.00%) 606509.00 ( 14.24%)
TPut 16 519697.00 ( 0.00%) 592992.00 ( 14.10%)
TPut 17 521356.00 ( 0.00%) 597000.00 ( 14.51%)
TPut 18 497257.00 ( 0.00%) 585946.00 ( 17.84%)
TPut 19 481325.00 ( 0.00%) 574467.00 ( 19.35%)
TPut 20 475309.00 ( 0.00%) 565526.00 ( 18.98%)
TPut 21 483753.00 ( 0.00%) 552787.00 ( 14.27%)
TPut 22 474643.00 ( 0.00%) 547518.00 ( 15.35%)
TPut 23 468023.00 ( 0.00%) 536802.00 ( 14.70%)
TPut 24 450846.00 ( 0.00%) 525210.00 ( 16.49%)
TPut 25 461834.00 ( 0.00%) 519757.00 ( 12.54%)
TPut 26 456034.00 ( 0.00%) 515338.00 ( 13.00%)
TPut 27 460254.00 ( 0.00%) 513576.00 ( 11.59%)
TPut 28 457438.00 ( 0.00%) 496290.00 ( 8.49%)
TPut 29 454550.00 ( 0.00%) 499818.00 ( 9.96%)
TPut 30 453571.00 ( 0.00%) 495767.00 ( 9.30%)
TPut 31 460525.00 ( 0.00%) 485911.00 ( 5.51%)
TPut 32 448016.00 ( 0.00%) 488940.00 ( 9.13%)
TPut 33 445150.00 ( 0.00%) 483970.00 ( 8.72%)
TPut 34 444827.00 ( 0.00%) 475079.00 ( 6.80%)
TPut 35 433878.00 ( 0.00%) 480246.00 ( 10.69%)
TPut 36 423728.00 ( 0.00%) 472159.00 ( 11.43%)
TPut 37 429712.00 ( 0.00%) 473183.00 ( 10.12%)
TPut 38 411219.00 ( 0.00%) 482078.00 ( 17.23%)
TPut 39 416380.00 ( 0.00%) 485846.00 ( 16.68%)
TPut 40 413840.00 ( 0.00%) 490026.00 ( 18.41%)
TPut 41 401668.00 ( 0.00%) 474282.00 ( 18.08%)
TPut 42 404103.00 ( 0.00%) 482448.00 ( 19.39%)
TPut 43 401247.00 ( 0.00%) 482974.00 ( 20.37%)
TPut 44 407076.00 ( 0.00%) 481641.00 ( 18.32%)
TPut 45 398597.00 ( 0.00%) 485042.00 ( 21.69%)
TPut 46 404714.00 ( 0.00%) 484843.00 ( 19.80%)
TPut 47 404595.00 ( 0.00%) 479977.00 ( 18.63%)
TPut 48 413784.00 ( 0.00%) 480818.00 ( 16.20%)
This is looking much better overall although I am concerned about the
increased variability between JVMs.
specjbb Peaks
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 559757.00 ( 0.00%) 638725.00 ( 14.11%)
Actual Warehouse 8.00 ( 0.00%) 7.00 (-12.50%)
Actual Peak Bops 610474.00 ( 0.00%) 679295.00 ( 11.27%)
SpecJBB Bops 502292.00 ( 0.00%) 582258.00 ( 15.92%)
SpecJBB Bops/JVM 125573.00 ( 0.00%) 145565.00 ( 15.92%)
Looking fine.
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
User 481412.08 481942.54
System 1301.91 578.20
Elapsed 10402.09 10404.47
3.12.0-rc3 3.12.0-rc3
account-v9 periodretry-v9
Compaction cost 105928 13748
NUMA PTE updates 457567880 45890118
NUMA hint faults 69831880 45725506
NUMA hint local faults 19303679 28637898
NUMA hint local percent 27 62
NUMA pages migrated 102050548 13244738
AutoNUMA cost 354301 229200
and system CPU usage is still way down so now we are seeing large
improvements for less work. Previous tests had indicated that period
retrying of task migration was necessary for a good "local percent"
of local/remote faults. It implies that the load balancer and NUMA
scheduling may be making conflicting decisions.
While there is still plenty of future work it looks like this is ready
for wider testing.
Documentation/sysctl/kernel.txt | 76 +++
fs/exec.c | 1 +
fs/proc/array.c | 2 +
include/linux/cpu.h | 67 ++-
include/linux/mempolicy.h | 1 +
include/linux/migrate.h | 7 +-
include/linux/mm.h | 118 +++-
include/linux/mm_types.h | 17 +-
include/linux/page-flags-layout.h | 28 +-
include/linux/sched.h | 67 ++-
include/linux/sched/sysctl.h | 1 -
include/linux/stop_machine.h | 1 +
kernel/bounds.c | 4 +
kernel/cpu.c | 227 ++++++--
kernel/fork.c | 5 +-
kernel/sched/core.c | 184 ++++++-
kernel/sched/debug.c | 60 +-
kernel/sched/fair.c | 1092 ++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 19 +-
kernel/sched/idle_task.c | 2 +-
kernel/sched/rt.c | 5 +-
kernel/sched/sched.h | 27 +-
kernel/sched/stop_task.c | 2 +-
kernel/stop_machine.c | 272 +++++----
kernel/sysctl.c | 21 +-
mm/huge_memory.c | 119 +++-
mm/memory.c | 158 ++----
mm/mempolicy.c | 82 ++-
mm/migrate.c | 49 +-
mm/mm_init.c | 18 +-
mm/mmzone.c | 14 +-
mm/mprotect.c | 65 +--
mm/page_alloc.c | 4 +-
33 files changed, 2248 insertions(+), 567 deletions(-)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 01/63] hotplug: Optimize {get,put}_online_cpus()
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
NOTE: This is a placeholder only. A more comprehensive series is in
progress but this patch on its own mitigates most of the
overhead the migrate_swap patch is concerned with. It's
expected that CPU hotplug locking series would go in before
this series.
The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.
Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.
The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.
However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.
Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion -- this new task_struct member is placed in a 4 byte hole on
64bit builds.
Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
include/linux/cpu.h | 67 ++++++++++++++-
include/linux/sched.h | 3 +
kernel/cpu.c | 227 +++++++++++++++++++++++++++++++++++++-------------
kernel/sched/core.c | 2 +
4 files changed, 237 insertions(+), 62 deletions(-)
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 801ff9e..e520c76 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,8 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
struct device;
@@ -173,10 +175,69 @@ extern struct bus_type cpu_subsys;
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ /* Support reader recursion */
+ /* The value was >= 1 and remains so, reordering causes no harm. */
+ if (current->cpuhp_ref++)
+ return;
+
+ preempt_disable();
+ /*
+ * We are in an RCU-sched read-side critical section, so the writer
+ * cannot both change __cpuhp_state from readers_fast and start
+ * checking counters while we are here. So if we see !__cpuhp_state,
+ * we know that the writer won't be checking until we past the
+ * preempt_enable() and that once the synchronize_sched() is done, the
+ * writer will see anything we did within this RCU-sched read-side
+ * critical section.
+ */
+ if (likely(!__cpuhp_state))
+ __this_cpu_inc(__cpuhp_refcount);
+ else
+ __get_online_cpus(); /* Unconditional memory barrier. */
+ preempt_enable();
+ /*
+ * The barrier() from preempt_enable() prevents the compiler from
+ * bleeding the critical section out.
+ */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ /* The value was >= 1 and remains so, reordering causes no harm. */
+ if (--current->cpuhp_ref)
+ return;
+
+ /*
+ * The barrier() in preempt_disable() prevents the compiler from
+ * bleeding the critical section out.
+ */
+ preempt_disable();
+ /*
+ * Same as in get_online_cpus().
+ */
+ if (likely(!__cpuhp_state))
+ __this_cpu_dec(__cpuhp_refcount);
+ else
+ __put_online_cpus(); /* Unconditional memory barrier. */
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
@@ -200,6 +261,8 @@ static inline void cpu_hotplug_driver_unlock(void)
#else /* CONFIG_HOTPLUG_CPU */
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
static inline void cpu_hotplug_begin(void) {}
static inline void cpu_hotplug_done(void) {}
#define get_online_cpus() do { } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6682da3..5308d89 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1026,6 +1026,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpuhp_ref;
+#endif
struct task_struct *last_wakee;
unsigned long wakee_flips;
unsigned long wakee_flip_decay_ts;
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d7f07a2..dccf605 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,195 @@ static int cpu_hotplug_disabled;
#ifdef CONFIG_HOTPLUG_CPU
-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+ p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+again:
+ __this_cpu_inc(__cpuhp_refcount);
+
/*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
+ * Due to having preemption disabled the decrement happens on
+ * the same CPU as the increment, avoiding the
+ * increment-on-one-CPU-and-decrement-on-another problem.
+ *
+ * And yes, if the reader misses the writer's assignment of
+ * readers_block to __cpuhp_state, then the writer is
+ * guaranteed to see the reader's increment. Conversely, any
+ * readers that increment their __cpuhp_refcount after the
+ * writer looks are guaranteed to see the readers_block value,
+ * which in turn means that they are guaranteed to immediately
+ * decrement their __cpuhp_refcount, so that it doesn't matter
+ * that the writer missed them.
*/
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
-void get_online_cpus(void)
-{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
+ smp_mb(); /* A matches D */
+
+ if (likely(__cpuhp_state != readers_block))
return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);
+ /*
+ * Make sure an outgoing writer sees the waitcount to ensure we
+ * make progress.
+ */
+ atomic_inc(&cpuhp_waitcount);
+
+ /*
+ * Per the above comment; we still have preemption disabled and
+ * will thus decrement on the same CPU as we incremented.
+ */
+ __put_online_cpus();
+
+ /*
+ * We either call schedule() in the wait, or we'll fall through
+ * and reschedule on the preempt_enable() in get_online_cpus().
+ */
+ preempt_enable_no_resched();
+ __wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+ preempt_disable();
+
+ /*
+ * Given we've still got preempt_disabled and new cpu_hotplug_begin()
+ * must do a synchronize_sched() we're guaranteed a successfull
+ * acquisition this time -- even if we wake the current
+ * cpu_hotplug_end() now.
+ */
+ if (atomic_dec_and_test(&cpuhp_waitcount))
+ wake_up(&cpuhp_writer);
+
+ goto again;
}
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
-void put_online_cpus(void)
+void __put_online_cpus(void)
{
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
+ smp_mb(); /* B matches C */
+ /*
+ * In other words, if they see our decrement (presumably to aggregate
+ * zero, as that is the only time it matters) they will also see our
+ * critical section.
+ */
+ this_cpu_dec(__cpuhp_refcount);
+
+ /* Prod writer to recheck readers_active */
+ wake_up(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+#define per_cpu_sum(var) \
+({ \
+ typeof(var) __sum = 0; \
+ int cpu; \
+ for_each_possible_cpu(cpu) \
+ __sum += per_cpu(var, cpu); \
+ __sum; \
+})
- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+/*
+ * Return true if the modular sum of the __cpuhp_refcount per-CPU variables
+ * is zero. If this sum is zero, then it is stable due to the fact that if
+ * any newly arriving readers increment a given counter, they will
+ * immediately decrement that same counter.
+ */
+static bool cpuhp_readers_active_check(void)
+{
+ if (per_cpu_sum(__cpuhp_refcount) != 0)
+ return false;
+ /*
+ * If we observed the decrement; ensure we see the entire critical
+ * section.
+ */
+
+ smp_mb(); /* C matches B */
+
+ return true;
}
-EXPORT_SYMBOL_GPL(put_online_cpus);
/*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ /*
+ * Since cpu_hotplug_begin() is always called after invoking
+ * cpu_maps_update_begin(), we can be sure that only one writer is
+ * active.
+ */
+ lockdep_assert_held(&cpu_add_remove_lock);
- for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
- schedule();
- }
+ /* Allow reader-in-writer recursion. */
+ current->cpuhp_ref++;
+
+ /* Notify readers to take the slow path. */
+ __cpuhp_state = readers_slow;
+
+ /* See percpu_down_write(); guarantees all readers take the slow path */
+ synchronize_sched();
+
+ /*
+ * Notify new readers to block; up until now, and thus throughout the
+ * longish synchronize_sched() above, new readers could still come in.
+ */
+ __cpuhp_state = readers_block;
+
+ smp_mb(); /* D matches A */
+
+ /*
+ * If they don't see our writer of readers_block to __cpuhp_state,
+ * then we are guaranteed to see their __cpuhp_refcount increment, and
+ * therefore will wait for them.
+ */
+
+ /* Wait for all now active readers to complete. */
+ wait_event(cpuhp_writer, cpuhp_readers_active_check());
}
void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ /*
+ * Signal the writer is done, no fast path yet.
+ *
+ * One reason that we cannot just immediately flip to readers_fast is
+ * that new readers might fail to see the results of this writer's
+ * critical section.
+ */
+ __cpuhp_state = readers_slow;
+ wake_up_all(&cpuhp_readers);
+
+ /*
+ * The wait_event()/wake_up_all() prevents the race where the readers
+ * are delayed between fetching __cpuhp_state and blocking.
+ */
+
+ /* See percpu_up_write(); readers will no longer attempt to block. */
+ synchronize_sched();
+
+ /* Let 'em rip */
+ __cpuhp_state = readers_fast;
+ current->cpuhp_ref--;
+
+ /*
+ * Wait for any pending readers to be running. This ensures readers
+ * after writer and avoids writers starving readers.
+ */
+ wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
}
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..2f3420c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1630,6 +1630,8 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
#endif /* CONFIG_NUMA_BALANCING */
+
+ cpu_hotplug_init_task(p);
}
#ifdef CONFIG_NUMA_BALANCING
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 01/63] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
NOTE: This is a placeholder only. A more comprehensive series is in
progress but this patch on its own mitigates most of the
overhead the migrate_swap patch is concerned with. It's
expected that CPU hotplug locking series would go in before
this series.
The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.
Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.
The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.
However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.
Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion -- this new task_struct member is placed in a 4 byte hole on
64bit builds.
Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
include/linux/cpu.h | 67 ++++++++++++++-
include/linux/sched.h | 3 +
kernel/cpu.c | 227 +++++++++++++++++++++++++++++++++++++-------------
kernel/sched/core.c | 2 +
4 files changed, 237 insertions(+), 62 deletions(-)
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 801ff9e..e520c76 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,8 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
struct device;
@@ -173,10 +175,69 @@ extern struct bus_type cpu_subsys;
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ /* Support reader recursion */
+ /* The value was >= 1 and remains so, reordering causes no harm. */
+ if (current->cpuhp_ref++)
+ return;
+
+ preempt_disable();
+ /*
+ * We are in an RCU-sched read-side critical section, so the writer
+ * cannot both change __cpuhp_state from readers_fast and start
+ * checking counters while we are here. So if we see !__cpuhp_state,
+ * we know that the writer won't be checking until we past the
+ * preempt_enable() and that once the synchronize_sched() is done, the
+ * writer will see anything we did within this RCU-sched read-side
+ * critical section.
+ */
+ if (likely(!__cpuhp_state))
+ __this_cpu_inc(__cpuhp_refcount);
+ else
+ __get_online_cpus(); /* Unconditional memory barrier. */
+ preempt_enable();
+ /*
+ * The barrier() from preempt_enable() prevents the compiler from
+ * bleeding the critical section out.
+ */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ /* The value was >= 1 and remains so, reordering causes no harm. */
+ if (--current->cpuhp_ref)
+ return;
+
+ /*
+ * The barrier() in preempt_disable() prevents the compiler from
+ * bleeding the critical section out.
+ */
+ preempt_disable();
+ /*
+ * Same as in get_online_cpus().
+ */
+ if (likely(!__cpuhp_state))
+ __this_cpu_dec(__cpuhp_refcount);
+ else
+ __put_online_cpus(); /* Unconditional memory barrier. */
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
@@ -200,6 +261,8 @@ static inline void cpu_hotplug_driver_unlock(void)
#else /* CONFIG_HOTPLUG_CPU */
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
static inline void cpu_hotplug_begin(void) {}
static inline void cpu_hotplug_done(void) {}
#define get_online_cpus() do { } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6682da3..5308d89 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1026,6 +1026,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpuhp_ref;
+#endif
struct task_struct *last_wakee;
unsigned long wakee_flips;
unsigned long wakee_flip_decay_ts;
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d7f07a2..dccf605 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,195 @@ static int cpu_hotplug_disabled;
#ifdef CONFIG_HOTPLUG_CPU
-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+ p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+again:
+ __this_cpu_inc(__cpuhp_refcount);
+
/*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
+ * Due to having preemption disabled the decrement happens on
+ * the same CPU as the increment, avoiding the
+ * increment-on-one-CPU-and-decrement-on-another problem.
+ *
+ * And yes, if the reader misses the writer's assignment of
+ * readers_block to __cpuhp_state, then the writer is
+ * guaranteed to see the reader's increment. Conversely, any
+ * readers that increment their __cpuhp_refcount after the
+ * writer looks are guaranteed to see the readers_block value,
+ * which in turn means that they are guaranteed to immediately
+ * decrement their __cpuhp_refcount, so that it doesn't matter
+ * that the writer missed them.
*/
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
-void get_online_cpus(void)
-{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
+ smp_mb(); /* A matches D */
+
+ if (likely(__cpuhp_state != readers_block))
return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);
+ /*
+ * Make sure an outgoing writer sees the waitcount to ensure we
+ * make progress.
+ */
+ atomic_inc(&cpuhp_waitcount);
+
+ /*
+ * Per the above comment; we still have preemption disabled and
+ * will thus decrement on the same CPU as we incremented.
+ */
+ __put_online_cpus();
+
+ /*
+ * We either call schedule() in the wait, or we'll fall through
+ * and reschedule on the preempt_enable() in get_online_cpus().
+ */
+ preempt_enable_no_resched();
+ __wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+ preempt_disable();
+
+ /*
+ * Given we've still got preempt_disabled and new cpu_hotplug_begin()
+ * must do a synchronize_sched() we're guaranteed a successfull
+ * acquisition this time -- even if we wake the current
+ * cpu_hotplug_end() now.
+ */
+ if (atomic_dec_and_test(&cpuhp_waitcount))
+ wake_up(&cpuhp_writer);
+
+ goto again;
}
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
-void put_online_cpus(void)
+void __put_online_cpus(void)
{
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
+ smp_mb(); /* B matches C */
+ /*
+ * In other words, if they see our decrement (presumably to aggregate
+ * zero, as that is the only time it matters) they will also see our
+ * critical section.
+ */
+ this_cpu_dec(__cpuhp_refcount);
+
+ /* Prod writer to recheck readers_active */
+ wake_up(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+#define per_cpu_sum(var) \
+({ \
+ typeof(var) __sum = 0; \
+ int cpu; \
+ for_each_possible_cpu(cpu) \
+ __sum += per_cpu(var, cpu); \
+ __sum; \
+})
- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+/*
+ * Return true if the modular sum of the __cpuhp_refcount per-CPU variables
+ * is zero. If this sum is zero, then it is stable due to the fact that if
+ * any newly arriving readers increment a given counter, they will
+ * immediately decrement that same counter.
+ */
+static bool cpuhp_readers_active_check(void)
+{
+ if (per_cpu_sum(__cpuhp_refcount) != 0)
+ return false;
+ /*
+ * If we observed the decrement; ensure we see the entire critical
+ * section.
+ */
+
+ smp_mb(); /* C matches B */
+
+ return true;
}
-EXPORT_SYMBOL_GPL(put_online_cpus);
/*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ /*
+ * Since cpu_hotplug_begin() is always called after invoking
+ * cpu_maps_update_begin(), we can be sure that only one writer is
+ * active.
+ */
+ lockdep_assert_held(&cpu_add_remove_lock);
- for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
- schedule();
- }
+ /* Allow reader-in-writer recursion. */
+ current->cpuhp_ref++;
+
+ /* Notify readers to take the slow path. */
+ __cpuhp_state = readers_slow;
+
+ /* See percpu_down_write(); guarantees all readers take the slow path */
+ synchronize_sched();
+
+ /*
+ * Notify new readers to block; up until now, and thus throughout the
+ * longish synchronize_sched() above, new readers could still come in.
+ */
+ __cpuhp_state = readers_block;
+
+ smp_mb(); /* D matches A */
+
+ /*
+ * If they don't see our writer of readers_block to __cpuhp_state,
+ * then we are guaranteed to see their __cpuhp_refcount increment, and
+ * therefore will wait for them.
+ */
+
+ /* Wait for all now active readers to complete. */
+ wait_event(cpuhp_writer, cpuhp_readers_active_check());
}
void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ /*
+ * Signal the writer is done, no fast path yet.
+ *
+ * One reason that we cannot just immediately flip to readers_fast is
+ * that new readers might fail to see the results of this writer's
+ * critical section.
+ */
+ __cpuhp_state = readers_slow;
+ wake_up_all(&cpuhp_readers);
+
+ /*
+ * The wait_event()/wake_up_all() prevents the race where the readers
+ * are delayed between fetching __cpuhp_state and blocking.
+ */
+
+ /* See percpu_up_write(); readers will no longer attempt to block. */
+ synchronize_sched();
+
+ /* Let 'em rip */
+ __cpuhp_state = readers_fast;
+ current->cpuhp_ref--;
+
+ /*
+ * Wait for any pending readers to be running. This ensures readers
+ * after writer and avoids writers starving readers.
+ */
+ wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
}
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..2f3420c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1630,6 +1630,8 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
#endif /* CONFIG_NUMA_BALANCING */
+
+ cpu_hotplug_init_task(p);
}
#ifdef CONFIG_NUMA_BALANCING
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9d4c1d1..1428c66 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -355,6 +355,72 @@ utilize.
==============================================================
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
osrelease, ostype & version:
# cat osrelease
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9d4c1d1..1428c66 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -355,6 +355,72 @@ utilize.
==============================================================
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
osrelease, ostype & version:
# cat osrelease
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 12:46 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 02/63] mm: numa: Document automatic NUMA balancing sysctls
@ 2013-10-07 12:46 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: numa: Document automatic NUMA balancing sysctls
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:24 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 10fc05d0e551146ad6feb0ab8902d28a2d3c5624
Gitweb: http://git.kernel.org/tip/10fc05d0e551146ad6feb0ab8902d28a2d3c5624
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:40 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:20 +0200
mm: numa: Document automatic NUMA balancing sysctls
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9d4c1d1..1428c66 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -355,6 +355,72 @@ utilize.
==============================================================
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
osrelease, ostype & version:
# cat osrelease
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 03/63] sched, numa: Comment fixlets
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
Fix a 80 column violation and a PTE vs PMD reference.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 8 ++++----
mm/huge_memory.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..b22f52a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
out:
/*
- * It is possible to reach the end of the VMA list but the last few VMAs are
- * not guaranteed to the vma_migratable. If they are not, we would find the
- * !migratable VMA on the next scan but not reset the scanner to the start
- * so check it now.
+ * It is possible to reach the end of the VMA list but the last few
+ * VMAs are not guaranteed to the vma_migratable. If they are not, we
+ * would find the !migratable VMA on the next scan but not reset the
+ * scanner to the start so check it now.
*/
if (vma)
mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..19dbb08 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
lock_page(page);
- /* Confirm the PTE did not while locked */
+ /* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 03/63] sched, numa: Comment fixlets
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
Fix a 80 column violation and a PTE vs PMD reference.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 8 ++++----
mm/huge_memory.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..b22f52a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
out:
/*
- * It is possible to reach the end of the VMA list but the last few VMAs are
- * not guaranteed to the vma_migratable. If they are not, we would find the
- * !migratable VMA on the next scan but not reset the scanner to the start
- * so check it now.
+ * It is possible to reach the end of the VMA list but the last few
+ * VMAs are not guaranteed to the vma_migratable. If they are not, we
+ * would find the !migratable VMA on the next scan but not reset the
+ * scanner to the start so check it now.
*/
if (vma)
mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..19dbb08 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
lock_page(page);
- /* Confirm the PTE did not while locked */
+ /* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 03/63] sched, numa: Comment fixlets
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 12:46 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Fix a 80 column violation and a PTE vs PMD reference.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 03/63] sched, numa: Comment fixlets
@ 2013-10-07 12:46 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:46 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Fix a 80 column violation and a PTE vs PMD reference.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Fix comments
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:24 ` tip-bot for Peter Zijlstra
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
mgorman, tglx
Commit-ID: c69307d533d7aa7cc8894dbbb8a274599f8630d7
Gitweb: http://git.kernel.org/tip/c69307d533d7aa7cc8894dbbb8a274599f8630d7
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:28:41 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:30 +0200
sched/numa: Fix comments
Fix a 80 column violation and a PTE vs PMD reference.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 8 ++++----
mm/huge_memory.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b89cd2..817cd7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
out:
/*
- * It is possible to reach the end of the VMA list but the last few VMAs are
- * not guaranteed to the vma_migratable. If they are not, we would find the
- * !migratable VMA on the next scan but not reset the scanner to the start
- * so check it now.
+ * It is possible to reach the end of the VMA list but the last few
+ * VMAs are not guaranteed to the vma_migratable. If they are not, we
+ * would find the !migratable VMA on the next scan but not reset the
+ * scanner to the start so check it now.
*/
if (vma)
mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..19dbb08 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
lock_page(page);
- /* Confirm the PTE did not while locked */
+ /* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
If another task handled a hinting fault in parallel then do not double
account for it.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 19dbb08..dab2bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
check_same:
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ if (unlikely(!pmd_same(pmd, *pmdp))) {
+ /* Someone else took our fault */
+ current_nid = -1;
goto out_unlock;
+ }
clear_pmdnuma:
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
If another task handled a hinting fault in parallel then do not double
account for it.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 19dbb08..dab2bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
check_same:
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ if (unlikely(!pmd_same(pmd, *pmdp))) {
+ /* Someone else took our fault */
+ current_nid = -1;
goto out_unlock;
+ }
clear_pmdnuma:
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 12:47 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> If another task handled a hinting fault in parallel then do not double
> account for it.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 04/63] mm: numa: Do not account for a hinting fault if we raced
@ 2013-10-07 12:47 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 12:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> If another task handled a hinting fault in parallel then do not double
> account for it.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: numa: Do not account for a hinting fault if we raced
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:24 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 0c3a775e1e0b069bf765f8355b723ce0d18dcc6c
Gitweb: http://git.kernel.org/tip/0c3a775e1e0b069bf765f8355b723ce0d18dcc6c
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:42 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:40 +0200
mm: numa: Do not account for a hinting fault if we raced
If another task handled a hinting fault in parallel then do not double
account for it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 19dbb08..dab2bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
check_same:
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ if (unlikely(!pmd_same(pmd, *pmdp))) {
+ /* Someone else took our fault */
+ current_nid = -1;
goto out_unlock;
+ }
clear_pmdnuma:
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:core/urgent] mm: numa: Do not account for a hinting fault if we raced
2013-10-07 10:28 ` Mel Gorman
` (2 preceding siblings ...)
(?)
@ 2013-10-29 10:42 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
srikar, mgorman, tglx
Commit-ID: 1dd49bfa3465756b3ce72214b58a33e4afb67aa3
Gitweb: http://git.kernel.org/tip/1dd49bfa3465756b3ce72214b58a33e4afb67aa3
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:42 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:05 +0100
mm: numa: Do not account for a hinting fault if we raced
If another task handled a hinting fault in parallel then do not double
account for it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 610e3df..33ee637 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,8 +1325,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
check_same:
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ if (unlikely(!pmd_same(pmd, *pmdp))) {
+ /* Someone else took our fault */
+ current_nid = -1;
goto out_unlock;
+ }
clear_pmdnuma:
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.
This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dab2bab..f362363 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (current_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- put_page(page);
- goto clear_pmdnuma;
- }
+ /*
+ * Acquire the page lock to serialise THP migrations but avoid dropping
+ * page_table_lock if at all possible
+ */
+ if (trylock_page(page))
+ goto got_lock;
- /* Acquire the page lock to serialise THP migrations */
+ /* Serialise against migrationa and check placement check placement */
spin_unlock(&mm->page_table_lock);
lock_page(page);
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(page);
goto out_unlock;
}
- spin_unlock(&mm->page_table_lock);
+
+got_lock:
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ unlock_page(page);
+ put_page(page);
+ goto clear_pmdnuma;
+ }
/* Migrate the THP to the requested node */
+ spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (!migrated)
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.
This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dab2bab..f362363 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (current_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- put_page(page);
- goto clear_pmdnuma;
- }
+ /*
+ * Acquire the page lock to serialise THP migrations but avoid dropping
+ * page_table_lock if at all possible
+ */
+ if (trylock_page(page))
+ goto got_lock;
- /* Acquire the page lock to serialise THP migrations */
+ /* Serialise against migrationa and check placement check placement */
spin_unlock(&mm->page_table_lock);
lock_page(page);
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(page);
goto out_unlock;
}
- spin_unlock(&mm->page_table_lock);
+
+got_lock:
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ unlock_page(page);
+ put_page(page);
+ goto clear_pmdnuma;
+ }
/* Migrate the THP to the requested node */
+ spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (!migrated)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 13:55 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 13:55 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The locking for migrating THP is unusual. While normal page migration
> prevents parallel accesses using a migration PTE, THP migration relies on
> a combination of the page_table_lock, the page lock and the existance of
> the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
>
> If a THP page is currently being migrated and another thread traps a
> fault on the same page it checks if the page is misplaced. If it is not,
> then pmd_numa is cleared. The problem is that it checks if the page is
> misplaced without holding the page lock meaning that the racing thread
> can be migrating the THP when the second thread clears the NUMA bit
> and faults a stale page.
>
> This patch checks if the page is potentially being migrated and stalls
> using the lock_page if it is potentially being migrated before checking
> if the page is misplaced or not.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 05/63] mm: Wait for THP migrations to complete during NUMA hinting faults
@ 2013-10-07 13:55 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 13:55 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The locking for migrating THP is unusual. While normal page migration
> prevents parallel accesses using a migration PTE, THP migration relies on
> a combination of the page_table_lock, the page lock and the existance of
> the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
>
> If a THP page is currently being migrated and another thread traps a
> fault on the same page it checks if the page is misplaced. If it is not,
> then pmd_numa is cleared. The problem is that it checks if the page is
> misplaced without holding the page lock meaning that the racing thread
> can be migrating the THP when the second thread clears the NUMA bit
> and faults a stale page.
>
> This patch checks if the page is potentially being migrated and stalls
> using the lock_page if it is potentially being migrated before checking
> if the page is misplaced or not.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: Wait for THP migrations to complete during NUMA hinting faults
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:24 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: ff9042b11a71c81238c70af168cd36b98a6d5a3c
Gitweb: http://git.kernel.org/tip/ff9042b11a71c81238c70af168cd36b98a6d5a3c
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:43 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:41 +0200
mm: Wait for THP migrations to complete during NUMA hinting faults
The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.
This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dab2bab..f362363 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (current_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- put_page(page);
- goto clear_pmdnuma;
- }
+ /*
+ * Acquire the page lock to serialise THP migrations but avoid dropping
+ * page_table_lock if at all possible
+ */
+ if (trylock_page(page))
+ goto got_lock;
- /* Acquire the page lock to serialise THP migrations */
+ /* Serialise against migrationa and check placement check placement */
spin_unlock(&mm->page_table_lock);
lock_page(page);
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(page);
goto out_unlock;
}
- spin_unlock(&mm->page_table_lock);
+
+got_lock:
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ unlock_page(page);
+ put_page(page);
+ goto clear_pmdnuma;
+ }
/* Migrate the THP to the requested node */
+ spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (!migrated)
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:core/urgent] mm: Wait for THP migrations to complete during NUMA hinting faults
2013-10-07 10:28 ` Mel Gorman
` (2 preceding siblings ...)
(?)
@ 2013-10-29 10:42 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
srikar, mgorman, tglx
Commit-ID: 42836f5f8baa33085f547098b74aa98991ee9216
Gitweb: http://git.kernel.org/tip/42836f5f8baa33085f547098b74aa98991ee9216
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:43 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:19 +0100
mm: Wait for THP migrations to complete during NUMA hinting faults
The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.
If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.
This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33ee637..e10d780 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,13 +1295,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (current_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- put_page(page);
- goto clear_pmdnuma;
- }
+ /*
+ * Acquire the page lock to serialise THP migrations but avoid dropping
+ * page_table_lock if at all possible
+ */
+ if (trylock_page(page))
+ goto got_lock;
- /* Acquire the page lock to serialise THP migrations */
+ /* Serialise against migrationa and check placement check placement */
spin_unlock(&mm->page_table_lock);
lock_page(page);
@@ -1312,9 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(page);
goto out_unlock;
}
- spin_unlock(&mm->page_table_lock);
+
+got_lock:
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ unlock_page(page);
+ put_page(page);
+ goto clear_pmdnuma;
+ }
/* Migrate the THP to the requested node */
+ spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (!migrated)
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 06/63] mm: Prevent parallel splits during THP migration
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
1 file changed, 30 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f362363..1d6334f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
+ struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
int current_nid = -1;
- bool migrated;
+ bool migrated, page_locked;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- get_page(page);
current_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Acquire the page lock to serialise THP migrations but avoid dropping
* page_table_lock if at all possible
*/
- if (trylock_page(page))
- goto got_lock;
+ page_locked = trylock_page(page);
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ /* If the page was locked, there are no parallel migrations */
+ if (page_locked) {
+ unlock_page(page);
+ goto clear_pmdnuma;
+ }
- /* Serialise against migrationa and check placement check placement */
+ /* Otherwise wait for potential migrations and retry fault */
+ spin_unlock(&mm->page_table_lock);
+ wait_on_page_locked(page);
+ goto out;
+ }
+
+ /* Page is misplaced, serialise migrations and parallel THP splits */
+ get_page(page);
spin_unlock(&mm->page_table_lock);
- lock_page(page);
+ if (!page_locked) {
+ lock_page(page);
+ page_locked = true;
+ }
+ anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
}
-got_lock:
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- unlock_page(page);
- put_page(page);
- goto clear_pmdnuma;
- }
-
/* Migrate the THP to the requested node */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
goto check_same;
task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
return 0;
check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
spin_unlock(&mm->page_table_lock);
+
+out:
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
+
if (current_nid != -1)
task_numa_fault(current_nid, HPAGE_PMD_NR, false);
return 0;
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 06/63] mm: Prevent parallel splits during THP migration
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
1 file changed, 30 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f362363..1d6334f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
+ struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
int current_nid = -1;
- bool migrated;
+ bool migrated, page_locked;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- get_page(page);
current_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Acquire the page lock to serialise THP migrations but avoid dropping
* page_table_lock if at all possible
*/
- if (trylock_page(page))
- goto got_lock;
+ page_locked = trylock_page(page);
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ /* If the page was locked, there are no parallel migrations */
+ if (page_locked) {
+ unlock_page(page);
+ goto clear_pmdnuma;
+ }
- /* Serialise against migrationa and check placement check placement */
+ /* Otherwise wait for potential migrations and retry fault */
+ spin_unlock(&mm->page_table_lock);
+ wait_on_page_locked(page);
+ goto out;
+ }
+
+ /* Page is misplaced, serialise migrations and parallel THP splits */
+ get_page(page);
spin_unlock(&mm->page_table_lock);
- lock_page(page);
+ if (!page_locked) {
+ lock_page(page);
+ page_locked = true;
+ }
+ anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
}
-got_lock:
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- unlock_page(page);
- put_page(page);
- goto clear_pmdnuma;
- }
-
/* Migrate the THP to the requested node */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
goto check_same;
task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
return 0;
check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
spin_unlock(&mm->page_table_lock);
+
+out:
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
+
if (current_nid != -1)
task_numa_fault(current_nid, HPAGE_PMD_NR, false);
return 0;
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 06/63] mm: Prevent parallel splits during THP migration
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 14:01 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:01 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migrations are serialised by the page lock but on its own that does
> not prevent THP splits. If the page is split during THP migration then
> the pmd_same checks will prevent page table corruption but the unlock page
> and other fix-ups potentially will cause corruption. This patch takes the
> anon_vma lock to prevent parallel splits during migration.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 06/63] mm: Prevent parallel splits during THP migration
@ 2013-10-07 14:01 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:01 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migrations are serialised by the page lock but on its own that does
> not prevent THP splits. If the page is split during THP migration then
> the pmd_same checks will prevent page table corruption but the unlock page
> and other fix-ups potentially will cause corruption. This patch takes the
> anon_vma lock to prevent parallel splits during migration.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: Prevent parallel splits during THP migration
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:24 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: b8916634b77bffb233d8f2f45703c80343457cc1
Gitweb: http://git.kernel.org/tip/b8916634b77bffb233d8f2f45703c80343457cc1
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:44 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:43 +0200
mm: Prevent parallel splits during THP migration
THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
1 file changed, 30 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f362363..1d6334f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
+ struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
int current_nid = -1;
- bool migrated;
+ bool migrated, page_locked;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- get_page(page);
current_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Acquire the page lock to serialise THP migrations but avoid dropping
* page_table_lock if at all possible
*/
- if (trylock_page(page))
- goto got_lock;
+ page_locked = trylock_page(page);
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ /* If the page was locked, there are no parallel migrations */
+ if (page_locked) {
+ unlock_page(page);
+ goto clear_pmdnuma;
+ }
- /* Serialise against migrationa and check placement check placement */
+ /* Otherwise wait for potential migrations and retry fault */
+ spin_unlock(&mm->page_table_lock);
+ wait_on_page_locked(page);
+ goto out;
+ }
+
+ /* Page is misplaced, serialise migrations and parallel THP splits */
+ get_page(page);
spin_unlock(&mm->page_table_lock);
- lock_page(page);
+ if (!page_locked) {
+ lock_page(page);
+ page_locked = true;
+ }
+ anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
}
-got_lock:
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- unlock_page(page);
- put_page(page);
- goto clear_pmdnuma;
- }
-
/* Migrate the THP to the requested node */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
goto check_same;
task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
return 0;
check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
spin_unlock(&mm->page_table_lock);
+
+out:
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
+
if (current_nid != -1)
task_numa_fault(current_nid, HPAGE_PMD_NR, false);
return 0;
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:core/urgent] mm: Prevent parallel splits during THP migration
2013-10-07 10:28 ` Mel Gorman
` (2 preceding siblings ...)
(?)
@ 2013-10-29 10:42 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
srikar, mgorman, tglx
Commit-ID: 587fe586f44a48f9691001ba6c45b86c8e4ba21f
Gitweb: http://git.kernel.org/tip/587fe586f44a48f9691001ba6c45b86c8e4ba21f
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:44 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:39 +0100
mm: Prevent parallel splits during THP migration
THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++--------------
1 file changed, 30 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e10d780..d8534b3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1278,18 +1278,18 @@ out:
int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
+ struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
int current_nid = -1;
- bool migrated;
+ bool migrated, page_locked;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- get_page(page);
current_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (current_nid == numa_node_id())
@@ -1299,12 +1299,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Acquire the page lock to serialise THP migrations but avoid dropping
* page_table_lock if at all possible
*/
- if (trylock_page(page))
- goto got_lock;
+ page_locked = trylock_page(page);
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ /* If the page was locked, there are no parallel migrations */
+ if (page_locked) {
+ unlock_page(page);
+ goto clear_pmdnuma;
+ }
- /* Serialise against migrationa and check placement check placement */
+ /* Otherwise wait for potential migrations and retry fault */
+ spin_unlock(&mm->page_table_lock);
+ wait_on_page_locked(page);
+ goto out;
+ }
+
+ /* Page is misplaced, serialise migrations and parallel THP splits */
+ get_page(page);
spin_unlock(&mm->page_table_lock);
- lock_page(page);
+ if (!page_locked) {
+ lock_page(page);
+ page_locked = true;
+ }
+ anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PTE did not while locked */
spin_lock(&mm->page_table_lock);
@@ -1314,14 +1331,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
}
-got_lock:
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- unlock_page(page);
- put_page(page);
- goto clear_pmdnuma;
- }
-
/* Migrate the THP to the requested node */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1330,6 +1339,8 @@ got_lock:
goto check_same;
task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
return 0;
check_same:
@@ -1346,6 +1357,11 @@ clear_pmdnuma:
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
spin_unlock(&mm->page_table_lock);
+
+out:
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
+
if (current_nid != -1)
task_numa_fault(current_nid, HPAGE_PMD_NR, false);
return 0;
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
There are three callers of task_numa_fault():
- do_huge_pmd_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_pmd_numa_page():
Accounts not at all when the page isn't migrated, otherwise
accounts against the node we migrated towards.
This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.
So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.
They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 25 +++++++++++++------------
mm/memory.c | 53 +++++++++++++++++++++--------------------------------
2 files changed, 34 insertions(+), 44 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6334f..c3bb65f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+ int page_nid = -1, this_nid = numa_node_id();
int target_nid;
- int current_nid = -1;
- bool migrated, page_locked;
+ bool page_locked;
+ bool migrated = false;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- current_nid = page_to_nid(page);
+ page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (!migrated)
+ if (migrated)
+ page_nid = target_nid;
+ else
goto check_same;
- task_numa_fault(target_nid, HPAGE_PMD_NR, true);
- if (anon_vma)
- page_unlock_anon_vma_read(anon_vma);
- return 0;
+ goto out;
check_same:
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
/* Someone else took our fault */
- current_nid = -1;
+ page_nid = -1;
goto out_unlock;
}
clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);
- if (current_nid != -1)
- task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..42ae82e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3519,12 +3519,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
- unsigned long addr, int current_nid)
+ unsigned long addr, int page_nid)
{
get_page(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
return mpol_misplaced(page, vma, addr);
@@ -3535,7 +3535,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1;
+ int page_nid = -1;
int target_nid;
bool migrated = false;
@@ -3565,15 +3565,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
- current_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
- /*
- * Account for the fault against the current node if it not
- * being replaced regardless of where the page is located.
- */
- current_nid = numa_node_id();
put_page(page);
goto out;
}
@@ -3581,11 +3576,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, target_nid);
if (migrated)
- current_nid = target_nid;
+ page_nid = target_nid;
out:
- if (current_nid != -1)
- task_numa_fault(current_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
return 0;
}
@@ -3600,7 +3595,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int local_nid = numa_node_id();
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3623,9 +3617,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int curr_nid = local_nid;
+ int page_nid = -1;
int target_nid;
- bool migrated;
+ bool migrated = false;
+
if (!pte_present(pteval))
continue;
if (!pte_numa(pteval))
@@ -3647,25 +3642,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;
- /*
- * Note that the NUMA fault is later accounted to either
- * the node that is currently running or where the page is
- * migrated to.
- */
- curr_nid = local_nid;
- target_nid = numa_migrate_prep(page, vma, addr,
- page_to_nid(page));
- if (target_nid == -1) {
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+ pte_unmap_unlock(pte, ptl);
+ if (target_nid != -1) {
+ migrated = migrate_misplaced_page(page, target_nid);
+ if (migrated)
+ page_nid = target_nid;
+ } else {
put_page(page);
- continue;
}
- /* Migrate to the requested node */
- pte_unmap_unlock(pte, ptl);
- migrated = migrate_misplaced_page(page, target_nid);
- if (migrated)
- curr_nid = target_nid;
- task_numa_fault(curr_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
There are three callers of task_numa_fault():
- do_huge_pmd_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_pmd_numa_page():
Accounts not at all when the page isn't migrated, otherwise
accounts against the node we migrated towards.
This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.
So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.
They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 25 +++++++++++++------------
mm/memory.c | 53 +++++++++++++++++++++--------------------------------
2 files changed, 34 insertions(+), 44 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6334f..c3bb65f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+ int page_nid = -1, this_nid = numa_node_id();
int target_nid;
- int current_nid = -1;
- bool migrated, page_locked;
+ bool page_locked;
+ bool migrated = false;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- current_nid = page_to_nid(page);
+ page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (!migrated)
+ if (migrated)
+ page_nid = target_nid;
+ else
goto check_same;
- task_numa_fault(target_nid, HPAGE_PMD_NR, true);
- if (anon_vma)
- page_unlock_anon_vma_read(anon_vma);
- return 0;
+ goto out;
check_same:
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
/* Someone else took our fault */
- current_nid = -1;
+ page_nid = -1;
goto out_unlock;
}
clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);
- if (current_nid != -1)
- task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..42ae82e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3519,12 +3519,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
- unsigned long addr, int current_nid)
+ unsigned long addr, int page_nid)
{
get_page(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
return mpol_misplaced(page, vma, addr);
@@ -3535,7 +3535,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1;
+ int page_nid = -1;
int target_nid;
bool migrated = false;
@@ -3565,15 +3565,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
- current_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
- /*
- * Account for the fault against the current node if it not
- * being replaced regardless of where the page is located.
- */
- current_nid = numa_node_id();
put_page(page);
goto out;
}
@@ -3581,11 +3576,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, target_nid);
if (migrated)
- current_nid = target_nid;
+ page_nid = target_nid;
out:
- if (current_nid != -1)
- task_numa_fault(current_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
return 0;
}
@@ -3600,7 +3595,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int local_nid = numa_node_id();
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3623,9 +3617,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int curr_nid = local_nid;
+ int page_nid = -1;
int target_nid;
- bool migrated;
+ bool migrated = false;
+
if (!pte_present(pteval))
continue;
if (!pte_numa(pteval))
@@ -3647,25 +3642,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;
- /*
- * Note that the NUMA fault is later accounted to either
- * the node that is currently running or where the page is
- * migrated to.
- */
- curr_nid = local_nid;
- target_nid = numa_migrate_prep(page, vma, addr,
- page_to_nid(page));
- if (target_nid == -1) {
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+ pte_unmap_unlock(pte, ptl);
+ if (target_nid != -1) {
+ migrated = migrate_misplaced_page(page, target_nid);
+ if (migrated)
+ page_nid = target_nid;
+ } else {
put_page(page);
- continue;
}
- /* Migrate to the requested node */
- pte_unmap_unlock(pte, ptl);
- migrated = migrate_misplaced_page(page, target_nid);
- if (migrated)
- curr_nid = target_nid;
- task_numa_fault(curr_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 14:02 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> There are three callers of task_numa_fault():
>
> - do_huge_pmd_numa_page():
> Accounts against the current node, not the node where the
> page resides, unless we migrated, in which case it accounts
> against the node we migrated to.
>
> - do_numa_page():
> Accounts against the current node, not the node where the
> page resides, unless we migrated, in which case it accounts
> against the node we migrated to.
>
> - do_pmd_numa_page():
> Accounts not at all when the page isn't migrated, otherwise
> accounts against the node we migrated towards.
>
> This seems wrong to me; all three sites should have the same
> sementaics, furthermore we should accounts against where the page
> really is, we already know where the task is.
>
> So modify all three sites to always account; we did after all receive
> the fault; and always account to where the page is after migration,
> regardless of success.
>
> They all still differ on when they clear the PTE/PMD; ideally that
> would get sorted too.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 07/63] mm: numa: Sanitize task_numa_fault() callsites
@ 2013-10-07 14:02 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> There are three callers of task_numa_fault():
>
> - do_huge_pmd_numa_page():
> Accounts against the current node, not the node where the
> page resides, unless we migrated, in which case it accounts
> against the node we migrated to.
>
> - do_numa_page():
> Accounts against the current node, not the node where the
> page resides, unless we migrated, in which case it accounts
> against the node we migrated to.
>
> - do_pmd_numa_page():
> Accounts not at all when the page isn't migrated, otherwise
> accounts against the node we migrated towards.
>
> This seems wrong to me; all three sites should have the same
> sementaics, furthermore we should accounts against where the page
> really is, we already know where the task is.
>
> So modify all three sites to always account; we did after all receive
> the fault; and always account to where the page is after migration,
> regardless of success.
>
> They all still differ on when they clear the PTE/PMD; ideally that
> would get sorted too.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: numa: Sanitize task_numa_fault() callsites
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:25 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 8191acbd30c73e45c24ad16c372e0b42cc7ac8f8
Gitweb: http://git.kernel.org/tip/8191acbd30c73e45c24ad16c372e0b42cc7ac8f8
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:45 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:44 +0200
mm: numa: Sanitize task_numa_fault() callsites
There are three callers of task_numa_fault():
- do_huge_pmd_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_pmd_numa_page():
Accounts not at all when the page isn't migrated, otherwise
accounts against the node we migrated towards.
This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.
So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.
They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 25 +++++++++++++------------
mm/memory.c | 53 +++++++++++++++++++++--------------------------------
2 files changed, 34 insertions(+), 44 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d6334f..c3bb65f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+ int page_nid = -1, this_nid = numa_node_id();
int target_nid;
- int current_nid = -1;
- bool migrated, page_locked;
+ bool page_locked;
+ bool migrated = false;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- current_nid = page_to_nid(page);
+ page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (!migrated)
+ if (migrated)
+ page_nid = target_nid;
+ else
goto check_same;
- task_numa_fault(target_nid, HPAGE_PMD_NR, true);
- if (anon_vma)
- page_unlock_anon_vma_read(anon_vma);
- return 0;
+ goto out;
check_same:
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
/* Someone else took our fault */
- current_nid = -1;
+ page_nid = -1;
goto out_unlock;
}
clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);
- if (current_nid != -1)
- task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..42ae82e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3519,12 +3519,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
- unsigned long addr, int current_nid)
+ unsigned long addr, int page_nid)
{
get_page(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
return mpol_misplaced(page, vma, addr);
@@ -3535,7 +3535,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1;
+ int page_nid = -1;
int target_nid;
bool migrated = false;
@@ -3565,15 +3565,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
- current_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
- /*
- * Account for the fault against the current node if it not
- * being replaced regardless of where the page is located.
- */
- current_nid = numa_node_id();
put_page(page);
goto out;
}
@@ -3581,11 +3576,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, target_nid);
if (migrated)
- current_nid = target_nid;
+ page_nid = target_nid;
out:
- if (current_nid != -1)
- task_numa_fault(current_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
return 0;
}
@@ -3600,7 +3595,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int local_nid = numa_node_id();
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3623,9 +3617,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int curr_nid = local_nid;
+ int page_nid = -1;
int target_nid;
- bool migrated;
+ bool migrated = false;
+
if (!pte_present(pteval))
continue;
if (!pte_numa(pteval))
@@ -3647,25 +3642,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;
- /*
- * Note that the NUMA fault is later accounted to either
- * the node that is currently running or where the page is
- * migrated to.
- */
- curr_nid = local_nid;
- target_nid = numa_migrate_prep(page, vma, addr,
- page_to_nid(page));
- if (target_nid == -1) {
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+ pte_unmap_unlock(pte, ptl);
+ if (target_nid != -1) {
+ migrated = migrate_misplaced_page(page, target_nid);
+ if (migrated)
+ page_nid = target_nid;
+ } else {
put_page(page);
- continue;
}
- /* Migrate to the requested node */
- pte_unmap_unlock(pte, ptl);
- migrated = migrate_misplaced_page(page, target_nid);
- if (migrated)
- curr_nid = target_nid;
- task_numa_fault(curr_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:core/urgent] mm: numa: Sanitize task_numa_fault() callsites
2013-10-07 10:28 ` Mel Gorman
` (2 preceding siblings ...)
(?)
@ 2013-10-29 10:42 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
srikar, mgorman, tglx
Commit-ID: c61109e34f60f6e85bb43c5a1cd51c0e3db40847
Gitweb: http://git.kernel.org/tip/c61109e34f60f6e85bb43c5a1cd51c0e3db40847
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:45 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:37:52 +0100
mm: numa: Sanitize task_numa_fault() callsites
There are three callers of task_numa_fault():
- do_huge_pmd_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_pmd_numa_page():
Accounts not at all when the page isn't migrated, otherwise
accounts against the node we migrated towards.
This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.
So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.
They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 25 +++++++++++++------------
mm/memory.c | 53 +++++++++++++++++++++--------------------------------
2 files changed, 34 insertions(+), 44 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8534b3..00ddfcd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,18 +1281,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+ int page_nid = -1, this_nid = numa_node_id();
int target_nid;
- int current_nid = -1;
- bool migrated, page_locked;
+ bool page_locked;
+ bool migrated = false;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
page = pmd_page(pmd);
- current_nid = page_to_nid(page);
+ page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
/*
@@ -1335,19 +1336,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (!migrated)
+ if (migrated)
+ page_nid = target_nid;
+ else
goto check_same;
- task_numa_fault(target_nid, HPAGE_PMD_NR, true);
- if (anon_vma)
- page_unlock_anon_vma_read(anon_vma);
- return 0;
+ goto out;
check_same:
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
/* Someone else took our fault */
- current_nid = -1;
+ page_nid = -1;
goto out_unlock;
}
clear_pmdnuma:
@@ -1362,8 +1362,9 @@ out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);
- if (current_nid != -1)
- task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 1311f26..d176154 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3521,12 +3521,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
- unsigned long addr, int current_nid)
+ unsigned long addr, int page_nid)
{
get_page(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
return mpol_misplaced(page, vma, addr);
@@ -3537,7 +3537,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1;
+ int page_nid = -1;
int target_nid;
bool migrated = false;
@@ -3567,15 +3567,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
- current_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
- /*
- * Account for the fault against the current node if it not
- * being replaced regardless of where the page is located.
- */
- current_nid = numa_node_id();
put_page(page);
goto out;
}
@@ -3583,11 +3578,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, target_nid);
if (migrated)
- current_nid = target_nid;
+ page_nid = target_nid;
out:
- if (current_nid != -1)
- task_numa_fault(current_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
return 0;
}
@@ -3602,7 +3597,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int local_nid = numa_node_id();
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3625,9 +3619,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int curr_nid = local_nid;
+ int page_nid = -1;
int target_nid;
- bool migrated;
+ bool migrated = false;
+
if (!pte_present(pteval))
continue;
if (!pte_numa(pteval))
@@ -3649,25 +3644,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;
- /*
- * Note that the NUMA fault is later accounted to either
- * the node that is currently running or where the page is
- * migrated to.
- */
- curr_nid = local_nid;
- target_nid = numa_migrate_prep(page, vma, addr,
- page_to_nid(page));
- if (target_nid == -1) {
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+ pte_unmap_unlock(pte, ptl);
+ if (target_nid != -1) {
+ migrated = migrate_misplaced_page(page, target_nid);
+ if (migrated)
+ page_nid = target_nid;
+ } else {
put_page(page);
- continue;
}
- /* Migrate to the requested node */
- pte_unmap_unlock(pte, ptl);
- migrated = migrate_misplaced_page(page, target_nid);
- if (migrated)
- curr_nid = target_nid;
- task_numa_fault(curr_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open
Task A Task B
do_huge_pmd_numa_page do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
lock_page
mpol_misplaced == 2
migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at
During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 33 +++++++++++++++------------------
mm/migrate.c | 19 +++++++++++--------
2 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bb65f..d4928769 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1) {
/* If the page was locked, there are no parallel migrations */
- if (page_locked) {
- unlock_page(page);
+ if (page_locked)
goto clear_pmdnuma;
- }
- /* Otherwise wait for potential migrations and retry fault */
+ /*
+ * Otherwise wait for potential migrations and retry. We do
+ * relock and check_same as the page may no longer be mapped.
+ * As the fault is being retried, do not account for it.
+ */
spin_unlock(&mm->page_table_lock);
wait_on_page_locked(page);
+ page_nid = -1;
goto out;
}
/* Page is misplaced, serialise migrations and parallel THP splits */
get_page(page);
spin_unlock(&mm->page_table_lock);
- if (!page_locked) {
+ if (!page_locked)
lock_page(page);
- page_locked = true;
- }
anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
+ page_nid = -1;
goto out_unlock;
}
- /* Migrate the THP to the requested node */
+ /*
+ * Migrate the THP to the requested node, returns with page unlocked
+ * and pmd_numa cleared.
+ */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (migrated)
page_nid = target_nid;
- else
- goto check_same;
goto out;
-
-check_same:
- spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
- /* Someone else took our fault */
- page_nid = -1;
- goto out_unlock;
- }
clear_pmdnuma:
+ BUG_ON(!PageLocked(page));
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
VM_BUG_ON(pmd_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
+ unlock_page(page);
out_unlock:
spin_unlock(&mm->page_table_lock);
diff --git a/mm/migrate.c b/mm/migrate.c
index 9c8d5f5..ce8c3a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1713,12 +1713,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
unlock_page(new_page);
put_page(new_page); /* Free it */
- unlock_page(page);
+ /* Retake the callers reference and putback on LRU */
+ get_page(page);
putback_lru_page(page);
-
- count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
- isolated = 0;
- goto out;
+ mod_zone_page_state(page_zone(page),
+ NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+ goto out_fail;
}
/*
@@ -1735,9 +1735,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
- page_add_new_anon_rmap(new_page, vma, haddr);
-
+ pmdp_clear_flush(vma, haddr, pmd);
set_pmd_at(mm, haddr, pmd, entry);
+ page_add_new_anon_rmap(new_page, vma, haddr);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(page);
/*
@@ -1756,7 +1756,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
-out:
mod_zone_page_state(page_zone(page),
NR_ISOLATED_ANON + page_lru,
-HPAGE_PMD_NR);
@@ -1765,6 +1764,10 @@ out:
out_fail:
count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
out_dropref:
+ entry = pmd_mknonnuma(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, &entry);
+
unlock_page(page);
put_page(page);
return 0;
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open
Task A Task B
do_huge_pmd_numa_page do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
lock_page
mpol_misplaced == 2
migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at
During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 33 +++++++++++++++------------------
mm/migrate.c | 19 +++++++++++--------
2 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bb65f..d4928769 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1) {
/* If the page was locked, there are no parallel migrations */
- if (page_locked) {
- unlock_page(page);
+ if (page_locked)
goto clear_pmdnuma;
- }
- /* Otherwise wait for potential migrations and retry fault */
+ /*
+ * Otherwise wait for potential migrations and retry. We do
+ * relock and check_same as the page may no longer be mapped.
+ * As the fault is being retried, do not account for it.
+ */
spin_unlock(&mm->page_table_lock);
wait_on_page_locked(page);
+ page_nid = -1;
goto out;
}
/* Page is misplaced, serialise migrations and parallel THP splits */
get_page(page);
spin_unlock(&mm->page_table_lock);
- if (!page_locked) {
+ if (!page_locked)
lock_page(page);
- page_locked = true;
- }
anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
+ page_nid = -1;
goto out_unlock;
}
- /* Migrate the THP to the requested node */
+ /*
+ * Migrate the THP to the requested node, returns with page unlocked
+ * and pmd_numa cleared.
+ */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (migrated)
page_nid = target_nid;
- else
- goto check_same;
goto out;
-
-check_same:
- spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
- /* Someone else took our fault */
- page_nid = -1;
- goto out_unlock;
- }
clear_pmdnuma:
+ BUG_ON(!PageLocked(page));
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
VM_BUG_ON(pmd_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
+ unlock_page(page);
out_unlock:
spin_unlock(&mm->page_table_lock);
diff --git a/mm/migrate.c b/mm/migrate.c
index 9c8d5f5..ce8c3a0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1713,12 +1713,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
unlock_page(new_page);
put_page(new_page); /* Free it */
- unlock_page(page);
+ /* Retake the callers reference and putback on LRU */
+ get_page(page);
putback_lru_page(page);
-
- count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
- isolated = 0;
- goto out;
+ mod_zone_page_state(page_zone(page),
+ NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+ goto out_fail;
}
/*
@@ -1735,9 +1735,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
- page_add_new_anon_rmap(new_page, vma, haddr);
-
+ pmdp_clear_flush(vma, haddr, pmd);
set_pmd_at(mm, haddr, pmd, entry);
+ page_add_new_anon_rmap(new_page, vma, haddr);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(page);
/*
@@ -1756,7 +1756,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
-out:
mod_zone_page_state(page_zone(page),
NR_ISOLATED_ANON + page_lru,
-HPAGE_PMD_NR);
@@ -1765,6 +1764,10 @@ out:
out_fail:
count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
out_dropref:
+ entry = pmd_mknonnuma(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, &entry);
+
unlock_page(page);
put_page(page);
return 0;
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 14:02 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migration uses the page lock to guard against parallel allocations
> but there are cases like this still open
>
> Task A Task B
> do_huge_pmd_numa_page do_huge_pmd_numa_page
> lock_page
> mpol_misplaced == -1
> unlock_page
> goto clear_pmdnuma
> lock_page
> mpol_misplaced == 2
> migrate_misplaced_transhuge
> pmd = pmd_mknonnuma
> set_pmd_at
>
> During hours of testing, one crashed with weird errors and while I have
> no direct evidence, I suspect something like the race above happened.
> This patch extends the page lock to being held until the pmd_numa is
> cleared to prevent migration starting in parallel while the pmd_numa is
> being cleared. It also flushes the old pmd entry and orders pagetable
> insertion before rmap insertion.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 08/63] mm: Close races between THP migration and PMD numa clearing
@ 2013-10-07 14:02 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> THP migration uses the page lock to guard against parallel allocations
> but there are cases like this still open
>
> Task A Task B
> do_huge_pmd_numa_page do_huge_pmd_numa_page
> lock_page
> mpol_misplaced == -1
> unlock_page
> goto clear_pmdnuma
> lock_page
> mpol_misplaced == 2
> migrate_misplaced_transhuge
> pmd = pmd_mknonnuma
> set_pmd_at
>
> During hours of testing, one crashed with weird errors and while I have
> no direct evidence, I suspect something like the race above happened.
> This patch extends the page lock to being held until the pmd_numa is
> cleared to prevent migration starting in parallel while the pmd_numa is
> being cleared. It also flushes the old pmd entry and orders pagetable
> insertion before rmap insertion.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: Close races between THP migration and PMD numa clearing
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:25 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: a54a407fbf7735fd8f7841375574f5d9b0375f93
Gitweb: http://git.kernel.org/tip/a54a407fbf7735fd8f7841375574f5d9b0375f93
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:46 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:45 +0200
mm: Close races between THP migration and PMD numa clearing
THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open
Task A Task B
--------------------- ---------------------
do_huge_pmd_numa_page do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
lock_page
mpol_misplaced == 2
migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at
During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 33 +++++++++++++++------------------
mm/migrate.c | 19 +++++++++++--------
2 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bb65f..d4928769 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1) {
/* If the page was locked, there are no parallel migrations */
- if (page_locked) {
- unlock_page(page);
+ if (page_locked)
goto clear_pmdnuma;
- }
- /* Otherwise wait for potential migrations and retry fault */
+ /*
+ * Otherwise wait for potential migrations and retry. We do
+ * relock and check_same as the page may no longer be mapped.
+ * As the fault is being retried, do not account for it.
+ */
spin_unlock(&mm->page_table_lock);
wait_on_page_locked(page);
+ page_nid = -1;
goto out;
}
/* Page is misplaced, serialise migrations and parallel THP splits */
get_page(page);
spin_unlock(&mm->page_table_lock);
- if (!page_locked) {
+ if (!page_locked)
lock_page(page);
- page_locked = true;
- }
anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PMD did not change while page_table_lock was released */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
+ page_nid = -1;
goto out_unlock;
}
- /* Migrate the THP to the requested node */
+ /*
+ * Migrate the THP to the requested node, returns with page unlocked
+ * and pmd_numa cleared.
+ */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (migrated)
page_nid = target_nid;
- else
- goto check_same;
goto out;
-
-check_same:
- spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
- /* Someone else took our fault */
- page_nid = -1;
- goto out_unlock;
- }
clear_pmdnuma:
+ BUG_ON(!PageLocked(page));
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
VM_BUG_ON(pmd_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
+ unlock_page(page);
out_unlock:
spin_unlock(&mm->page_table_lock);
diff --git a/mm/migrate.c b/mm/migrate.c
index a26bccd..7bd90d3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1713,12 +1713,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
unlock_page(new_page);
put_page(new_page); /* Free it */
- unlock_page(page);
+ /* Retake the callers reference and putback on LRU */
+ get_page(page);
putback_lru_page(page);
-
- count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
- isolated = 0;
- goto out;
+ mod_zone_page_state(page_zone(page),
+ NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+ goto out_fail;
}
/*
@@ -1735,9 +1735,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
- page_add_new_anon_rmap(new_page, vma, haddr);
-
+ pmdp_clear_flush(vma, haddr, pmd);
set_pmd_at(mm, haddr, pmd, entry);
+ page_add_new_anon_rmap(new_page, vma, haddr);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(page);
/*
@@ -1756,7 +1756,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
-out:
mod_zone_page_state(page_zone(page),
NR_ISOLATED_ANON + page_lru,
-HPAGE_PMD_NR);
@@ -1765,6 +1764,10 @@ out:
out_fail:
count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
out_dropref:
+ entry = pmd_mknonnuma(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, &entry);
+
unlock_page(page);
put_page(page);
return 0;
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:core/urgent] mm: Close races between THP migration and PMD numa clearing
2013-10-07 10:28 ` Mel Gorman
` (2 preceding siblings ...)
(?)
@ 2013-10-29 10:42 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
srikar, mgorman, tglx
Commit-ID: 3f926ab945b60a5824369d21add7710622a2eac0
Gitweb: http://git.kernel.org/tip/3f926ab945b60a5824369d21add7710622a2eac0
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:46 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:38:05 +0100
mm: Close races between THP migration and PMD numa clearing
THP migration uses the page lock to guard against parallel allocations
but there are cases like this still open
Task A Task B
--------------------- ---------------------
do_huge_pmd_numa_page do_huge_pmd_numa_page
lock_page
mpol_misplaced == -1
unlock_page
goto clear_pmdnuma
lock_page
mpol_misplaced == 2
migrate_misplaced_transhuge
pmd = pmd_mknonnuma
set_pmd_at
During hours of testing, one crashed with weird errors and while I have
no direct evidence, I suspect something like the race above happened.
This patch extends the page lock to being held until the pmd_numa is
cleared to prevent migration starting in parallel while the pmd_numa is
being cleared. It also flushes the old pmd entry and orders pagetable
insertion before rmap insertion.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 33 +++++++++++++++------------------
mm/migrate.c | 19 +++++++++++--------
2 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 00ddfcd..cca80d9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1304,24 +1304,25 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1) {
/* If the page was locked, there are no parallel migrations */
- if (page_locked) {
- unlock_page(page);
+ if (page_locked)
goto clear_pmdnuma;
- }
- /* Otherwise wait for potential migrations and retry fault */
+ /*
+ * Otherwise wait for potential migrations and retry. We do
+ * relock and check_same as the page may no longer be mapped.
+ * As the fault is being retried, do not account for it.
+ */
spin_unlock(&mm->page_table_lock);
wait_on_page_locked(page);
+ page_nid = -1;
goto out;
}
/* Page is misplaced, serialise migrations and parallel THP splits */
get_page(page);
spin_unlock(&mm->page_table_lock);
- if (!page_locked) {
+ if (!page_locked)
lock_page(page);
- page_locked = true;
- }
anon_vma = page_lock_anon_vma_read(page);
/* Confirm the PTE did not while locked */
@@ -1329,32 +1330,28 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
+ page_nid = -1;
goto out_unlock;
}
- /* Migrate the THP to the requested node */
+ /*
+ * Migrate the THP to the requested node, returns with page unlocked
+ * and pmd_numa cleared.
+ */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (migrated)
page_nid = target_nid;
- else
- goto check_same;
goto out;
-
-check_same:
- spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp))) {
- /* Someone else took our fault */
- page_nid = -1;
- goto out_unlock;
- }
clear_pmdnuma:
+ BUG_ON(!PageLocked(page));
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
VM_BUG_ON(pmd_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
+ unlock_page(page);
out_unlock:
spin_unlock(&mm->page_table_lock);
diff --git a/mm/migrate.c b/mm/migrate.c
index 7a7325e..c046927 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1715,12 +1715,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
unlock_page(new_page);
put_page(new_page); /* Free it */
- unlock_page(page);
+ /* Retake the callers reference and putback on LRU */
+ get_page(page);
putback_lru_page(page);
-
- count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
- isolated = 0;
- goto out;
+ mod_zone_page_state(page_zone(page),
+ NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+ goto out_fail;
}
/*
@@ -1737,9 +1737,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
- page_add_new_anon_rmap(new_page, vma, haddr);
-
+ pmdp_clear_flush(vma, haddr, pmd);
set_pmd_at(mm, haddr, pmd, entry);
+ page_add_new_anon_rmap(new_page, vma, haddr);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(page);
/*
@@ -1758,7 +1758,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
-out:
mod_zone_page_state(page_zone(page),
NR_ISOLATED_ANON + page_lru,
-HPAGE_PMD_NR);
@@ -1767,6 +1766,10 @@ out:
out_fail:
count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
out_dropref:
+ entry = pmd_mknonnuma(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, &entry);
+
unlock_page(page);
put_page(page);
return 0;
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
A THP PMD update is accounted for as 512 pages updated in vmstat. This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/mprotect.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot,
prot_numa)) {
- pages += HPAGE_PMD_NR;
+ pages++;
continue;
}
/* fall through */
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
A THP PMD update is accounted for as 512 pages updated in vmstat. This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/mprotect.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot,
prot_numa)) {
- pages += HPAGE_PMD_NR;
+ pages++;
continue;
}
/* fall through */
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 14:02 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> A THP PMD update is accounted for as 512 pages updated in vmstat. This is
> large difference when estimating the cost of automatic NUMA balancing and
> can be misleading when comparing results that had collapsed versus split
> THP. This patch addresses the accounting issue.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 09/63] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-10-07 14:02 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 14:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> A THP PMD update is accounted for as 512 pages updated in vmstat. This is
> large difference when estimating the cost of automatic NUMA balancing and
> can be misleading when comparing results that had collapsed versus split
> THP. This patch addresses the accounting issue.
>
> Cc: stable <stable@vger.kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: Account for a THP NUMA hinting update as one PTE update
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:25 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: afcae2655b0ab67e65f161b1bb214efcfa1db415
Gitweb: http://git.kernel.org/tip/afcae2655b0ab67e65f161b1bb214efcfa1db415
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:47 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:46 +0200
mm: Account for a THP NUMA hinting update as one PTE update
A THP PMD update is accounted for as 512 pages updated in vmstat. This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/mprotect.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot,
prot_numa)) {
- pages += HPAGE_PMD_NR;
+ pages++;
continue;
}
/* fall through */
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:core/urgent] mm: Account for a THP NUMA hinting update as one PTE update
2013-10-07 10:28 ` Mel Gorman
` (2 preceding siblings ...)
(?)
@ 2013-10-29 10:43 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-29 10:43 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, stable, aarcange,
srikar, mgorman, tglx
Commit-ID: 0255d491848032f6c601b6410c3b8ebded3a37b1
Gitweb: http://git.kernel.org/tip/0255d491848032f6c601b6410c3b8ebded3a37b1
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:47 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 11:38:17 +0100
mm: Account for a THP NUMA hinting update as one PTE update
A THP PMD update is accounted for as 512 pages updated in vmstat. This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/mprotect.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a3af058..412ba2b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot,
prot_numa)) {
- pages += HPAGE_PMD_NR;
+ pages++;
continue;
}
/* fall through */
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/mprotect.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..7bdbd4b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -101,8 +101,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
make_migration_entry_read(&entry);
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
+
+ pages++;
}
- pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/mprotect.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..7bdbd4b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -101,8 +101,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
make_migration_entry_read(&entry);
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
+
+ pages++;
}
- pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 15:12 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 15:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning is expensive both in terms of the scanning itself and
> the TLB flush if there are any updates. Currently non-present PTEs are
> accounted for as an update and incurring a TLB flush where it is only
> necessary for anonymous migration entries. This patch addresses the
> problem and should reduce TLB flushes.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 10/63] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
@ 2013-10-07 15:12 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 15:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning is expensive both in terms of the scanning itself and
> the TLB flush if there are any updates. Currently non-present PTEs are
> accounted for as an update and incurring a TLB flush where it is only
> necessary for anonymous migration entries. This patch addresses the
> problem and should reduce TLB flushes.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:25 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: e920e14ca29b0b2a981cfc90e4e20edd6f078d19
Gitweb: http://git.kernel.org/tip/e920e14ca29b0b2a981cfc90e4e20edd6f078d19
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:48 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:48 +0200
mm: Do not flush TLB during protection change if !pte_present && !migration_entry
NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-11-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/mprotect.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..7bdbd4b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -101,8 +101,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
make_migration_entry_read(&entry);
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
+
+ pages++;
}
- pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 11/63] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 19 ++++++++++++++++---
mm/mprotect.c | 14 ++++++++++----
2 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4928769..de8d5cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1458,6 +1458,12 @@ out:
return ret;
}
+/*
+ * Returns
+ * - 0 if PMD could not be locked
+ * - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ * - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot, int prot_numa)
{
@@ -1466,9 +1472,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (__pmd_trans_huge_lock(pmd, vma) == 1) {
pmd_t entry;
- entry = pmdp_get_and_clear(mm, addr, pmd);
+ ret = 1;
if (!prot_numa) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ ret = HPAGE_PMD_NR;
BUG_ON(pmd_write(entry));
} else {
struct page *page = pmd_page(*pmd);
@@ -1476,12 +1484,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
/* only check non-shared pages */
if (page_mapcount(page) == 1 &&
!pmd_numa(*pmd)) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
+ ret = HPAGE_PMD_NR;
}
}
- set_pmd_at(mm, addr, pmd, entry);
+
+ /* Set PMD if cleared earlier */
+ if (ret == HPAGE_PMD_NR)
+ set_pmd_at(mm, addr, pmd, entry);
+
spin_unlock(&vma->vm_mm->page_table_lock);
- ret = 1;
}
return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7bdbd4b..2da33dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -144,10 +144,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot,
- prot_numa)) {
- pages++;
- continue;
+ else {
+ int nr_ptes = change_huge_pmd(vma, pmd, addr,
+ newprot, prot_numa);
+
+ if (nr_ptes) {
+ if (nr_ptes == HPAGE_PMD_NR)
+ pages++;
+
+ continue;
+ }
}
/* fall through */
}
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 11/63] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 19 ++++++++++++++++---
mm/mprotect.c | 14 ++++++++++----
2 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4928769..de8d5cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1458,6 +1458,12 @@ out:
return ret;
}
+/*
+ * Returns
+ * - 0 if PMD could not be locked
+ * - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ * - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot, int prot_numa)
{
@@ -1466,9 +1472,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (__pmd_trans_huge_lock(pmd, vma) == 1) {
pmd_t entry;
- entry = pmdp_get_and_clear(mm, addr, pmd);
+ ret = 1;
if (!prot_numa) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ ret = HPAGE_PMD_NR;
BUG_ON(pmd_write(entry));
} else {
struct page *page = pmd_page(*pmd);
@@ -1476,12 +1484,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
/* only check non-shared pages */
if (page_mapcount(page) == 1 &&
!pmd_numa(*pmd)) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
+ ret = HPAGE_PMD_NR;
}
}
- set_pmd_at(mm, addr, pmd, entry);
+
+ /* Set PMD if cleared earlier */
+ if (ret == HPAGE_PMD_NR)
+ set_pmd_at(mm, addr, pmd, entry);
+
spin_unlock(&vma->vm_mm->page_table_lock);
- ret = 1;
}
return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7bdbd4b..2da33dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -144,10 +144,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot,
- prot_numa)) {
- pages++;
- continue;
+ else {
+ int nr_ptes = change_huge_pmd(vma, pmd, addr,
+ newprot, prot_numa);
+
+ if (nr_ptes) {
+ if (nr_ptes == HPAGE_PMD_NR)
+ pages++;
+
+ continue;
+ }
}
/* fall through */
}
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
2013-10-07 10:28 ` Mel Gorman
(?)
@ 2013-10-09 17:25 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: f123d74abf91574837d14e5ea58f6a779a387bf5
Gitweb: http://git.kernel.org/tip/f123d74abf91574837d14e5ea58f6a779a387bf5
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:49 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:49 +0200
mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-12-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 19 ++++++++++++++++---
mm/mprotect.c | 14 ++++++++++----
2 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4928769..de8d5cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1458,6 +1458,12 @@ out:
return ret;
}
+/*
+ * Returns
+ * - 0 if PMD could not be locked
+ * - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ * - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot, int prot_numa)
{
@@ -1466,9 +1472,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (__pmd_trans_huge_lock(pmd, vma) == 1) {
pmd_t entry;
- entry = pmdp_get_and_clear(mm, addr, pmd);
+ ret = 1;
if (!prot_numa) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ ret = HPAGE_PMD_NR;
BUG_ON(pmd_write(entry));
} else {
struct page *page = pmd_page(*pmd);
@@ -1476,12 +1484,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
/* only check non-shared pages */
if (page_mapcount(page) == 1 &&
!pmd_numa(*pmd)) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
+ ret = HPAGE_PMD_NR;
}
}
- set_pmd_at(mm, addr, pmd, entry);
+
+ /* Set PMD if cleared earlier */
+ if (ret == HPAGE_PMD_NR)
+ set_pmd_at(mm, addr, pmd, entry);
+
spin_unlock(&vma->vm_mm->page_table_lock);
- ret = 1;
}
return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7bdbd4b..2da33dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -144,10 +144,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot,
- prot_numa)) {
- pages++;
- continue;
+ else {
+ int nr_ptes = change_huge_pmd(vma, pmd, addr,
+ newprot, prot_numa);
+
+ if (nr_ptes) {
+ if (nr_ptes == HPAGE_PMD_NR)
+ pages++;
+
+ continue;
+ }
}
/* fall through */
}
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.
[peterz@infradead.org: Correct use of is_huge_zero_page]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 10 +++++++++-
mm/memory.c | 1 +
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de8d5cf..8677dbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,6 +1291,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
page = pmd_page(pmd);
+ BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
@@ -1481,8 +1482,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
} else {
struct page *page = pmd_page(*pmd);
- /* only check non-shared pages */
+ /*
+ * Only check non-shared pages. Do not trap faults
+ * against the zero page. The read-only data is likely
+ * to be read-cached on the local CPU cache and it is
+ * less useful to know about local vs remote hits on
+ * the zero page.
+ */
if (page_mapcount(page) == 1 &&
+ !is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 42ae82e..ed51f15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3564,6 +3564,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
+ BUG_ON(is_zero_pfn(page_to_pfn(page)));
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.
[peterz@infradead.org: Correct use of is_huge_zero_page]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/huge_memory.c | 10 +++++++++-
mm/memory.c | 1 +
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de8d5cf..8677dbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,6 +1291,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
page = pmd_page(pmd);
+ BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
@@ -1481,8 +1482,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
} else {
struct page *page = pmd_page(*pmd);
- /* only check non-shared pages */
+ /*
+ * Only check non-shared pages. Do not trap faults
+ * against the zero page. The read-only data is likely
+ * to be read-cached on the local CPU cache and it is
+ * less useful to know about local vs remote hits on
+ * the zero page.
+ */
if (page_mapcount(page) == 1 &&
+ !is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 42ae82e..ed51f15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3564,6 +3564,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
+ BUG_ON(is_zero_pfn(page_to_pfn(page)));
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 17:10 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The zero page is not replicated between nodes and is often shared between
> processes. The data is read-only and likely to be cached in local CPUs
> if heavily accessed meaning that the remote memory access cost is less
> of a concern. This patch prevents trapping faults on the zero pages. For
> tasks using the zero page this will reduce the number of PTE updates,
> TLB flushes and hinting faults.
>
> [peterz@infradead.org: Correct use of is_huge_zero_page]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 12/63] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-10-07 17:10 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The zero page is not replicated between nodes and is often shared between
> processes. The data is read-only and likely to be cached in local CPUs
> if heavily accessed meaning that the remote memory access cost is less
> of a concern. This patch prevents trapping faults on the zero pages. For
> tasks using the zero page this will reduce the number of PTE updates,
> TLB flushes and hinting faults.
>
> [peterz@infradead.org: Correct use of is_huge_zero_page]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: numa: Do not migrate or account for hinting faults on the zero page
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:25 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:25 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: a1a46184e34cfd0764f06a54870defa052b0a094
Gitweb: http://git.kernel.org/tip/a1a46184e34cfd0764f06a54870defa052b0a094
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:50 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:50 +0200
mm: numa: Do not migrate or account for hinting faults on the zero page
The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Correct use of is_huge_zero_page]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-13-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/huge_memory.c | 10 +++++++++-
mm/memory.c | 1 +
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de8d5cf..8677dbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,6 +1291,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
page = pmd_page(pmd);
+ BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
@@ -1481,8 +1482,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
} else {
struct page *page = pmd_page(*pmd);
- /* only check non-shared pages */
+ /*
+ * Only check non-shared pages. Do not trap faults
+ * against the zero page. The read-only data is likely
+ * to be read-cached on the local CPU cache and it is
+ * less useful to know about local vs remote hits on
+ * the zero page.
+ */
if (page_mapcount(page) == 1 &&
+ !is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 42ae82e..ed51f15 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3564,6 +3564,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
+ BUG_ON(is_zero_pfn(page_to_pfn(page)));
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.
This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.
Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.
Before:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3232 [022] .... 212.787402: task_numa_work: working
thread 0/0-3232 [022] .... 212.888473: task_numa_work: working
thread 0/0-3232 [022] .... 212.989538: task_numa_work: working
thread 0/0-3232 [022] .... 213.090602: task_numa_work: working
thread 0/0-3232 [022] .... 213.191667: task_numa_work: working
thread 0/0-3232 [022] .... 213.292734: task_numa_work: working
thread 0/0-3232 [022] .... 213.393804: task_numa_work: working
thread 0/0-3232 [022] .... 213.494869: task_numa_work: working
thread 0/0-3232 [022] .... 213.596937: task_numa_work: working
thread 0/0-3232 [022] .... 213.699000: task_numa_work: working
thread 0/0-3232 [022] .... 213.801067: task_numa_work: working
thread 0/0-3232 [022] .... 213.903155: task_numa_work: working
thread 0/0-3232 [022] .... 214.005201: task_numa_work: working
thread 0/0-3232 [022] .... 214.107266: task_numa_work: working
thread 0/0-3232 [022] .... 214.209342: task_numa_work: working
After:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3253 [005] .... 136.865051: task_numa_work: working
thread 0/2-3255 [026] .... 136.965134: task_numa_work: working
thread 0/3-3256 [024] .... 137.065217: task_numa_work: working
thread 0/3-3256 [024] .... 137.165302: task_numa_work: working
thread 0/3-3256 [024] .... 137.265382: task_numa_work: working
thread 0/0-3253 [004] .... 137.366465: task_numa_work: working
thread 0/2-3255 [026] .... 137.466549: task_numa_work: working
thread 0/0-3253 [004] .... 137.566629: task_numa_work: working
thread 0/0-3253 [004] .... 137.666711: task_numa_work: working
thread 0/1-3254 [028] .... 137.766799: task_numa_work: working
thread 0/0-3253 [004] .... 137.866876: task_numa_work: working
thread 0/2-3255 [026] .... 137.966960: task_numa_work: working
thread 0/1-3254 [028] .... 138.067041: task_numa_work: working
thread 0/2-3255 [026] .... 138.167123: task_numa_work: working
thread 0/3-3256 [024] .... 138.267207: task_numa_work: working
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b22f52a..8b9ff79 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
return;
/*
+ * Delay this task enough that another task of this mm will likely win
+ * the next time around.
+ */
+ p->node_stamp += 2 * TICK_NSEC;
+
+ /*
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
- curr->node_stamp = now;
+ curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.
This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.
Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.
Before:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3232 [022] .... 212.787402: task_numa_work: working
thread 0/0-3232 [022] .... 212.888473: task_numa_work: working
thread 0/0-3232 [022] .... 212.989538: task_numa_work: working
thread 0/0-3232 [022] .... 213.090602: task_numa_work: working
thread 0/0-3232 [022] .... 213.191667: task_numa_work: working
thread 0/0-3232 [022] .... 213.292734: task_numa_work: working
thread 0/0-3232 [022] .... 213.393804: task_numa_work: working
thread 0/0-3232 [022] .... 213.494869: task_numa_work: working
thread 0/0-3232 [022] .... 213.596937: task_numa_work: working
thread 0/0-3232 [022] .... 213.699000: task_numa_work: working
thread 0/0-3232 [022] .... 213.801067: task_numa_work: working
thread 0/0-3232 [022] .... 213.903155: task_numa_work: working
thread 0/0-3232 [022] .... 214.005201: task_numa_work: working
thread 0/0-3232 [022] .... 214.107266: task_numa_work: working
thread 0/0-3232 [022] .... 214.209342: task_numa_work: working
After:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3253 [005] .... 136.865051: task_numa_work: working
thread 0/2-3255 [026] .... 136.965134: task_numa_work: working
thread 0/3-3256 [024] .... 137.065217: task_numa_work: working
thread 0/3-3256 [024] .... 137.165302: task_numa_work: working
thread 0/3-3256 [024] .... 137.265382: task_numa_work: working
thread 0/0-3253 [004] .... 137.366465: task_numa_work: working
thread 0/2-3255 [026] .... 137.466549: task_numa_work: working
thread 0/0-3253 [004] .... 137.566629: task_numa_work: working
thread 0/0-3253 [004] .... 137.666711: task_numa_work: working
thread 0/1-3254 [028] .... 137.766799: task_numa_work: working
thread 0/0-3253 [004] .... 137.866876: task_numa_work: working
thread 0/2-3255 [026] .... 137.966960: task_numa_work: working
thread 0/1-3254 [028] .... 138.067041: task_numa_work: working
thread 0/2-3255 [026] .... 138.167123: task_numa_work: working
thread 0/3-3256 [024] .... 138.267207: task_numa_work: working
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b22f52a..8b9ff79 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
return;
/*
+ * Delay this task enough that another task of this mm will likely win
+ * the next time around.
+ */
+ p->node_stamp += 2 * TICK_NSEC;
+
+ /*
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
- curr->node_stamp = now;
+ curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 17:24 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> With a trace_printk("working\n"); right after the cmpxchg in
> task_numa_work() we can see that of a 4 thread process, its always the
> same task winning the race and doing the protection change.
>
> This is a problem since the task doing the protection change has a
> penalty for taking faults -- it is busy when marking the PTEs. If its
> always the same task the ->numa_faults[] get severely skewed.
>
> Avoid this by delaying the task doing the protection change such that
> it is unlikely to win the privilege again.
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 13/63] sched: numa: Mitigate chance that same task always updates PTEs
@ 2013-10-07 17:24 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> With a trace_printk("working\n"); right after the cmpxchg in
> task_numa_work() we can see that of a 4 thread process, its always the
> same task winning the race and doing the protection change.
>
> This is a problem since the task doing the protection change has a
> penalty for taking faults -- it is busy when marking the PTEs. If its
> always the same task the ->numa_faults[] get severely skewed.
>
> Avoid this by delaying the task doing the protection change such that
> it is unlikely to win the privilege again.
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Mitigate chance that same task always updates PTEs
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:26 ` tip-bot for Peter Zijlstra
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
mgorman, tglx
Commit-ID: 19a78d110d7a8045aeb90d38ee8fe9743ce88c2d
Gitweb: http://git.kernel.org/tip/19a78d110d7a8045aeb90d38ee8fe9743ce88c2d
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:28:51 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:39:56 +0200
sched/numa: Mitigate chance that same task always updates PTEs
With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.
This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.
Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.
Before:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3232 [022] .... 212.787402: task_numa_work: working
thread 0/0-3232 [022] .... 212.888473: task_numa_work: working
thread 0/0-3232 [022] .... 212.989538: task_numa_work: working
thread 0/0-3232 [022] .... 213.090602: task_numa_work: working
thread 0/0-3232 [022] .... 213.191667: task_numa_work: working
thread 0/0-3232 [022] .... 213.292734: task_numa_work: working
thread 0/0-3232 [022] .... 213.393804: task_numa_work: working
thread 0/0-3232 [022] .... 213.494869: task_numa_work: working
thread 0/0-3232 [022] .... 213.596937: task_numa_work: working
thread 0/0-3232 [022] .... 213.699000: task_numa_work: working
thread 0/0-3232 [022] .... 213.801067: task_numa_work: working
thread 0/0-3232 [022] .... 213.903155: task_numa_work: working
thread 0/0-3232 [022] .... 214.005201: task_numa_work: working
thread 0/0-3232 [022] .... 214.107266: task_numa_work: working
thread 0/0-3232 [022] .... 214.209342: task_numa_work: working
After:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3253 [005] .... 136.865051: task_numa_work: working
thread 0/2-3255 [026] .... 136.965134: task_numa_work: working
thread 0/3-3256 [024] .... 137.065217: task_numa_work: working
thread 0/3-3256 [024] .... 137.165302: task_numa_work: working
thread 0/3-3256 [024] .... 137.265382: task_numa_work: working
thread 0/0-3253 [004] .... 137.366465: task_numa_work: working
thread 0/2-3255 [026] .... 137.466549: task_numa_work: working
thread 0/0-3253 [004] .... 137.566629: task_numa_work: working
thread 0/0-3253 [004] .... 137.666711: task_numa_work: working
thread 0/1-3254 [028] .... 137.766799: task_numa_work: working
thread 0/0-3253 [004] .... 137.866876: task_numa_work: working
thread 0/2-3255 [026] .... 137.966960: task_numa_work: working
thread 0/1-3254 [028] .... 138.067041: task_numa_work: working
thread 0/2-3255 [026] .... 138.167123: task_numa_work: working
thread 0/3-3256 [024] .... 138.267207: task_numa_work: working
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 817cd7b..573d815e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
return;
/*
+ * Delay this task enough that another task of this mm will likely win
+ * the next time around.
+ */
+ p->node_stamp += 2 * TICK_NSEC;
+
+ /*
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
- curr->node_stamp = now;
+ curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b9ff79..39be6af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
*/
p->node_stamp += 2 * TICK_NSEC;
- /*
- * Do not set pte_numa if the current running node is rate-limited.
- * This loses statistics on the fault but if we are unwilling to
- * migrate to this node, it is less likely we can do useful work
- */
- if (migrate_ratelimited(numa_node_id()))
- return;
-
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b9ff79..39be6af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
*/
p->node_stamp += 2 * TICK_NSEC;
- /*
- * Do not set pte_numa if the current running node is rate-limited.
- * This loses statistics on the fault but if we are unwilling to
- * migrate to this node, it is less likely we can do useful work
- */
- if (migrate_ratelimited(numa_node_id()))
- return;
-
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 17:24 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
> limited sees like a bad idea. Even if this node can't migrate anymore other
> nodes might and we want up-to-date information to do balance decisions.
> We already rate limit the actual migrations, this should leave enough
> bandwidth to allow the non-migrating scanning. I think its important we
> keep up-to-date information if we're going to do placement based on it.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 14/63] sched: numa: Continue PTE scanning even if migrate rate limited
@ 2013-10-07 17:24 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:24 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
> limited sees like a bad idea. Even if this node can't migrate anymore other
> nodes might and we want up-to-date information to do balance decisions.
> We already rate limit the actual migrations, this should leave enough
> bandwidth to allow the non-migrating scanning. I think its important we
> keep up-to-date information if we're going to do placement based on it.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Continue PTE scanning even if migrate rate limited
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:26 ` tip-bot for Peter Zijlstra
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, srikar, aarcange,
mgorman, tglx
Commit-ID: 9e645ab6d089f5822479a833c6977c785bcfffe3
Gitweb: http://git.kernel.org/tip/9e645ab6d089f5822479a833c6977c785bcfffe3
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:28:52 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:09 +0200
sched/numa: Continue PTE scanning even if migrate rate limited
Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 573d815e..464207f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
*/
p->node_stamp += 2 * TICK_NSEC;
- /*
- * Do not set pte_numa if the current running node is rate-limited.
- * This loses statistics on the fault but if we are unwilling to
- * migrate to this node, it is less likely we can do useful work
- */
- if (migrate_ratelimited(numa_node_id()))
- return;
-
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mm_types.h | 10 ----------
kernel/fork.c | 3 ---
kernel/sched/fair.c | 18 ------------------
kernel/sched/features.h | 4 +---
4 files changed, 1 insertion(+), 34 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b7adf1d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -428,20 +428,10 @@ struct mm_struct {
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
-
- /*
- * The first node a task was scheduled on. If a task runs on
- * a different node than Make PTE Scan Go Now.
- */
- int first_nid;
#endif
struct uprobes_state uprobes_state;
};
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT -1
-#define NUMA_PTE_SCAN_ACTIVE -2
-
static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..7192d91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
#endif
-#ifdef CONFIG_NUMA_BALANCING
- mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
if (!mm_init(mm, tsk))
goto fail_nomem;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39be6af..148838c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
return;
/*
- * We do not care about task placement until a task runs on a node
- * other than the first one used by the address space. This is
- * largely because migrations are driven by what CPU the task
- * is running on. If it's never scheduled on another node, it'll
- * not migrate so why bother trapping the fault.
- */
- if (mm->first_nid == NUMA_PTE_SCAN_INIT)
- mm->first_nid = numa_node_id();
- if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
- /* Are we running on a new node yet? */
- if (numa_node_id() == mm->first_nid &&
- !sched_feat_numa(NUMA_FORCE))
- return;
-
- mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
- }
-
- /*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
* can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
/*
* Apply the automatic NUMA scheduling policy. Enabled automatically
* at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
-SCHED_FEAT(NUMA_FORCE, false)
#endif
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mm_types.h | 10 ----------
kernel/fork.c | 3 ---
kernel/sched/fair.c | 18 ------------------
kernel/sched/features.h | 4 +---
4 files changed, 1 insertion(+), 34 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b7adf1d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -428,20 +428,10 @@ struct mm_struct {
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
-
- /*
- * The first node a task was scheduled on. If a task runs on
- * a different node than Make PTE Scan Go Now.
- */
- int first_nid;
#endif
struct uprobes_state uprobes_state;
};
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT -1
-#define NUMA_PTE_SCAN_ACTIVE -2
-
static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..7192d91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
#endif
-#ifdef CONFIG_NUMA_BALANCING
- mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
if (!mm_init(mm, tsk))
goto fail_nomem;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39be6af..148838c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
return;
/*
- * We do not care about task placement until a task runs on a node
- * other than the first one used by the address space. This is
- * largely because migrations are driven by what CPU the task
- * is running on. If it's never scheduled on another node, it'll
- * not migrate so why bother trapping the fault.
- */
- if (mm->first_nid == NUMA_PTE_SCAN_INIT)
- mm->first_nid = numa_node_id();
- if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
- /* Are we running on a new node yet? */
- if (numa_node_id() == mm->first_nid &&
- !sched_feat_numa(NUMA_FORCE))
- return;
-
- mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
- }
-
- /*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
* can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
/*
* Apply the automatic NUMA scheduling policy. Enabled automatically
* at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
-SCHED_FEAT(NUMA_FORCE, false)
#endif
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 17:42 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:42 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> PTE scanning and NUMA hinting fault handling is expensive so commit
> 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
> on a new node") deferred the PTE scan until a task had been scheduled on
> another node. The problem is that in the purely shared memory case that
> this may never happen and no NUMA hinting fault information will be
> captured. We are not ruling out the possibility that something better
> can be done here but for now, this patch needs to be reverted and depend
> entirely on the scan_delay to avoid punishing short-lived processes.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 15/63] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
@ 2013-10-07 17:42 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:42 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> PTE scanning and NUMA hinting fault handling is expensive so commit
> 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
> on a new node") deferred the PTE scan until a task had been scheduled on
> another node. The problem is that in the purely shared memory case that
> this may never happen and no NUMA hinting fault information will be
> captured. We are not ruling out the possibility that something better
> can be done here but for now, this patch needs to be reverted and depend
> entirely on the scan_delay to avoid punishing short-lived processes.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:26 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: b726b7dfb400c937546fa91cf8523dcb1aa2fc6e
Gitweb: http://git.kernel.org/tip/b726b7dfb400c937546fa91cf8523dcb1aa2fc6e
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:53 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:17 +0200
Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/mm_types.h | 10 ----------
kernel/fork.c | 3 ---
kernel/sched/fair.c | 18 ------------------
kernel/sched/features.h | 4 +---
4 files changed, 1 insertion(+), 34 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b7adf1d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -428,20 +428,10 @@ struct mm_struct {
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
-
- /*
- * The first node a task was scheduled on. If a task runs on
- * a different node than Make PTE Scan Go Now.
- */
- int first_nid;
#endif
struct uprobes_state uprobes_state;
};
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT -1
-#define NUMA_PTE_SCAN_ACTIVE -2
-
static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..7192d91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -817,9 +817,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
#endif
-#ifdef CONFIG_NUMA_BALANCING
- mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
if (!mm_init(mm, tsk))
goto fail_nomem;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 464207f..49b11fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
return;
/*
- * We do not care about task placement until a task runs on a node
- * other than the first one used by the address space. This is
- * largely because migrations are driven by what CPU the task
- * is running on. If it's never scheduled on another node, it'll
- * not migrate so why bother trapping the fault.
- */
- if (mm->first_nid == NUMA_PTE_SCAN_INIT)
- mm->first_nid = numa_node_id();
- if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
- /* Are we running on a new node yet? */
- if (numa_node_id() == mm->first_nid &&
- !sched_feat_numa(NUMA_FORCE))
- return;
-
- mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
- }
-
- /*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
* can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
/*
* Apply the automatic NUMA scheduling policy. Enabled automatically
* at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
-SCHED_FEAT(NUMA_FORCE, false)
#endif
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 7 +++++++
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f3420c..681945e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1619,8 +1619,8 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_NUMA_BALANCING
if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
- p->mm->numa_next_scan = jiffies;
- p->mm->numa_next_reset = jiffies;
+ p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
p->mm->numa_scan_seq = 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 148838c..22c0c7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;
+ if (!mm->numa_next_reset || !mm->numa_next_scan) {
+ mm->numa_next_scan = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ mm->numa_next_reset = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+ }
+
/*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 7 +++++++
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f3420c..681945e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1619,8 +1619,8 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_NUMA_BALANCING
if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
- p->mm->numa_next_scan = jiffies;
- p->mm->numa_next_reset = jiffies;
+ p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
p->mm->numa_scan_seq = 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 148838c..22c0c7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;
+ if (!mm->numa_next_reset || !mm->numa_next_scan) {
+ mm->numa_next_scan = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ mm->numa_next_reset = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+ }
+
/*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 17:44 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Scan delay logic and resets are currently initialised to start scanning
> immediately instead of delaying properly. Initialise them properly at
> fork time and catch when a new mm has been allocated.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 16/63] sched: numa: Initialise numa_next_scan properly
@ 2013-10-07 17:44 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> Scan delay logic and resets are currently initialised to start scanning
> immediately instead of delaying properly. Initialise them properly at
> fork time and catch when a new mm has been allocated.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Initialise numa_next_scan properly
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:26 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 7e8d16b6cbccb2f5da579f5085479fb82ba851b8
Gitweb: http://git.kernel.org/tip/7e8d16b6cbccb2f5da579f5085479fb82ba851b8
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:54 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:19 +0200
sched/numa: Initialise numa_next_scan properly
Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 7 +++++++
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f575d5b..aee7e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1624,8 +1624,8 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_NUMA_BALANCING
if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
- p->mm->numa_next_scan = jiffies;
- p->mm->numa_next_reset = jiffies;
+ p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
p->mm->numa_scan_seq = 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49b11fa..0966f0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;
+ if (!mm->numa_next_reset || !mm->numa_next_scan) {
+ mm->numa_next_scan = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ mm->numa_next_reset = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+ }
+
/*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.
In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.
On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/kernel.txt | 11 +++---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++------
3 files changed, 83 insertions(+), 17 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1428c66..8cd7e5f 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -403,15 +403,16 @@ workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
when it initially forks.
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5308d89..a8095ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1329,6 +1329,7 @@ struct task_struct {
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
+ unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22c0c7c..c0092e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
#ifdef CONFIG_NUMA_BALANCING
/*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
*/
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 60000;
+unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
/* Portion of address space to scan in MB */
unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+ unsigned long rss = 0;
+ unsigned long nr_scan_pages;
+
+ /*
+ * Calculations based on RSS as non-present and empty pages are skipped
+ * by the PTE scanner and NUMA hinting faults should be trapped based
+ * on resident pages
+ */
+ nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+ rss = get_mm_rss(p->mm);
+ if (!rss)
+ rss = nr_scan_pages;
+
+ rss = round_up(rss, nr_scan_pages);
+ return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+ unsigned int scan, floor;
+ unsigned int windows = 1;
+
+ if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+ windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+ floor = 1000 / windows;
+
+ scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+ return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+ unsigned int smin = task_scan_min(p);
+ unsigned int smax;
+
+ /* Watch for min being lower than max due to floor calculations */
+ smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+ return max(smin, smax);
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_scan_period_max = task_scan_max(p);
/* FIXME: Scheduling placement policy hints go here */
}
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*/
- if (!migrated)
- p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
- p->numa_scan_period + jiffies_to_msecs(10));
+ if (!migrated) {
+ /* Initialise if necessary */
+ if (!p->numa_scan_period_max)
+ p->numa_scan_period_max = task_scan_max(p);
+
+ p->numa_scan_period = min(p->numa_scan_period_max,
+ p->numa_scan_period + 10);
+ }
task_numa_placement(p);
}
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
unsigned long start, end;
+ unsigned long nr_pte_updates = 0;
long pages;
WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
*/
migrate = mm->numa_next_reset;
if (time_after(now, migrate)) {
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ p->numa_scan_period = task_scan_min(p);
next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
xchg(&mm->numa_next_reset, next_scan);
}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;
- if (p->numa_scan_period == 0)
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ if (p->numa_scan_period == 0) {
+ p->numa_scan_period_max = task_scan_max(p);
+ p->numa_scan_period = task_scan_min(p);
+ }
next_scan = now + msecs_to_jiffies(p->numa_scan_period);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- pages -= change_prot_numa(vma, start, end);
+ nr_pte_updates += change_prot_numa(vma, start, end);
+
+ /*
+ * Scan sysctl_numa_balancing_scan_size but ensure that
+ * at least one PTE is updated so that unused virtual
+ * address space is quickly skipped.
+ */
+ if (nr_pte_updates)
+ pages -= (end - start) >> PAGE_SHIFT;
start = end;
if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
- curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ curr->numa_scan_period = task_scan_min(curr);
curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.
In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.
On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/kernel.txt | 11 +++---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++------
3 files changed, 83 insertions(+), 17 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1428c66..8cd7e5f 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -403,15 +403,16 @@ workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
when it initially forks.
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5308d89..a8095ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1329,6 +1329,7 @@ struct task_struct {
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
+ unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22c0c7c..c0092e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
#ifdef CONFIG_NUMA_BALANCING
/*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
*/
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 60000;
+unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
/* Portion of address space to scan in MB */
unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+ unsigned long rss = 0;
+ unsigned long nr_scan_pages;
+
+ /*
+ * Calculations based on RSS as non-present and empty pages are skipped
+ * by the PTE scanner and NUMA hinting faults should be trapped based
+ * on resident pages
+ */
+ nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+ rss = get_mm_rss(p->mm);
+ if (!rss)
+ rss = nr_scan_pages;
+
+ rss = round_up(rss, nr_scan_pages);
+ return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+ unsigned int scan, floor;
+ unsigned int windows = 1;
+
+ if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+ windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+ floor = 1000 / windows;
+
+ scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+ return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+ unsigned int smin = task_scan_min(p);
+ unsigned int smax;
+
+ /* Watch for min being lower than max due to floor calculations */
+ smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+ return max(smin, smax);
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_scan_period_max = task_scan_max(p);
/* FIXME: Scheduling placement policy hints go here */
}
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*/
- if (!migrated)
- p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
- p->numa_scan_period + jiffies_to_msecs(10));
+ if (!migrated) {
+ /* Initialise if necessary */
+ if (!p->numa_scan_period_max)
+ p->numa_scan_period_max = task_scan_max(p);
+
+ p->numa_scan_period = min(p->numa_scan_period_max,
+ p->numa_scan_period + 10);
+ }
task_numa_placement(p);
}
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
unsigned long start, end;
+ unsigned long nr_pte_updates = 0;
long pages;
WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
*/
migrate = mm->numa_next_reset;
if (time_after(now, migrate)) {
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ p->numa_scan_period = task_scan_min(p);
next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
xchg(&mm->numa_next_reset, next_scan);
}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;
- if (p->numa_scan_period == 0)
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ if (p->numa_scan_period == 0) {
+ p->numa_scan_period_max = task_scan_max(p);
+ p->numa_scan_period = task_scan_min(p);
+ }
next_scan = now + msecs_to_jiffies(p->numa_scan_period);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- pages -= change_prot_numa(vma, start, end);
+ nr_pte_updates += change_prot_numa(vma, start, end);
+
+ /*
+ * Scan sysctl_numa_balancing_scan_size but ensure that
+ * at least one PTE is updated so that unused virtual
+ * address space is quickly skipped.
+ */
+ if (nr_pte_updates)
+ pages -= (end - start) >> PAGE_SHIFT;
start = end;
if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
- curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ curr->numa_scan_period = task_scan_min(curr);
curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 17:44 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The NUMA PTE scan rate is controlled with a combination of the
> numa_balancing_scan_period_min, numa_balancing_scan_period_max and
> numa_balancing_scan_size. This scan rate is independent of the size
> of the task and as an aside it is further complicated by the fact that
> numa_balancing_scan_size controls how many pages are marked pte_numa and
> not how much virtual memory is scanned.
>
> In combination, it is almost impossible to meaningfully tune the min and
> max scan periods and reasoning about performance is complex when the time
> to complete a full scan is is partially a function of the tasks memory
> size. This patch alters the semantic of the min and max tunables to be
> about tuning the length time it takes to complete a scan of a tasks occupied
> virtual address space. Conceptually this is a lot easier to understand. There
> is a "sanity" check to ensure the scan rate is never extremely fast based on
> the amount of virtual memory that should be scanned in a second. The default
> of 2.5G seems arbitrary but it is to have the maximum scan rate after the
> patch roughly match the maximum scan rate before the patch was applied.
>
> On a similar note, numa_scan_period is in milliseconds and not
> jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
> to numa_scan_period means that the rate scanning slows depends on HZ which
> is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 17/63] sched: Set the scan rate proportional to the memory usage of the task being scanned
@ 2013-10-07 17:44 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 17:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> The NUMA PTE scan rate is controlled with a combination of the
> numa_balancing_scan_period_min, numa_balancing_scan_period_max and
> numa_balancing_scan_size. This scan rate is independent of the size
> of the task and as an aside it is further complicated by the fact that
> numa_balancing_scan_size controls how many pages are marked pte_numa and
> not how much virtual memory is scanned.
>
> In combination, it is almost impossible to meaningfully tune the min and
> max scan periods and reasoning about performance is complex when the time
> to complete a full scan is is partially a function of the tasks memory
> size. This patch alters the semantic of the min and max tunables to be
> about tuning the length time it takes to complete a scan of a tasks occupied
> virtual address space. Conceptually this is a lot easier to understand. There
> is a "sanity" check to ensure the scan rate is never extremely fast based on
> the amount of virtual memory that should be scanned in a second. The default
> of 2.5G seems arbitrary but it is to have the maximum scan rate after the
> patch roughly match the maximum scan rate before the patch was applied.
>
> On a similar note, numa_scan_period is in milliseconds and not
> jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
> to numa_scan_period means that the rate scanning slows depends on HZ which
> is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Set the scan rate proportional to the memory usage of the task being scanned
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:26 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 598f0ec0bc996e90a806ee9564af919ea5aad401
Gitweb: http://git.kernel.org/tip/598f0ec0bc996e90a806ee9564af919ea5aad401
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:55 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:20 +0200
sched/numa: Set the scan rate proportional to the memory usage of the task being scanned
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.
In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.
On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
Documentation/sysctl/kernel.txt | 11 +++---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++------
3 files changed, 83 insertions(+), 17 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1428c66..8cd7e5f 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -403,15 +403,16 @@ workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
when it initially forks.
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2ac5285..fdcb4c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,7 @@ struct task_struct {
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
+ unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0966f0c..e08d757 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
#ifdef CONFIG_NUMA_BALANCING
/*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
*/
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 60000;
+unsigned int sysctl_numa_balancing_scan_period_reset = 60000;
/* Portion of address space to scan in MB */
unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+ unsigned long rss = 0;
+ unsigned long nr_scan_pages;
+
+ /*
+ * Calculations based on RSS as non-present and empty pages are skipped
+ * by the PTE scanner and NUMA hinting faults should be trapped based
+ * on resident pages
+ */
+ nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+ rss = get_mm_rss(p->mm);
+ if (!rss)
+ rss = nr_scan_pages;
+
+ rss = round_up(rss, nr_scan_pages);
+ return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+ unsigned int scan, floor;
+ unsigned int windows = 1;
+
+ if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+ windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+ floor = 1000 / windows;
+
+ scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+ return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+ unsigned int smin = task_scan_min(p);
+ unsigned int smax;
+
+ /* Watch for min being lower than max due to floor calculations */
+ smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+ return max(smin, smax);
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_scan_period_max = task_scan_max(p);
/* FIXME: Scheduling placement policy hints go here */
}
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*/
- if (!migrated)
- p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
- p->numa_scan_period + jiffies_to_msecs(10));
+ if (!migrated) {
+ /* Initialise if necessary */
+ if (!p->numa_scan_period_max)
+ p->numa_scan_period_max = task_scan_max(p);
+
+ p->numa_scan_period = min(p->numa_scan_period_max,
+ p->numa_scan_period + 10);
+ }
task_numa_placement(p);
}
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
unsigned long start, end;
+ unsigned long nr_pte_updates = 0;
long pages;
WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
*/
migrate = mm->numa_next_reset;
if (time_after(now, migrate)) {
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ p->numa_scan_period = task_scan_min(p);
next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
xchg(&mm->numa_next_reset, next_scan);
}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;
- if (p->numa_scan_period == 0)
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ if (p->numa_scan_period == 0) {
+ p->numa_scan_period_max = task_scan_max(p);
+ p->numa_scan_period = task_scan_min(p);
+ }
next_scan = now + msecs_to_jiffies(p->numa_scan_period);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- pages -= change_prot_numa(vma, start, end);
+ nr_pte_updates += change_prot_numa(vma, start, end);
+
+ /*
+ * Scan sysctl_numa_balancing_scan_size but ensure that
+ * at least one PTE is updated so that unused virtual
+ * address space is quickly skipped.
+ */
+ if (nr_pte_updates)
+ pages -= (end - start) >> PAGE_SHIFT;
start = end;
if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
- curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ curr->numa_scan_period = task_scan_min(curr);
curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0092e5..8cea7a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
out:
/*
+ * If the whole process was scanned without updates then no NUMA
+ * hinting faults are being recorded and scan rate should be lower.
+ */
+ if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+ p->numa_scan_period = min(p->numa_scan_period_max,
+ p->numa_scan_period << 1);
+
+ next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+ mm->numa_next_scan = next_scan;
+ }
+
+ /*
* It is possible to reach the end of the VMA list but the last few
* VMAs are not guaranteed to the vma_migratable. If they are not, we
* would find the !migratable VMA on the next scan but not reset the
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0092e5..8cea7a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
out:
/*
+ * If the whole process was scanned without updates then no NUMA
+ * hinting faults are being recorded and scan rate should be lower.
+ */
+ if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+ p->numa_scan_period = min(p->numa_scan_period_max,
+ p->numa_scan_period << 1);
+
+ next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+ mm->numa_next_scan = next_scan;
+ }
+
+ /*
* It is possible to reach the end of the VMA list but the last few
* VMAs are not guaranteed to the vma_migratable. If they are not, we
* would find the !migratable VMA on the next scan but not reset the
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 18:02 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
> was migrated. For long-lived but idle processes there may be no faults
> but the scan rate will be high and just waste CPU. This patch will slow
> the scan rate for processes that are not trapping faults.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 18/63] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
@ 2013-10-07 18:02 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
> was migrated. For long-lived but idle processes there may be no faults
> but the scan rate will be high and just waste CPU. This patch will slow
> the scan rate for processes that are not trapping faults.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Slow scan rate if no NUMA hinting faults are being recorded
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:26 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:26 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: f307cd1a32fab53012b01749a1f5ba10b0a7243f
Gitweb: http://git.kernel.org/tip/f307cd1a32fab53012b01749a1f5ba10b0a7243f
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:56 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:21 +0200
sched/numa: Slow scan rate if no NUMA hinting faults are being recorded
NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-19-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e08d757..c6c3302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
out:
/*
+ * If the whole process was scanned without updates then no NUMA
+ * hinting faults are being recorded and scan rate should be lower.
+ */
+ if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+ p->numa_scan_period = min(p->numa_scan_period_max,
+ p->numa_scan_period << 1);
+
+ next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+ mm->numa_next_scan = next_scan;
+ }
+
+ /*
* It is possible to reach the end of the VMA list but the last few
* VMAs are not guaranteed to the vma_migratable. If they are not, we
* would find the !migratable VMA on the next scan but not reset the
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 11 ++++++++++-
kernel/sched/sched.h | 12 ++++++++++++
4 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a8095ad..8828e40 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,6 +1332,8 @@ struct task_struct {
unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+
+ unsigned long *numa_faults;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 681945e..aad2e02 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
+ p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
cpu_hotplug_init_task(p);
@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
+
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cea7a2..df300d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;
- /* FIXME: Allocate task-specific structure for placement policy here */
+ /* Allocate buffer to track faults on a per-node basis */
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ if (!p->numa_faults)
+ return;
+ }
/*
* If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
}
task_numa_placement(p);
+
+ p->numa_faults[node] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..6a955f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
#include <linux/tick.h>
+#include <linux/slab.h>
#include "cpupri.h"
#include "cpuacct.h"
@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
return rq->clock_task;
}
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP
#define rcu_dereference_check_sched_domain(p) \
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 11 ++++++++++-
kernel/sched/sched.h | 12 ++++++++++++
4 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a8095ad..8828e40 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,6 +1332,8 @@ struct task_struct {
unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+
+ unsigned long *numa_faults;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 681945e..aad2e02 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
+ p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
cpu_hotplug_init_task(p);
@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
+
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cea7a2..df300d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;
- /* FIXME: Allocate task-specific structure for placement policy here */
+ /* Allocate buffer to track faults on a per-node basis */
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ if (!p->numa_faults)
+ return;
+ }
/*
* If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
}
task_numa_placement(p);
+
+ p->numa_faults[node] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..6a955f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
#include <linux/tick.h>
+#include <linux/slab.h>
#include "cpupri.h"
#include "cpuacct.h"
@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
return rq->clock_task;
}
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP
#define rcu_dereference_check_sched_domain(p) \
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 18:02 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch tracks what nodes numa hinting faults were incurred on.
> This information is later used to schedule a task on the node storing
> the pages most frequently faulted by the task.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
@ 2013-10-07 18:02 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:02 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch tracks what nodes numa hinting faults were incurred on.
> This information is later used to schedule a task on the node storing
> the pages most frequently faulted by the task.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Track NUMA hinting faults on per-node basis
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:27 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: f809ca9a554dda49fb264c79e31c722e0b063ff8
Gitweb: http://git.kernel.org/tip/f809ca9a554dda49fb264c79e31c722e0b063ff8
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:57 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:22 +0200
sched/numa: Track NUMA hinting faults on per-node basis
This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 11 ++++++++++-
kernel/sched/sched.h | 12 ++++++++++++
4 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fdcb4c8..a810e95 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1342,6 +1342,8 @@ struct task_struct {
unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+
+ unsigned long *numa_faults;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aee7e4d..6808d35 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1634,6 +1634,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
+ p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}
@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
+
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6c3302..0bb3e0a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;
- /* FIXME: Allocate task-specific structure for placement policy here */
+ /* Allocate buffer to track faults on a per-node basis */
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ if (!p->numa_faults)
+ return;
+ }
/*
* If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
}
task_numa_placement(p);
+
+ p->numa_faults[node] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e82484d..199099c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
#include <linux/tick.h>
+#include <linux/slab.h>
#include "cpupri.h"
#include "cpuacct.h"
@@ -555,6 +556,17 @@ static inline u64 rq_clock_task(struct rq *rq)
return rq->clock_task;
}
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP
#define rcu_dereference_check_sched_domain(p) \
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
2013-10-07 10:28 ` Mel Gorman
` (2 preceding siblings ...)
(?)
@ 2013-12-04 5:32 ` Wanpeng Li
2013-12-04 5:37 ` Wanpeng Li
-1 siblings, 1 reply; 340+ messages in thread
From: Wanpeng Li @ 2013-12-04 5:32 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML
On Mon, Oct 07, 2013 at 11:28:57AM +0100, Mel Gorman wrote:
>This patch tracks what nodes numa hinting faults were incurred on.
>This information is later used to schedule a task on the node storing
>the pages most frequently faulted by the task.
>
>Signed-off-by: Mel Gorman <mgorman@suse.de>
>---
> include/linux/sched.h | 2 ++
> kernel/sched/core.c | 3 +++
> kernel/sched/fair.c | 11 ++++++++++-
> kernel/sched/sched.h | 12 ++++++++++++
> 4 files changed, 27 insertions(+), 1 deletion(-)
>
>diff --git a/include/linux/sched.h b/include/linux/sched.h
>index a8095ad..8828e40 100644
>--- a/include/linux/sched.h
>+++ b/include/linux/sched.h
>@@ -1332,6 +1332,8 @@ struct task_struct {
> unsigned int numa_scan_period_max;
> u64 node_stamp; /* migration stamp */
> struct callback_head numa_work;
>+
>+ unsigned long *numa_faults;
> #endif /* CONFIG_NUMA_BALANCING */
>
> struct rcu_head rcu;
>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>index 681945e..aad2e02 100644
>--- a/kernel/sched/core.c
>+++ b/kernel/sched/core.c
>@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
> p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> p->numa_work.next = &p->numa_work;
>+ p->numa_faults = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
>
> cpu_hotplug_init_task(p);
>@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> if (mm)
> mmdrop(mm);
> if (unlikely(prev_state == TASK_DEAD)) {
>+ task_numa_free(prev);
Function task_numa_free() depends on patch 43/64.
Regards,
Wanpeng Li
>+
> /*
> * Remove function-return probe instances associated with this
> * task and put them back on the free list.
>diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>index 8cea7a2..df300d9 100644
>--- a/kernel/sched/fair.c
>+++ b/kernel/sched/fair.c
>@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> if (!numabalancing_enabled)
> return;
>
>- /* FIXME: Allocate task-specific structure for placement policy here */
>+ /* Allocate buffer to track faults on a per-node basis */
>+ if (unlikely(!p->numa_faults)) {
>+ int size = sizeof(*p->numa_faults) * nr_node_ids;
>+
>+ p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
>+ if (!p->numa_faults)
>+ return;
>+ }
>
> /*
> * If pages are properly placed (did not migrate) then scan slower.
>@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
> }
>
> task_numa_placement(p);
>+
>+ p->numa_faults[node] += pages;
> }
>
> static void reset_ptenuma_scan(struct task_struct *p)
>diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>index b3c5653..6a955f4 100644
>--- a/kernel/sched/sched.h
>+++ b/kernel/sched/sched.h
>@@ -6,6 +6,7 @@
> #include <linux/spinlock.h>
> #include <linux/stop_machine.h>
> #include <linux/tick.h>
>+#include <linux/slab.h>
>
> #include "cpupri.h"
> #include "cpuacct.h"
>@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
> return rq->clock_task;
> }
>
>+#ifdef CONFIG_NUMA_BALANCING
>+static inline void task_numa_free(struct task_struct *p)
>+{
>+ kfree(p->numa_faults);
>+}
>+#else /* CONFIG_NUMA_BALANCING */
>+static inline void task_numa_free(struct task_struct *p)
>+{
>+}
>+#endif /* CONFIG_NUMA_BALANCING */
>+
> #ifdef CONFIG_SMP
>
> #define rcu_dereference_check_sched_domain(p) \
>--
>1.8.4
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 19/63] sched: Track NUMA hinting faults on per-node basis
2013-12-04 5:32 ` [PATCH 19/63] sched: " Wanpeng Li
@ 2013-12-04 5:37 ` Wanpeng Li
0 siblings, 0 replies; 340+ messages in thread
From: Wanpeng Li @ 2013-12-04 5:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML
On Wed, Dec 04, 2013 at 01:32:42PM +0800, Wanpeng Li wrote:
>On Mon, Oct 07, 2013 at 11:28:57AM +0100, Mel Gorman wrote:
>>This patch tracks what nodes numa hinting faults were incurred on.
>>This information is later used to schedule a task on the node storing
>>the pages most frequently faulted by the task.
>>
>>Signed-off-by: Mel Gorman <mgorman@suse.de>
>>---
>> include/linux/sched.h | 2 ++
>> kernel/sched/core.c | 3 +++
>> kernel/sched/fair.c | 11 ++++++++++-
>> kernel/sched/sched.h | 12 ++++++++++++
>> 4 files changed, 27 insertions(+), 1 deletion(-)
>>
>>diff --git a/include/linux/sched.h b/include/linux/sched.h
>>index a8095ad..8828e40 100644
>>--- a/include/linux/sched.h
>>+++ b/include/linux/sched.h
>>@@ -1332,6 +1332,8 @@ struct task_struct {
>> unsigned int numa_scan_period_max;
>> u64 node_stamp; /* migration stamp */
>> struct callback_head numa_work;
>>+
>>+ unsigned long *numa_faults;
>> #endif /* CONFIG_NUMA_BALANCING */
>>
>> struct rcu_head rcu;
>>diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>index 681945e..aad2e02 100644
>>--- a/kernel/sched/core.c
>>+++ b/kernel/sched/core.c
>>@@ -1629,6 +1629,7 @@ static void __sched_fork(struct task_struct *p)
>> p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
>> p->numa_scan_period = sysctl_numa_balancing_scan_delay;
>> p->numa_work.next = &p->numa_work;
>>+ p->numa_faults = NULL;
>> #endif /* CONFIG_NUMA_BALANCING */
>>
>> cpu_hotplug_init_task(p);
>>@@ -1892,6 +1893,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>> if (mm)
>> mmdrop(mm);
>> if (unlikely(prev_state == TASK_DEAD)) {
>>+ task_numa_free(prev);
>
>Function task_numa_free() depends on patch 43/64.
Sorry, I miss it.
>
>Regards,
>Wanpeng Li
>
>>+
>> /*
>> * Remove function-return probe instances associated with this
>> * task and put them back on the free list.
>>diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>index 8cea7a2..df300d9 100644
>>--- a/kernel/sched/fair.c
>>+++ b/kernel/sched/fair.c
>>@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
>> if (!numabalancing_enabled)
>> return;
>>
>>- /* FIXME: Allocate task-specific structure for placement policy here */
>>+ /* Allocate buffer to track faults on a per-node basis */
>>+ if (unlikely(!p->numa_faults)) {
>>+ int size = sizeof(*p->numa_faults) * nr_node_ids;
>>+
>>+ p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
>>+ if (!p->numa_faults)
>>+ return;
>>+ }
>>
>> /*
>> * If pages are properly placed (did not migrate) then scan slower.
>>@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
>> }
>>
>> task_numa_placement(p);
>>+
>>+ p->numa_faults[node] += pages;
>> }
>>
>> static void reset_ptenuma_scan(struct task_struct *p)
>>diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>index b3c5653..6a955f4 100644
>>--- a/kernel/sched/sched.h
>>+++ b/kernel/sched/sched.h
>>@@ -6,6 +6,7 @@
>> #include <linux/spinlock.h>
>> #include <linux/stop_machine.h>
>> #include <linux/tick.h>
>>+#include <linux/slab.h>
>>
>> #include "cpupri.h"
>> #include "cpuacct.h"
>>@@ -552,6 +553,17 @@ static inline u64 rq_clock_task(struct rq *rq)
>> return rq->clock_task;
>> }
>>
>>+#ifdef CONFIG_NUMA_BALANCING
>>+static inline void task_numa_free(struct task_struct *p)
>>+{
>>+ kfree(p->numa_faults);
>>+}
>>+#else /* CONFIG_NUMA_BALANCING */
>>+static inline void task_numa_free(struct task_struct *p)
>>+{
>>+}
>>+#endif /* CONFIG_NUMA_BALANCING */
>>+
>> #ifdef CONFIG_SMP
>>
>> #define rcu_dereference_check_sched_domain(p) \
>>--
>>1.8.4
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@kvack.org. For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 17 +++++++++++++++--
3 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8828e40..83bc1f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,7 @@ struct task_struct {
struct callback_head numa_work;
unsigned long *numa_faults;
+ int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aad2e02..cecbbed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1628,6 +1628,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+ p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df300d9..5fdab8c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
static void task_numa_placement(struct task_struct *p)
{
- int seq;
+ int seq, nid, max_nid = -1;
+ unsigned long max_faults = 0;
if (!p->mm) /* for example, ksmd faulting in a user's mm */
return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);
- /* FIXME: Scheduling placement policy hints go here */
+ /* Find the node with the highest number of faults */
+ for_each_online_node(nid) {
+ unsigned long faults = p->numa_faults[nid];
+ p->numa_faults[nid] >>= 1;
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_nid = nid;
+ }
+ }
+
+ /* Update the tasks preferred node if necessary */
+ if (max_faults && max_nid != p->numa_preferred_nid)
+ p->numa_preferred_nid = max_nid;
}
/*
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 17 +++++++++++++++--
3 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8828e40..83bc1f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,7 @@ struct task_struct {
struct callback_head numa_work;
unsigned long *numa_faults;
+ int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aad2e02..cecbbed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1628,6 +1628,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+ p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df300d9..5fdab8c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
static void task_numa_placement(struct task_struct *p)
{
- int seq;
+ int seq, nid, max_nid = -1;
+ unsigned long max_faults = 0;
if (!p->mm) /* for example, ksmd faulting in a user's mm */
return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);
- /* FIXME: Scheduling placement policy hints go here */
+ /* Find the node with the highest number of faults */
+ for_each_online_node(nid) {
+ unsigned long faults = p->numa_faults[nid];
+ p->numa_faults[nid] >>= 1;
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_nid = nid;
+ }
+ }
+
+ /* Update the tasks preferred node if necessary */
+ if (max_faults && max_nid != p->numa_preferred_nid)
+ p->numa_preferred_nid = max_nid;
}
/*
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 18:04 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:04 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch selects a preferred node for a task to run on based on the
> NUMA hinting faults. This information is later used to migrate tasks
> towards the node during balancing.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 20/63] sched: Select a preferred node with the most numa hinting faults
@ 2013-10-07 18:04 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:04 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> This patch selects a preferred node for a task to run on based on the
> NUMA hinting faults. This information is later used to migrate tasks
> towards the node during balancing.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Select a preferred node with the most numa hinting faults
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:27 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 688b7585d16ab57a17aa4422a3b290b3a55fa679
Gitweb: http://git.kernel.org/tip/688b7585d16ab57a17aa4422a3b290b3a55fa679
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:58 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:23 +0200
sched/numa: Select a preferred node with the most numa hinting faults
This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 17 +++++++++++++++--
3 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a810e95..b1fc75e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1344,6 +1344,7 @@ struct task_struct {
struct callback_head numa_work;
unsigned long *numa_faults;
+ int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6808d35..d15cd70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1633,6 +1633,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+ p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0bb3e0a..9efd34f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
static void task_numa_placement(struct task_struct *p)
{
- int seq;
+ int seq, nid, max_nid = -1;
+ unsigned long max_faults = 0;
if (!p->mm) /* for example, ksmd faulting in a user's mm */
return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);
- /* FIXME: Scheduling placement policy hints go here */
+ /* Find the node with the highest number of faults */
+ for_each_online_node(nid) {
+ unsigned long faults = p->numa_faults[nid];
+ p->numa_faults[nid] >>= 1;
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_nid = nid;
+ }
+ }
+
+ /* Update the tasks preferred node if necessary */
+ if (max_faults && max_nid != p->numa_preferred_nid)
+ p->numa_preferred_nid = max_nid;
}
/*
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 21/63] sched: Update NUMA hinting faults once per scan
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:28 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.
This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 13 +++++++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 16 +++++++++++++---
3 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 83bc1f5..2e02757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1333,7 +1333,20 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+ /*
+ * Exponential decaying average of faults on a per-node basis.
+ * Scheduling placement decisions are made based on the these counts.
+ * The values remain static for the duration of a PTE scan
+ */
unsigned long *numa_faults;
+
+ /*
+ * numa_faults_buffer records faults per node during the current
+ * scan window. When the scan completes, the counts in numa_faults
+ * decay and these values are copied.
+ */
+ unsigned long *numa_faults_buffer;
+
int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cecbbed..201c953 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,6 +1631,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
+ p->numa_faults_buffer = NULL;
#endif /* CONFIG_NUMA_BALANCING */
cpu_hotplug_init_task(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5fdab8c..6227fb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
- unsigned long faults = p->numa_faults[nid];
+ unsigned long faults;
+
+ /* Decay existing window and copy faults since last scan */
p->numa_faults[nid] >>= 1;
+ p->numa_faults[nid] += p->numa_faults_buffer[nid];
+ p->numa_faults_buffer[nid] = 0;
+
+ faults = p->numa_faults[nid];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * nr_node_ids;
- p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ /* numa_faults and numa_faults_buffer share the allocation */
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
if (!p->numa_faults)
return;
+
+ BUG_ON(p->numa_faults_buffer);
+ p->numa_faults_buffer = p->numa_faults + nr_node_ids;
}
/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
- p->numa_faults[node] += pages;
+ p->numa_faults_buffer[node] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 21/63] sched: Update NUMA hinting faults once per scan
@ 2013-10-07 10:28 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:28 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.
This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 13 +++++++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 16 +++++++++++++---
3 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 83bc1f5..2e02757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1333,7 +1333,20 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+ /*
+ * Exponential decaying average of faults on a per-node basis.
+ * Scheduling placement decisions are made based on the these counts.
+ * The values remain static for the duration of a PTE scan
+ */
unsigned long *numa_faults;
+
+ /*
+ * numa_faults_buffer records faults per node during the current
+ * scan window. When the scan completes, the counts in numa_faults
+ * decay and these values are copied.
+ */
+ unsigned long *numa_faults_buffer;
+
int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cecbbed..201c953 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,6 +1631,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
+ p->numa_faults_buffer = NULL;
#endif /* CONFIG_NUMA_BALANCING */
cpu_hotplug_init_task(p);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5fdab8c..6227fb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
- unsigned long faults = p->numa_faults[nid];
+ unsigned long faults;
+
+ /* Decay existing window and copy faults since last scan */
p->numa_faults[nid] >>= 1;
+ p->numa_faults[nid] += p->numa_faults_buffer[nid];
+ p->numa_faults_buffer[nid] = 0;
+
+ faults = p->numa_faults[nid];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * nr_node_ids;
- p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ /* numa_faults and numa_faults_buffer share the allocation */
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
if (!p->numa_faults)
return;
+
+ BUG_ON(p->numa_faults_buffer);
+ p->numa_faults_buffer = p->numa_faults + nr_node_ids;
}
/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
- p->numa_faults[node] += pages;
+ p->numa_faults_buffer[node] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 21/63] sched: Update NUMA hinting faults once per scan
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 18:39 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA hinting fault counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
>
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 21/63] sched: Update NUMA hinting faults once per scan
@ 2013-10-07 18:39 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:28 AM, Mel Gorman wrote:
> NUMA hinting fault counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
>
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Update NUMA hinting faults once per scan
2013-10-07 10:28 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:27 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 745d61476ddb737aad3495fa6d9a8f8c2ee59f86
Gitweb: http://git.kernel.org/tip/745d61476ddb737aad3495fa6d9a8f8c2ee59f86
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:28:59 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:25 +0200
sched/numa: Update NUMA hinting faults once per scan
NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.
This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/sched.h | 13 +++++++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 16 +++++++++++++---
3 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b1fc75e..a463bc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1343,7 +1343,20 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+ /*
+ * Exponential decaying average of faults on a per-node basis.
+ * Scheduling placement decisions are made based on the these counts.
+ * The values remain static for the duration of a PTE scan
+ */
unsigned long *numa_faults;
+
+ /*
+ * numa_faults_buffer records faults per node during the current
+ * scan window. When the scan completes, the counts in numa_faults
+ * decay and these values are copied.
+ */
+ unsigned long *numa_faults_buffer;
+
int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d15cd70..064a0af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1636,6 +1636,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
+ p->numa_faults_buffer = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9efd34f..3abc651 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
- unsigned long faults = p->numa_faults[nid];
+ unsigned long faults;
+
+ /* Decay existing window and copy faults since last scan */
p->numa_faults[nid] >>= 1;
+ p->numa_faults[nid] += p->numa_faults_buffer[nid];
+ p->numa_faults_buffer[nid] = 0;
+
+ faults = p->numa_faults[nid];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * nr_node_ids;
- p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ /* numa_faults and numa_faults_buffer share the allocation */
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
if (!p->numa_faults)
return;
+
+ BUG_ON(p->numa_faults_buffer);
+ p->numa_faults_buffer = p->numa_faults + nr_node_ids;
}
/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
- p->numa_faults[node] += pages;
+ p->numa_faults_buffer[node] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 22/63] sched: Favour moving tasks towards the preferred node
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing. Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.
[srikar@linux.vnet.ibm.com: Fixed statistics]
[peterz@infradead.org: Tunable and use higher faults instead of preferred]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/kernel.txt | 8 +++++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 7 +++++
kernel/sysctl.c | 7 +++++
6 files changed, 83 insertions(+), 6 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 8cd7e5f..d48bca4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
==============================================================
@@ -420,6 +421,11 @@ scanned for a given scan.
numa_balancing_scan_period_reset is a blunt instrument that controls how
often a tasks scan delay is reset to detect sudden changes in task behaviour.
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
==============================================================
osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2e02757..d5ae4bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -768,6 +768,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */
extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 201c953..3515c41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+ p->numa_migrate_seq = 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
@@ -5661,6 +5661,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6227fb4..8c2b779 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
return max(smin, smax);
}
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);
/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
}
/* Update the tasks preferred node if necessary */
- if (max_faults && max_nid != p->numa_preferred_nid)
+ if (max_faults && max_nid != p->numa_preferred_nid) {
p->numa_preferred_nid = max_nid;
+ p->numa_migrate_seq = 0;
+ }
}
/*
@@ -4070,6 +4082,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return delta < (s64)sysctl_sched_migration_cost;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+ !(env->sd->flags & SD_NUMA)) {
+ return false;
+ }
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (dst_nid == p->numa_preferred_nid ||
+ p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
+#endif
+
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
@@ -4125,11 +4169,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * 1) destination numa is preferred
+ * 2) task is cache cold, or
+ * 3) too many balance attempts have failed.
*/
-
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+ if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+ if (tsk_cache_hot) {
+ schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+ schedstat_inc(p, se.statistics.nr_forced_migrations);
+ }
+#endif
+ return 1;
+ }
+
if (!tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..42f616a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "numa_balancing_settle_count",
+ .data = &sysctl_numa_balancing_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_SCHED_DEBUG */
{
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 22/63] sched: Favour moving tasks towards the preferred node
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing. Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.
[srikar@linux.vnet.ibm.com: Fixed statistics]
[peterz@infradead.org: Tunable and use higher faults instead of preferred]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/sysctl/kernel.txt | 8 +++++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 7 +++++
kernel/sysctl.c | 7 +++++
6 files changed, 83 insertions(+), 6 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 8cd7e5f..d48bca4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
==============================================================
@@ -420,6 +421,11 @@ scanned for a given scan.
numa_balancing_scan_period_reset is a blunt instrument that controls how
often a tasks scan delay is reset to detect sudden changes in task behaviour.
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
==============================================================
osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2e02757..d5ae4bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -768,6 +768,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */
extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 201c953..3515c41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+ p->numa_migrate_seq = 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
@@ -5661,6 +5661,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6227fb4..8c2b779 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
return max(smin, smax);
}
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);
/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
}
/* Update the tasks preferred node if necessary */
- if (max_faults && max_nid != p->numa_preferred_nid)
+ if (max_faults && max_nid != p->numa_preferred_nid) {
p->numa_preferred_nid = max_nid;
+ p->numa_migrate_seq = 0;
+ }
}
/*
@@ -4070,6 +4082,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return delta < (s64)sysctl_sched_migration_cost;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+ !(env->sd->flags & SD_NUMA)) {
+ return false;
+ }
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (dst_nid == p->numa_preferred_nid ||
+ p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
+#endif
+
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
@@ -4125,11 +4169,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * 1) destination numa is preferred
+ * 2) task is cache cold, or
+ * 3) too many balance attempts have failed.
*/
-
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+ if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+ if (tsk_cache_hot) {
+ schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+ schedstat_inc(p, se.statistics.nr_forced_migrations);
+ }
+#endif
+ return 1;
+ }
+
if (!tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..42f616a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "numa_balancing_settle_count",
+ .data = &sysctl_numa_balancing_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_SCHED_DEBUG */
{
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 22/63] sched: Favour moving tasks towards the preferred node
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:39 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch favours moving tasks towards NUMA node that recorded a higher
> number of NUMA faults during active load balancing. Ideally this is
> self-reinforcing as the longer the task runs on that node, the more faults
> it should incur causing task_numa_placement to keep the task running on that
> node. In reality a big weakness is that the nodes CPUs can be overloaded
> and it would be more efficient to queue tasks on an idle node and migrate
> to the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the preferred
> node for a PTE scans which is controlled by the numa_balancing_settle_count
> sysctl. Once the settle_count number of scans has complete the schedule
> is free to place the task on an alternative node if the load is imbalanced.
>
> [srikar@linux.vnet.ibm.com: Fixed statistics]
> [peterz@infradead.org: Tunable and use higher faults instead of preferred]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 22/63] sched: Favour moving tasks towards the preferred node
@ 2013-10-07 18:39 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch favours moving tasks towards NUMA node that recorded a higher
> number of NUMA faults during active load balancing. Ideally this is
> self-reinforcing as the longer the task runs on that node, the more faults
> it should incur causing task_numa_placement to keep the task running on that
> node. In reality a big weakness is that the nodes CPUs can be overloaded
> and it would be more efficient to queue tasks on an idle node and migrate
> to the new node. This would require additional smarts in the balancer so
> for now the balancer will simply prefer to place the task on the preferred
> node for a PTE scans which is controlled by the numa_balancing_settle_count
> sysctl. Once the settle_count number of scans has complete the schedule
> is free to place the task on an alternative node if the load is imbalanced.
>
> [srikar@linux.vnet.ibm.com: Fixed statistics]
> [peterz@infradead.org: Tunable and use higher faults instead of preferred]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Favour moving tasks towards the preferred node
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:27 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 3a7053b3224f4a8b0e8184166190076593621617
Gitweb: http://git.kernel.org/tip/3a7053b3224f4a8b0e8184166190076593621617
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:00 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:26 +0200
sched/numa: Favour moving tasks towards the preferred node
This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing. Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.
[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
Documentation/sysctl/kernel.txt | 8 +++++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 7 +++++
kernel/sysctl.c | 7 +++++
6 files changed, 83 insertions(+), 6 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 8cd7e5f..d48bca4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
==============================================================
@@ -420,6 +421,11 @@ scanned for a given scan.
numa_balancing_scan_period_reset is a blunt instrument that controls how
often a tasks scan delay is reset to detect sudden changes in task behaviour.
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
==============================================================
osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a463bc3..aecdc5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -777,6 +777,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */
extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 064a0af..b7e6b6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,7 +1631,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+ p->numa_migrate_seq = 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
@@ -5656,6 +5656,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3abc651..6ffddca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
return max(smin, smax);
}
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);
/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
}
/* Update the tasks preferred node if necessary */
- if (max_faults && max_nid != p->numa_preferred_nid)
+ if (max_faults && max_nid != p->numa_preferred_nid) {
p->numa_preferred_nid = max_nid;
+ p->numa_migrate_seq = 0;
+ }
}
/*
@@ -4071,6 +4083,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return delta < (s64)sysctl_sched_migration_cost;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+ !(env->sd->flags & SD_NUMA)) {
+ return false;
+ }
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (dst_nid == p->numa_preferred_nid ||
+ p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
+#endif
+
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
@@ -4128,11 +4172,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * 1) destination numa is preferred
+ * 2) task is cache cold, or
+ * 3) too many balance attempts have failed.
*/
-
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+ if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+ if (tsk_cache_hot) {
+ schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+ schedstat_inc(p, se.statistics.nr_forced_migrations);
+ }
+#endif
+ return 1;
+ }
+
if (!tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..42f616a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "numa_balancing_settle_count",
+ .data = &sysctl_numa_balancing_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_SCHED_DEBUG */
{
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.
[mgorman@suse.de: changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 33 +++++++++++++++++++++++++++++++++
kernel/sched/features.h | 8 ++++++++
2 files changed, 41 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c2b779..21cad59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4106,12 +4106,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
return false;
}
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+ return false;
+
+ if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ return false;
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+
#else
static inline bool migrate_improves_locality(struct task_struct *p,
struct lb_env *env)
{
return false;
}
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
#endif
/*
@@ -4174,6 +4205,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* 3) too many balance attempts have failed.
*/
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+ if (!tsk_cache_hot)
+ tsk_cache_hot = migrate_degrades_locality(p, env);
if (migrate_improves_locality(p, env)) {
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA, false)
* balancing.
*/
SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.
[mgorman@suse.de: changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 33 +++++++++++++++++++++++++++++++++
kernel/sched/features.h | 8 ++++++++
2 files changed, 41 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c2b779..21cad59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4106,12 +4106,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
return false;
}
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+ return false;
+
+ if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ return false;
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+
#else
static inline bool migrate_improves_locality(struct task_struct *p,
struct lb_env *env)
{
return false;
}
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
#endif
/*
@@ -4174,6 +4205,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* 3) too many balance attempts have failed.
*/
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+ if (!tsk_cache_hot)
+ tsk_cache_hot = migrate_degrades_locality(p, env);
if (migrate_improves_locality(p, env)) {
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA, false)
* balancing.
*/
SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:40 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Just as "sched: Favour moving tasks towards the preferred node" favours
> moving tasks towards nodes with a higher number of recorded NUMA hinting
> faults, this patch resists moving tasks towards nodes with lower faults.
>
> [mgorman@suse.de: changelog]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 23/63] sched: Resist moving tasks towards nodes with fewer hinting faults
@ 2013-10-07 18:40 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Just as "sched: Favour moving tasks towards the preferred node" favours
> moving tasks towards nodes with a higher number of recorded NUMA hinting
> faults, this patch resists moving tasks towards nodes with lower faults.
>
> [mgorman@suse.de: changelog]
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Resist moving tasks towards nodes with fewer hinting faults
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:27 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 7a0f308337d11fd5caa9f845c6d08cc5d6067988
Gitweb: http://git.kernel.org/tip/7a0f308337d11fd5caa9f845c6d08cc5d6067988
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:01 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:27 +0200
sched/numa: Resist moving tasks towards nodes with fewer hinting faults
Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-24-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 33 +++++++++++++++++++++++++++++++++
kernel/sched/features.h | 8 ++++++++
2 files changed, 41 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6ffddca..8943124 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4107,12 +4107,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
return false;
}
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+ return false;
+
+ if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ return false;
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+
#else
static inline bool migrate_improves_locality(struct task_struct *p,
struct lb_env *env)
{
return false;
}
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
#endif
/*
@@ -4177,6 +4208,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* 3) too many balance attempts have failed.
*/
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+ if (!tsk_cache_hot)
+ tsk_cache_hot = migrate_degrades_locality(p, env);
if (migrate_improves_locality(p, env)) {
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA, false)
* balancing.
*/
SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/core.c | 19 +++++++++++++++++++
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 1 +
3 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3515c41..60e640d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4353,6 +4353,25 @@ fail:
return ret;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+ struct migration_arg arg = { p, target_cpu };
+ int curr_cpu = task_cpu(p);
+
+ if (curr_cpu == target_cpu)
+ return 0;
+
+ if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+ return -EINVAL;
+
+ /* TODO: This is not properly updating schedstats */
+
+ return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
/*
* migration_cpu_stop - this will be executed by a highprio stopper thread
* and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 21cad59..63677ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+ unsigned long load, min_load = ULONG_MAX;
+ int i, idlest_cpu = this_cpu;
+
+ BUG_ON(cpu_to_node(this_cpu) == nid);
+
+ rcu_read_lock();
+ for_each_cpu(i, cpumask_of_node(nid)) {
+ load = weighted_cpuload(i);
+
+ if (load < min_load) {
+ min_load = load;
+ idlest_cpu = i;
+ }
+ }
+ rcu_read_unlock();
+
+ return idlest_cpu;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /* Update the tasks preferred node if necessary */
+ /*
+ * Record the preferred node as the node with the most faults,
+ * requeue the task to be running on the idlest CPU on the
+ * preferred node and reset the scanning rate to recheck
+ * the working set placement.
+ */
if (max_faults && max_nid != p->numa_preferred_nid) {
+ int preferred_cpu;
+
+ /*
+ * If the task is not on the preferred node then find the most
+ * idle CPU to migrate to.
+ */
+ preferred_cpu = task_cpu(p);
+ if (cpu_to_node(preferred_cpu) != max_nid) {
+ preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+ max_nid);
+ }
+
+ /* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
+ migrate_task_to(p, preferred_cpu);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a955f4..dca80b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,6 +554,7 @@ static inline u64 rq_clock_task(struct rq *rq)
}
#ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/core.c | 19 +++++++++++++++++++
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 1 +
3 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3515c41..60e640d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4353,6 +4353,25 @@ fail:
return ret;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+ struct migration_arg arg = { p, target_cpu };
+ int curr_cpu = task_cpu(p);
+
+ if (curr_cpu == target_cpu)
+ return 0;
+
+ if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+ return -EINVAL;
+
+ /* TODO: This is not properly updating schedstats */
+
+ return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
/*
* migration_cpu_stop - this will be executed by a highprio stopper thread
* and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 21cad59..63677ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+ unsigned long load, min_load = ULONG_MAX;
+ int i, idlest_cpu = this_cpu;
+
+ BUG_ON(cpu_to_node(this_cpu) == nid);
+
+ rcu_read_lock();
+ for_each_cpu(i, cpumask_of_node(nid)) {
+ load = weighted_cpuload(i);
+
+ if (load < min_load) {
+ min_load = load;
+ idlest_cpu = i;
+ }
+ }
+ rcu_read_unlock();
+
+ return idlest_cpu;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /* Update the tasks preferred node if necessary */
+ /*
+ * Record the preferred node as the node with the most faults,
+ * requeue the task to be running on the idlest CPU on the
+ * preferred node and reset the scanning rate to recheck
+ * the working set placement.
+ */
if (max_faults && max_nid != p->numa_preferred_nid) {
+ int preferred_cpu;
+
+ /*
+ * If the task is not on the preferred node then find the most
+ * idle CPU to migrate to.
+ */
+ preferred_cpu = task_cpu(p);
+ if (cpu_to_node(preferred_cpu) != max_nid) {
+ preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+ max_nid);
+ }
+
+ /* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
+ migrate_task_to(p, preferred_cpu);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a955f4..dca80b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,6 +554,7 @@ static inline u64 rq_clock_task(struct rq *rq)
}
#ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:40 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 24/63] sched: Reschedule task on preferred NUMA node once selected
@ 2013-10-07 18:40 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Reschedule task on preferred NUMA node once selected
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:27 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: e6628d5b0a2979f3e0ee6f7783ede5df50cb9ede
Gitweb: http://git.kernel.org/tip/e6628d5b0a2979f3e0ee6f7783ede5df50cb9ede
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:02 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:28 +0200
sched/numa: Reschedule task on preferred NUMA node once selected
A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-25-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/core.c | 19 +++++++++++++++++++
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 1 +
3 files changed, 65 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7e6b6f..66b878e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4348,6 +4348,25 @@ fail:
return ret;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+ struct migration_arg arg = { p, target_cpu };
+ int curr_cpu = task_cpu(p);
+
+ if (curr_cpu == target_cpu)
+ return 0;
+
+ if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+ return -EINVAL;
+
+ /* TODO: This is not properly updating schedstats */
+
+ return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
/*
* migration_cpu_stop - this will be executed by a highprio stopper thread
* and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8943124..8b15e9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+ unsigned long load, min_load = ULONG_MAX;
+ int i, idlest_cpu = this_cpu;
+
+ BUG_ON(cpu_to_node(this_cpu) == nid);
+
+ rcu_read_lock();
+ for_each_cpu(i, cpumask_of_node(nid)) {
+ load = weighted_cpuload(i);
+
+ if (load < min_load) {
+ min_load = load;
+ idlest_cpu = i;
+ }
+ }
+ rcu_read_unlock();
+
+ return idlest_cpu;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /* Update the tasks preferred node if necessary */
+ /*
+ * Record the preferred node as the node with the most faults,
+ * requeue the task to be running on the idlest CPU on the
+ * preferred node and reset the scanning rate to recheck
+ * the working set placement.
+ */
if (max_faults && max_nid != p->numa_preferred_nid) {
+ int preferred_cpu;
+
+ /*
+ * If the task is not on the preferred node then find the most
+ * idle CPU to migrate to.
+ */
+ preferred_cpu = task_cpu(p);
+ if (cpu_to_node(preferred_cpu) != max_nid) {
+ preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+ max_nid);
+ }
+
+ /* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
+ migrate_task_to(p, preferred_cpu);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 199099c..66458c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -557,6 +557,7 @@ static inline u64 rq_clock_task(struct rq *rq)
}
#ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 5 +++--
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++-----------
mm/huge_memory.c | 5 +++--
mm/memory.c | 8 ++++++--
4 files changed, 47 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5ae4bd..8a3aa9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1435,10 +1435,11 @@ struct task_struct {
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
#ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
extern void set_numabalancing_state(bool enabled);
#else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+ bool migrated)
{
}
static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63677ed..dce3545 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static inline int task_faults_idx(int nid, int priv)
+{
+ return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_faults)
+ return 0;
+
+ return p->numa_faults[task_faults_idx(nid, 0)] +
+ p->numa_faults[task_faults_idx(nid, 1)];
+}
+
static unsigned long weighted_cpuload(const int cpu);
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
unsigned long faults;
+ int priv, i;
- /* Decay existing window and copy faults since last scan */
- p->numa_faults[nid] >>= 1;
- p->numa_faults[nid] += p->numa_faults_buffer[nid];
- p->numa_faults_buffer[nid] = 0;
+ for (priv = 0; priv < 2; priv++) {
+ i = task_faults_idx(nid, priv);
- faults = p->numa_faults[nid];
+ /* Decay existing window, copy faults since last scan */
+ p->numa_faults[i] >>= 1;
+ p->numa_faults[i] += p->numa_faults_buffer[i];
+ p->numa_faults_buffer[i] = 0;
+ }
+
+ /* Find maximum private faults */
+ faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
+ int priv;
if (!numabalancing_enabled)
return;
+ /* For now, do not attempt to detect private/shared accesses */
+ priv = 1;
+
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
return;
BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+ p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
}
/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
- p->numa_faults_buffer[node] += pages;
+ p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
@@ -4145,7 +4169,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
return false;
if (dst_nid == p->numa_preferred_nid ||
- p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ task_faults(p, dst_nid) > task_faults(p, src_nid))
return true;
return false;
@@ -4169,7 +4193,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
return false;
- if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ if (task_faults(p, dst_nid) < task_faults(p, src_nid))
return true;
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8677dbf..9142167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid;
+ int target_nid, last_nid = -1;
bool page_locked;
bool migrated = false;
@@ -1293,6 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
+ last_nid = page_nid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1361,7 +1362,7 @@ out:
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index ed51f15..24bc9b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,6 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
+ int last_nid;
int target_nid;
bool migrated = false;
@@ -3566,6 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));
+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3581,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);
return 0;
}
@@ -3596,6 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
+ int last_nid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3643,6 +3646,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;
+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3655,7 +3659,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 5 +++--
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++-----------
mm/huge_memory.c | 5 +++--
mm/memory.c | 8 ++++++--
4 files changed, 47 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5ae4bd..8a3aa9e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1435,10 +1435,11 @@ struct task_struct {
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
#ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
extern void set_numabalancing_state(bool enabled);
#else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+ bool migrated)
{
}
static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63677ed..dce3545 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static inline int task_faults_idx(int nid, int priv)
+{
+ return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_faults)
+ return 0;
+
+ return p->numa_faults[task_faults_idx(nid, 0)] +
+ p->numa_faults[task_faults_idx(nid, 1)];
+}
+
static unsigned long weighted_cpuload(const int cpu);
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
unsigned long faults;
+ int priv, i;
- /* Decay existing window and copy faults since last scan */
- p->numa_faults[nid] >>= 1;
- p->numa_faults[nid] += p->numa_faults_buffer[nid];
- p->numa_faults_buffer[nid] = 0;
+ for (priv = 0; priv < 2; priv++) {
+ i = task_faults_idx(nid, priv);
- faults = p->numa_faults[nid];
+ /* Decay existing window, copy faults since last scan */
+ p->numa_faults[i] >>= 1;
+ p->numa_faults[i] += p->numa_faults_buffer[i];
+ p->numa_faults_buffer[i] = 0;
+ }
+
+ /* Find maximum private faults */
+ faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
+ int priv;
if (!numabalancing_enabled)
return;
+ /* For now, do not attempt to detect private/shared accesses */
+ priv = 1;
+
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
return;
BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+ p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
}
/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
- p->numa_faults_buffer[node] += pages;
+ p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
@@ -4145,7 +4169,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
return false;
if (dst_nid == p->numa_preferred_nid ||
- p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ task_faults(p, dst_nid) > task_faults(p, src_nid))
return true;
return false;
@@ -4169,7 +4193,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
return false;
- if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ if (task_faults(p, dst_nid) < task_faults(p, src_nid))
return true;
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8677dbf..9142167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid;
+ int target_nid, last_nid = -1;
bool page_locked;
bool migrated = false;
@@ -1293,6 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
+ last_nid = page_nid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1361,7 +1362,7 @@ out:
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index ed51f15..24bc9b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,6 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
+ int last_nid;
int target_nid;
bool migrated = false;
@@ -3566,6 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));
+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3581,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);
return 0;
}
@@ -3596,6 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
+ int last_nid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3643,6 +3646,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;
+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3655,7 +3659,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:41 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults
> that are private to a task and those that are shared. This patch prepares
> infrastructure for separately accounting shared and private faults by
> allocating the necessary buffers and passing in relevant information. For
> now, all faults are treated as private and detection will be introduced
> later.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 25/63] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
@ 2013-10-07 18:41 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults
> that are private to a task and those that are shared. This patch prepares
> infrastructure for separately accounting shared and private faults by
> allocating the necessary buffers and passing in relevant information. For
> now, all faults are treated as private and detection will be introduced
> later.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Add infrastructure for split shared/ private accounting of NUMA hinting faults
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:28 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: ac8e895bd260cb8bb19ade6a3abd44e7abe9a01d
Gitweb: http://git.kernel.org/tip/ac8e895bd260cb8bb19ade6a3abd44e7abe9a01d
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:03 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:30 +0200
sched/numa: Add infrastructure for split shared/private accounting of NUMA hinting faults
Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-26-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/sched.h | 5 +++--
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++-----------
mm/huge_memory.c | 5 +++--
mm/memory.c | 8 ++++++--
4 files changed, 47 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index aecdc5a..d946195 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1445,10 +1445,11 @@ struct task_struct {
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
#ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
extern void set_numabalancing_state(bool enabled);
#else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+ bool migrated)
{
}
static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b15e9e..89eeb89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static inline int task_faults_idx(int nid, int priv)
+{
+ return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_faults)
+ return 0;
+
+ return p->numa_faults[task_faults_idx(nid, 0)] +
+ p->numa_faults[task_faults_idx(nid, 1)];
+}
+
static unsigned long weighted_cpuload(const int cpu);
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
unsigned long faults;
+ int priv, i;
- /* Decay existing window and copy faults since last scan */
- p->numa_faults[nid] >>= 1;
- p->numa_faults[nid] += p->numa_faults_buffer[nid];
- p->numa_faults_buffer[nid] = 0;
+ for (priv = 0; priv < 2; priv++) {
+ i = task_faults_idx(nid, priv);
- faults = p->numa_faults[nid];
+ /* Decay existing window, copy faults since last scan */
+ p->numa_faults[i] >>= 1;
+ p->numa_faults[i] += p->numa_faults_buffer[i];
+ p->numa_faults_buffer[i] = 0;
+ }
+
+ /* Find maximum private faults */
+ faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
+ int priv;
if (!numabalancing_enabled)
return;
+ /* For now, do not attempt to detect private/shared accesses */
+ priv = 1;
+
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
return;
BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+ p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
}
/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
- p->numa_faults_buffer[node] += pages;
+ p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}
static void reset_ptenuma_scan(struct task_struct *p)
@@ -4146,7 +4170,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
return false;
if (dst_nid == p->numa_preferred_nid ||
- p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ task_faults(p, dst_nid) > task_faults(p, src_nid))
return true;
return false;
@@ -4170,7 +4194,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
return false;
- if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ if (task_faults(p, dst_nid) < task_faults(p, src_nid))
return true;
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8677dbf..9142167 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid;
+ int target_nid, last_nid = -1;
bool page_locked;
bool migrated = false;
@@ -1293,6 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
+ last_nid = page_nid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1361,7 +1362,7 @@ out:
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index ed51f15..24bc9b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,6 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
+ int last_nid;
int target_nid;
bool migrated = false;
@@ -3566,6 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));
+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3581,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);
return 0;
}
@@ -3596,6 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
+ int last_nid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3643,6 +3646,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;
+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3655,7 +3659,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.
[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dce3545..9eb384b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
int seq, nid, max_nid = -1;
unsigned long max_faults = 0;
- if (!p->mm) /* for example, ksmd faulting in a user's mm */
- return;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;
+ /* for example, ksmd faulting in a user's mm */
+ if (!p->mm)
+ return;
+
/* For now, do not attempt to detect private/shared accesses */
priv = 1;
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.
[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dce3545..9eb384b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
int seq, nid, max_nid = -1;
unsigned long max_faults = 0;
- if (!p->mm) /* for example, ksmd faulting in a user's mm */
- return;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;
+ /* for example, ksmd faulting in a user's mm */
+ if (!p->mm)
+ return;
+
/* For now, do not attempt to detect private/shared accesses */
priv = 1;
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:41 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_placement checks current->mm but after buffers for faults
> have already been uselessly allocated. Move the check earlier.
>
> [peterz@infradead.org: Identified the problem]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 26/63] sched: Check current->mm before allocating NUMA faults
@ 2013-10-07 18:41 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:41 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_placement checks current->mm but after buffers for faults
> have already been uselessly allocated. Move the check earlier.
>
> [peterz@infradead.org: Identified the problem]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Check current-> mm before allocating NUMA faults
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:28 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 9ff1d9ff3c2c8ab3feaeb2e8056a07ca293f7bde
Gitweb: http://git.kernel.org/tip/9ff1d9ff3c2c8ab3feaeb2e8056a07ca293f7bde
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:04 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:31 +0200
sched/numa: Check current->mm before allocating NUMA faults
task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.
[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-27-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89eeb89..3383079 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
int seq, nid, max_nid = -1;
unsigned long max_faults = 0;
- if (!p->mm) /* for example, ksmd faulting in a user's mm */
- return;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;
+ /* for example, ksmd faulting in a user's mm */
+ if (!p->mm)
+ return;
+
/* For now, do not attempt to detect private/shared accesses */
priv = 1;
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.
This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/migrate.h | 7 ++++---
mm/huge_memory.c | 12 +++++-------
mm/memory.c | 7 ++-----
mm/migrate.c | 17 ++++++-----------
mm/mprotect.c | 4 +---
5 files changed, 18 insertions(+), 29 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..f5096b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,11 +90,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#endif /* CONFIG_MIGRATION */
#ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node);
extern bool migrate_ratelimited(int node);
#else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node)
{
return -EAGAIN; /* can't migrate now */
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9142167..2a28c2c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1484,14 +1484,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page = pmd_page(*pmd);
/*
- * Only check non-shared pages. Do not trap faults
- * against the zero page. The read-only data is likely
- * to be read-cached on the local CPU cache and it is
- * less useful to know about local vs remote hits on
- * the zero page.
+ * Do not trap faults against the zero page. The
+ * read-only data is likely to be read-cached on the
+ * local CPU cache and it is less useful to know about
+ * local vs remote hits on the zero page.
*/
- if (page_mapcount(page) == 1 &&
- !is_huge_zero_page(page) &&
+ if (!is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 24bc9b8..3e3b4b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3577,7 +3577,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
/* Migrate to the requested node */
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
@@ -3642,16 +3642,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = vm_normal_page(vma, addr, pteval);
if (unlikely(!page))
continue;
- /* only check non-shared pages */
- if (unlikely(page_mapcount(page) != 1))
- continue;
last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
if (target_nid != -1) {
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index ce8c3a0..f212944 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1599,7 +1599,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
* node. Caller is expected to have an elevated reference count on
* the page that will be dropped by this function before returning.
*/
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+ int node)
{
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
@@ -1607,10 +1608,11 @@ int migrate_misplaced_page(struct page *page, int node)
LIST_HEAD(migratepages);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
+ * Don't migrate file pages that are mapped in multiple processes
+ * with execute permissions as they are probably shared libraries.
*/
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+ (vma->vm_flags & VM_EXEC))
goto out;
/*
@@ -1661,13 +1663,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int page_lru = page_is_file_cache(page);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
- */
- if (page_mapcount(page) != 1)
- goto out_dropref;
-
- /*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2da33dc..41e0292 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (last_nid != this_nid)
all_same_node = false;
- /* only check non-shared pages */
- if (!pte_numa(oldpte) &&
- page_mapcount(page) == 1) {
+ if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
updated = true;
}
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.
This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/migrate.h | 7 ++++---
mm/huge_memory.c | 12 +++++-------
mm/memory.c | 7 ++-----
mm/migrate.c | 17 ++++++-----------
mm/mprotect.c | 4 +---
5 files changed, 18 insertions(+), 29 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..f5096b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,11 +90,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#endif /* CONFIG_MIGRATION */
#ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node);
extern bool migrate_ratelimited(int node);
#else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node)
{
return -EAGAIN; /* can't migrate now */
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9142167..2a28c2c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1484,14 +1484,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page = pmd_page(*pmd);
/*
- * Only check non-shared pages. Do not trap faults
- * against the zero page. The read-only data is likely
- * to be read-cached on the local CPU cache and it is
- * less useful to know about local vs remote hits on
- * the zero page.
+ * Do not trap faults against the zero page. The
+ * read-only data is likely to be read-cached on the
+ * local CPU cache and it is less useful to know about
+ * local vs remote hits on the zero page.
*/
- if (page_mapcount(page) == 1 &&
- !is_huge_zero_page(page) &&
+ if (!is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 24bc9b8..3e3b4b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3577,7 +3577,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
/* Migrate to the requested node */
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
@@ -3642,16 +3642,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = vm_normal_page(vma, addr, pteval);
if (unlikely(!page))
continue;
- /* only check non-shared pages */
- if (unlikely(page_mapcount(page) != 1))
- continue;
last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
if (target_nid != -1) {
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index ce8c3a0..f212944 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1599,7 +1599,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
* node. Caller is expected to have an elevated reference count on
* the page that will be dropped by this function before returning.
*/
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+ int node)
{
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
@@ -1607,10 +1608,11 @@ int migrate_misplaced_page(struct page *page, int node)
LIST_HEAD(migratepages);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
+ * Don't migrate file pages that are mapped in multiple processes
+ * with execute permissions as they are probably shared libraries.
*/
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+ (vma->vm_flags & VM_EXEC))
goto out;
/*
@@ -1661,13 +1663,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int page_lru = page_is_file_cache(page);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
- */
- if (page_mapcount(page) != 1)
- goto out_dropref;
-
- /*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2da33dc..41e0292 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (last_nid != this_nid)
all_same_node = false;
- /* only check non-shared pages */
- if (!pte_numa(oldpte) &&
- page_mapcount(page) == 1) {
+ if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
updated = true;
}
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:43 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:43 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
>
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 27/63] mm: numa: Scan pages with elevated page_mapcount
@ 2013-10-07 18:43 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:43 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
>
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: numa: Scan pages with elevated page_mapcount
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:28 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 1bc115d87dffd1c43bdc3c9c9d1e3a51c195d18e
Gitweb: http://git.kernel.org/tip/1bc115d87dffd1c43bdc3c9c9d1e3a51c195d18e
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:05 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:32 +0200
mm: numa: Scan pages with elevated page_mapcount
Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.
This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/migrate.h | 7 ++++---
mm/huge_memory.c | 12 +++++-------
mm/memory.c | 7 ++-----
mm/migrate.c | 17 ++++++-----------
mm/mprotect.c | 4 +---
5 files changed, 18 insertions(+), 29 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8d3c57f..f5096b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -90,11 +90,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#endif /* CONFIG_MIGRATION */
#ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node);
extern bool migrate_ratelimited(int node);
#else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node)
{
return -EAGAIN; /* can't migrate now */
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9142167..2a28c2c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1484,14 +1484,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page = pmd_page(*pmd);
/*
- * Only check non-shared pages. Do not trap faults
- * against the zero page. The read-only data is likely
- * to be read-cached on the local CPU cache and it is
- * less useful to know about local vs remote hits on
- * the zero page.
+ * Do not trap faults against the zero page. The
+ * read-only data is likely to be read-cached on the
+ * local CPU cache and it is less useful to know about
+ * local vs remote hits on the zero page.
*/
- if (page_mapcount(page) == 1 &&
- !is_huge_zero_page(page) &&
+ if (!is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index 24bc9b8..3e3b4b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3577,7 +3577,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
/* Migrate to the requested node */
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
@@ -3642,16 +3642,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = vm_normal_page(vma, addr, pteval);
if (unlikely(!page))
continue;
- /* only check non-shared pages */
- if (unlikely(page_mapcount(page) != 1))
- continue;
last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
if (target_nid != -1) {
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 7bd90d3..fcba2f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1599,7 +1599,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
* node. Caller is expected to have an elevated reference count on
* the page that will be dropped by this function before returning.
*/
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+ int node)
{
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
@@ -1607,10 +1608,11 @@ int migrate_misplaced_page(struct page *page, int node)
LIST_HEAD(migratepages);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
+ * Don't migrate file pages that are mapped in multiple processes
+ * with execute permissions as they are probably shared libraries.
*/
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+ (vma->vm_flags & VM_EXEC))
goto out;
/*
@@ -1661,13 +1663,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int page_lru = page_is_file_cache(page);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
- */
- if (page_mapcount(page) != 1)
- goto out_dropref;
-
- /*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2da33dc..41e0292 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (last_nid != this_nid)
all_same_node = false;
- /* only check non-shared pages */
- if (!pte_numa(oldpte) &&
- page_mapcount(page) == 1) {
+ if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
updated = true;
}
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 28/63] sched: Remove check that skips small VMAs
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9eb384b..fb4fc66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma))
continue;
- /* Skip small VMAs. They are not likely to be of relevance */
- if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
- continue;
-
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 28/63] sched: Remove check that skips small VMAs
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9eb384b..fb4fc66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma))
continue;
- /* Skip small VMAs. They are not likely to be of relevance */
- if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
- continue;
-
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 28/63] sched: Remove check that skips small VMAs
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:44 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_work skips small VMAs. At the time the logic was to reduce the
> scanning overhead which was considerable. It is a dubious hack at best.
> It would make much more sense to cache where faults have been observed
> and only rescan those regions during subsequent PTE scans. Remove this
> hack as motivation to do it properly in the future.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 28/63] sched: Remove check that skips small VMAs
@ 2013-10-07 18:44 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:44 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> task_numa_work skips small VMAs. At the time the logic was to reduce the
> scanning overhead which was considerable. It is a dubious hack at best.
> It would make much more sense to cache where faults have been observed
> and only rescan those regions during subsequent PTE scans. Remove this
> hack as motivation to do it properly in the future.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Remove check that skips small VMAs
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:28 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 073b5beea735c7e1970686c94ff1f3aaac790a2a
Gitweb: http://git.kernel.org/tip/073b5beea735c7e1970686c94ff1f3aaac790a2a
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:06 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:33 +0200
sched/numa: Remove check that skips small VMAs
task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-29-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3383079..862d20d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma))
continue;
- /* Skip small VMAs. They are not likely to be of relevance */
- if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
- continue;
-
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.
To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.
To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.
First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.
The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.
The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.
[riel@redhat.com: Fix complication error when !NUMA_BALANCING]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mm.h | 89 +++++++++++++++++++++++++++++----------
include/linux/mm_types.h | 4 +-
include/linux/page-flags-layout.h | 28 +++++++-----
kernel/sched/fair.c | 12 ++++--
mm/huge_memory.c | 8 ++--
mm/memory.c | 16 +++----
mm/mempolicy.c | 8 ++--
mm/migrate.c | 4 +-
mm/mm_init.c | 18 ++++----
mm/mmzone.c | 14 +++---
mm/mprotect.c | 26 ++++++++----
mm/page_alloc.c | 4 +-
12 files changed, 149 insertions(+), 82 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..bb412ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF (ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF (ZONES_PGOFF - LAST_NIDPID_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT (LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT (LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK ((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK ((1UL << LAST_NIDPID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,48 +661,93 @@ static inline int page_to_nid(const struct page *page)
#endif
#ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
{
- return xchg(&page->_last_nid, nid);
+ return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
}
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
{
- return page->_last_nid;
+ return nidpid & LAST__PID_MASK;
}
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+ return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
{
- page->_last_nid = -1;
+ return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+ return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+ return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+ page->_last_nidpid = -1;
}
#else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
- return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+ return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
}
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
{
- int nid = (1 << LAST_NID_SHIFT) - 1;
+ int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
- page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
}
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
#else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
{
return page_to_nid(page);
}
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
return page_to_nid(page);
}
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+ return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
{
}
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b7adf1d..38a902a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
void *shadow;
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
- int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+ int _last_nidpid;
#endif
}
/*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
* The last is when there is insufficient space in page->flags and a separate
* lookup is necessary.
*
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | NODE | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
*/
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
#endif
#ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
#else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
#endif
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
#else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
#endif
/*
@@ -81,8 +87,8 @@
#define NODE_NOT_IN_PAGE_FLAGS
#endif
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb4fc66..f83da25 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!p->mm)
return;
- /* For now, do not attempt to detect private/shared accesses */
- priv = 1;
+ /*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (!nidpid_pid_unset(last_nidpid))
+ priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+ else
+ priv = 1;
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a28c2c..0baf0e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid, last_nid = -1;
+ int target_nid, last_nidpid = -1;
bool page_locked;
bool migrated = false;
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
return 0;
}
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
page_tail->mapping = page->mapping;
page_tail->index = page->index + i;
- page_nid_xchg_last(page_tail, page_nid_last(page));
+ page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 3e3b4b8..cc7f206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
#include "internal.h"
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
#endif
#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
- int last_nid;
+ int last_nidpid;
int target_nid;
bool migrated = false;
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
return 0;
}
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int last_nid;
+ int last_nidpid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!page))
continue;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..aff1f1e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2348,9 +2348,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_nid;
+ int last_nidpid;
+ int this_nidpid;
polnid = numa_node_id();
+ this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
/*
* Multi-stage node selection is used in conjunction
@@ -2373,8 +2375,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* it less likely we act on an unlikely task<->page
* relation.
*/
- last_nid = page_nid_xchg_last(page, polnid);
- if (last_nid != polnid)
+ last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+ if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f212944..22abf87 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
__GFP_NOWARN) &
~GFP_IOFS, 0);
if (newpage)
- page_nid_xchg_last(newpage, page_nid_last(page));
+ page_nidpid_xchg_last(newpage, page_nidpid_last(page));
return newpage;
}
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (!new_page)
goto out_fail;
- page_nid_xchg_last(new_page, page_nid_last(page));
+ page_nidpid_xchg_last(new_page, page_nidpid_last(page));
isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
unsigned long or_mask, add_mask;
shift = 8 * sizeof(unsigned long);
- width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+ width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
- LAST_NID_WIDTH,
+ LAST_NIDPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- "Section %d Node %d Zone %d Lastnid %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
- LAST_NID_SHIFT);
+ LAST_NIDPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
- "Section %lu Node %lu Zone %lu Lastnid %lu\n",
+ "Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
- (unsigned long)LAST_NID_PGSHIFT);
+ (unsigned long)LAST_NIDPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
- "Last nid not in page flags");
+ "Last nidpid not in page flags");
#endif
if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
INIT_LIST_HEAD(&lruvec->lists[lru]);
}
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
{
unsigned long old_flags, flags;
- int last_nid;
+ int last_nidpid;
do {
old_flags = flags = page->flags;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
- flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
- return last_nid;
+ return last_nidpid;
}
#endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 41e0292..f0b087d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+ int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
- bool all_same_node = true;
+ bool all_same_nidpid = true;
int last_nid = -1;
+ int last_pid = -1;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -63,11 +64,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (page) {
- int this_nid = page_to_nid(page);
+ int nidpid = page_nidpid_last(page);
+ int this_nid = nidpid_to_nid(nidpid);
+ int this_pid = nidpid_to_pid(nidpid);
+
if (last_nid == -1)
last_nid = this_nid;
- if (last_nid != this_nid)
- all_same_node = false;
+ if (last_pid == -1)
+ last_pid = this_pid;
+ if (last_nid != this_nid ||
+ last_pid != this_pid) {
+ all_same_nidpid = false;
+ }
if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
@@ -107,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
- *ret_all_same_node = all_same_node;
+ *ret_all_same_nidpid = all_same_nidpid;
return pages;
}
@@ -134,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd_t *pmd;
unsigned long next;
unsigned long pages = 0;
- bool all_same_node;
+ bool all_same_nidpid;
pmd = pmd_offset(pud, addr);
do {
@@ -158,7 +166,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_none_or_clear_bad(pmd))
continue;
pages += change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa, &all_same_node);
+ dirty_accountable, prot_numa, &all_same_nidpid);
/*
* If we are changing protections for NUMA hinting faults then
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_node)
+ if (prot_numa && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ee638f..f6301d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.
To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.
To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.
First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.
The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.
The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.
[riel@redhat.com: Fix complication error when !NUMA_BALANCING]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mm.h | 89 +++++++++++++++++++++++++++++----------
include/linux/mm_types.h | 4 +-
include/linux/page-flags-layout.h | 28 +++++++-----
kernel/sched/fair.c | 12 ++++--
mm/huge_memory.c | 8 ++--
mm/memory.c | 16 +++----
mm/mempolicy.c | 8 ++--
mm/migrate.c | 4 +-
mm/mm_init.c | 18 ++++----
mm/mmzone.c | 14 +++---
mm/mprotect.c | 26 ++++++++----
mm/page_alloc.c | 4 +-
12 files changed, 149 insertions(+), 82 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..bb412ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF (ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF (ZONES_PGOFF - LAST_NIDPID_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT (LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT (LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK ((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK ((1UL << LAST_NIDPID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,48 +661,93 @@ static inline int page_to_nid(const struct page *page)
#endif
#ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
{
- return xchg(&page->_last_nid, nid);
+ return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
}
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
{
- return page->_last_nid;
+ return nidpid & LAST__PID_MASK;
}
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+ return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
{
- page->_last_nid = -1;
+ return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+ return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+ return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+ page->_last_nidpid = -1;
}
#else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
- return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+ return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
}
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
{
- int nid = (1 << LAST_NID_SHIFT) - 1;
+ int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
- page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
}
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
#else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
{
return page_to_nid(page);
}
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
return page_to_nid(page);
}
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+ return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
{
}
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b7adf1d..38a902a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
void *shadow;
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
- int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+ int _last_nidpid;
#endif
}
/*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
* The last is when there is insufficient space in page->flags and a separate
* lookup is necessary.
*
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | NODE | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
*/
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
#endif
#ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
#else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
#endif
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
#else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
#endif
/*
@@ -81,8 +87,8 @@
#define NODE_NOT_IN_PAGE_FLAGS
#endif
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb4fc66..f83da25 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!p->mm)
return;
- /* For now, do not attempt to detect private/shared accesses */
- priv = 1;
+ /*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (!nidpid_pid_unset(last_nidpid))
+ priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+ else
+ priv = 1;
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a28c2c..0baf0e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid, last_nid = -1;
+ int target_nid, last_nidpid = -1;
bool page_locked;
bool migrated = false;
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
return 0;
}
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
page_tail->mapping = page->mapping;
page_tail->index = page->index + i;
- page_nid_xchg_last(page_tail, page_nid_last(page));
+ page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 3e3b4b8..cc7f206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
#include "internal.h"
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
#endif
#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
- int last_nid;
+ int last_nidpid;
int target_nid;
bool migrated = false;
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
return 0;
}
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int last_nid;
+ int last_nidpid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!page))
continue;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..aff1f1e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2348,9 +2348,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_nid;
+ int last_nidpid;
+ int this_nidpid;
polnid = numa_node_id();
+ this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
/*
* Multi-stage node selection is used in conjunction
@@ -2373,8 +2375,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* it less likely we act on an unlikely task<->page
* relation.
*/
- last_nid = page_nid_xchg_last(page, polnid);
- if (last_nid != polnid)
+ last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+ if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f212944..22abf87 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
__GFP_NOWARN) &
~GFP_IOFS, 0);
if (newpage)
- page_nid_xchg_last(newpage, page_nid_last(page));
+ page_nidpid_xchg_last(newpage, page_nidpid_last(page));
return newpage;
}
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (!new_page)
goto out_fail;
- page_nid_xchg_last(new_page, page_nid_last(page));
+ page_nidpid_xchg_last(new_page, page_nidpid_last(page));
isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
unsigned long or_mask, add_mask;
shift = 8 * sizeof(unsigned long);
- width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+ width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
- LAST_NID_WIDTH,
+ LAST_NIDPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- "Section %d Node %d Zone %d Lastnid %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
- LAST_NID_SHIFT);
+ LAST_NIDPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
- "Section %lu Node %lu Zone %lu Lastnid %lu\n",
+ "Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
- (unsigned long)LAST_NID_PGSHIFT);
+ (unsigned long)LAST_NIDPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
- "Last nid not in page flags");
+ "Last nidpid not in page flags");
#endif
if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
INIT_LIST_HEAD(&lruvec->lists[lru]);
}
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
{
unsigned long old_flags, flags;
- int last_nid;
+ int last_nidpid;
do {
old_flags = flags = page->flags;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
- flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
- return last_nid;
+ return last_nidpid;
}
#endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 41e0292..f0b087d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+ int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
- bool all_same_node = true;
+ bool all_same_nidpid = true;
int last_nid = -1;
+ int last_pid = -1;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -63,11 +64,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (page) {
- int this_nid = page_to_nid(page);
+ int nidpid = page_nidpid_last(page);
+ int this_nid = nidpid_to_nid(nidpid);
+ int this_pid = nidpid_to_pid(nidpid);
+
if (last_nid == -1)
last_nid = this_nid;
- if (last_nid != this_nid)
- all_same_node = false;
+ if (last_pid == -1)
+ last_pid = this_pid;
+ if (last_nid != this_nid ||
+ last_pid != this_pid) {
+ all_same_nidpid = false;
+ }
if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
@@ -107,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
- *ret_all_same_node = all_same_node;
+ *ret_all_same_nidpid = all_same_nidpid;
return pages;
}
@@ -134,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd_t *pmd;
unsigned long next;
unsigned long pages = 0;
- bool all_same_node;
+ bool all_same_nidpid;
pmd = pmd_offset(pud, addr);
do {
@@ -158,7 +166,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_none_or_clear_bad(pmd))
continue;
pages += change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa, &all_same_node);
+ dirty_accountable, prot_numa, &all_same_nidpid);
/*
* If we are changing protections for NUMA hinting faults then
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_node)
+ if (prot_numa && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ee638f..f6301d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:45 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:45 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults that
> are private to a task and those that are shared. If treated identically
> there is a risk that shared pages bounce between nodes depending on
> the order they are referenced by tasks. Ultimately what is desirable is
> that task private pages remain local to the task while shared pages are
> interleaved between sharing tasks running on different nodes to give good
> average performance. This is further complicated by THP as even
> applications that partition their data may not be partitioning on a huge
> page boundary.
>
> To start with, this patch assumes that multi-threaded or multi-process
> applications partition their data and that in general the private accesses
> are more important for cpu->memory locality in the general case. Also,
> no new infrastructure is required to treat private pages properly but
> interleaving for shared pages requires additional infrastructure.
>
> To detect private accesses the pid of the last accessing task is required
> but the storage requirements are a high. This patch borrows heavily from
> Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
> to encode some bits from the last accessing task in the page flags as
> well as the node information. Collisions will occur but it is better than
> just depending on the node information. Node information is then used to
> determine if a page needs to migrate. The PID information is used to detect
> private/shared accesses. The preferred NUMA node is selected based on where
> the maximum number of approximately private faults were measured. Shared
> faults are not taken into consideration for a few reasons.
>
> First, if there are many tasks sharing the page then they'll all move
> towards the same node. The node will be compute overloaded and then
> scheduled away later only to bounce back again. Alternatively the shared
> tasks would just bounce around nodes because the fault information is
> effectively noise. Either way accounting for shared faults the same as
> private faults can result in lower performance overall.
>
> The second reason is based on a hypothetical workload that has a small
> number of very important, heavily accessed private pages but a large shared
> array. The shared array would dominate the number of faults and be selected
> as a preferred node even though it's the wrong decision.
>
> The third reason is that multiple threads in a process will race each
> other to fault the shared page making the fault information unreliable.
>
> [riel@redhat.com: Fix complication error when !NUMA_BALANCING]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 29/63] sched: Set preferred NUMA node based on number of private faults
@ 2013-10-07 18:45 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:45 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults that
> are private to a task and those that are shared. If treated identically
> there is a risk that shared pages bounce between nodes depending on
> the order they are referenced by tasks. Ultimately what is desirable is
> that task private pages remain local to the task while shared pages are
> interleaved between sharing tasks running on different nodes to give good
> average performance. This is further complicated by THP as even
> applications that partition their data may not be partitioning on a huge
> page boundary.
>
> To start with, this patch assumes that multi-threaded or multi-process
> applications partition their data and that in general the private accesses
> are more important for cpu->memory locality in the general case. Also,
> no new infrastructure is required to treat private pages properly but
> interleaving for shared pages requires additional infrastructure.
>
> To detect private accesses the pid of the last accessing task is required
> but the storage requirements are a high. This patch borrows heavily from
> Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
> to encode some bits from the last accessing task in the page flags as
> well as the node information. Collisions will occur but it is better than
> just depending on the node information. Node information is then used to
> determine if a page needs to migrate. The PID information is used to detect
> private/shared accesses. The preferred NUMA node is selected based on where
> the maximum number of approximately private faults were measured. Shared
> faults are not taken into consideration for a few reasons.
>
> First, if there are many tasks sharing the page then they'll all move
> towards the same node. The node will be compute overloaded and then
> scheduled away later only to bounce back again. Alternatively the shared
> tasks would just bounce around nodes because the fault information is
> effectively noise. Either way accounting for shared faults the same as
> private faults can result in lower performance overall.
>
> The second reason is based on a hypothetical workload that has a small
> number of very important, heavily accessed private pages but a large shared
> array. The shared array would dominate the number of faults and be selected
> as a preferred node even though it's the wrong decision.
>
> The third reason is that multiple threads in a process will race each
> other to fault the shared page making the fault information unreliable.
>
> [riel@redhat.com: Fix complication error when !NUMA_BALANCING]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Set preferred NUMA node based on number of private faults
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:28 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: b795854b1fa70f6aee923ae5df74ff7afeaddcaa
Gitweb: http://git.kernel.org/tip/b795854b1fa70f6aee923ae5df74ff7afeaddcaa
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:07 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:35 +0200
sched/numa: Set preferred NUMA node based on number of private faults
Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.
To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.
To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.
First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.
The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.
The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.
Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/mm.h | 89 +++++++++++++++++++++++++++++----------
include/linux/mm_types.h | 4 +-
include/linux/page-flags-layout.h | 28 +++++++-----
kernel/sched/fair.c | 12 ++++--
mm/huge_memory.c | 8 ++--
mm/memory.c | 16 +++----
mm/mempolicy.c | 8 ++--
mm/migrate.c | 4 +-
mm/mm_init.c | 18 ++++----
mm/mmzone.c | 14 +++---
mm/mprotect.c | 26 ++++++++----
mm/page_alloc.c | 4 +-
12 files changed, 149 insertions(+), 82 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..bb412ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,11 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF (ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF (ZONES_PGOFF - LAST_NIDPID_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT (LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT (LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK ((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK ((1UL << LAST_NIDPID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
static inline enum zone_type page_zonenum(const struct page *page)
@@ -661,48 +661,93 @@ static inline int page_to_nid(const struct page *page)
#endif
#ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
{
- return xchg(&page->_last_nid, nid);
+ return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
}
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
{
- return page->_last_nid;
+ return nidpid & LAST__PID_MASK;
}
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+ return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
{
- page->_last_nid = -1;
+ return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+ return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+ return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+ page->_last_nidpid = -1;
}
#else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
- return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+ return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
}
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
{
- int nid = (1 << LAST_NID_SHIFT) - 1;
+ int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
- page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
}
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
#else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
{
return page_to_nid(page);
}
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
return page_to_nid(page);
}
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+ return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
{
}
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b7adf1d..38a902a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
void *shadow;
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
- int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+ int _last_nidpid;
#endif
}
/*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
* The last is when there is insufficient space in page->flags and a separate
* lookup is necessary.
*
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | NODE | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
*/
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
#endif
#ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
#else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
#endif
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
#else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
#endif
/*
@@ -81,8 +87,8 @@
#define NODE_NOT_IN_PAGE_FLAGS
#endif
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 862d20d..b1de7c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!p->mm)
return;
- /* For now, do not attempt to detect private/shared accesses */
- priv = 1;
+ /*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (!nidpid_pid_unset(last_nidpid))
+ priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+ else
+ priv = 1;
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a28c2c..0baf0e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1282,7 +1282,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid, last_nid = -1;
+ int target_nid, last_nidpid = -1;
bool page_locked;
bool migrated = false;
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1362,7 +1362,7 @@ out:
page_unlock_anon_vma_read(anon_vma);
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
return 0;
}
@@ -1682,7 +1682,7 @@ static void __split_huge_page_refcount(struct page *page,
page_tail->mapping = page->mapping;
page_tail->index = page->index + i;
- page_nid_xchg_last(page_tail, page_nid_last(page));
+ page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 3e3b4b8..cc7f206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
#include "internal.h"
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
#endif
#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
- int last_nid;
+ int last_nidpid;
int target_nid;
bool migrated = false;
@@ -3567,7 +3567,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3583,7 +3583,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
return 0;
}
@@ -3598,7 +3598,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int last_nid;
+ int last_nidpid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3643,7 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!page))
continue;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3656,7 +3656,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..aff1f1e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2348,9 +2348,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_nid;
+ int last_nidpid;
+ int this_nidpid;
polnid = numa_node_id();
+ this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
/*
* Multi-stage node selection is used in conjunction
@@ -2373,8 +2375,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* it less likely we act on an unlikely task<->page
* relation.
*/
- last_nid = page_nid_xchg_last(page, polnid);
- if (last_nid != polnid)
+ last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+ if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index fcba2f4..025d1e3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1498,7 +1498,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
__GFP_NOWARN) &
~GFP_IOFS, 0);
if (newpage)
- page_nid_xchg_last(newpage, page_nid_last(page));
+ page_nidpid_xchg_last(newpage, page_nidpid_last(page));
return newpage;
}
@@ -1675,7 +1675,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (!new_page)
goto out_fail;
- page_nid_xchg_last(new_page, page_nid_last(page));
+ page_nidpid_xchg_last(new_page, page_nidpid_last(page));
isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
unsigned long or_mask, add_mask;
shift = 8 * sizeof(unsigned long);
- width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+ width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
- LAST_NID_WIDTH,
+ LAST_NIDPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- "Section %d Node %d Zone %d Lastnid %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
- LAST_NID_SHIFT);
+ LAST_NIDPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
- "Section %lu Node %lu Zone %lu Lastnid %lu\n",
+ "Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
- (unsigned long)LAST_NID_PGSHIFT);
+ (unsigned long)LAST_NIDPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
- "Last nid not in page flags");
+ "Last nidpid not in page flags");
#endif
if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
INIT_LIST_HEAD(&lruvec->lists[lru]);
}
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
{
unsigned long old_flags, flags;
- int last_nid;
+ int last_nidpid;
do {
old_flags = flags = page->flags;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
- flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
- return last_nid;
+ return last_nidpid;
}
#endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 41e0292..f0b087d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+ int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
- bool all_same_node = true;
+ bool all_same_nidpid = true;
int last_nid = -1;
+ int last_pid = -1;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -63,11 +64,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (page) {
- int this_nid = page_to_nid(page);
+ int nidpid = page_nidpid_last(page);
+ int this_nid = nidpid_to_nid(nidpid);
+ int this_pid = nidpid_to_pid(nidpid);
+
if (last_nid == -1)
last_nid = this_nid;
- if (last_nid != this_nid)
- all_same_node = false;
+ if (last_pid == -1)
+ last_pid = this_pid;
+ if (last_nid != this_nid ||
+ last_pid != this_pid) {
+ all_same_nidpid = false;
+ }
if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
@@ -107,7 +115,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
- *ret_all_same_node = all_same_node;
+ *ret_all_same_nidpid = all_same_nidpid;
return pages;
}
@@ -134,7 +142,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd_t *pmd;
unsigned long next;
unsigned long pages = 0;
- bool all_same_node;
+ bool all_same_nidpid;
pmd = pmd_offset(pud, addr);
do {
@@ -158,7 +166,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_none_or_clear_bad(pmd))
continue;
pages += change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa, &all_same_node);
+ dirty_accountable, prot_numa, &all_same_nidpid);
/*
* If we are changing protections for NUMA hinting faults then
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_node)
+ if (prot_numa && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..89bedd0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,7 +626,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -4015,7 +4015,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 30/63] sched: Do not migrate memory immediately after switching node
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Rik van Riel <riel@redhat.com>
The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.
The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 18 ++++++++++++++++--
mm/mempolicy.c | 12 ++++++++++++
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 60e640d..124bb40 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f83da25..b7052ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
* the preferred node but still allow the scheduler to move the task again if
* the nodes CPUs are overloaded.
*/
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
static inline int task_faults_idx(int nid, int priv)
{
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
migrate_task_to(p, preferred_cpu);
}
}
@@ -4120,6 +4120,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
set_task_cpu(p, env->dst_cpu);
activate_task(env->dst_rq, p, 0);
check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->numa_preferred_nid != -1) {
+ int src_nid = cpu_to_node(env->src_cpu);
+ int dst_nid = cpu_to_node(env->dst_cpu);
+
+ /*
+ * If the load balancer has moved the task then limit
+ * migrations from taking place in the short term in
+ * case this is a short-lived migration.
+ */
+ if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+ p->numa_migrate_seq = 0;
+ }
+#endif
}
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aff1f1e..196d8da 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2378,6 +2378,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+ /*
+ * If the scheduler has just moved us away from our
+ * preferred node, do not bother migrating pages yet.
+ * This way a short and temporary process migration will
+ * not cause excessive memory migration.
+ */
+ if (polnid != current->numa_preferred_nid &&
+ !current->numa_migrate_seq)
+ goto out;
+#endif
}
if (curnid != polnid)
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 30/63] sched: Do not migrate memory immediately after switching node
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Rik van Riel <riel@redhat.com>
The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.
The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 18 ++++++++++++++++--
mm/mempolicy.c | 12 ++++++++++++
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 60e640d..124bb40 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1626,7 +1626,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f83da25..b7052ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
* the preferred node but still allow the scheduler to move the task again if
* the nodes CPUs are overloaded.
*/
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
static inline int task_faults_idx(int nid, int priv)
{
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
migrate_task_to(p, preferred_cpu);
}
}
@@ -4120,6 +4120,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
set_task_cpu(p, env->dst_cpu);
activate_task(env->dst_rq, p, 0);
check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->numa_preferred_nid != -1) {
+ int src_nid = cpu_to_node(env->src_cpu);
+ int dst_nid = cpu_to_node(env->dst_cpu);
+
+ /*
+ * If the load balancer has moved the task then limit
+ * migrations from taking place in the short term in
+ * case this is a short-lived migration.
+ */
+ if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+ p->numa_migrate_seq = 0;
+ }
+#endif
}
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aff1f1e..196d8da 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2378,6 +2378,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+ /*
+ * If the scheduler has just moved us away from our
+ * preferred node, do not bother migrating pages yet.
+ * This way a short and temporary process migration will
+ * not cause excessive memory migration.
+ */
+ if (polnid != current->numa_preferred_nid &&
+ !current->numa_migrate_seq)
+ goto out;
+#endif
}
if (curnid != polnid)
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Do not migrate memory immediately after switching node
2013-10-07 10:29 ` Mel Gorman
(?)
@ 2013-10-09 17:28 ` tip-bot for Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 6fe6b2d6dabf392aceb3ad3a5e859b46a04465c6
Gitweb: http://git.kernel.org/tip/6fe6b2d6dabf392aceb3ad3a5e859b46a04465c6
Author: Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:08 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:36 +0200
sched/numa: Do not migrate memory immediately after switching node
The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.
The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 18 ++++++++++++++++--
mm/mempolicy.c | 12 ++++++++++++
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66b878e..9060a7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1631,7 +1631,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1de7c5..61ec0d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
* the preferred node but still allow the scheduler to move the task again if
* the nodes CPUs are overloaded.
*/
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
static inline int task_faults_idx(int nid, int priv)
{
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
migrate_task_to(p, preferred_cpu);
}
}
@@ -4121,6 +4121,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
set_task_cpu(p, env->dst_cpu);
activate_task(env->dst_rq, p, 0);
check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->numa_preferred_nid != -1) {
+ int src_nid = cpu_to_node(env->src_cpu);
+ int dst_nid = cpu_to_node(env->dst_cpu);
+
+ /*
+ * If the load balancer has moved the task then limit
+ * migrations from taking place in the short term in
+ * case this is a short-lived migration.
+ */
+ if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+ p->numa_migrate_seq = 0;
+ }
+#endif
}
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aff1f1e..196d8da 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2378,6 +2378,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+ /*
+ * If the scheduler has just moved us away from our
+ * preferred node, do not bother migrating pages yet.
+ * This way a short and temporary process migration will
+ * not cause excessive memory migration.
+ */
+ if (polnid != current->numa_preferred_nid &&
+ !current->numa_migrate_seq)
+ goto out;
+#endif
}
if (curnid != polnid)
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 31/63] mm: numa: only unmap migrate-on-fault VMAs
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.
[riel@redhat.com: Initial patch]
Reviewed-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mempolicy.h | 1 +
kernel/sched/fair.c | 2 +-
mm/mempolicy.c | 24 ++++++++++++++++++++++++
3 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index da6716b..ea4d249 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -136,6 +136,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
struct mempolicy *get_vma_policy(struct task_struct *tsk,
struct vm_area_struct *vma, unsigned long addr);
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
extern void numa_default_policy(void);
extern void numa_policy_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7052ed..1789e3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1130,7 +1130,7 @@ void task_numa_work(struct callback_head *work)
vma = mm->mmap;
}
for (; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
+ if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
continue;
do {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 196d8da..0e895a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,6 +1679,30 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
return pol;
}
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
+{
+ struct mempolicy *pol = get_task_policy(task);
+ if (vma) {
+ if (vma->vm_ops && vma->vm_ops->get_policy) {
+ bool ret = false;
+
+ pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+ if (pol && (pol->flags & MPOL_F_MOF))
+ ret = true;
+ mpol_cond_put(pol);
+
+ return ret;
+ } else if (vma->vm_policy) {
+ pol = vma->vm_policy;
+ }
+ }
+
+ if (!pol)
+ return default_policy.flags & MPOL_F_MOF;
+
+ return pol->flags & MPOL_F_MOF;
+}
+
static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
{
enum zone_type dynamic_policy_zone = policy_zone;
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 31/63] mm: numa: only unmap migrate-on-fault VMAs
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.
[riel@redhat.com: Initial patch]
Reviewed-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mempolicy.h | 1 +
kernel/sched/fair.c | 2 +-
mm/mempolicy.c | 24 ++++++++++++++++++++++++
3 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index da6716b..ea4d249 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -136,6 +136,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
struct mempolicy *get_vma_policy(struct task_struct *tsk,
struct vm_area_struct *vma, unsigned long addr);
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
extern void numa_default_policy(void);
extern void numa_policy_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7052ed..1789e3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1130,7 +1130,7 @@ void task_numa_work(struct callback_head *work)
vma = mm->mmap;
}
for (; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
+ if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
continue;
do {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 196d8da..0e895a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,6 +1679,30 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
return pol;
}
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
+{
+ struct mempolicy *pol = get_task_policy(task);
+ if (vma) {
+ if (vma->vm_ops && vma->vm_ops->get_policy) {
+ bool ret = false;
+
+ pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+ if (pol && (pol->flags & MPOL_F_MOF))
+ ret = true;
+ mpol_cond_put(pol);
+
+ return ret;
+ } else if (vma->vm_policy) {
+ pol = vma->vm_policy;
+ }
+ }
+
+ if (!pol)
+ return default_policy.flags & MPOL_F_MOF;
+
+ return pol->flags & MPOL_F_MOF;
+}
+
static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
{
enum zone_type dynamic_policy_zone = policy_zone;
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: numa: Limit NUMA scanning to migrate-on-fault VMAs
2013-10-07 10:29 ` Mel Gorman
(?)
@ 2013-10-09 17:29 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, jmario, riel, srikar,
aarcange, mgorman, tglx
Commit-ID: fc3147245d193bd0f57307859c698fa28a20b0fe
Gitweb: http://git.kernel.org/tip/fc3147245d193bd0f57307859c698fa28a20b0fe
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:09 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:38 +0200
mm: numa: Limit NUMA scanning to migrate-on-fault VMAs
There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.
[riel@redhat.com: Initial patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-and-tested-by: Joe Mario <jmario@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/mempolicy.h | 1 +
kernel/sched/fair.c | 2 +-
mm/mempolicy.c | 24 ++++++++++++++++++++++++
3 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index da6716b..ea4d249 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -136,6 +136,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
struct mempolicy *get_vma_policy(struct task_struct *tsk,
struct vm_area_struct *vma, unsigned long addr);
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
extern void numa_default_policy(void);
extern void numa_policy_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61ec0d4..d98175d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1130,7 +1130,7 @@ void task_numa_work(struct callback_head *work)
vma = mm->mmap;
}
for (; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
+ if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
continue;
do {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 196d8da..0e895a2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,6 +1679,30 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
return pol;
}
+bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
+{
+ struct mempolicy *pol = get_task_policy(task);
+ if (vma) {
+ if (vma->vm_ops && vma->vm_ops->get_policy) {
+ bool ret = false;
+
+ pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+ if (pol && (pol->flags & MPOL_F_MOF))
+ ret = true;
+ mpol_cond_put(pol);
+
+ return ret;
+ } else if (vma->vm_policy) {
+ pol = vma->vm_policy;
+ }
+ }
+
+ if (!pol)
+ return default_policy.flags & MPOL_F_MOF;
+
+ return pol->flags & MPOL_F_MOF;
+}
+
static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
{
enum zone_type dynamic_policy_zone = policy_zone;
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.
task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 102 insertions(+), 29 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1789e3c..fd6e9e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
}
static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+struct numa_stats {
+ unsigned long load;
+ s64 eff_load;
+ unsigned long faults;
+};
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
- unsigned long load, min_load = ULONG_MAX;
- int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+ struct task_struct *p;
- BUG_ON(cpu_to_node(this_cpu) == nid);
+ int src_cpu, src_nid;
+ int dst_cpu, dst_nid;
- rcu_read_lock();
- for_each_cpu(i, cpumask_of_node(nid)) {
- load = weighted_cpuload(i);
+ struct numa_stats src_stats, dst_stats;
- if (load < min_load) {
- min_load = load;
- idlest_cpu = i;
+ unsigned long best_load;
+ int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+ int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+ struct task_numa_env env = {
+ .p = p,
+ .src_cpu = task_cpu(p),
+ .src_nid = cpu_to_node(task_cpu(p)),
+ .dst_cpu = node_cpu,
+ .dst_nid = p->numa_preferred_nid,
+ .best_load = ULONG_MAX,
+ .best_cpu = task_cpu(p),
+ };
+ struct sched_domain *sd;
+ int cpu;
+ struct task_group *tg = task_group(p);
+ unsigned long weight;
+ bool balanced;
+ int imbalance_pct, idx = -1;
+
+ /*
+ * Find the lowest common scheduling domain covering the nodes of both
+ * the CPU the task is currently running on and the target NUMA node.
+ */
+ rcu_read_lock();
+ for_each_domain(env.src_cpu, sd) {
+ if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+ /*
+ * busy_idx is used for the load decision as it is the
+ * same index used by the regular load balancer for an
+ * active cpu.
+ */
+ idx = sd->busy_idx;
+ imbalance_pct = sd->imbalance_pct;
+ break;
}
}
rcu_read_unlock();
- return idlest_cpu;
+ if (WARN_ON_ONCE(idx == -1))
+ return 0;
+
+ /*
+ * XXX the below is mostly nicked from wake_affine(); we should
+ * see about sharing a bit if at all possible; also it might want
+ * some per entity weight love.
+ */
+ weight = p->se.load.weight;
+ env.src_stats.load = source_load(env.src_cpu, idx);
+ env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+ env.src_stats.eff_load *= power_of(env.src_cpu);
+ env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+ for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+ env.dst_cpu = cpu;
+ env.dst_stats.load = target_load(cpu, idx);
+
+ /* If the CPU is idle, use it */
+ if (!env.dst_stats.load) {
+ env.best_cpu = cpu;
+ goto migrate;
+ }
+
+ /* Otherwise check the target CPU load */
+ env.dst_stats.eff_load = 100;
+ env.dst_stats.eff_load *= power_of(cpu);
+ env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+ /*
+ * Destination is considered balanced if the destination CPU is
+ * less loaded than the source CPU. Unfortunately there is a
+ * risk that a task running on a lightly loaded CPU will not
+ * migrate to its preferred node due to load imbalances.
+ */
+ balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+ if (!balanced)
+ continue;
+
+ if (env.dst_stats.eff_load < env.best_load) {
+ env.best_load = env.dst_stats.eff_load;
+ env.best_cpu = cpu;
+ }
+ }
+
+migrate:
+ return migrate_task_to(p, env.best_cpu);
}
static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
* the working set placement.
*/
if (max_faults && max_nid != p->numa_preferred_nid) {
- int preferred_cpu;
-
- /*
- * If the task is not on the preferred node then find the most
- * idle CPU to migrate to.
- */
- preferred_cpu = task_cpu(p);
- if (cpu_to_node(preferred_cpu) != max_nid) {
- preferred_cpu = find_idlest_cpu_node(preferred_cpu,
- max_nid);
- }
-
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- migrate_task_to(p, preferred_cpu);
+ task_numa_migrate(p);
}
}
@@ -3292,7 +3366,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
struct sched_entity *se = tg->se[cpu];
- if (!tg->parent) /* the trivial, non-cgroup case */
+ if (!tg->parent || !wl) /* the trivial, non-cgroup case */
return wl;
for_each_sched_entity(se) {
@@ -3345,8 +3419,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
}
#else
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
- unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
return wl;
}
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.
task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 102 insertions(+), 29 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1789e3c..fd6e9e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
}
static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+struct numa_stats {
+ unsigned long load;
+ s64 eff_load;
+ unsigned long faults;
+};
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
- unsigned long load, min_load = ULONG_MAX;
- int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+ struct task_struct *p;
- BUG_ON(cpu_to_node(this_cpu) == nid);
+ int src_cpu, src_nid;
+ int dst_cpu, dst_nid;
- rcu_read_lock();
- for_each_cpu(i, cpumask_of_node(nid)) {
- load = weighted_cpuload(i);
+ struct numa_stats src_stats, dst_stats;
- if (load < min_load) {
- min_load = load;
- idlest_cpu = i;
+ unsigned long best_load;
+ int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+ int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+ struct task_numa_env env = {
+ .p = p,
+ .src_cpu = task_cpu(p),
+ .src_nid = cpu_to_node(task_cpu(p)),
+ .dst_cpu = node_cpu,
+ .dst_nid = p->numa_preferred_nid,
+ .best_load = ULONG_MAX,
+ .best_cpu = task_cpu(p),
+ };
+ struct sched_domain *sd;
+ int cpu;
+ struct task_group *tg = task_group(p);
+ unsigned long weight;
+ bool balanced;
+ int imbalance_pct, idx = -1;
+
+ /*
+ * Find the lowest common scheduling domain covering the nodes of both
+ * the CPU the task is currently running on and the target NUMA node.
+ */
+ rcu_read_lock();
+ for_each_domain(env.src_cpu, sd) {
+ if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+ /*
+ * busy_idx is used for the load decision as it is the
+ * same index used by the regular load balancer for an
+ * active cpu.
+ */
+ idx = sd->busy_idx;
+ imbalance_pct = sd->imbalance_pct;
+ break;
}
}
rcu_read_unlock();
- return idlest_cpu;
+ if (WARN_ON_ONCE(idx == -1))
+ return 0;
+
+ /*
+ * XXX the below is mostly nicked from wake_affine(); we should
+ * see about sharing a bit if at all possible; also it might want
+ * some per entity weight love.
+ */
+ weight = p->se.load.weight;
+ env.src_stats.load = source_load(env.src_cpu, idx);
+ env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+ env.src_stats.eff_load *= power_of(env.src_cpu);
+ env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+ for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+ env.dst_cpu = cpu;
+ env.dst_stats.load = target_load(cpu, idx);
+
+ /* If the CPU is idle, use it */
+ if (!env.dst_stats.load) {
+ env.best_cpu = cpu;
+ goto migrate;
+ }
+
+ /* Otherwise check the target CPU load */
+ env.dst_stats.eff_load = 100;
+ env.dst_stats.eff_load *= power_of(cpu);
+ env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+ /*
+ * Destination is considered balanced if the destination CPU is
+ * less loaded than the source CPU. Unfortunately there is a
+ * risk that a task running on a lightly loaded CPU will not
+ * migrate to its preferred node due to load imbalances.
+ */
+ balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+ if (!balanced)
+ continue;
+
+ if (env.dst_stats.eff_load < env.best_load) {
+ env.best_load = env.dst_stats.eff_load;
+ env.best_cpu = cpu;
+ }
+ }
+
+migrate:
+ return migrate_task_to(p, env.best_cpu);
}
static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
* the working set placement.
*/
if (max_faults && max_nid != p->numa_preferred_nid) {
- int preferred_cpu;
-
- /*
- * If the task is not on the preferred node then find the most
- * idle CPU to migrate to.
- */
- preferred_cpu = task_cpu(p);
- if (cpu_to_node(preferred_cpu) != max_nid) {
- preferred_cpu = find_idlest_cpu_node(preferred_cpu,
- max_nid);
- }
-
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- migrate_task_to(p, preferred_cpu);
+ task_numa_migrate(p);
}
}
@@ -3292,7 +3366,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
struct sched_entity *se = tg->se[cpu];
- if (!tg->parent) /* the trivial, non-cgroup case */
+ if (!tg->parent || !wl) /* the trivial, non-cgroup case */
return wl;
for_each_sched_entity(se) {
@@ -3345,8 +3419,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
}
#else
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
- unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
return wl;
}
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:58 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
> find_idlest_cpu_node has two critical limitations. It does not take the
> scheduling class into account when calculating the load and it is unsuitable
> for using when comparing loads between NUMA nodes.
>
> task_numa_find_cpu uses similar load calculations to wake_affine() when
> selecting the least loaded CPU within a scheduling domain common to the
> source and destimation nodes. It avoids causing CPU load imbalances in
> the machine by refusing to migrate if the relative load on the target
> CPU is higher than the source CPU.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 32/63] sched: Avoid overloading CPUs on a preferred NUMA node
@ 2013-10-07 18:58 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
> find_idlest_cpu_node has two critical limitations. It does not take the
> scheduling class into account when calculating the load and it is unsuitable
> for using when comparing loads between NUMA nodes.
>
> task_numa_find_cpu uses similar load calculations to wake_affine() when
> selecting the least loaded CPU within a scheduling domain common to the
> source and destimation nodes. It avoids causing CPU load imbalances in
> the machine by refusing to migrate if the relative load on the target
> CPU is higher than the source CPU.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Avoid overloading CPUs on a preferred NUMA node
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:29 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 58d081b5082dd85e02ac9a1fb151d97395340a09
Gitweb: http://git.kernel.org/tip/58d081b5082dd85e02ac9a1fb151d97395340a09
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:10 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:39 +0200
sched/numa: Avoid overloading CPUs on a preferred NUMA node
This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.
task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-33-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 102 insertions(+), 29 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d98175d..51a7600 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
}
static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+struct numa_stats {
+ unsigned long load;
+ s64 eff_load;
+ unsigned long faults;
+};
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
- unsigned long load, min_load = ULONG_MAX;
- int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+ struct task_struct *p;
- BUG_ON(cpu_to_node(this_cpu) == nid);
+ int src_cpu, src_nid;
+ int dst_cpu, dst_nid;
- rcu_read_lock();
- for_each_cpu(i, cpumask_of_node(nid)) {
- load = weighted_cpuload(i);
+ struct numa_stats src_stats, dst_stats;
- if (load < min_load) {
- min_load = load;
- idlest_cpu = i;
+ unsigned long best_load;
+ int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+ int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+ struct task_numa_env env = {
+ .p = p,
+ .src_cpu = task_cpu(p),
+ .src_nid = cpu_to_node(task_cpu(p)),
+ .dst_cpu = node_cpu,
+ .dst_nid = p->numa_preferred_nid,
+ .best_load = ULONG_MAX,
+ .best_cpu = task_cpu(p),
+ };
+ struct sched_domain *sd;
+ int cpu;
+ struct task_group *tg = task_group(p);
+ unsigned long weight;
+ bool balanced;
+ int imbalance_pct, idx = -1;
+
+ /*
+ * Find the lowest common scheduling domain covering the nodes of both
+ * the CPU the task is currently running on and the target NUMA node.
+ */
+ rcu_read_lock();
+ for_each_domain(env.src_cpu, sd) {
+ if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+ /*
+ * busy_idx is used for the load decision as it is the
+ * same index used by the regular load balancer for an
+ * active cpu.
+ */
+ idx = sd->busy_idx;
+ imbalance_pct = sd->imbalance_pct;
+ break;
}
}
rcu_read_unlock();
- return idlest_cpu;
+ if (WARN_ON_ONCE(idx == -1))
+ return 0;
+
+ /*
+ * XXX the below is mostly nicked from wake_affine(); we should
+ * see about sharing a bit if at all possible; also it might want
+ * some per entity weight love.
+ */
+ weight = p->se.load.weight;
+ env.src_stats.load = source_load(env.src_cpu, idx);
+ env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+ env.src_stats.eff_load *= power_of(env.src_cpu);
+ env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+ for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+ env.dst_cpu = cpu;
+ env.dst_stats.load = target_load(cpu, idx);
+
+ /* If the CPU is idle, use it */
+ if (!env.dst_stats.load) {
+ env.best_cpu = cpu;
+ goto migrate;
+ }
+
+ /* Otherwise check the target CPU load */
+ env.dst_stats.eff_load = 100;
+ env.dst_stats.eff_load *= power_of(cpu);
+ env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+ /*
+ * Destination is considered balanced if the destination CPU is
+ * less loaded than the source CPU. Unfortunately there is a
+ * risk that a task running on a lightly loaded CPU will not
+ * migrate to its preferred node due to load imbalances.
+ */
+ balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+ if (!balanced)
+ continue;
+
+ if (env.dst_stats.eff_load < env.best_load) {
+ env.best_load = env.dst_stats.eff_load;
+ env.best_cpu = cpu;
+ }
+ }
+
+migrate:
+ return migrate_task_to(p, env.best_cpu);
}
static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
* the working set placement.
*/
if (max_faults && max_nid != p->numa_preferred_nid) {
- int preferred_cpu;
-
- /*
- * If the task is not on the preferred node then find the most
- * idle CPU to migrate to.
- */
- preferred_cpu = task_cpu(p);
- if (cpu_to_node(preferred_cpu) != max_nid) {
- preferred_cpu = find_idlest_cpu_node(preferred_cpu,
- max_nid);
- }
-
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- migrate_task_to(p, preferred_cpu);
+ task_numa_migrate(p);
}
}
@@ -3292,7 +3366,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
struct sched_entity *se = tg->se[cpu];
- if (!tg->parent) /* the trivial, non-cgroup case */
+ if (!tg->parent || !wl) /* the trivial, non-cgroup case */
return wl;
for_each_sched_entity(se) {
@@ -3345,8 +3419,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
}
#else
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
- unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
return wl;
}
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8a3aa9e..4dd0c94 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,7 @@ struct task_struct {
int numa_migrate_seq;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
+ unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd6e9e1..559175b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,23 @@ migrate:
return migrate_task_to(p, env.best_cpu);
}
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+ /* Success if task is already running on preferred CPU */
+ p->numa_migrate_retry = 0;
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ return;
+
+ /* This task has no NUMA fault statistics yet */
+ if (unlikely(p->numa_preferred_nid == -1))
+ return;
+
+ /* Otherwise, try migrate to a CPU on the preferred node */
+ if (task_numa_migrate(p) != 0)
+ p->numa_migrate_retry = jiffies + HZ*5;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -1045,17 +1062,12 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /*
- * Record the preferred node as the node with the most faults,
- * requeue the task to be running on the idlest CPU on the
- * preferred node and reset the scanning rate to recheck
- * the working set placement.
- */
+ /* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- task_numa_migrate(p);
+ numa_migrate_preferred(p);
}
}
@@ -1111,6 +1123,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
task_numa_placement(p);
+ /* Retry task to preferred node migration if it previously failed */
+ if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+ numa_migrate_preferred(p);
+
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8a3aa9e..4dd0c94 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,7 @@ struct task_struct {
int numa_migrate_seq;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
+ unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd6e9e1..559175b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,23 @@ migrate:
return migrate_task_to(p, env.best_cpu);
}
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+ /* Success if task is already running on preferred CPU */
+ p->numa_migrate_retry = 0;
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ return;
+
+ /* This task has no NUMA fault statistics yet */
+ if (unlikely(p->numa_preferred_nid == -1))
+ return;
+
+ /* Otherwise, try migrate to a CPU on the preferred node */
+ if (task_numa_migrate(p) != 0)
+ p->numa_migrate_retry = jiffies + HZ*5;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -1045,17 +1062,12 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /*
- * Record the preferred node as the node with the most faults,
- * requeue the task to be running on the idlest CPU on the
- * preferred node and reset the scanning rate to recheck
- * the working set placement.
- */
+ /* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- task_numa_migrate(p);
+ numa_migrate_preferred(p);
}
}
@@ -1111,6 +1123,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
task_numa_placement(p);
+ /* Retry task to preferred node migration if it previously failed */
+ if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+ numa_migrate_preferred(p);
+
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 18:58 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> When a preferred node is selected for a tasks there is an attempt to migrate
> the task to a CPU there. This may fail in which case the task will only
> migrate if the active load balancer takes action. This may never happen if
> the conditions are not right. This patch will check at NUMA hinting fault
> time if another attempt should be made to migrate the task. It will only
> make an attempt once every five seconds.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 33/63] sched: Retry migration of tasks to CPU on a preferred node
@ 2013-10-07 18:58 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 18:58 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> When a preferred node is selected for a tasks there is an attempt to migrate
> the task to a CPU there. This may fail in which case the task will only
> migrate if the active load balancer takes action. This may never happen if
> the conditions are not right. This patch will check at NUMA hinting fault
> time if another attempt should be made to migrate the task. It will only
> make an attempt once every five seconds.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Retry migration of tasks to CPU on a preferred node
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:29 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 6b9a7460b6baf6c77fc3d23d927ddfc3f3f05bf3
Gitweb: http://git.kernel.org/tip/6b9a7460b6baf6c77fc3d23d927ddfc3f3f05bf3
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:11 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:40 +0200
sched/numa: Retry migration of tasks to CPU on a preferred node
When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-34-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d946195..14251a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1341,6 +1341,7 @@ struct task_struct {
int numa_migrate_seq;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
+ unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51a7600..f84ac3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,23 @@ migrate:
return migrate_task_to(p, env.best_cpu);
}
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+ /* Success if task is already running on preferred CPU */
+ p->numa_migrate_retry = 0;
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ return;
+
+ /* This task has no NUMA fault statistics yet */
+ if (unlikely(p->numa_preferred_nid == -1))
+ return;
+
+ /* Otherwise, try migrate to a CPU on the preferred node */
+ if (task_numa_migrate(p) != 0)
+ p->numa_migrate_retry = jiffies + HZ*5;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -1045,17 +1062,12 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /*
- * Record the preferred node as the node with the most faults,
- * requeue the task to be running on the idlest CPU on the
- * preferred node and reset the scanning rate to recheck
- * the working set placement.
- */
+ /* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- task_numa_migrate(p);
+ numa_migrate_preferred(p);
}
}
@@ -1111,6 +1123,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
task_numa_placement(p);
+ /* Retry task to preferred node migration if it previously failed */
+ if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+ numa_migrate_preferred(p);
+
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 34/63] sched: numa: increment numa_migrate_seq when task runs in correct location
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Rik van Riel <riel@redhat.com>
When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.
[mgorman@suse.de: Only increment migrate_seq if migration temporarily disabled]
Signed-off-by: Rik van Riel <riel@redhat.com>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 559175b..9a2e68e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
{
/* Success if task is already running on preferred CPU */
p->numa_migrate_retry = 0;
- if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+ /*
+ * If migration is temporarily disabled due to a task migration
+ * then re-enable it now as the task is running on its
+ * preferred node and memory should migrate locally
+ */
+ if (!p->numa_migrate_seq)
+ p->numa_migrate_seq++;
return;
+ }
/* This task has no NUMA fault statistics yet */
if (unlikely(p->numa_preferred_nid == -1))
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 34/63] sched: numa: increment numa_migrate_seq when task runs in correct location
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Rik van Riel <riel@redhat.com>
When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.
[mgorman@suse.de: Only increment migrate_seq if migration temporarily disabled]
Signed-off-by: Rik van Riel <riel@redhat.com>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 559175b..9a2e68e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
{
/* Success if task is already running on preferred CPU */
p->numa_migrate_retry = 0;
- if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+ /*
+ * If migration is temporarily disabled due to a task migration
+ * then re-enable it now as the task is running on its
+ * preferred node and memory should migrate locally
+ */
+ if (!p->numa_migrate_seq)
+ p->numa_migrate_seq++;
return;
+ }
/* This task has no NUMA fault statistics yet */
if (unlikely(p->numa_preferred_nid == -1))
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Increment numa_migrate_seq when task runs in correct location
2013-10-07 10:29 ` Mel Gorman
(?)
@ 2013-10-09 17:29 ` tip-bot for Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Rik van Riel @ 2013-10-09 17:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 06ea5e035b4e66cc77790457a89fc7e368060c4b
Gitweb: http://git.kernel.org/tip/06ea5e035b4e66cc77790457a89fc7e368060c4b
Author: Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 7 Oct 2013 11:29:12 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:41 +0200
sched/numa: Increment numa_migrate_seq when task runs in correct location
When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.
Signed-off-by: Rik van Riel <riel@redhat.com>
[ Only increment migrate_seq if migration temporarily disabled. ]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-35-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f84ac3f..de9b4d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
{
/* Success if task is already running on preferred CPU */
p->numa_migrate_retry = 0;
- if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+ /*
+ * If migration is temporarily disabled due to a task migration
+ * then re-enable it now as the task is running on its
+ * preferred node and memory should migrate locally
+ */
+ if (!p->numa_migrate_seq)
+ p->numa_migrate_seq++;
return;
+ }
/* This task has no NUMA fault statistics yet */
if (unlikely(p->numa_preferred_nid == -1))
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a2e68e..8760231 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,16 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
continue;
+ /*
+ * Shared library pages mapped by multiple processes are not
+ * migrated as it is expected they are cache replicated. Avoid
+ * hinting faults in read-only file-backed mappings or the vdso
+ * as migrating the pages will be of marginal benefit.
+ */
+ if (!vma->vm_mm ||
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ continue;
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a2e68e..8760231 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,16 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
continue;
+ /*
+ * Shared library pages mapped by multiple processes are not
+ * migrated as it is expected they are cache replicated. Avoid
+ * hinting faults in read-only file-backed mappings or the vdso
+ * as migrating the pages will be of marginal benefit.
+ */
+ if (!vma->vm_mm ||
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ continue;
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 19:04 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:04 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> NUMA hinting faults will not migrate a shared executable page mapped by
> multiple processes on the grounds that the data is probably in the CPU
> cache already and the page may just bounce between tasks running on multipl
> nodes. Even if the migration is avoided, there is still the overhead of
> trapping the fault, updating the statistics, making scheduler placement
> decisions based on the information etc. If we are never going to migrate
> the page, it is overhead for no gain and worse a process may be placed on
> a sub-optimal node for shared executable pages. This patch avoids trapping
> faults for shared libraries entirely.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 35/63] sched: numa: Do not trap hinting faults for shared libraries
@ 2013-10-07 19:04 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:04 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> NUMA hinting faults will not migrate a shared executable page mapped by
> multiple processes on the grounds that the data is probably in the CPU
> cache already and the page may just bounce between tasks running on multipl
> nodes. Even if the migration is avoided, there is still the overhead of
> trapping the fault, updating the statistics, making scheduler placement
> decisions based on the information etc. If we are never going to migrate
> the page, it is overhead for no gain and worse a process may be placed on
> a sub-optimal node for shared executable pages. This patch avoids trapping
> faults for shared libraries entirely.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] sched/numa: Do not trap hinting faults for shared libraries
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:29 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 4591ce4f2d22dc9de7a6719161ce409b5fd1caac
Gitweb: http://git.kernel.org/tip/4591ce4f2d22dc9de7a6719161ce409b5fd1caac
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:13 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:42 +0200
sched/numa: Do not trap hinting faults for shared libraries
NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-36-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de9b4d8..fbc0c84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,16 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
continue;
+ /*
+ * Shared library pages mapped by multiple processes are not
+ * migrated as it is expected they are cache replicated. Avoid
+ * hinting faults in read-only file-backed mappings or the vdso
+ * as migrating the pages will be of marginal benefit.
+ */
+ if (!vma->vm_mm ||
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ continue;
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/mprotect.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f0b087d..5aae390 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd = pmd_offset(pud, addr);
do {
+ unsigned long this_pages;
+
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
@@ -165,8 +167,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
if (pmd_none_or_clear_bad(pmd))
continue;
- pages += change_pte_range(vma, pmd, addr, next, newprot,
+ this_pages = change_pte_range(vma, pmd, addr, next, newprot,
dirty_accountable, prot_numa, &all_same_nidpid);
+ pages += this_pages;
/*
* If we are changing protections for NUMA hinting faults then
@@ -174,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_nidpid)
+ if (prot_numa && this_pages && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/mprotect.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f0b087d..5aae390 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd = pmd_offset(pud, addr);
do {
+ unsigned long this_pages;
+
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
@@ -165,8 +167,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
if (pmd_none_or_clear_bad(pmd))
continue;
- pages += change_pte_range(vma, pmd, addr, next, newprot,
+ this_pages = change_pte_range(vma, pmd, addr, next, newprot,
dirty_accountable, prot_numa, &all_same_nidpid);
+ pages += this_pages;
/*
* If we are changing protections for NUMA hinting faults then
@@ -174,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_nidpid)
+ if (prot_numa && this_pages && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
2013-10-07 10:29 ` Mel Gorman
@ 2013-10-07 19:06 ` Rik van Riel
-1 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:06 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Base page PMD faulting is meant to batch handle NUMA hinting faults from
> PTEs. However, even is no PTE faults would ever be handled within a
> range the kernel still traps PMD hinting faults. This patch avoids the
> overhead.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 340+ messages in thread
* Re: [PATCH 36/63] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
@ 2013-10-07 19:06 ` Rik van Riel
0 siblings, 0 replies; 340+ messages in thread
From: Rik van Riel @ 2013-10-07 19:06 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML
On 10/07/2013 06:29 AM, Mel Gorman wrote:
> Base page PMD faulting is meant to batch handle NUMA hinting faults from
> PTEs. However, even is no PTE faults would ever be handled within a
> range the kernel still traps PMD hinting faults. This patch avoids the
> overhead.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] mm: numa: Trap pmd hinting faults only if we would otherwise trap PTE faults
2013-10-07 10:29 ` Mel Gorman
(?)
(?)
@ 2013-10-09 17:29 ` tip-bot for Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Mel Gorman @ 2013-10-09 17:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 25cbbef1924299249756bc4030fcb2436c019813
Gitweb: http://git.kernel.org/tip/25cbbef1924299249756bc4030fcb2436c019813
Author: Mel Gorman <mgorman@suse.de>
AuthorDate: Mon, 7 Oct 2013 11:29:14 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:44 +0200
mm: numa: Trap pmd hinting faults only if we would otherwise trap PTE faults
Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-37-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
mm/mprotect.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f0b087d..5aae390 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -146,6 +146,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd = pmd_offset(pud, addr);
do {
+ unsigned long this_pages;
+
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
@@ -165,8 +167,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
if (pmd_none_or_clear_bad(pmd))
continue;
- pages += change_pte_range(vma, pmd, addr, next, newprot,
+ this_pages = change_pte_range(vma, pmd, addr, next, newprot,
dirty_accountable, prot_numa, &all_same_nidpid);
+ pages += this_pages;
/*
* If we are changing protections for NUMA hinting faults then
@@ -174,7 +177,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_nidpid)
+ if (prot_numa && this_pages && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 37/63] stop_machine: Introduce stop_two_cpus()
2013-10-07 10:28 ` Mel Gorman
@ 2013-10-07 10:29 ` Mel Gorman
-1 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.
The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.
[riel@redhat.com: Deadlock avoidance]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/stop_machine.h | 1 +
kernel/stop_machine.c | 272 +++++++++++++++++++++++++++----------------
2 files changed, 175 insertions(+), 98 deletions(-)
diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
};
int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
struct cpu_stop_work *work_buf);
int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
return done.executed ? done.ret : -ENOENT;
}
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+ /* Dummy starting state for thread. */
+ MULTI_STOP_NONE,
+ /* Awaiting everyone to be scheduled. */
+ MULTI_STOP_PREPARE,
+ /* Disable interrupts. */
+ MULTI_STOP_DISABLE_IRQ,
+ /* Run the function */
+ MULTI_STOP_RUN,
+ /* Exit */
+ MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+ int (*fn)(void *);
+ void *data;
+ /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+ unsigned int num_threads;
+ const struct cpumask *active_cpus;
+
+ enum multi_stop_state state;
+ atomic_t thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+ enum multi_stop_state newstate)
+{
+ /* Reset ack counter. */
+ atomic_set(&msdata->thread_ack, msdata->num_threads);
+ smp_wmb();
+ msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+ if (atomic_dec_and_test(&msdata->thread_ack))
+ set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+ struct multi_stop_data *msdata = data;
+ enum multi_stop_state curstate = MULTI_STOP_NONE;
+ int cpu = smp_processor_id(), err = 0;
+ unsigned long flags;
+ bool is_active;
+
+ /*
+ * When called from stop_machine_from_inactive_cpu(), irq might
+ * already be disabled. Save the state and restore it on exit.
+ */
+ local_save_flags(flags);
+
+ if (!msdata->active_cpus)
+ is_active = cpu == cpumask_first(cpu_online_mask);
+ else
+ is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+ /* Simple state machine */
+ do {
+ /* Chill out and ensure we re-read multi_stop_state. */
+ cpu_relax();
+ if (msdata->state != curstate) {
+ curstate = msdata->state;
+ switch (curstate) {
+ case MULTI_STOP_DISABLE_IRQ:
+ local_irq_disable();
+ hard_irq_disable();
+ break;
+ case MULTI_STOP_RUN:
+ if (is_active)
+ err = msdata->fn(msdata->data);
+ break;
+ default:
+ break;
+ }
+ ack_state(msdata);
+ }
+ } while (curstate != MULTI_STOP_EXIT);
+
+ local_irq_restore(flags);
+ return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+ int cpu1;
+ int cpu2;
+ struct cpu_stop_work *work1;
+ struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+ struct irq_cpu_stop_queue_work_info *info = arg;
+ cpu_stop_queue_work(info->cpu1, info->work1);
+ cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+ int call_cpu;
+ struct cpu_stop_done done;
+ struct cpu_stop_work work1, work2;
+ struct irq_cpu_stop_queue_work_info call_args;
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = arg,
+ .num_threads = 2,
+ .active_cpus = cpumask_of(cpu1),
+ };
+
+ work1 = work2 = (struct cpu_stop_work){
+ .fn = multi_cpu_stop,
+ .arg = &msdata,
+ .done = &done
+ };
+
+ call_args = (struct irq_cpu_stop_queue_work_info){
+ .cpu1 = cpu1,
+ .cpu2 = cpu2,
+ .work1 = &work1,
+ .work2 = &work2,
+ };
+
+ cpu_stop_init_done(&done, 2);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+
+ /*
+ * Queuing needs to be done by the lowest numbered CPU, to ensure
+ * that works are always queued in the same order on every CPU.
+ * This prevents deadlocks.
+ */
+ call_cpu = min(cpu1, cpu2);
+
+ smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+ &call_args, 0);
+
+ wait_for_completion(&done.completion);
+ return done.executed ? done.ret : -ENOENT;
+}
+
/**
* stop_one_cpu_nowait - stop a cpu but don't wait for completion
* @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);
#ifdef CONFIG_STOP_MACHINE
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
- /* Dummy starting state for thread. */
- STOPMACHINE_NONE,
- /* Awaiting everyone to be scheduled. */
- STOPMACHINE_PREPARE,
- /* Disable interrupts. */
- STOPMACHINE_DISABLE_IRQ,
- /* Run the function */
- STOPMACHINE_RUN,
- /* Exit */
- STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
- int (*fn)(void *);
- void *data;
- /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
- unsigned int num_threads;
- const struct cpumask *active_cpus;
-
- enum stopmachine_state state;
- atomic_t thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
- enum stopmachine_state newstate)
-{
- /* Reset ack counter. */
- atomic_set(&smdata->thread_ack, smdata->num_threads);
- smp_wmb();
- smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
- if (atomic_dec_and_test(&smdata->thread_ack))
- set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
- struct stop_machine_data *smdata = data;
- enum stopmachine_state curstate = STOPMACHINE_NONE;
- int cpu = smp_processor_id(), err = 0;
- unsigned long flags;
- bool is_active;
-
- /*
- * When called from stop_machine_from_inactive_cpu(), irq might
- * already be disabled. Save the state and restore it on exit.
- */
- local_save_flags(flags);
-
- if (!smdata->active_cpus)
- is_active = cpu == cpumask_first(cpu_online_mask);
- else
- is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
- /* Simple state machine */
- do {
- /* Chill out and ensure we re-read stopmachine_state. */
- cpu_relax();
- if (smdata->state != curstate) {
- curstate = smdata->state;
- switch (curstate) {
- case STOPMACHINE_DISABLE_IRQ:
- local_irq_disable();
- hard_irq_disable();
- break;
- case STOPMACHINE_RUN:
- if (is_active)
- err = smdata->fn(smdata->data);
- break;
- default:
- break;
- }
- ack_state(smdata);
- }
- } while (curstate != STOPMACHINE_EXIT);
-
- local_irq_restore(flags);
- return err;
-}
-
int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
- .num_threads = num_online_cpus(),
- .active_cpus = cpus };
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = data,
+ .num_threads = num_online_cpus(),
+ .active_cpus = cpus,
+ };
if (!stop_machine_initialized) {
/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
unsigned long flags;
int ret;
- WARN_ON_ONCE(smdata.num_threads != 1);
+ WARN_ON_ONCE(msdata.num_threads != 1);
local_irq_save(flags);
hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
}
/* Set the initial state and stop all online cpus. */
- set_state(&smdata, STOPMACHINE_PREPARE);
- return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+ return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
}
int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
+ struct multi_stop_data msdata = { .fn = fn, .data = data,
.active_cpus = cpus };
struct cpu_stop_done done;
int ret;
/* Local CPU must be inactive and CPU hotplug in progress. */
BUG_ON(cpu_active(raw_smp_processor_id()));
- smdata.num_threads = num_active_cpus() + 1; /* +1 for local */
+ msdata.num_threads = num_active_cpus() + 1; /* +1 for local */
/* No proper task established and can't sleep - busy wait for lock. */
while (!mutex_trylock(&stop_cpus_mutex))
cpu_relax();
/* Schedule work on other CPUs and execute directly for local CPU */
- set_state(&smdata, STOPMACHINE_PREPARE);
+ set_state(&msdata, MULTI_STOP_PREPARE);
cpu_stop_init_done(&done, num_active_cpus());
- queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+ queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
&done);
- ret = stop_machine_cpu_stop(&smdata);
+ ret = multi_cpu_stop(&msdata);
/* Busy wait for completion. */
while (!completion_done(&done.completion))
--
1.8.4
^ permalink raw reply [flat|nested] 340+ messages in thread
* [PATCH 37/63] stop_machine: Introduce stop_two_cpus()
@ 2013-10-07 10:29 ` Mel Gorman
0 siblings, 0 replies; 340+ messages in thread
From: Mel Gorman @ 2013-10-07 10:29 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
Johannes Weiner, Linux-MM, LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.
The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.
[riel@redhat.com: Deadlock avoidance]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/stop_machine.h | 1 +
kernel/stop_machine.c | 272 +++++++++++++++++++++++++++----------------
2 files changed, 175 insertions(+), 98 deletions(-)
diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
};
int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
struct cpu_stop_work *work_buf);
int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
return done.executed ? done.ret : -ENOENT;
}
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+ /* Dummy starting state for thread. */
+ MULTI_STOP_NONE,
+ /* Awaiting everyone to be scheduled. */
+ MULTI_STOP_PREPARE,
+ /* Disable interrupts. */
+ MULTI_STOP_DISABLE_IRQ,
+ /* Run the function */
+ MULTI_STOP_RUN,
+ /* Exit */
+ MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+ int (*fn)(void *);
+ void *data;
+ /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+ unsigned int num_threads;
+ const struct cpumask *active_cpus;
+
+ enum multi_stop_state state;
+ atomic_t thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+ enum multi_stop_state newstate)
+{
+ /* Reset ack counter. */
+ atomic_set(&msdata->thread_ack, msdata->num_threads);
+ smp_wmb();
+ msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+ if (atomic_dec_and_test(&msdata->thread_ack))
+ set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+ struct multi_stop_data *msdata = data;
+ enum multi_stop_state curstate = MULTI_STOP_NONE;
+ int cpu = smp_processor_id(), err = 0;
+ unsigned long flags;
+ bool is_active;
+
+ /*
+ * When called from stop_machine_from_inactive_cpu(), irq might
+ * already be disabled. Save the state and restore it on exit.
+ */
+ local_save_flags(flags);
+
+ if (!msdata->active_cpus)
+ is_active = cpu == cpumask_first(cpu_online_mask);
+ else
+ is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+ /* Simple state machine */
+ do {
+ /* Chill out and ensure we re-read multi_stop_state. */
+ cpu_relax();
+ if (msdata->state != curstate) {
+ curstate = msdata->state;
+ switch (curstate) {
+ case MULTI_STOP_DISABLE_IRQ:
+ local_irq_disable();
+ hard_irq_disable();
+ break;
+ case MULTI_STOP_RUN:
+ if (is_active)
+ err = msdata->fn(msdata->data);
+ break;
+ default:
+ break;
+ }
+ ack_state(msdata);
+ }
+ } while (curstate != MULTI_STOP_EXIT);
+
+ local_irq_restore(flags);
+ return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+ int cpu1;
+ int cpu2;
+ struct cpu_stop_work *work1;
+ struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+ struct irq_cpu_stop_queue_work_info *info = arg;
+ cpu_stop_queue_work(info->cpu1, info->work1);
+ cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+ int call_cpu;
+ struct cpu_stop_done done;
+ struct cpu_stop_work work1, work2;
+ struct irq_cpu_stop_queue_work_info call_args;
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = arg,
+ .num_threads = 2,
+ .active_cpus = cpumask_of(cpu1),
+ };
+
+ work1 = work2 = (struct cpu_stop_work){
+ .fn = multi_cpu_stop,
+ .arg = &msdata,
+ .done = &done
+ };
+
+ call_args = (struct irq_cpu_stop_queue_work_info){
+ .cpu1 = cpu1,
+ .cpu2 = cpu2,
+ .work1 = &work1,
+ .work2 = &work2,
+ };
+
+ cpu_stop_init_done(&done, 2);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+
+ /*
+ * Queuing needs to be done by the lowest numbered CPU, to ensure
+ * that works are always queued in the same order on every CPU.
+ * This prevents deadlocks.
+ */
+ call_cpu = min(cpu1, cpu2);
+
+ smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+ &call_args, 0);
+
+ wait_for_completion(&done.completion);
+ return done.executed ? done.ret : -ENOENT;
+}
+
/**
* stop_one_cpu_nowait - stop a cpu but don't wait for completion
* @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);
#ifdef CONFIG_STOP_MACHINE
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
- /* Dummy starting state for thread. */
- STOPMACHINE_NONE,
- /* Awaiting everyone to be scheduled. */
- STOPMACHINE_PREPARE,
- /* Disable interrupts. */
- STOPMACHINE_DISABLE_IRQ,
- /* Run the function */
- STOPMACHINE_RUN,
- /* Exit */
- STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
- int (*fn)(void *);
- void *data;
- /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
- unsigned int num_threads;
- const struct cpumask *active_cpus;
-
- enum stopmachine_state state;
- atomic_t thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
- enum stopmachine_state newstate)
-{
- /* Reset ack counter. */
- atomic_set(&smdata->thread_ack, smdata->num_threads);
- smp_wmb();
- smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
- if (atomic_dec_and_test(&smdata->thread_ack))
- set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
- struct stop_machine_data *smdata = data;
- enum stopmachine_state curstate = STOPMACHINE_NONE;
- int cpu = smp_processor_id(), err = 0;
- unsigned long flags;
- bool is_active;
-
- /*
- * When called from stop_machine_from_inactive_cpu(), irq might
- * already be disabled. Save the state and restore it on exit.
- */
- local_save_flags(flags);
-
- if (!smdata->active_cpus)
- is_active = cpu == cpumask_first(cpu_online_mask);
- else
- is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
- /* Simple state machine */
- do {
- /* Chill out and ensure we re-read stopmachine_state. */
- cpu_relax();
- if (smdata->state != curstate) {
- curstate = smdata->state;
- switch (curstate) {
- case STOPMACHINE_DISABLE_IRQ:
- local_irq_disable();
- hard_irq_disable();
- break;
- case STOPMACHINE_RUN:
- if (is_active)
- err = smdata->fn(smdata->data);
- break;
- default:
- break;
- }
- ack_state(smdata);
- }
- } while (curstate != STOPMACHINE_EXIT);
-
- local_irq_restore(flags);
- return err;
-}
-
int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
- .num_threads = num_online_cpus(),
- .active_cpus = cpus };
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = data,
+ .num_threads = num_online_cpus(),
+ .active_cpus = cpus,
+ };
if (!stop_machine_initialized) {
/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
unsigned long flags;
int ret;
- WARN_ON_ONCE(smdata.num_threads != 1);
+ WARN_ON_ONCE(msdata.num_threads != 1);
local_irq_save(flags);
hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
}
/* Set the initial state and stop all online cpus. */
- set_state(&smdata, STOPMACHINE_PREPARE);
- return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+ return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
}
int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
+ struct multi_stop_data msdata = { .fn = fn, .data = data,
.active_cpus = cpus };
struct cpu_stop_done done;
int ret;
/* Local CPU must be inactive and CPU hotplug in progress. */
BUG_ON(cpu_active(raw_smp_processor_id()));
- smdata.num_threads = num_active_cpus() + 1; /* +1 for local */
+ msdata.num_threads = num_active_cpus() + 1; /* +1 for local */
/* No proper task established and can't sleep - busy wait for lock. */
while (!mutex_trylock(&stop_cpus_mutex))
cpu_relax();
/* Schedule work on other CPUs and execute directly for local CPU */
- set_state(&smdata, STOPMACHINE_PREPARE);
+ set_state(&msdata, MULTI_STOP_PREPARE);
cpu_stop_init_done(&done, num_active_cpus());
- queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+ queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
&done);
- ret = stop_machine_cpu_stop(&smdata);
+ ret = multi_cpu_stop(&msdata);
/* Busy wait for completion. */
while (!completion_done(&done.completion))
--
1.8.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 340+ messages in thread
* [tip:sched/core] stop_machine: Introduce stop_two_cpus()
2013-10-07 10:29 ` Mel Gorman
(?)
@ 2013-10-09 17:30 ` tip-bot for Peter Zijlstra
-1 siblings, 0 replies; 340+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-09 17:30 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, peterz, hannes, riel, aarcange, srikar,
mgorman, tglx
Commit-ID: 1be0bd77c5dd7c903f46abf52f9a3650face3c1d
Gitweb: http://git.kernel.org/tip/1be0bd77c5dd7c903f46abf52f9a3650face3c1d
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Oct 2013 11:29:15 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 Oct 2013 12:40:45 +0200
stop_machine: Introduce stop_two_cpus()
Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.
The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ Implemented deadlock avoidance. ]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-38-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/stop_machine.h | 1 +
kernel/stop_machine.c | 272 +++++++++++++++++++++++++++----------------
2 files changed, 175 insertions(+), 98 deletions(-)
diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
};
int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
struct cpu_stop_work *work_buf);
int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
return done.executed ? done.ret : -ENOENT;
}
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+ /* Dummy starting state for thread. */
+ MULTI_STOP_NONE,
+ /* Awaiting everyone to be scheduled. */
+ MULTI_STOP_PREPARE,
+ /* Disable interrupts. */
+ MULTI_STOP_DISABLE_IRQ,
+ /* Run the function */
+ MULTI_STOP_RUN,
+ /* Exit */
+ MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+ int (*fn)(void *);
+ void *data;
+ /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+ unsigned int num_threads;
+ const struct cpumask *active_cpus;
+
+ enum multi_stop_state state;
+ atomic_t thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+ enum multi_stop_state newstate)
+{
+ /* Reset ack counter. */
+ atomic_set(&msdata->thread_ack, msdata->num_threads);
+ smp_wmb();
+ msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+ if (atomic_dec_and_test(&msdata->thread_ack))
+ set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+ struct multi_stop_data *msdata = data;
+ enum multi_stop_state curstate = MULTI_STOP_NONE;
+ int cpu = smp_processor_id(), err = 0;
+ unsigned long flags;
+ bool is_active;
+
+ /*
+ * When called from stop_machine_from_inactive_cpu(), irq might
+ * already be disabled. Save the state and restore it on exit.
+ */
+ local_save_flags(flags);
+
+ if (!msdata->active_cpus)
+ is_active = cpu == cpumask_first(cpu_online_mask);
+ else
+ is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+ /* Simple state machine */
+ do {
+ /* Chill out and ensure we re-read multi_stop_state. */
+ cpu_relax();
+ if (msdata->state != curstate) {
+ curstate = msdata->state;
+ switch (curstate) {
+ case MULTI_STOP_DISABLE_IRQ:
+ local_irq_disa